ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
jambalaya has joined #panfrost
kenzie7 has joined #panfrost
warpme___ has quit []
kenzie7 has quit []
benpouls_ has joined #panfrost
kenzie7 has joined #panfrost
benpoulson has quit [Ping timeout: 480 seconds]
floof58 has quit [Ping timeout: 480 seconds]
floof58 has joined #panfrost
floof58 is now known as Guest408
floof58 has joined #panfrost
Guest408 has quit [Ping timeout: 480 seconds]
benpouls_ has quit [Remote host closed the connection]
floof58 has quit [Ping timeout: 480 seconds]
benpoulson has joined #panfrost
floof58 has joined #panfrost
benpoulson has quit [Remote host closed the connection]
benpoulson has joined #panfrost
davidlt has joined #panfrost
hanetzer1 has joined #panfrost
hanetzer has quit [Ping timeout: 480 seconds]
carlosstive[m] has quit [autokilled: Spambot. Mail support@oftc.net if you think this is in error. (2022-08-18 04:12:09)]
hanetzer2 has joined #panfrost
hanetzer1 has quit [Ping timeout: 480 seconds]
davidlt has quit [Ping timeout: 480 seconds]
soreau has quit [Read error: Connection reset by peer]
soreau has joined #panfrost
nlhowell has joined #panfrost
davidlt has joined #panfrost
tomeu5 has joined #panfrost
sergi8 has joined #panfrost
ndufresne5 has joined #panfrost
ndufresne has quit [Read error: Connection reset by peer]
sergi has quit [Write error: connection closed]
MajorBiscuit has joined #panfrost
tomeu has quit [Ping timeout: 480 seconds]
Major_Biscuit has quit [Ping timeout: 480 seconds]
warpme___ has joined #panfrost
indy_ has joined #panfrost
indy has quit [Ping timeout: 480 seconds]
rkanwal has joined #panfrost
atler has joined #panfrost
indy_ is now known as indy
anarsoul|2 has joined #panfrost
anarsoul has quit [Read error: No route to host]
zhxt_ has quit [Read error: Connection reset by peer]
zhxt_ has joined #panfrost
anarsoul has joined #panfrost
anarsoul|2 has quit [Read error: Connection reset by peer]
warpme___ has quit []
Danct12 has quit [Read error: Connection reset by peer]
rkanwal has quit [Read error: Connection reset by peer]
rkanwal has joined #panfrost
ndufresne5 is now known as ndufresne
nlhowell has quit [Ping timeout: 480 seconds]
soreau has quit [Read error: Connection reset by peer]
soreau has joined #panfrost
<greenjustin> Would folks here happen to know much about the performance of Mali texture pipelines? I've found this really good doc for Midgard, but I can't seem to find anything publicly available about Bifrost or Valhall, was curious if someone here had any performance numbers: https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/the-mali-gpu-an-abstract-machine-part-3---the-midgard-shader-core
<robmur01> there's a matching set of blogs about Bifrost, not sure about Valhall though
<robmur01> oh, my bad, it's actually part 4 of that same blog series
<greenjustin> oh! thank you so much!
<greenjustin> ok so it's still about 1 texel per core
<greenjustin> *per clock
<robmur01> there are various other nuggets scattered around in Arm docs like the best practices guide and GPU datasheet
<robmur01> of course subsequent GPUs iterated quite hard on the initial G71 design, so it's ended up probably being the least representative of Bifrost overall
<greenjustin> Ah, and it looks like valhall is 4 texels per clock per core
<greenjustin> Yeah but I think the MT8183 is G71, right? So that's probably still relevant to what I'm doing
zekra has joined #panfrost
zekra has left #panfrost [#panfrost]
stepri01 has quit [Remote host closed the connection]
Danct12 has joined #panfrost
MajorBiscuit has quit [Quit: WeeChat 3.5]
karolherbst has quit [Read error: Connection reset by peer]
karolherbst_ has joined #panfrost
karolherbst_ is now known as karolherbst
Lyude has quit [Quit: Bouncer restarting]
Lyude has joined #panfrost
davidlt has quit [Ping timeout: 480 seconds]
davidlt has joined #panfrost
rkanwal has quit [Ping timeout: 480 seconds]
rasterman has joined #panfrost
alyssa has joined #panfrost
<alyssa> greenjustin: there are some nuggets in https://developer.arm.com/documentation/102849/latest/
pch has quit [Ping timeout: 480 seconds]
pch has joined #panfrost
davidlt has quit [Ping timeout: 480 seconds]
paulk has joined #panfrost
<greenjustin> alyssa: Thank you!
<greenjustin> This might be a dumb question, but does the pixel format matter for the texel performance?
<alyssa> n
<alyssa> yeah, absolutely
<alyssa> memory bandwidth if nothing else
<greenjustin> oh interesting, like would I be able to fetch 4 texels a cycle if the format were e.g. R8?
<alyssa> ETC2 requires half the memory bandwidth of RGBA8, which requires 1/4 of RGBA32
* alyssa shrugs
<alyssa> I don't know the specific uarch details of the Mali texturing pipe
<alyssa> but in general you can expect better performance with better formats
<greenjustin> That's interesting. Yeah I figured chroma subsampling for example would greatly improve memory bandwidth, but the docs make it almost sound like the texture fetch unit itself can only process about 1 texel request or w/e per cycle for some reason
<alyssa> sure
<alyssa> "for some reason" remember that hardware is deeply pipelined
<alyssa> uarch design isn't like optimizing software
<alyssa> (e.g. doing more in one cycle can slow things down overall if the extra propagation delay forces you to lower the clock rate of the whole chip)
<alyssa> That data sheet is also greatly simplified from the real hw behaviours, of course.
MajorBiscuit has joined #panfrost
<greenjustin> Interesting. I guess I assumed it worked more like CPU uarch, where for example you could issue a whole bunch of store instructions and they'd likely run simultaneously on a superscalar arch
<greenjustin> At least up until you saturate the bus
<greenjustin> (Or write combiner, or whatever)
<greenjustin> I guess in my head I would have assumed the texel unit worked the same way. E.g. you can fetch 1 ARGB pixel, or 4 R8 pixels per clock
* alyssa shrugs
<alyssa> I'm not a hardware person
<alyssa> I will say that texturing is /very/ complicated
<alyssa> it's not just a texel fetch
<alyssa> s/texturing/sampling/
<alyssa> and cache friendliness is the single biggest indicator of performance overall...
<greenjustin> That's fair yeah. I could easily see implementing an optimization like that introducing more propagation delay than it's worth.
<alyssa> overall R8 will still be more efficient than ARGB8
<alyssa> whether that's noticeable as lower FPS or just lower power consumption will depend on piles and piles of factors
<alyssa> the "1 pixel per clock" Arm quotes in the data sheet is just talking about the texturing unit proper, I think
<alyssa> which is good to know but maybe not super helpful on its own
<alyssa> A single texture request can take 100s of cycles, if the cache is cold and you're having a real bad day
<greenjustin> Right, I assume computing actual performance characteristics is tough and depends on a bunch of factors. Some of which might not even be public information?
<greenjustin> I was just trying to upper bound the throughput for a shader we've been experimenting with. I'm fairly certain it's texture fetch bound, so I've been using "texel fetch per clock" * "number of shaders" * "max clock speed"
<alyssa> not sure what the second factor is in for
<alyssa> "texel fetch per clock" * "max clock speed" is the upper bound on the texturing unit performance
<alyssa> but that doesn't factor in the (potentially much slower) memory bus
<alyssa> or the caching system
<alyssa> it's probably best just to benchmark the 'real' behaviour for your workload than try to imagine best case scenarios that are impossible to actually achieve, idk
Intdtti has quit [autokilled: This host violated network policy and has been banned. Mail support@oftc.net if you think this is in error. (2022-08-18 20:56:55)]
<greenjustin> Oh? Is there not 1 texture fetch unit per shader core?
_op62 has quit [autokilled: This host violated network policy and has been banned. Mail support@oftc.net if you think this is in error. (2022-08-18 20:58:13)]
<alyssa> oh, misunderstood you
<alyssa> sorry
MajorBiscuit has quit [Quit: WeeChat 3.5]
hyrc has joined #panfrost
hyrc has quit []
rasterman has quit [Quit: Gettin' stinky!]
alyssa has quit [Quit: leaving]
<daniels> greenjustin: locality is definitely the most important thing, which I appreciate is difficult for your uses … we also don’t have any great insight into how the DDK schedules fetches, which is critical here as Bifrost has a fun hardware design around dependency scheduling
<greenjustin> daniels: Yeah I'm almost certain our cache miss rate is terrible and there isn't much we can do about it. What I'm starting to think though is that even with a 100% cache hit rate, the GPU is just not suited for this type of work
<daniels> greenjustin: splitting the two planes into independent passes may or may not help. the hardware does have dual-issue tex support but iirc that only kicks in when using identical normalised coords, so wouldn’t happen with texelFetch
megatradeusa[m] has quit [Excess Flood]
<daniels> (as a thought, you might even be better off doing rather than vert/frag and playing with the dimensions there to see what effect that has?)
<greenjustin> daniels: I think an even bigger problem than the texelFetch might be the 2 pixel *writes* per clock. By my calculations, the maximum throughput just based on that bottleneck alone is ~8.66 gigapixels per second
<greenjustin> vs our SIMD version is like 11.7 GP/s per core
<daniels> right, hence the suggestion of either doing one plane at a time rather than aggregating and writing both in a single pass - or going the other way where you can go much wider with compute to try to get maximum locality from tile reads
megatradeusa[m] has joined #panfrost
<daniels> I don’t know if doing writes out from compute is going to be better or worse than the frag store unit, but could be worth a shot …
<greenjustin> oh I think I know what you're saying
<greenjustin> like tell GL the textures are super wide and 1 pixel tall and use that to better exploit the write combiner?
<daniels> well. at the moment you’re using a GL draw call
<daniels> with texture reads and fragment writes
<daniels> if you use a compute shader (or OpenCL) with ins and outs as direct memory access, rather than executing once per output pixel for the fragment shader, you can define your execution ‘width’ which effectively gangs your invocations together
<daniels> so you can choose that based on e.g. input tile dimensions, and then play around with whether to use the texture fetch unit or do I/O as mem loads & stores through SSBOs
<daniels> it may be massively worse, or it might be quite a bit better
<daniels> but if you’re basically just doing weird blits, you could very well benefit from going to compute shaders rather than the full draw pipeline
<greenjustin> that's an interesting idea, I will have to look into that
<greenjustin> thank you!
<daniels> np!
bluetail has quit [Quit: Ping timeout (120 seconds)]
bluetail has joined #panfrost