ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel driver has landed in mainline, userspace driver is part of mesa - Logs at https://oftc.irclog.whitequark.org/lima/
dllud has joined #lima
dllud_ has quit [Read error: Connection reset by peer]
Daanct12 has joined #lima
enunes has joined #lima
Daanct12 has quit [Ping timeout: 480 seconds]
dllud_ has joined #lima
dllud has quit [Read error: Connection reset by peer]
dllud has joined #lima
dllud_ has quit [Remote host closed the connection]
adjtm has quit [Quit: Leaving]
drod has joined #lima
<anarsoul> marex: technically you still have to read whole texture, regardless of its format. I'm not sure which one will be more performant, you need do to your own benchmarking
<anarsoul> marex: and don't forget about cache! :)
<anarsoul> utgard processes fragments in 2x2 groups, so cache hit rate should be good for planar yuv (and for packed yuyv)
<anarsoul> marex: I think you can compare fs shader length for both cases, the shortest will likely be faster
<marex> anarsoul: well for packed YUV, it would be literally load/mul
<marex> for planar YUV, it would be three loads, some swizzling, and then mul
<anarsoul> marex: and probably some coords math
<marex> so I think planar yuv is not good
<anarsoul> marex: it's not so obvious to me. Basically i420 has only 1 u/v value per 4 y, so it'll use less memory bandwidth. So it'll depend on the shader
<anarsoul> marex: keep in mind that utgard pp is VLIW architecture, so it does a lot of operations per instruction
<anarsoul> with I420 I guess the shortest it could get is 3 instructions, since it needs 3 samplers
<anarsoul> with YUYV it's only 1 sampler, but you'll likely need a conditional to get correct Y for your pixel
<anarsoul> and also some coords math
<marex> anarsoul: but the hardware has two samplers, doesn't it ?
<anarsoul> marex: no, one sampler per instruction
<marex> but then, if I keep using different memory addresses per instruction, won't that have awful DRAM access pattern ?
<anarsoul> marex: it does fragments processing in 2x2 groups, so it'll be OK for I420 for sure
<anarsoul> 1 cache miss (for 1st fragment), 3 cache hits for 2-4 fragments
<marex> and with packed YUV ?
<anarsoul> 1 cache miss, 1 cache hit?
<anarsoul> I'm not sure if it fetches whole cache line at once or not
<marex> anarsoul: the line is just 1 pixel ?
<marex> if it did fetch an entire line of pixels, then there would be 1 miss and multiple hits
<anarsoul> marex: I think cache line is 64 bytes
<anarsoul> but I'm not sure if mali prefetches whole cacheline
<marex> anarsoul: that would only make sense, since 64B is also nice for DDR DRAM
<marex> that's burst length
<enunes> marex: just passing by but there are performance counters for the caches, if you had both implementations you could get some data using https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6077
<anarsoul> enunes: while you here, can you re-test !16136 with ppir lowering in place, but just drop special hanlding of ppir_op_ddy? i.e. handle it as ppir_op_ddx
<anarsoul> I still think it's incorrect to completely drop it, since op_ddx and op_ddy apparently need both arguments
<enunes> anarsoul: I'll give it a try
<marex> enunes: thank you
enunes has quit [Ping timeout: 480 seconds]
enunes has joined #lima
drod has quit [Remote host closed the connection]
adjtm has joined #lima