#lima on 2022-04-25 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:56 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel driver has landed in mainline, userspace driver is part of mesa - Logs at https://oftc.irclog.whitequark.org/lima/

05:14 dllud has joined #lima

05:14 dllud_ has quit [Read error: Connection reset by peer]

06:17 Daanct12 has joined #lima

07:44 enunes has joined #lima

10:54 Daanct12 has quit [Ping timeout: 480 seconds]

11:21 dllud_ has joined #lima

11:21 dllud has quit [Read error: Connection reset by peer]

11:25 dllud has joined #lima

11:25 dllud_ has quit [Remote host closed the connection]

14:23 adjtm has quit [Quit: Leaving]

16:56 drod has joined #lima

17:01 <anarsoul> marex: technically you still have to read whole texture, regardless of its format. I'm not sure which one will be more performant, you need do to your own benchmarking

17:02 <anarsoul> marex: and don't forget about cache! :)

17:03 <anarsoul> utgard processes fragments in 2x2 groups, so cache hit rate should be good for planar yuv (and for packed yuyv)

18:18 <anarsoul> marex: I think you can compare fs shader length for both cases, the shortest will likely be faster

18:29 <marex> anarsoul: well for packed YUV, it would be literally load/mul

18:30 <marex> for planar YUV, it would be three loads, some swizzling, and then mul

18:30 <anarsoul> marex: and probably some coords math

18:36 <marex> so I think planar yuv is not good

18:39 <anarsoul> marex: it's not so obvious to me. Basically i420 has only 1 u/v value per 4 y, so it'll use less memory bandwidth. So it'll depend on the shader

18:41 <anarsoul> marex: keep in mind that utgard pp is VLIW architecture, so it does a lot of operations per instruction

18:42 <anarsoul> with I420 I guess the shortest it could get is 3 instructions, since it needs 3 samplers

18:42 <anarsoul> with YUYV it's only 1 sampler, but you'll likely need a conditional to get correct Y for your pixel

18:42 <anarsoul> and also some coords math

18:43 <marex> anarsoul: but the hardware has two samplers, doesn't it ?

18:43 <anarsoul> marex: no, one sampler per instruction

18:43 <anarsoul> https://gitlab.freedesktop.org/panfrost/mali-isa-docs/blob/master/Utgard-PP.md

18:44 <marex> but then, if I keep using different memory addresses per instruction, won't that have awful DRAM access pattern ?

18:45 <anarsoul> marex: it does fragments processing in 2x2 groups, so it'll be OK for I420 for sure

18:45 <anarsoul> 1 cache miss (for 1st fragment), 3 cache hits for 2-4 fragments

18:46 <marex> and with packed YUV ?

18:48 <anarsoul> 1 cache miss, 1 cache hit?

18:48 <anarsoul> I'm not sure if it fetches whole cache line at once or not

18:54 <marex> anarsoul: the line is just 1 pixel ?

18:54 <marex> if it did fetch an entire line of pixels, then there would be 1 miss and multiple hits

18:56 <anarsoul> marex: I think cache line is 64 bytes

18:56 <anarsoul> but I'm not sure if mali prefetches whole cacheline

19:00 <marex> anarsoul: that would only make sense, since 64B is also nice for DDR DRAM

19:00 <marex> that's burst length

19:27 <enunes> marex: just passing by but there are performance counters for the caches, if you had both implementations you could get some data using https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6077

19:48 <anarsoul> enunes: while you here, can you re-test !16136 with ppir lowering in place, but just drop special hanlding of ppir_op_ddy? i.e. handle it as ppir_op_ddx

19:49 <anarsoul> I still think it's incorrect to completely drop it, since op_ddx and op_ddy apparently need both arguments

19:50 <enunes> anarsoul: I'll give it a try

20:04 <marex> enunes: thank you

21:49 enunes has quit [Ping timeout: 480 seconds]

22:17 enunes has joined #lima

22:33 drod has quit [Remote host closed the connection]

22:37 adjtm has joined #lima