#lima on 2025-02-18 — irc logs at oftc.irclog.whitequark.org

2024-07-16 04:51 ChanServ changed the topic of #lima to: Development channel for open source lima driver for ARM Mali4** GPUs - Kernel driver has landed in mainline, userspace driver is part of mesa - Logs at https://oftc.irclog.whitequark.org/lima/

01:21 <anarsoul[m]> Nope. We do not transform the program in regalloc (besides spilling registers)

02:19 Daanct12 has joined #lima

02:32 <enunes> anarsoul: in addition to these, I had a simple copy propagation pass which removed a few more movs too, it could also include modifier folding while doing that, I had it in a MR a few months ago but dropped it from there to unblock and never actually pushed another MR for it, can push it again

02:32 <anarsoul[m]> sure, sounds good

02:32 <enunes> anarsoul: for 3) I think the challenge was blocks with discard

02:33 <anarsoul[m]> well, it produces an extra mov whenever store_output source is not an ssa

02:35 <anarsoul[m]> btw, I also implemented using combiner for vector multiplication if one of sources is scalar, it improves instruction count for some shaders, but regresses others due to increased register pressure

02:35 <anarsoul[m]> shader-db still says that instructions are helped

05:39 <anarsoul[m]> I ran glmark2-es on 24.3.4 and git main to compare the difference, but it looks like glmark shaders didn't get much improvement. It's still noticeable though

05:39 <anarsoul[m]> 24.3.4: https://gist.github.com/anarsoul/a1578faee9c6dd2928a69f53280bc9d2

05:40 <anarsoul[m]> git main: https://gist.github.com/anarsoul/a19b3ae3ab0d21f60cc856535f4b5a1b

06:27 <anarsoul[m]> we probably should start using fp16 for varyings. It would halve memory bandwidth requirements for geometry-heavy workloads and will likely have noticeable performance improvement

07:49 <anarsoul[m]> tried it, it's a really small change but unfortunately no measurable performance difference in glmark2

13:54 Daanct12 has quit [Quit: WeeChat 4.5.1]

18:03 chewitt has quit [Quit: Zzz..]

19:03 <anarsoul> enunes: so for 3) discard is indeed the challenge, and we cannot get rid of mov here, however we also create a mov for ppir_op_dummy which is usually register load

19:05 <anarsoul> this one can be optimized if "end" block has the only instruction which is this mov