#panfrost on 2022-11-02 — irc logs at oftc.irclog.whitequark.org

2022-10-28 18:37 alyssa changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard + Bifrost + Valhall - Logs https://oftc.irclog.whitequark.org/panfrost - I don't know anything about WSI. That's my story and I'm sticking to it.

00:05 alyssa has joined #panfrost

00:06 <alyssa> cphealy: Did you get a chance to test the shadowing fix? thank you

00:06 <alyssa> Oh, you did, didn't see the email, whoops sorry

00:07 <alyssa> 1/2 perf drop on that benchmark ... that's very unfortunate :|

00:07 <cphealy> alyssa: I'm pretty confident that the tests I ran are valid. I used the latest released glmark2-es2-wayland with TOT Mesa.

00:07 <cphealy> That I don't see any of the benefit you mentioned though gives me a little concern that I did something wrong.

00:08 <cphealy> Which specific glmark2 benchmark would you expect to improve?

00:08 <alyssa> cphealy: right

00:08 <alyssa> The new solution will use somewhat more CPU in exchange for a lot less GPU on certain workloads

00:08 <alyssa> I can easily see that being a huge win for me on RK3399 (fast CPU, slow GPU) but not so much on your board (fast GPU, slow CPU)...

00:09 <cphealy> When you tested, were you using a SoC with big ARM cores?

00:09 <alyssa> Ye

00:09 <cphealy> Any chance you can re-run on RK3399 with the big cores turned off?

00:12 <alyssa> Any hint how to do that? :)

00:13 <cphealy> Not yet, give me a few min though.. ;-)

00:13 <alyssa> thanks

00:13 <alyssa> (won't be able to for ~45 minutes, no rush)

00:19 <alpernebbi> taskset -c 0-3 or echo 0 | sudo tee /sys/devices/system/cpu/cpu{4,5}/online might work

00:19 <alyssa> alpernebbi: thanks!

00:19 hanetzer1 has joined #panfrost

00:20 <cphealy> alpernebbi: you beat me to it!

00:21 hanetzer has quit [Ping timeout: 480 seconds]

00:21 <cphealy> The cpu number can be different on different platforms so which cores are the big cores should first be determined to know what cpu numbers to disable.

00:21 <alyssa> yeah, but alpernebbi has the same machine I do :)

00:22 <alpernebbi> got the sysfs one from https://astr0baby.wordpress.com/2020/01/12/pinebookpro-virtualization/

00:23 <cphealy> Ahh, it's probably the right answer then... ;-)

00:23 <alyssa> alpernebbi: Also, boo, I made it years without thinking "hexacore" and almost managed to forget it ;)

00:23 <cphealy> For other platforms, one can check the "cpu_capacity" sysfs for each CPU core to see which ones have the higher capacity.

00:25 <cphealy> https://linux.kernel.narkive.com/RsUCB0A4/patch-v7-repost-9-9-arm64-add-sysfs-cpu-capacity-attribute

00:25 paulk-bis has joined #panfrost

00:27 paulk has quit [Ping timeout: 480 seconds]

00:29 <alpernebbi> a bit sleepy, so I thought "what is hexacore, some genre of music" for a moment

00:29 <alyssa> hahaha

00:29 <alyssa> "hex core" maybe?

00:29 <alyssa> I can't remember the silly marketing back when rk3399 was new and shint

00:30 <cphealy> 2 big cores and 4 little cores

00:30 <alpernebbi> hex core sounds like some magic artifact

00:30 <cphealy> that's hexacore

00:31 <alpernebbi> cphealy: yeah latin for 6 or something

00:32 <alpernebbi> just having a sleepy moment

00:33 <cphealy> ;-)

00:40 hanetzer2 has joined #panfrost

00:41 <alpernebbi> btw I never tried the sysfs one, but I did use taskset -c 4,5 for qemu and it was enough for me

00:42 hanetzer1 has quit [Ping timeout: 480 seconds]

01:41 paulk-ter has joined #panfrost

01:42 paulk-bis has quit [Ping timeout: 480 seconds]

01:49 paulk has joined #panfrost

01:51 paulk-ter has quit [Ping timeout: 480 seconds]

02:02 paulk-bis has joined #panfrost

02:03 paulk has quit [Ping timeout: 480 seconds]

02:12 atler is now known as Guest212

02:12 atler has joined #panfrost

02:14 Guest212 has quit [Ping timeout: 480 seconds]

02:44 Daanct12 has joined #panfrost

06:13 floof58 is now known as Guest226

06:14 floof58 has joined #panfrost

06:17 Guest226 has quit [Ping timeout: 480 seconds]

06:32 floof58 has quit [Quit: floof58]

06:36 floof58 has joined #panfrost

08:35 rasterman has joined #panfrost

09:30 camus1 has joined #panfrost

09:30 camus has quit [Read error: Connection reset by peer]

10:43 bbrezillon has joined #panfrost

10:45 br_ has quit [Read error: Connection reset by peer]

10:50 bbrezillon has quit [Remote host closed the connection]

10:56 bbrezillon has joined #panfrost

11:46 br_ has joined #panfrost

12:56 alpernebbi has quit [Quit: alpernebbi]

14:03 alpernebbi has joined #panfrost

14:19 warpme____ has joined #panfrost

15:10 alpernebbi has quit [Ping timeout: 480 seconds]

15:33 alpernebbi has joined #panfrost

15:44 falk689_ has quit []

15:45 falk689 has joined #panfrost

16:38 karolherbst has quit [Ping timeout: 480 seconds]

16:43 karolherbst has joined #panfrost

16:49 <alyssa> Curious

16:49 <alyssa> with performance governors for cpu/gpu on rk3399, everything on the system is super responsive

16:49 <alyssa> so I'm wondering maybe the kernel scheduling (for both CPU and GPU) is just crap on this machine and that's why stuff is so janky most of the time

17:00 pjakobsson has quit [Remote host closed the connection]

17:03 <robmur01> A lot of responsiveness on a not-very-busy system can be down to interrupt handlers running on idle CPUs, which are thus clocked right down, further confounded by CPU0 often being the weediest little CPU yet bearing the brunt of most default affinity

17:03 <robmur01> tricky problem to solve well with software-controlled DVFS

17:04 <robmur01> try punting IRQ affinity for things that matter to the big cores, which will do a lot better even at their lowest freq

17:05 <cphealy> If only I had big cores in my SoC... ;-)

17:06 <alyssa> robmur01: nod... I guess the combination of software DVFS and software big.little is a mess

17:08 <robmur01> interrupts are basically the most pathological form of a bursty workload

17:09 <robclark> what are you comparing performance gov too? IME schedutil needs a lot of hinting from userspace about what tasks are important to move to big cores, vs what are not time-critical.. android has a lot of cgroup+uclamp stuff around that

17:12 <alyssa> robclark: whatever debian's default is

17:12 <alyssa> I suppose I should be grateful mainline+Debian works on this machine at all ;-D

17:15 <robmur01> probably doesn't help if the CPU has time to go idle and clock down while the GPU is busy for a frame, and vice versa. Any kind of scaling algorithm is liable to need different tuning for different workloads

17:21 <robclark> heh, you think that is hard to get right.. now move that game into a VM ;-)

17:21 <alyssa> oof

17:25 <alyssa> cphealy: Reproduced the glmark2 unhappiness with the shadowing stuff on RK3399

17:27 <anarsoul> alyssa: fix is in the works? :)

17:29 <alyssa> anarsoul: Still trying to understand

17:30 <alyssa> Lot of time spent in memcpy now

17:30 <alyssa> I guess that makes sense

17:32 <alyssa> buffer is 290816 and it's shadowing 4x each frame

17:32 <alyssa> so just over 1MB of memcpying every frame

17:32 <alyssa> versus just under 2MB of copying incurred from flushing

17:33 <alyssa> so less overall system memory bw, but more visible because it's on the CPU now

17:33 <alyssa> I guess?

17:33 <robclark> alyssa: you should TC plus allow_cpu_storage

17:33 <cphealy> is that CPU memcpying done using NEON instructions?

17:34 <alyssa> robclark: Plumbing in TC in the next few hours before the bpoint seems hard ;-)

17:34 <robclark> heh, well..

17:34 <robclark> cphealy: I guess the issue is probably readback from writecombine buffers

17:34 <alyssa> OOI, does TC help in real workloads?

17:34 <alyssa> as opposed to viewperf

17:35 <robclark> yeah.. and in particular the cpu-storage trick for shadowing buffers is nice because you are memcpy from cached/malloced to WC gpu buffers instead of WC->WC

17:37 <alyssa> oh that would solve this nicely

17:37 <alyssa> cphealy: You interested? ;-p

17:37 <robclark> the case I see TC hurt are really more just scheduler issues.. scheduler interprets light load split over two threads as "these completely independent threads aren't heavily loaded" without realizing the association between the two

17:37 <cphealy> ha, you wouldn't want me writing NEON code. Just curious if CPU memcopy could be faster with Panfrost.

17:49 <alyssa> cphealy: found the actual issue though

17:49 <alyssa> thank you for your dilligent benchmarking, this would've slipped through otherwise!

17:50 <cphealy> No problem

17:51 <cphealy> We are a team on Panfrost now

17:51 <alyssa> :-D

17:55 <alyssa> OK, with these patches, the "interleaved=true, map" case is doubled in perf

17:55 <alyssa> but the non-interleaved map is hurt a little bit and the non-interleaved subdata is halved in perf

17:55 <alyssa> ~~averages out though~~ investigating the non-interleaved case now that I have a better idea what's going on

17:58 <alyssa> right.. The subdata case is going to suck without the TC optimization

18:20 <robmur01> cphealy: probably - https://sourceware.org/git/?p=glibc.git;a=commit;h=e6f3fe362f1aab78b1448d69ecdbd9e3872636d3

18:24 <robmur01> or rather; probably not, until now

18:25 * robmur01 can't read diffs properly

18:27 * alyssa is unsure there's much to be done to help with the subdata case

18:37 <greenjustin> Surprised there's no prfm in this memcpy implementation

18:37 <greenjustin> Suppose it doesn't matter much if you're copying from coherent memory though

18:38 <robmur01> generally prfm does more harm than good for simple access patterns that the stride prefetcher can deal with itself

18:41 <alyssa> scratch that... this is supposed to work ok

18:41 <greenjustin> That's fair, especially on older chips where prfm doesn't co-issue for free

18:41 <alyssa> but apparently having some spooky action at a distance

18:42 <robmur01> I think pretty much everything since Cortex-A9, bar original ThnuderX, has a competent stride prefetcher

18:43 <alyssa> 100% gdb cpu usage delight

18:47 <alyssa> something is seriously broken here

18:49 <alyssa> oh, nvm, user error

18:49 <alyssa> whole bunch of vbo's. right.

18:49 <alyssa> yeah, I really don't see what the driver can do here without TC

18:54 <anarsoul> what is TC?

18:58 <alyssa> threaded context

19:08 <alyssa> cphealy: Pushed a new version of the resource shadowing fix

19:08 <alyssa> The subdata case may be slower but the other cases should be faster

19:09 <alyssa> and even if they're not, I'm inclined to land given the massive perf improvement on real workloads (i.e. not a glmark2 case designed specifically to emulate poorly written old apps)

19:17 <alyssa> Let me know how perf is with that for you

19:17 <alyssa> (massive win on RK3399 anyway)

19:21 <alpernebbi> yay for rk3399 wins!

19:22 <alyssa> alpernebbi: :D

19:25 <cphealy> alyssa: I'll give it a try in a few, tnx!

19:25 <alyssa> +1

19:38 atler has quit [Quit: atler]

20:18 atler has joined #panfrost

21:22 rasterman has quit [Quit: Gettin' stinky!]

21:49 alpernebbi_ has joined #panfrost

21:53 alpernebbi has quit [Ping timeout: 480 seconds]

22:05 <alyssa> stepri01: Unfortunately, our UAPI build problems aren't solved yet :(

22:05 <alyssa> https://rosenzweig.io/0001-drm-uapi-panfrost-Fix-build-with-C.patch

22:05 <alyssa> This is needed to fix the C++ build

22:06 <alyssa> Of course I'm not supposed to land that Mesa change without first landing in the kernel

22:07 <alyssa> I don't have a current kernel tree checked out and don't have the disk space to spare on this machine to fix that for a 1 line patch

22:08 <alyssa> so I would appreciate it if you could write the obvious 1 line fix (as there), add my reviewed-by, and push to drm-misc-fixes as before

22:08 <alyssa> (ideally by Tuesday so the Mesa side fix makes it into 22.3-rc1)

22:08 <alyssa> Thank you :)

22:11 CuriousGuy has joined #panfrost

22:13 CuriousGuy has quit [Remote host closed the connection]

22:19 alyssa has quit [Quit: leaving]