<marcan>
should slow down the HV tick on at least the secondaries
<marcan>
that plus the BHL means it doesn't scale
<marcan>
I forget if I take the BHL to do local FIQ handling... I should probably move at least that to finer-grained locking, so the HV ticks don't fight each other. that'll at least make it scale.
<kevans91>
big hairy lock?
PhilippvK has joined #asahi-dev
phiologe has quit [Ping timeout: 480 seconds]
<marcan>
big hypervisor lock
nicolas17 has quit [Quit: Konversation terminated!]
the_lanetly_052 has quit [Remote host closed the connection]
the_lanetly_052 has joined #asahi-dev
Major_Biscuit has joined #asahi-dev
MajorBiscuit has quit [Ping timeout: 480 seconds]
riker77 has quit [Quit: Quitting IRC - gone for good...]
riker77 has joined #asahi-dev
M76cw2gnqjj[m] has joined #asahi-dev
Major_Biscuit has quit []
MajorBiscuit has joined #asahi-dev
kameks has joined #asahi-dev
AnalogDigital[m] has left #asahi-dev [#asahi-dev]
<arnd>
I ran into some funny performance issue on the second half of the T6002. I noticed that a BLAS/GEMM benchmark is a lot slower on 10 cores of the M1 Ultra than an M1 Max MBP. With some more investigation I found that the first 10 cores are about the same, but the second set of firestorm cores appears slower. Doing the test individually on each core shows 30 GFLOPS on any icestorm, 90 GFLOPS on the first eight firestorm cores and 60
<j`ey>
ok maybe it doesn't actually need any fixes, after a quick look
<arnd>
j`ey: what was the issue? I see the firestorm cores running at 3228000 HZ in /sys/devices/system/cpu/cpu12/cpufreq/cpuinfo_cur_freq during the test
<_jannau_>
looks like cpufreq doesn't work on the second die. m1n1 initializes all cores to 2 GHz, (max for icestorm, 2/3 of max for firestorm)
<arnd>
ok, that explains it then
<arnd>
I guess cpuinfo_cur_freq shows what the kernel has requested, not what the hardware actually does
<_jannau_>
yes, 3.228 GHz is even wrong for the cores of the 1st die. that's only reached when the other cores of the cluster are in deep sleep
<arnd>
so it thinks that the cores in the second cluster always run at the same clk as the corresponding ones in the first one. looking at cur_freq confirms that
<_jannau_>
that's wrong, looks like I botched the cpufreq dt
<jannau>
missed to update the core's "apple,freq-domain = <&cpufreq_hw x>;" for the second die
<jannau>
should be 3, 4, 5 instead of 0, 1, 2
<arnd>
jannau, right, I just tried that locally
<arnd>
I had to change the reg-names as well, but now it seems to work
<arnd>
jannau: should I send you a proper patch, or do you just want to fold it in?
<arnd>
not sure if marcan ends up redoing it all anyway
<_jannau_>
I'll fold it in, we will probably end up redoing it anyway
<arnd>
ok
<arnd>
I got a typo in cpu_p00_d1, that needs to be 4, not 3
<arnd>
with the wrong number, CPU 12 dropped down to 18 GFLOPS...
<_jannau_>
thanks. strange that I still saw a 2x performance improvement in kcbench compared to m1 max
<j`ey>
redo the measurements later :D
<arnd>
jannau: what is the maximum frequency under constant load? If the m1 max gets throttled to the same 2GHz after a while, that still works out
<_jannau_>
3 GHz are the expected all core full load frequency for the performance cores. even in the macbook pro it should be sustainable for a longer period of time. we need to look for the actual core frequency
riker77 has quit [Ping timeout: 480 seconds]
robinp_ has quit [Ping timeout: 480 seconds]
<arnd>
it scales almost perfectly now: 591 GFLOPS for either set of 8 firestorm cores, 1171 GFLOPS for all 16
robinp has joined #asahi-dev
<arnd>
interestingly it goes down to 886 GFLOPS if I use all 20 cores including the small ones, but I think that's just blis/gemm not being aware of big/little cores, and making the big ones wait
kov has joined #asahi-dev
riker77 has joined #asahi-dev
<jannau>
10% faster, 70s instead of 78s for make vmlinux with linux-5.15 arm64 defconfig. I guess I have to redo it on the m1 max as well
c10l has quit [Quit: Bye o/]
<j`ey>
it was 5mins on my m1 air D: (for 5.18ish defconfig)
c10l has joined #asahi-dev
riker77 has quit [Quit: Quitting IRC - gone for good...]
<sven>
yay, nvmem (not to be confused with NVMe) which was even more boring than watchdog got merged :D
<j`ey>
sven: woot
riker77 has joined #asahi-dev
MajorBiscuit has quit [Ping timeout: 480 seconds]
<arnd>
jannau: on this many cores, the total runtime is not all that meaningful because half the build time is spent in single-threaded work like linking or parsing the Makefiles, it's often better to compare CPU time (user+system from /bin/time, or output from perf stat) to see how much work the CPUs got done
MajorBiscuit has joined #asahi-dev
<maz>
j`ey: 5 minutes for 'make vmlinux'? that's odd. it takes half that time on my mini (make -j9 vmlinux).
<j`ey>
I cant remember what -j I used now, less than 9 for sure. I also only have the 8GB ram model
<maz>
8GB is more than enough. the compiler used may have a significant impact though (I use GCC 10.2.1)
<j`ey>
I was having OOM issues with LLVM, so felt like being convservative :P will run more tests at some point. I was using llvm/clang
<j`ey>
(OOM issues *building* LLVM)
<j`ey>
maz: any news on your studio?
<maz>
j`ey: landing expected first half of May...
<j`ey>
oof
___nick___ has joined #asahi-dev
___nick___ has quit []
atsalyuk has joined #asahi-dev
___nick___ has joined #asahi-dev
<mps>
j`ey: `real 2m 26.38s` on mbp with `busybox time make -j9 vmlinux`, though not defconfig but current linux-asahi for alpine config
<j`ey>
hmmm real 4m 4.61s
<j`ey>
I cant imagine clang could be that much slower.. retesting with gcc
<mps>
what is PAGE_SIZE of running kernel
<j`ey>
16
<mps>
to complete pkg build (apk for alpine) take about 4m 40s
<mps>
s/to/to me/
<j`ey>
real 5m 51.31s, with gcc...
<arnd>
the time it takes for building a kernel can differ hugely based on a single Kconfig option, such as CONFIG_DEBUG_INFO, the exact toolchain version, or how the compiler was built
<j`ey>
I'll try mps's config later, since we have the same gcc (from alpine)
<arnd>
j`ey: clang is usually some 10% to 20% slower than gcc for a defconfig build, but it's more sensitive to options that lead to larger indirect header inclusions
<arnd>
in the gcc source tree, run ./contrib/download_prerequisites to make it build a local copy of the mpc/mpfr/isl/gmp libraries instead of the distro version