ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard + Bifrost + Valhall - Logs https://oftc.irclog.whitequark.org/panfrost - I don't know anything about WSI. That's my story and I'm sticking to it.
<jdavidberger>
htop shows 3 idle cores and one that maxs out at like 5-10%. Its only doing ~10 dispatches over a few seconds for the test so I think the CPU shouldn't be slowing it down I think. It gets very very close to 25 if nothing else is running and the shaders are cached and everything which is suspiciously close to 32 * 800mhz
<cphealy>
That is definitely suspiciously close to 32*800MHz!
<cphealy>
What method are you using to come up with the value of 38gflops?
<jdavidberger>
From the datasheet gen1 g52 is 32 operations per cycle but gen2 should be 48 operations per cycle so 48*800mhz = 38.4gflops
<cphealy>
"FP32 Operations/Cycle"?
<HdkR>
And then you learn that 32 operations/cycle != 32 FP32 operations/cycle
<cphealy>
How are you interpreting the 32/48 value for Mali G52 as gen1 and gen2?
<cphealy>
I wasn't aware that there are different gens of G52.
<jdavidberger>
HdkR: True; but that linked datasheet lists it as FP32 operations/clock
<jdavidberger>
cphealy: That is pretty unclear in the datasheet but the hardware revision is r1p0 and the max thread count as reported by the GPU matches the 768 number on the RK3566
<cphealy>
I think the 32/48 is independent of r1p0 vs some other revision. My understanding of the revision is it is more along the lines of newer revision with bugs fixed. Basically the same as Cortex-A55 r1p0 vs Cortex-A55 r1p1 as an example.
<jdavidberger>
Maybe -- the official documentation from ARM leaves something to be desired here for sure. But the GPU reporting 768 threads available with the drmIoctl call makes me think it's the second gen and I gotta think the 32->48 bump comes from the max threads going from 512->768
<cphealy>
I have a different theory on the "FP32 Operations/Cycle" reporting 32/48: If you look at the top of that table, you will see an entry for "Arithmetric Units". With every GPU that has two numbers for the "FP32 Operations/Cycle", you will see two numbers for the "Arithmetric Units", so the correct FP32 Operations/Cycle number is likely tied directly to how many arithmetric units the GPU has.
<cphealy>
It could be that you have a G52 with 2 arithmetric units as opposed to 3.
<cphealy>
According to this datasheet: https://www.boardcon.com/download/Rockchip_RK3566_Datasheet_V1.1.pdf The GPU is ARM Mali G52 1-Core-2EE. This would mean that you have a single shader core with 2 execution units. This would mean 32 is the correct number for FP32 Operations/Cycle.
<jdavidberger>
that makes sense. The 768 thread thing really threw me off in that chart. Which is a bummer but glad I only spent one day trying to talk it into going faster
hanetzer has joined #panfrost
<cphealy>
;-)
<jdavidberger>
I have the application code I need to run speced out at ~20GFLOPs. Hopefully I'm not that far off. Thanks for the help; its good not to spin my wheels for nothing
camus has joined #panfrost
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #panfrost
wicastC has quit [Remote host closed the connection]
wicastC has joined #panfrost
davidlt_ has joined #panfrost
davidlt__ has joined #panfrost
rcf has quit [Quit: WeeChat 3.8-dev]
hanetzer1 has joined #panfrost
rcf has joined #panfrost
hanetzer has quit [Ping timeout: 480 seconds]
davidlt_ has quit [Ping timeout: 480 seconds]
davidlt__ has quit [Ping timeout: 480 seconds]
hanetzer2 has joined #panfrost
hanetzer1 has quit [Ping timeout: 480 seconds]
guillaume_g has joined #panfrost
guillaume_g has quit []
furry has joined #panfrost
furry has left #panfrost [#panfrost]
jdavidberger has quit [Quit: Leaving.]
rasterman has joined #panfrost
chewitt has quit [Quit: Zzz..]
<robmur01>
cphealy: spot on - G52r0 implicitly has 3EE shader cores, G52r1 is configurable for 2EE or 3EE
<HdkR>
q4a: They support the extension yes. The hardware converts them to compute shaders.
<HdkR>
Well, driver converts them to compute shaders :P
<robmur01>
right, just because the *driver* reports a capability doesn't mean it has to be implemented in hardware. Take llvmpipe, for instance ;)
<q4a>
then how to check if it's software or hw implementation?
<HdkR>
It's all compute shader baby
<HdkR>
No mali supports GS in hardware :)
MajorBiscuit has joined #panfrost
guillaume_g has joined #panfrost
guillaume_g has quit []
TheKit[m] has joined #panfrost
jelly has quit [Read error: Connection reset by peer]
jelly-hme has joined #panfrost
rcf1 has joined #panfrost
rcf has quit [Quit: WeeChat 3.4.1]
jdavidberger has joined #panfrost
alyssa has joined #panfrost
<alyssa>
Has MRT + blend shaders been broken on Midgard this entire time?
<alyssa>
Answer is likely than you'd think
<cphealy>
robmur01: Are there readable registers in the GPU that expose how many AUs each shader core has? Also, are there readable registers in the GPU that expose how many shader cores the GPU has?
<stepri01>
shader cores is easy - SHADER_PRESENT is a bitmap of which cores are implemented, so that number of bits set is the number of shader cores
<stepri01>
number of AUs is stored in CORE_FEATURES (on GPUs where it means something)
<stepri01>
(or Execution Engines to use the correct term)
<cphealy>
stepri01: Execution Engines is equivalent to Arithmetic Units in public ARM Mali datasheet vernacular, correct?
<alyssa>
stepri01: FWIW, the userspace does want to know the clock speed for clinfo...
<alyssa>
right now I hardcode 800MHz...
<alyssa>
IDK what any app can actually do with the information lol
<stepri01>
alyssa: Yes I know - I tried to argue against providing the data (because it's almost certainly useless) but I just got pointed to the spec and didn't have much of an argument :(
<stepri01>
hardcoding a random number seems like a good idea
<robclark>
stepri01, alyssa: drm/msm exposes max clk.. I use it for things like calculating % utilization from perfcntrs.. IMO it is a perfectly reasonable thing to expose to userspace
<stepri01>
robclark: The real problem is that "no idea" isn't an allowed response, and in some situations the kernel really doesn't know
<stepri01>
It's also badly specified as things like DVFS mean that it's not the actual frequency
<robclark>
surely the kernel knows the max freq.. it doesn't have to report the current freq, only the max
<stepri01>
only if it is actually managing the clocks. On a FPGA platform it might not be known, and using a software model there isn't necessarily such a thing as a clock
<stepri01>
so you end up with a "lie with a hardcoded number" path in the driver and wonder what exactly you gain by trying to give a real number on real hardware
<stepri01>
beyond the specific case of profiling using hardware specific counters the number is useless and no application should use it
<stepri01>
so why it exists in a supposedly hardware-agnostic spec is beyond me
stepri01 has quit [Quit: leaving]
<robclark>
yeah, not sure why it is in spec.. but I don't think weird developer-only edge cases would convince me that it isn't something that the kernel should expose
p0g0 has quit [Ping timeout: 480 seconds]
<alyssa>
especially given that neither FPGAs nor software models are available to us mere mortals
davidlt has joined #panfrost
<jdavidberger>
Is there some concept of max invocations on bifrost that is possibly less than just GL_MAX_COMPUTE_WORK_GROUP_COUNT * GL_MAX_COMPUTE_WORK_GROUP_SIZE?
<jdavidberger>
Specifically when I run with glDispatchCompute(65535,1,1) with a local size of {256,1,1} I expect 0xFFFF00 invocations and i'm seeing ... i think 0x2AAA80 invocations based on timings; hard to tell but def less
davidlt has quit [Ping timeout: 480 seconds]
jelly-hme is now known as jelly
jdavidberger has quit [Ping timeout: 480 seconds]
jdavidberger has joined #panfrost
davidlt has joined #panfrost
davidlt_ has joined #panfrost
davidlt has quit [Ping timeout: 480 seconds]
davidlt_ has quit [Ping timeout: 480 seconds]
Guest3758 has joined #panfrost
<alyssa>
jdavidberger: Check dmesg, any chance the job is timing out/