marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
DarkShadow44 has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]
DarkShadow44 has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
Emantor has quit [Quit: ZNC - http://znc.in]
Emantor has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
<bloom> piles of ALU
<bloom> dougall: hey, I have a reversing question for you
<bloom> I'd like to know what the thresholds are for register usage --> thread occupancy
<bloom> The cmdstream just specifies register count/4, but that seems more fine grained than real hw would implement
<bloom> But! Apple docs talk about occupancy perf counters?
<bloom> So it ought to be able to figure out from the other direction like you've done with the cpu?
<bloom> Oh, even better maxTotalThreadsPerThreadgroup is literally in the Metal API. nice!
<bloom> "Advanced Metal Shader Optimization" from WWDC'16 is still relevant :)
<bloom> "the shader cores... feature a constant execution and prefetch"
<dougall> hmm... yeah, i've been wanting to get at those counters, but i'm not quite sure how to approach it
<dougall> i think a granularity of four is possibly correct... register granularity is observable by using out-of-range register ids. occupancy would be 'floor(total / round_up(register_count))' right?
<bloom> Possibly, possibly not.
<bloom> Granularity of 4 is almost certainly the allocation of registers
<bloom> but that might not be the allocation of threads
<bloom> (that could be granularity of 8 instead, for example. etc)
Necrosporus has quit [Ping timeout: 252 seconds]
robinp has quit [Read error: Connection reset by peer]
robinp has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
Necrosporus has joined #asahi-gpu
phiologe has quit [Ping timeout: 250 seconds]
phiologe has joined #asahi-gpu
<bloom> note to self: r5 appears preloaded with vertex id in vs
<bloom> a bit of a hunch looking at a funny shader I compiled -- looks like integer ALU issues 2 insructions at once, maybe?
<bloom> the way it schedules the scalarization of vec4 arithmetic is suggestive
<bloom> dougall: Oh, ouch, embarassing - misread the sqrt op
<bloom> the thing I called sqrt, call it f(x)
<bloom> it's actually sqrt(x) = f(x) * x
<bloom> so it's in fact rsqrt
<bloom> but we already.. had rsqrt
<dougall> "Implemented as x * rsqrt(x) with special cases handled correctly" - so i guess how does sqrt(x) differ from x * rsqrt(x)?
<dougall> (asside from precision)
<bloom> Probably the usual suspects: NaN, Inf, signed zero
<bloom> rsqrt(0) is probably NaN and NaN * 0 = NaN, yet sqrt(0) = 0
<bloom> so rsqrt_special has to define rsqrt(0) to be finite (which is wrong)
<dougall> ah, yeah, that'd make sense :)
<bloom> also need sqrt(-0.0) = -0.0
<bloom> which holds if we set rsqrt(-0.0) = 0.0 since 0.0 * -0.0 = -0.0
<bloom> likewise, we want sqrt(+inf) = +inf
<bloom> but rsqrt(+inf) = 0.0 and 0.0 * +inf = NaN (indeterminate form)
<bloom> So rsqrt_special(+inf) needs to be some positive number.
The_DarkFire_[m] has joined #asahi-gpu
<bloom> dougall: Ok, I just r/e'd thread count
<bloom> actually r/e is a stretch
<bloom> Just dumped maxTotalThreadsPerThreadgroup and varied register pressure systematically
<dougall> ah, what's it look like?
<bloom> So, in terms of the "register quadwords" field in the cmdstream:
<bloom> (call that Q)
<bloom> If Q <= 13, then you have 1024 threads.
<bloom> If Q >= 14, you have less. I was about to say I had the formula but my formula is buggy, hang on
<dougall> (is one register quadword like r0-r3 or like r0l-r1h?)
<bloom> r0-r3
* bloom just bruteforces
<bloom> I guess that's a linear allocation, just a lot of rounding needed
<bloom> So.. a 192kb register file
<bloom> uh no
<bloom> uh yes
<dougall> haha
<bloom> it's past midnight i cant units
<dougall> yeah, i think that all makes sense
<bloom> numbers still feel suspect.
<bloom> What am I missing
<bloom> SZ = (128*4)*384
<bloom> >>> SZ / (29*16)
<bloom> 423.7241379310345
<bloom> which is less than the 448 actually issues
<bloom> I guess this is all of-by-one
<bloom> uh no
<bloom> ah!
<bloom> Ahhh!
<bloom> If you take SZ = 384 * (32 * 4 * 4), everything is too small except the biggest
<bloom> but if you doubled that, you would expect the last thread count to double
<bloom> but if you take SZ to be somewhere in betwee, everything rounds right
<bloom> 212992 = ((384 + 448)/2) * 32 * 4 * 4
<bloom> ^ smack int he middle
<bloom> and indeed:
<bloom> >>> { x: math.floor((SZ_ / (x*4*4)) / 64) * 64 for x in range(13, 32) }
<bloom> {13: 1024, 14: 896, 15: 832, 16: 832, 17: 768, 18: 704, 19: 640, 20: 640, 21: 576, 22: 576, 23: 576, 24: 512, 25: 512, 26: 512, 27: 448, 28: 448, 29: 448, 30: 384, 31: 384}
<bloom> Expanding out:
<bloom> min(1024, math.floor((SZ_ / (math.ceil(reg_count / 4)*4*4)) / 64) * 64)
<dougall> nice!
<bloom> Here's a better formula
<bloom> The register file is M = 53248 words.
<bloom> Every thread requires R words from the register file.
<bloom> Threads may only be dispatched in groups of 64.
<bloom> No more than 1024 threads may be dispatched.
<bloom> Therefore, we may dispatch `min(1024, align_down(M / R, 64))` threads.
<bloom> (where align_down(x, y) = floor(x / y) * y)
mxw39 has quit [Ping timeout: 240 seconds]
mxw39 has joined #asahi-gpu
<bloom> Possible addendum: But threads can only require multiples of 4 word-sized registers.
<bloom> Therefore, we may dispatch `min(1024, align_down(M / align_up(R, 4), 64))` threads.
<bloom> (where align_up(x, y) = ceil(x / y) * y)
method_ has joined #asahi-gpu
<bloom> "total, the M1 GPU contains up to 128 EUs and 1024 ALUs,[12] which by Apple's claim can execute nearly 25,000 threads simultaneously"
<bloom> I assume "nearly 25,000" is marketing speak for 1024 * 24 = 24576 threads
<bloom> which means scaling our estimated size up by 24x to 212992 * 24 = 5111808 = 4.9 MB of register file on the M1 GPU!
<bloom> Straight from the horse's mouth https://www.apple.com/mac/m1/
<bloom> 🐴
<dougall> yeah... that number is right, but i'm a bit confused where the 24x comes from? like is the register file per-core? (i assume so, and that'd an 8x, but then where's the 3x?)
<bloom> dougall: Can't tell.
<bloom> But the 24k is from apple's marketing
<bloom> not sure where the 3x is.
<dougall> yeah, i'm quite confused... "128 execution units", so 16 per core? when a simd-group is 32? is that like AMDs thing where you just put each half through the pipeline over two cycles?
<bloom> dunno where the 128 execution units comes from
<bloom> some of that is speculation from anandtech
<dougall> (that's from the announcement video iirc)
<bloom> Oh, I see.
<method_> interesting
<dougall> yeah, i'm sure it'll make more sense as we figure more out, but definitely plenty of puzzles left :)
<bloom> or not
<bloom> a lot of this stuff is irrelevant/invisible to software
vlixa has quit [Remote host closed the connection]
vlixa has joined #asahi-gpu
<glibc> oh, they don't use a fast sin at all? That's interesting indeed
odmir has joined #asahi-gpu
<bloom> Oh, theres's a cute optimization I need to check the legitimacy of
<bloom> If two 16-bit values are in a packed register (r0l/r0h say) can we use bitop_mov on the 32-bit register (r0) for vectorization?
<bloom> I bet so.
<bloom> Likewise using mov_imm as a 32-bit thing
* bloom spun up an optimizer for agx
<bloom> handles the core floating point stuff
morelightning[m] has quit [Quit: Idle for 30+ days]
artemist has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
m42uko has quit [Quit: Leaving.]
m42uko has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
vlixa has quit [Remote host closed the connection]
vlixa has joined #asahi-gpu
odmir has quit [Ping timeout: 268 seconds]
odmir has joined #asahi-gpu
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
vlixa has quit [Remote host closed the connection]
vlixa has joined #asahi-gpu