ChanServ changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
Etrien__ has quit [Read error: Connection reset by peer]
ella-0 has quit [Read error: Connection reset by peer]
Dcow has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
<lina>
phire: Performance really doesn't matter, and I need input/output in f32 format... I think I'm just going to wrap the existing kernel softfloat stuff for now
<phire>
fair
<lina>
If upstream doesn't like it I can always replace it with something else
<phire>
upstream linux, or upstream rust-on-linux?
<lina>
both ^^
<lina>
I've been looking at the new buffer stuff. The buffers are sized based on the core count, and are used even if I turn off half the cores. So it seems the "2 GPUs" is a red herring. It's 8 GPUs!
<lina>
All that stuff is also used on M1 Pro and Max
<lina>
So they planned for this scalability from the get go, it seems
<lina>
It's just M1 doesn't need it since it is one cluster
Etrien has quit [Remote host closed the connection]
<lina>
As for the deflake buffers, for some inexplicable reason macOS allocates them as exactly *9x* the M1 sizes... but with 8x it works, so maybe that's an off-by-one on their side (or they were padding for debugging)
Etrien has joined #asahi-gpu
<lina>
(deflake = preempt, I really need to replace that wording)
<phire>
yeah, I was supicious about the actual "gpu count" based on some of the initdata
<lina>
And then of the new tile buffers, most are fixed size * cluster count it seems, but one seems to be aux tilemaps, and the size is the tilemap size * 4 * nclusters
<lina>
I don't know where the 4 comes from, it seems the last 3/4 of each block end up unused...
<lina>
Maybe it's a spill thing or maybe I have a stride setting wrong somewhere
<phire>
so it's 4 cores per cluster?
<lina>
8 cores per cluster
* phire
scrolls down to correct die shot
<lina>
Ohh I found the stride... and yup, I have it wrong, I remember a factor of 4 I wondering about here. This also makes sense, I remember messing with this one on M1 and it not mattering...
<phire>
if the guranuality is that small, I wonder if it can split vertex workloads across clusters. Or if vertex workloads are limited to a single cluster
<lina>
It can, it's what these buffers are for.
<lina>
I can restrict which clusters it uses for vertex via a mask, and it slows down if I do that
<lina>
Fancy.
<lina>
If I restrict it to just one cluster, the buffers are not used at all (any cluster will do)
<lina>
Same as M1
<lina>
Also, for some queues/jobs, macOS does just that, rolling the cluster it picks
<lina>
But I can't get it to actually run multiple jobs across clusters like this, so I think this is just some kind of power/load spreading for simple jobs, like the compositor
<phire>
interesting, so it's splitting vertex workloads at a pretty fine granularity. Within a single draw?
<lina>
macOS only ever uses either 1 cluster rolling, or all clusters, and it doesn't seem to be per client at least. Might be a heuristic.
<lina>
I think so, since I think the glmark2 bunny is a single draw?
chadmed has quit [Quit: Konversation terminated!]
<lina>
And I mean it has to balance across cores anyway... balancing across clusters just means, I'm guessing, extra handling for tile maps and buffer management metadata
<lina>
I imagine that all that logic comes from die 0, and is unused in die 1, only the actual compute units and tiling output logic are used in die 1
Dcow has joined #asahi-gpu
<phire>
yeah, the mesh is just a single DrawArrays call
<phire>
or draw VBO if that's enabled. not sure which path is used by default
<lina>
If you look at the die shot, there are 8 compute units per cluster and then a bunch of structures that are also per cluster. And then there's one common area that is global in the M1 Pro, and not even symmetric in the M1 Max
<lina>
So I'm guessing that little block is the top-level buffer management / dispatch stuff, the per-cluster structures are the actual tiling/sorting/whatever, and then there's the actual cores
<lina>
And on the M1 Max, the little block on die 1 would be unused, everything else would be used, with the tiling/sorting stuff coordinating via the extra buffers
<phire>
that might actually be three common areas. One at the top, one in the middle, one at the bottom. top and bottom seem to be diffrent
<phire>
the bottom one might be to do with inter-die work sharing
<lina>
It's possible the way it works is there is an extra merge pass. Like, all the cores tile a random slice of geometry into separate buffers (and only have to coordinate to allocate buffer pages from the shared buffer manager), then once that's done the pointers are merged/chained (could be a linked list already, not sure exactly how it is organized) and then vertex gets to work on the result
<lina>
That makes perfect sense, come to think of it
<lina>
I've also noticed it uses more tile buffer pages than the M1, and that makes sense since allocation is at the 32K page level and with more slices, that's more overhead
<phire>
thoery: top is inter-cluster dispatch; middle one is inter-cluster coherency fabric; bottom is for inter-die coherrency/comunication
<lina>
I think top might include the ASC
<lina>
Are you sure middle is not symmetric? It looks symmetric to me...
<phire>
I mean, it's made of 2-4 symetric slices. but those slices seem to be tied together into a single block
Dcow has quit [Ping timeout: 480 seconds]
<phire>
in a way that reminds me of zen 2's l3 cache. but it doesn't look like a cache
<lina>
I think top is shared/ASC/bufmgr, bottom is actually partially unrelated to the GPU (it has all the extra PLLs for the M1 Max, those are needed by the other blocks), and there's nothing specific for inter-die since the architecture already scales in that direction with just Apple Fabric
<phire>
so I'm thinking tags?
<phire>
yeah, you might be right that it's just something sitting in otherwise unused space
<phire>
(at the bottom)
<lina>
It has 7 PLLs and there are no other PLLs lying around in the M1 Max exclusive half of the die, so that has to be it
<phire>
how can you tell they are PLLs?
<phire>
(that's not something I've learned to spot on a die shot)
<lina>
The little square blocks like that always scream PLL, they're clearly analog blocks and very different from any logic stuff, but not attached to any IO edges
<lina>
And you see them all over the place (there's 4 on the top side of the GPU, but also a row of them near the top of the die, and one in each DDR IO cell)
<lina>
And the count kind of adds up (7 extra for M1 Max: 1 for AVE/scaler stuff, 1 for the mystery unused neural engine, 2 extra DCPs, 2 for the GPU clusters, 1 for prores or just random)
<phire>
makes sense
chadmed has joined #asahi-gpu
<lina>
Oh yeah, and this also explains why the preemption buffers are so small, and replicated like this. They hold the tiling accelerator/command processor state, not compute unit state.
<lina>
So it can't preempt individual core launches (at whatever granularity they use for that), which we know because there's no tile memory buffers.
<lina>
(Or maybe it can trigger tiles to early-exit and just flush to the FB, like a partial render?)
<lina>
(Either way it definitely can't snapshot tile memory wholesale)
<phire>
yes, was just going to say
<phire>
there is already memory allocated for tiles, though we don't know if it's used
<lina>
For tiles?
<phire>
the framebuffer
<lina>
Well yes
<lina>
I mean, it has to be used for partial renders anyway
<lina>
I guess I should check whether preemption writes out partial data... should be easy...
<phire>
some of those compute jobs are long-running. I wonder if compute launches require buffers for compute unit state
<lina>
I wonder, they might have something special for compute
<phire>
but vertex/fragment jobs are useally short enough that you can just wait for them to end
<phire>
but I would think indivudal tile infocations could be long enough that it's worth preempting them
<phire>
I'm not sure what their expected preemption granularity is. it would be weird for a single tile to exeute for longer than ~10us
chadmed has quit [Quit: Konversation terminated!]
<lina>
I don't see any depth meta buffer data with preemption but no TVB overflows, so I don't think that's it
<lina>
I think it just can't preempt individual tile launches
<phire>
I wonder if it just aborts any tile that doesn't finish quick enough
<lina>
OK, so the final tile pointer array is == the first cluster-block of the temporary tile pointer array after a render
<lina>
So it looks like that just gets copied over (sounds inefficient?) and they probably either chain the rest of the pointers behind, or do some merging of top-level structures (but evidently not the actual vertex buffer data, this is all variable size anyway so they need to have some kind of linked list or growable array mechanism anyway)
<phire>
so you can actually see it merging?
<phire>
I guess it does need to be sorted into API order, which would be hard without a pass over it
<lina>
I'd have to look into the pointee data and poll it and see what happens to it, I'm only looking at the top level now
<lina>
If they split the input draw arrays by cluster in API order, then it's trivial
<lina>
Not sure how that would work with multiple draws though
<phire>
the other option is a centralized vertex-writeback block, basically equlivant to the ROPs for pixels on desktop gpus
<phire>
each compute core sets final vertex data to a centralized block, which sorts it and writes it to the TVB
<lina>
They have separate root tilemaps for every cluster, so they are definitely processing everything in chunks and then merging at the end
<lina>
And that wouldn't scale well...
<phire>
or N centralied blocks, addressed by screenspace location
<lina>
I mean I'm sure they have that within a cluster
<lina>
But between clusters, don't think so
<phire>
yes, there would be scaling issues
<phire>
but merging just seems wasteful
<phire>
but merging also has better scaling?
<phire>
Hang on, is there any reason not to just implement reading from N TVBs (one per cluster) duing the 3d invocation?
<phire>
and reading and merging them in a single pass
<lina>
The structures are trees, so they can probably just chain pointers
<lina>
The top level tile array is just 5 bytes per tile, a memcpy of that seems wasteful but is trivial
<lina>
And then the rest of the top level pointers are probably chained together at the next level, or something like that
<lina>
So when I say "merging" it's just writing out some pointers
<lina>
The actual vertex data pages are definitely not merged, that'd be wasteful
<lina>
It's just chaining together slices for each tile
<phire>
oh, point. we are just merging the indices
<phire>
it's a lot less waseful when you put it that way
<lina>
The top level tile array just has 5-byte memory pointers to the initial position of every tile, and then there's at least one intermediate meta/pointer array structure before the raw data (it's been a while since I dumped these buffers...)
<lina>
And since that next level is allocated out of the heap, it has to have chain pointers
<lina>
So it's very likely they just trivially chain those
<phire>
except you have to respect ordering, because stupid reasons
<phire>
there are very slight graphical diffrences (were depths are equal) if you render triangles out of API order
<lina>
Yeah, so the question is whether they do something at the dispatch level to make sure every cluster gets an ordered slice of the overall workload. I think they do, because the first tilemap is ~complete, and they get sparser as you step through the buffer.
<lina>
So that tells me the first cluster processed all the clear ops and every tile, and then the other clusters progressively processed additional slices of geometry for busy tiles
<lina>
Probably in draw order...
<phire>
and much larger diffrences if blending or image load/store are enabled
<lina>
And then if you can do that, chaining just works
<lina>
Also, they might have primitive IDs in the TVB data anyway, I think they need that later during rasterization? So maybe it doesn't matter...
<lina>
For a single draw op it's trivial, just split the primitives into equal 8ths. More interesting is what happens with multiple draw ops.
<lina>
It's possible they have some sort of pass to count the total number of primitives first, then split based on tat?
<lina>
*that
<phire>
actually, maybe they cheat and their depth sorting implementation also sorts by primitive ID
<phire>
which would have a side-effect that it has to fall back to single-cluster vertex processing when doing alpha blending
<lina>
That seems really wasteful...
<phire>
yeah, and apple's docs go on and on about how alpha blending is expensive
<phire>
why not make it a bit more expensive?
<lina>
wwwww
<phire>
like more than 95% of your geometery in a typical game will be opaque
<phire>
because alpha blending is somewhat expensive everywhere
<lina>
Yeah, but if they have to do some kind of flush around alpha draws...
<phire>
they already do
<lina>
Anyway, I should get ready for the stream (and have a sandwich or something) ^^
<phire>
alpha blending already puts the TBDR renderer into a very sub-optimal mode
<lina>
At the fragment level though, right? Not the vertex level...
<phire>
depending on the exact impelementation, you might have to allocate some of the tile buffer for a copy of the depth buffer
<phire>
hmm, does it really need a full flush to switch vertex modes?
<phire>
the mask is surely per job, so you can just put the alpha blended verts in a new job
<phire>
though, I can't actually remember what level we called "job"
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
Dcow has joined #asahi-gpu
Etrien has joined #asahi-gpu
SSJ_GZ has joined #asahi-gpu
Etrien__ has quit [Ping timeout: 480 seconds]
Dcow has quit [Ping timeout: 480 seconds]
chadmed has joined #asahi-gpu
ChaosPrincess has quit [Remote host closed the connection]
uur has joined #asahi-gpu
ChaosPrincess has joined #asahi-gpu
uur has quit [Quit: Leaving.]
uur has joined #asahi-gpu
uur has quit [Ping timeout: 480 seconds]
uur has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
uur has quit [Remote host closed the connection]
chadmed has joined #asahi-gpu
uur has joined #asahi-gpu
al3xtjames9 has joined #asahi-gpu
sorear___ has joined #asahi-gpu
Manouchehri_ has joined #asahi-gpu
jonmasters_ has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
yuyichao has joined #asahi-gpu
sa1_ has joined #asahi-gpu
Z750 has joined #asahi-gpu
nepeat_ has joined #asahi-gpu
wicastC has joined #asahi-gpu
yuyichao_ has quit [synthon.oftc.net larich.oftc.net]
kov has quit [synthon.oftc.net larich.oftc.net]
nepeat has quit [synthon.oftc.net larich.oftc.net]
al3xtjames has quit [synthon.oftc.net larich.oftc.net]
artemist has quit [synthon.oftc.net larich.oftc.net]
wicast has quit [synthon.oftc.net larich.oftc.net]
princesszoey has quit [synthon.oftc.net larich.oftc.net]
Z751 has quit [synthon.oftc.net larich.oftc.net]
Manouchehri has quit [synthon.oftc.net larich.oftc.net]
sorear__ has quit [synthon.oftc.net larich.oftc.net]
sa1 has quit [synthon.oftc.net larich.oftc.net]
al3xtjames9 is now known as al3xtjames
jonmasters has quit [synthon.oftc.net larich.oftc.net]
sa1_ is now known as sa1
Manouchehri_ is now known as Manouchehri
artemist has joined #asahi-gpu
princesszoey has joined #asahi-gpu
dviola has joined #asahi-gpu
uur has quit [Remote host closed the connection]
kov has joined #asahi-gpu
uur has joined #asahi-gpu
Dcow has joined #asahi-gpu
uur has quit [Remote host closed the connection]
Dcow has quit [Ping timeout: 480 seconds]
uur has joined #asahi-gpu
Dcow has joined #asahi-gpu
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
Dcow has quit [Ping timeout: 480 seconds]
uur has quit [Remote host closed the connection]
bluetail has quit [Ping timeout: 480 seconds]
Etrien__ has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
Dcow has joined #asahi-gpu
uur has joined #asahi-gpu
uur has quit [Remote host closed the connection]
Etrien has joined #asahi-gpu
Etrien__ has quit [Ping timeout: 480 seconds]
uur has joined #asahi-gpu
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
dviola has quit [Ping timeout: 480 seconds]
chadmed has joined #asahi-gpu
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
kov has quit [Quit: Coyote finally caught me]
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
Etrien__ has joined #asahi-gpu
Etrien__ has quit [Read error: Connection reset by peer]
Etrien__ has joined #asahi-gpu
mrkajetanp has joined #asahi-gpu
Etrien has quit [Ping timeout: 480 seconds]
mrkajetanp has quit [Remote host closed the connection]
bluetail has joined #asahi-gpu
chadmed has quit [Quit: Konversation terminated!]
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
dviola has joined #asahi-gpu
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
ethan has joined #asahi-gpu
ethan is now known as amateurece
compassion2 has joined #asahi-gpu
compassion has quit [Ping timeout: 480 seconds]
compassion2 is now known as compassion
fmstrat has joined #asahi-gpu
Gaspare has joined #asahi-gpu
alyssa has joined #asahi-gpu
<alyssa>
"Apple Fabric"
<alyssa>
there a luxury clothing line I didn't hear about?
<sven>
it's their interconnect which connects most (all?) thing inside the SoC
<_jannau_>
they started selling square greyish samples for $19 a piece last year
<alyssa>
_jannau_: Ooh, got it, thank you
uur has quit [Remote host closed the connection]
compassion has quit [Quit: lounge quit]
compassion has joined #asahi-gpu
uur has joined #asahi-gpu
uur has quit [Remote host closed the connection]
uur has joined #asahi-gpu
alyssa has quit [Quit: leaving]
lonjil has quit [Quit: No Ping reply in 180 seconds.]
lonjil has joined #asahi-gpu
uur has quit []
uur has joined #asahi-gpu
uur has quit [Read error: Connection reset by peer]
uur1 has joined #asahi-gpu
chengsun_ has joined #asahi-gpu
chengsun has quit [Ping timeout: 480 seconds]
uur1 has quit [Quit: Leaving.]
Gaspare has quit [Quit: Gaspare]
maria has joined #asahi-gpu
uur has joined #asahi-gpu
SSJ_GZ has quit [Read error: Connection reset by peer]