ChanServ changed the topic of #asahi-gpu to: Asahi Linux GPU development (no user support, NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
Z750 has quit [Quit: Ping timeout (120 seconds)]
Z750 has joined #asahi-gpu
ourdumbfuture has joined #asahi-gpu
nsklaus has quit [Ping timeout: 480 seconds]
thelounge606 has quit [Remote host closed the connection]
cr1901 has quit [Quit: Leaving]
cr1901 has joined #asahi-gpu
cr1901 has quit []
cr1901 has joined #asahi-gpu
cr1901 has quit []
cr1901 has joined #asahi-gpu
lonjil2 has quit []
lonjil has joined #asahi-gpu
cr1901_ has joined #asahi-gpu
cr1901_ has quit [Remote host closed the connection]
possiblemeatball has quit [Quit: Quit]
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
cr1901 has quit [Read error: Connection reset by peer]
cr1901 has joined #asahi-gpu
ourdumbfuture has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
odak_ has quit [Quit: odak_]
ourdumbfuture has joined #asahi-gpu
pyropeter3 has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
pyropeter2 has quit [Ping timeout: 480 seconds]
amarioguy has quit [Remote host closed the connection]
odak_ has joined #asahi-gpu
odak_ has quit [Quit: odak_]
odak_ has joined #asahi-gpu
hightower2 has joined #asahi-gpu
mlp has quit [Read error: Connection reset by peer]
flibit has joined #asahi-gpu
<lina> Looked at the Emacs issue and it's Xwayland/X11... I think it's another rendering loop, but Xorg doesn't like running under apitrace so I can't really check...
flibitijibibo has quit [Ping timeout: 480 seconds]
<lina> alyssa: Have you ever used apitrace with X?
nimprod3l has joined #asahi-gpu
zzywysm has quit [Ping timeout: 480 seconds]
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
nimprod3l has quit [Quit: Leaving]
<lina> Figured it out... it's a driver bug
<lina> We claim to support texture barriers but we don't (and can't without special handling)...
bcrumb has quit [Quit: WeeChat 3.8]
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
bcrumb has quit [Quit: WeeChat 3.8]
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
<lina> firefox: ../src/asahi/compiler/agx_performance.c:30: agx_occupancy_for_register_count: Assertion `!"" "Register count must be less than the maximum"' failed.
<lina> alyssa: Is that one expected?
bcrumb has quit [Quit: WeeChat 3.8]
odak_ has quit [Quit: odak_]
odak_ has joined #asahi-gpu
nsklaus has joined #asahi-gpu
cylm has joined #asahi-gpu
alyssa has joined #asahi-gpu
<alyssa> lina: sleepy nya nya nya nya nya nya nya nya nya bat man
<alyssa> 07:27 <lina> alyssa: Have you ever used apitrace with X?
<alyssa> Yeah there's some awful incantation to do it that I can never remember
<alyssa> 08:46 <lina> firefox: ../src/asahi/compiler/agx_performance.c:30: agx_occupancy_for_register_count: Assertion `!"" "Register count must be less than the maximum"' failed.
<alyssa> I admit that's a terrible error message but what that means is that "this shader needs to spill registers and there's no spilling implemented"
<alyssa> firefox has worked fine for me so I wonder how that one reproduced
<alyssa> like I believe it, some webrender shaders are chunky
thelounge6065 has joined #asahi-gpu
as400 has quit [Remote host closed the connection]
ourdumbfuture has joined #asahi-gpu
odak_ has quit [Quit: odak_]
odak_ has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
odak_ has quit [Ping timeout: 480 seconds]
<lina> alyssa: 3ec: c17fbf33 sample_mask 255, 63
<lina> is that... right?
yamii has joined #asahi-gpu
yamii_ has quit [Read error: Connection reset by peer]
<alyssa> that's fine
<alyssa> are you still live?
<lina> Yes
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
flibit has quit []
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
mlp has joined #asahi-gpu
mlp has quit []
mlp has joined #asahi-gpu
<lina> alyssa: Do we really support GL_EXT_multisampled_render_to_texture already? That's the implicit resolve stuff, right?
<lina> At least I remember we didn't do it for depth yet...
<alyssa> for colour render targets it should Just Work
<alyssa> for depth buffers I thought it did but I could've been wrong
<alyssa> there are like, no tests for this ...
<alyssa> there's an unmerged piglit that has problems on other drivers
<lina> Steven says it's broken and that's why MSAA didn't work when I tested it earlier
<lina> I can try turning it off and see if it fixes it...
mlp has quit []
mlp has joined #asahi-gpu
<alyssa> I would believe it
<alyssa> should be easy to fix hopefully if it's broken
<steven> Yeah, Lina was noticing that the MSRTT path for MSAA in Darwinia didn't have a visual difference (and I saw that locally too). Using the traditional blit-to-resolve path instead works fine of course
<alyssa> if you could do that I'd appreciate it!
<lina> BTW, the Ultra thing was an MSAA interaction, there may be more lurking... but I think this is the only one relevant for clustering. So if there's anything else with tiling and MSAA, you'll see it on M1.
<alyssa> fun
<lina> (Kernel side)
<alyssa> regardless I can't help with msrtt
<alyssa> if you don't/can't fix it, i'll merge your disable patch
<lina> There is exactly one number calculated based on FB tile dimensions that is only relevant for clustering and it needed to be *samples
alyssa has left #asahi-gpu [#asahi-gpu]
<steven> Is it functionally possible to specify a multisampled texture with a sample count of 1 (i.e. effectively making it single-sampled despite being GL_TEXTURE_2D_MULTISAMPLE)? If so I kind of wonder if it's as simple the sample count not propagating all the way through
i509vcb has joined #asahi-gpu
nimprod3l has joined #asahi-gpu
ourdumbfuture has joined #asahi-gpu
hightower2 has quit [Ping timeout: 480 seconds]
alyssa has joined #asahi-gpu
<alyssa> I have not really tested this
<alyssa> but yknow. should work. probably
<lina> I just finished the kernel fix! Reviewing now ^^
* alyssa running through CTS now
<steven> alyssa: thank you! looking now
<alyssa> as predicted, it's not a quickfix
<alyssa> and I expect that code to regress performance somewhat
<alyssa> we can recover that later. correctness first.
<alyssa> lina: i must say i would not have had the persistence to work through the Darwinia side of this
<alyssa> we make a good team, you and me :~)
<lina> ^^
<steven> first line of lower_sample_mask_write is "return false" but there's a bunch of changes after it -- was that return intentional?
<alyssa> probably not
<alyssa> yep definitely not
<lina> alyssa: Confirmed it fixes Darwinia ^^
<alyssa> lina: hold on i fixed the thing steven pointed out
<alyssa> try again with the fix :-D
<alyssa> (I repushed)
<steven> thanks alyssa!
<alyssa> cheers
<lina> Still works!
<alyssa> I'll revisit this post-CTS to deal with the performance hit
<alyssa> but CTS seems happy with this if you are
possiblemeatball has joined #asahi-gpu
<alyssa> so eMRT is up next
<alyssa> but first, lunch
alyssa has quit [Quit: leaving]
nimprod3l has quit [Remote host closed the connection]
nimprod3l has joined #asahi-gpu
<lina> Tested all the discard cases, works ^^
aafeke_ has joined #asahi-gpu
nimprod3l has quit [Quit: Leaving]
hightower2 has joined #asahi-gpu
aafeke_ has quit [Quit: aafeke_]
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
cylm has quit [Ping timeout: 480 seconds]
ourdumbfuture has joined #asahi-gpu
possiblemeatball has quit [Quit: Quit]
as400 has joined #asahi-gpu
c10l484 has quit []
c10l484 has joined #asahi-gpu
alyssa has joined #asahi-gpu
<alyssa> I'm combing through a trace of emrt this afternoon
<alyssa> I have a good handle on the transform they're doing for the fragment shaders
<alyssa> IDK if we'll do the same thing. But understanding first.
<alyssa> it relies on a few hardware features to work:
<alyssa> * special register 20 given the core (or cluster? what's the difference?) index
<alyssa> lina already found that one
<alyssa> (and documented it)
<alyssa> * special register 32 giving a sort of tile ID
<alyssa> this one is pretty subtle
<alyssa> in each core (cluster?), there can be multiple tiles being processed concurrently
<alyssa> up to max_concurrent_tiles
<alyssa> sr32 assigns an ID to each tile from 0 to max_concurrent_tiles - 1, such that each concurrently processed tile has a unique index
<alyssa> * thread_position_in_threadgroup.xy defined in fragment shaders as relative to the current tile
<alyssa> * threads_per_threadgroup defined in fragment shaders as the tile size
<alyssa> * special register 60 giving a coverage mask that's affected by.. I guess z/s testing maybe?
<alyssa> with all that information the actual lowering is nice and simple
<alyssa> the driver allocates a buffer like:
<alyssa> pixel_t buffer[# cores][max concurrent tiles per core][tile height][tile width];
<alyssa> then instead of doing tilebuffer access, the fragment shader load/stores from
<alyssa> buffer[core ID][concurrent tile ID][position in tile.y][position in tile.x]
<alyssa> What's nice about this scheme?
<alyssa> * That buffer is pretty small in practice. a few kilobytes per byte of the format.
<alyssa> * It's naturally tiled, so the cache behaviour isn't horrible, although not as good as a real twiddled format
<alyssa> * It's straightforward to address in software
c10l484 has quit []
<alyssa> So... the more spicy half is how do you get that buffer back into the compressed render target
c10l484 has joined #asahi-gpu
<alyssa> this seems to involve dispatching a tile shader, because of course it does T_T
<alyssa> not going to go through this line by line
<alyssa> but broadly:
<alyssa> 1. First it wait_pix 768, 3.. the big barrier needed
<alyssa> 2. If thread_index_in_threadgroup != 0, skip 3. (That is, step 3 executes once per tile instead of once per pixel)
<alyssa> 3. For each render target in the tilebuffer, do an image_write_block interleaved with a memory_barrer 0, 2, 10 (f5a2). It's not clear to me why the barriers are needed.
<alyssa> 4. threadgroup (tile) barrier
<alyssa> 5. Load the pixel colour from the buffer
<alyssa> 6. Store the colour into the tilebuffer, with a special sample mask source I haven't decoded
Guest3100 has joined #asahi-gpu
<alyssa> 7. Another threadgroup/tile barrier (to make sure all colours are written before proceeding)
<alyssa> 8. If thread_index_in_threadgroup != 0, skip 9. (That is, step 9 executes once per tile instead of once per pixel)
Guest3100 has quit [Remote host closed the connection]
<alyssa> 9. image_write_block from the tilebuffer to the spilled render target
<alyssa> 10. done
Malaph has joined #asahi-gpu
<alyssa> what makes this a bit tricky is that we don't have tile shaders implemented, and neither GL nor VK has tile shaders so I'm not inclined to change that until we have a 'real' use case
<alyssa> the only thing tile shaders are actually used for here (as opposed to regular fragment shaders) is the tile barriers
<alyssa> although we could use multiple draws to avoid the barriers. slower but whatever, that's not the bottleneck here :-p
<alyssa> store pipeline becomes empty at any rate
<alyssa> clear pipeline has the same fragment transform applied of course
<alyssa> partial reload is the standard partial reload + usual fragment transform
<alyssa> partial store is the spiciest
<alyssa> First, it image_write_blocks each non-spilled target with the barriers interleaved as before
<alyssa> then it loops over each pixel in the tile and st_tiles it with magic sources I don't understand yet
<alyssa> then it image_write_blocks that
<alyssa> e8: 09048604fc208000 st_tile r1l_r1h_r2l_r2h, u8norm, 1, xyzw, 64, 1337, 6
<alyssa> this is the weird st_tile
systwi_ has quit [Ping timeout: 480 seconds]
<alyssa> well that's one the other is
mlp has quit [Read error: Connection reset by peer]
<alyssa> fe: 09148604fc208000 st_tile r5l_r5h_r6l_r6h, u8norm, 1, xyzw, 64, 1337, 6
mlp has joined #asahi-gpu
systwi has joined #asahi-gpu
<alyssa> would like to know more about the structure of those last 3 sources ..
<alyssa> I guess messing with tile shader's imageblock stufff might help figure that out
<alyssa> haven't decided if I'm doing ths approach to eMRT though
<alyssa> ok, I've given this some thought
<alyssa> For now, I won't be implementing anything like what Apple does for this
<alyssa> instead will do the Simplest Thing That Could Possibly Work
<alyssa> which I think should get similar performance, just with a dramatically worse (but still tolerable) memory footprint
<alyssa> and if anything ever hits this other than the CTS we can revisit
<alyssa> there are times in driver dev to be clever and I don't think this is one
<alyssa> but that's for tomorrow
<alyssa> =)
<alyssa> or maybe today. i wonder how far I can get in a few minutes so I can stop thinking about it
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
possiblemeatball has joined #asahi-gpu
Malaph has quit []
ourdumbfuture has joined #asahi-gpu
* alyssa keeps lowering the bar to produce the Easiest Possible Implementation
<alyssa> I just want to pass CTS, yo
<alyssa> bindless images, sure why not, I already implement those
<alyssa> why not use em
<alyssa> ("Slow?" "Shhh")
<alyssa> it doesn't need to be fast
<alyssa> it just needs to work
possiblemeatball has quit [Quit: Quit]
nsklaus has quit [Ping timeout: 480 seconds]