ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
Z750 has quit [Quit: Ping timeout (120 seconds)]
Z750 has joined #asahi-gpu
ourdumbfuture has joined #asahi-gpu
nsklaus has quit [Ping timeout: 480 seconds]
thelounge606 has quit [Remote host closed the connection]
cr1901 has quit [Quit: Leaving]
cr1901 has joined #asahi-gpu
cr1901 has quit []
cr1901 has joined #asahi-gpu
cr1901 has quit []
cr1901 has joined #asahi-gpu
lonjil2 has quit []
lonjil has joined #asahi-gpu
cr1901_ has joined #asahi-gpu
cr1901_ has quit [Remote host closed the connection]
possiblemeatball has quit [Quit: Quit]
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
cr1901 has quit [Read error: Connection reset by peer]
cr1901 has joined #asahi-gpu
ourdumbfuture has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
odak_ has quit [Quit: odak_]
ourdumbfuture has joined #asahi-gpu
pyropeter3 has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
pyropeter2 has quit [Ping timeout: 480 seconds]
amarioguy has quit [Remote host closed the connection]
odak_ has joined #asahi-gpu
odak_ has quit [Quit: odak_]
odak_ has joined #asahi-gpu
hightower2 has joined #asahi-gpu
mlp has quit [Read error: Connection reset by peer]
flibit has joined #asahi-gpu
<lina>
Looked at the Emacs issue and it's Xwayland/X11... I think it's another rendering loop, but Xorg doesn't like running under apitrace so I can't really check...
flibitijibibo has quit [Ping timeout: 480 seconds]
<lina>
alyssa: Have you ever used apitrace with X?
nimprod3l has joined #asahi-gpu
zzywysm has quit [Ping timeout: 480 seconds]
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
nimprod3l has quit [Quit: Leaving]
<lina>
Figured it out... it's a driver bug
<lina>
We claim to support texture barriers but we don't (and can't without special handling)...
bcrumb has quit [Quit: WeeChat 3.8]
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
bcrumb has quit [Quit: WeeChat 3.8]
bcrumb has joined #asahi-gpu
bcrumb has quit []
bcrumb has joined #asahi-gpu
<lina>
firefox: ../src/asahi/compiler/agx_performance.c:30: agx_occupancy_for_register_count: Assertion `!"" "Register count must be less than the maximum"' failed.
<lina>
alyssa: Is that one expected?
bcrumb has quit [Quit: WeeChat 3.8]
odak_ has quit [Quit: odak_]
odak_ has joined #asahi-gpu
nsklaus has joined #asahi-gpu
cylm has joined #asahi-gpu
alyssa has joined #asahi-gpu
<alyssa>
lina: sleepy nya nya nya nya nya nya nya nya nya bat man
<alyssa>
07:27 <lina> alyssa: Have you ever used apitrace with X?
<alyssa>
Yeah there's some awful incantation to do it that I can never remember
<alyssa>
08:46 <lina> firefox: ../src/asahi/compiler/agx_performance.c:30: agx_occupancy_for_register_count: Assertion `!"" "Register count must be less than the maximum"' failed.
<alyssa>
I admit that's a terrible error message but what that means is that "this shader needs to spill registers and there's no spilling implemented"
<alyssa>
firefox has worked fine for me so I wonder how that one reproduced
<alyssa>
like I believe it, some webrender shaders are chunky
thelounge6065 has joined #asahi-gpu
as400 has quit [Remote host closed the connection]
ourdumbfuture has joined #asahi-gpu
odak_ has quit [Quit: odak_]
odak_ has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
odak_ has quit [Ping timeout: 480 seconds]
<lina>
alyssa: 3ec: c17fbf33 sample_mask 255, 63
<lina>
is that... right?
yamii has joined #asahi-gpu
yamii_ has quit [Read error: Connection reset by peer]
<alyssa>
that's fine
<alyssa>
are you still live?
<lina>
Yes
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
flibit has quit []
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
mlp has joined #asahi-gpu
mlp has quit []
mlp has joined #asahi-gpu
<lina>
alyssa: Do we really support GL_EXT_multisampled_render_to_texture already? That's the implicit resolve stuff, right?
<lina>
At least I remember we didn't do it for depth yet...
<alyssa>
for colour render targets it should Just Work
<alyssa>
for depth buffers I thought it did but I could've been wrong
<alyssa>
there are like, no tests for this ...
<alyssa>
there's an unmerged piglit that has problems on other drivers
<lina>
Steven says it's broken and that's why MSAA didn't work when I tested it earlier
<lina>
I can try turning it off and see if it fixes it...
mlp has quit []
mlp has joined #asahi-gpu
<alyssa>
I would believe it
<alyssa>
should be easy to fix hopefully if it's broken
<steven>
Yeah, Lina was noticing that the MSRTT path for MSAA in Darwinia didn't have a visual difference (and I saw that locally too). Using the traditional blit-to-resolve path instead works fine of course
<alyssa>
if you could do that I'd appreciate it!
<lina>
BTW, the Ultra thing was an MSAA interaction, there may be more lurking... but I think this is the only one relevant for clustering. So if there's anything else with tiling and MSAA, you'll see it on M1.
<alyssa>
fun
<lina>
(Kernel side)
<alyssa>
regardless I can't help with msrtt
<alyssa>
if you don't/can't fix it, i'll merge your disable patch
<lina>
There is exactly one number calculated based on FB tile dimensions that is only relevant for clustering and it needed to be *samples
alyssa has left #asahi-gpu [#asahi-gpu]
<steven>
Is it functionally possible to specify a multisampled texture with a sample count of 1 (i.e. effectively making it single-sampled despite being GL_TEXTURE_2D_MULTISAMPLE)? If so I kind of wonder if it's as simple the sample count not propagating all the way through
<lina>
I just finished the kernel fix! Reviewing now ^^
* alyssa
running through CTS now
<steven>
alyssa: thank you! looking now
<alyssa>
as predicted, it's not a quickfix
<alyssa>
and I expect that code to regress performance somewhat
<alyssa>
we can recover that later. correctness first.
<alyssa>
lina: i must say i would not have had the persistence to work through the Darwinia side of this
<alyssa>
we make a good team, you and me :~)
<lina>
^^
<steven>
first line of lower_sample_mask_write is "return false" but there's a bunch of changes after it -- was that return intentional?
<alyssa>
probably not
<alyssa>
yep definitely not
<lina>
alyssa: Confirmed it fixes Darwinia ^^
<alyssa>
lina: hold on i fixed the thing steven pointed out
<alyssa>
try again with the fix :-D
<alyssa>
(I repushed)
<steven>
thanks alyssa!
<alyssa>
cheers
<lina>
Still works!
<alyssa>
I'll revisit this post-CTS to deal with the performance hit
<alyssa>
but CTS seems happy with this if you are
possiblemeatball has joined #asahi-gpu
<alyssa>
so eMRT is up next
<alyssa>
but first, lunch
alyssa has quit [Quit: leaving]
nimprod3l has quit [Remote host closed the connection]
nimprod3l has joined #asahi-gpu
<lina>
Tested all the discard cases, works ^^
aafeke_ has joined #asahi-gpu
nimprod3l has quit [Quit: Leaving]
hightower2 has joined #asahi-gpu
aafeke_ has quit [Quit: aafeke_]
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
cylm has quit [Ping timeout: 480 seconds]
ourdumbfuture has joined #asahi-gpu
possiblemeatball has quit [Quit: Quit]
as400 has joined #asahi-gpu
c10l484 has quit []
c10l484 has joined #asahi-gpu
alyssa has joined #asahi-gpu
<alyssa>
I'm combing through a trace of emrt this afternoon
<alyssa>
I have a good handle on the transform they're doing for the fragment shaders
<alyssa>
IDK if we'll do the same thing. But understanding first.
<alyssa>
it relies on a few hardware features to work:
<alyssa>
* special register 20 given the core (or cluster? what's the difference?) index
<alyssa>
lina already found that one
<alyssa>
(and documented it)
<alyssa>
* special register 32 giving a sort of tile ID
<alyssa>
this one is pretty subtle
<alyssa>
in each core (cluster?), there can be multiple tiles being processed concurrently
<alyssa>
up to max_concurrent_tiles
<alyssa>
sr32 assigns an ID to each tile from 0 to max_concurrent_tiles - 1, such that each concurrently processed tile has a unique index
<alyssa>
* thread_position_in_threadgroup.xy defined in fragment shaders as relative to the current tile
<alyssa>
* threads_per_threadgroup defined in fragment shaders as the tile size
<alyssa>
* special register 60 giving a coverage mask that's affected by.. I guess z/s testing maybe?
<alyssa>
with all that information the actual lowering is nice and simple
<alyssa>
the driver allocates a buffer like:
<alyssa>
pixel_t buffer[# cores][max concurrent tiles per core][tile height][tile width];
<alyssa>
then instead of doing tilebuffer access, the fragment shader load/stores from
<alyssa>
buffer[core ID][concurrent tile ID][position in tile.y][position in tile.x]
<alyssa>
What's nice about this scheme?
<alyssa>
* That buffer is pretty small in practice. a few kilobytes per byte of the format.
<alyssa>
* It's naturally tiled, so the cache behaviour isn't horrible, although not as good as a real twiddled format
<alyssa>
* It's straightforward to address in software
c10l484 has quit []
<alyssa>
So... the more spicy half is how do you get that buffer back into the compressed render target
c10l484 has joined #asahi-gpu
<alyssa>
this seems to involve dispatching a tile shader, because of course it does T_T
<alyssa>
not going to go through this line by line
<alyssa>
but broadly:
<alyssa>
1. First it wait_pix 768, 3.. the big barrier needed
<alyssa>
2. If thread_index_in_threadgroup != 0, skip 3. (That is, step 3 executes once per tile instead of once per pixel)
<alyssa>
3. For each render target in the tilebuffer, do an image_write_block interleaved with a memory_barrer 0, 2, 10 (f5a2). It's not clear to me why the barriers are needed.
<alyssa>
4. threadgroup (tile) barrier
<alyssa>
5. Load the pixel colour from the buffer
<alyssa>
6. Store the colour into the tilebuffer, with a special sample mask source I haven't decoded
Guest3100 has joined #asahi-gpu
<alyssa>
7. Another threadgroup/tile barrier (to make sure all colours are written before proceeding)
<alyssa>
8. If thread_index_in_threadgroup != 0, skip 9. (That is, step 9 executes once per tile instead of once per pixel)
Guest3100 has quit [Remote host closed the connection]
<alyssa>
9. image_write_block from the tilebuffer to the spilled render target
<alyssa>
10. done
Malaph has joined #asahi-gpu
<alyssa>
what makes this a bit tricky is that we don't have tile shaders implemented, and neither GL nor VK has tile shaders so I'm not inclined to change that until we have a 'real' use case
<alyssa>
the only thing tile shaders are actually used for here (as opposed to regular fragment shaders) is the tile barriers
<alyssa>
although we could use multiple draws to avoid the barriers. slower but whatever, that's not the bottleneck here :-p
<alyssa>
store pipeline becomes empty at any rate
<alyssa>
clear pipeline has the same fragment transform applied of course
<alyssa>
partial reload is the standard partial reload + usual fragment transform
<alyssa>
partial store is the spiciest
<alyssa>
First, it image_write_blocks each non-spilled target with the barriers interleaved as before
<alyssa>
then it loops over each pixel in the tile and st_tiles it with magic sources I don't understand yet