#asahi-gpu on 2023-06-14 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:46 ChanServ changed the topic of #asahi-gpu to: Asahi Linux GPU development (no user support, NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu

00:00 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

00:02 Z750 has quit [Quit: Ping timeout (120 seconds)]

00:03 Z750 has joined #asahi-gpu

00:27 ourdumbfuture has joined #asahi-gpu

00:59 nsklaus has quit [Ping timeout: 480 seconds]

01:15 thelounge606 has quit [Remote host closed the connection]

01:17 cr1901 has quit [Quit: Leaving]

01:17 cr1901 has joined #asahi-gpu

01:22 cr1901 has quit []

01:23 cr1901 has joined #asahi-gpu

01:26 cr1901 has quit []

01:27 cr1901 has joined #asahi-gpu

01:34 lonjil2 has quit []

01:34 lonjil has joined #asahi-gpu

01:40 cr1901_ has joined #asahi-gpu

01:40 cr1901_ has quit [Remote host closed the connection]

01:42 possiblemeatball has quit [Quit: Quit]

01:58 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

02:37 cr1901 has quit [Read error: Connection reset by peer]

02:41 cr1901 has joined #asahi-gpu

02:46 ourdumbfuture has joined #asahi-gpu

03:03 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

03:06 odak_ has quit [Quit: odak_]

03:06 ourdumbfuture has joined #asahi-gpu

03:25 pyropeter3 has joined #asahi-gpu

03:26 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

03:27 pyropeter2 has quit [Ping timeout: 480 seconds]

03:45 amarioguy has quit [Remote host closed the connection]

05:06 odak_ has joined #asahi-gpu

06:41 odak_ has quit [Quit: odak_]

06:42 odak_ has joined #asahi-gpu

06:56 hightower2 has joined #asahi-gpu

06:58 mlp has quit [Read error: Connection reset by peer]

07:24 flibit has joined #asahi-gpu

07:26 <lina> Looked at the Emacs issue and it's Xwayland/X11... I think it's another rendering loop, but Xorg doesn't like running under apitrace so I can't really check...

07:27 flibitijibibo has quit [Ping timeout: 480 seconds]

07:27 <lina> alyssa: Have you ever used apitrace with X?

07:29 nimprod3l has joined #asahi-gpu

07:32 zzywysm has quit [Ping timeout: 480 seconds]

07:35 bcrumb has joined #asahi-gpu

07:36 bcrumb has quit []

07:37 bcrumb has joined #asahi-gpu

07:38 bcrumb has quit []

07:40 bcrumb has joined #asahi-gpu

07:43 nimprod3l has quit [Quit: Leaving]

08:11 <lina> Figured it out... it's a driver bug

08:11 <lina> We claim to support texture barriers but we don't (and can't without special handling)...

08:11 bcrumb has quit [Quit: WeeChat 3.8]

08:19 bcrumb has joined #asahi-gpu

08:22 bcrumb has quit []

08:26 bcrumb has joined #asahi-gpu

08:37 bcrumb has quit [Quit: WeeChat 3.8]

08:38 bcrumb has joined #asahi-gpu

08:38 bcrumb has quit []

08:44 bcrumb has joined #asahi-gpu

08:46 <lina> firefox: ../src/asahi/compiler/agx_performance.c:30: agx_occupancy_for_register_count: Assertion `!"" "Register count must be less than the maximum"' failed.

08:46 <lina> alyssa: Is that one expected?

09:06 bcrumb has quit [Quit: WeeChat 3.8]

09:07 odak_ has quit [Quit: odak_]

09:07 odak_ has joined #asahi-gpu

10:01 nsklaus has joined #asahi-gpu

11:08 cylm has joined #asahi-gpu

11:12 alyssa has joined #asahi-gpu

11:12 <alyssa> lina: sleepy nya nya nya nya nya nya nya nya nya bat man

11:13 <alyssa> 07:27 <lina> alyssa: Have you ever used apitrace with X?

11:13 <alyssa> Yeah there's some awful incantation to do it that I can never remember

11:15 <alyssa> lina: https://github.com/apitrace/apitrace/wiki/Glamor

11:16 <alyssa> 08:46 <lina> firefox: ../src/asahi/compiler/agx_performance.c:30: agx_occupancy_for_register_count: Assertion `!"" "Register count must be less than the maximum"' failed.

11:17 <alyssa> I admit that's a terrible error message but what that means is that "this shader needs to spill registers and there's no spilling implemented"

11:17 <alyssa> firefox has worked fine for me so I wonder how that one reproduced

11:17 <alyssa> like I believe it, some webrender shaders are chunky

11:47 thelounge6065 has joined #asahi-gpu

11:47 as400 has quit [Remote host closed the connection]

11:59 ourdumbfuture has joined #asahi-gpu

12:20 odak_ has quit [Quit: odak_]

12:30 odak_ has joined #asahi-gpu

12:35 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

12:38 ourdumbfuture has joined #asahi-gpu

12:47 odak_ has quit [Ping timeout: 480 seconds]

14:06 <lina> alyssa: 3ec: c17fbf33 sample_mask 255, 63

14:07 <lina> is that... right?

14:14 yamii has joined #asahi-gpu

14:14 yamii_ has quit [Read error: Connection reset by peer]

14:21 <alyssa> that's fine

14:21 <alyssa> are you still live?

14:23 <lina> Yes

15:28 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

15:29 ourdumbfuture has joined #asahi-gpu

15:30 flibit has quit []

16:03 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

16:12 mlp has joined #asahi-gpu

16:15 mlp has quit []

16:15 mlp has joined #asahi-gpu

16:16 <lina> alyssa: Do we really support GL_EXT_multisampled_render_to_texture already? That's the implicit resolve stuff, right?

16:16 <lina> At least I remember we didn't do it for depth yet...

16:16 <alyssa> for colour render targets it should Just Work

16:17 <alyssa> for depth buffers I thought it did but I could've been wrong

16:17 <alyssa> there are like, no tests for this ...

16:17 <alyssa> there's an unmerged piglit that has problems on other drivers

16:17 <lina> Steven says it's broken and that's why MSAA didn't work when I tested it earlier

16:17 <lina> I can try turning it off and see if it fixes it...

16:17 mlp has quit []

16:17 mlp has joined #asahi-gpu

16:17 <alyssa> I would believe it

16:18 <alyssa> should be easy to fix hopefully if it's broken

16:18 <steven> Yeah, Lina was noticing that the MSRTT path for MSAA in Darwinia didn't have a visual difference (and I saw that locally too). Using the traditional blit-to-resolve path instead works fine of course

16:18 <alyssa> if you could do that I'd appreciate it!

16:18 <lina> BTW, the Ultra thing was an MSAA interaction, there may be more lurking... but I think this is the only one relevant for clustering. So if there's anything else with tiling and MSAA, you'll see it on M1.

16:18 <alyssa> fun

16:18 <lina> (Kernel side)

16:18 <alyssa> regardless I can't help with msrtt

16:18 <alyssa> if you don't/can't fix it, i'll merge your disable patch

16:19 <lina> There is exactly one number calculated based on FB tile dimensions that is only relevant for clustering and it needed to be *samples

16:19 alyssa has left #asahi-gpu [#asahi-gpu]

16:21 <steven> Is it functionally possible to specify a multisampled texture with a sample count of 1 (i.e. effectively making it single-sampled despite being GL_TEXTURE_2D_MULTISAMPLE)? If so I kind of wonder if it's as simple the sample count not propagating all the way through

16:24 i509vcb has joined #asahi-gpu

16:26 nimprod3l has joined #asahi-gpu

16:32 ourdumbfuture has joined #asahi-gpu

16:37 hightower2 has quit [Ping timeout: 480 seconds]

16:38 alyssa has joined #asahi-gpu

16:38 <alyssa> lina: steven: https://gitlab.freedesktop.org/asahi/mesa/-/merge_requests/86

16:38 <alyssa> I have not really tested this

16:38 <alyssa> but yknow. should work. probably

16:38 <lina> I just finished the kernel fix! Reviewing now ^^

16:39 * alyssa running through CTS now

16:39 <steven> alyssa: thank you! looking now

16:39 <alyssa> as predicted, it's not a quickfix

16:39 <alyssa> and I expect that code to regress performance somewhat

16:39 <alyssa> we can recover that later. correctness first.

16:40 <alyssa> lina: i must say i would not have had the persistence to work through the Darwinia side of this

16:40 <alyssa> we make a good team, you and me :~)

16:41 <lina> ^^

16:42 <steven> first line of lower_sample_mask_write is "return false" but there's a bunch of changes after it -- was that return intentional?

16:43 <alyssa> probably not

16:43 <alyssa> yep definitely not

16:44 <lina> alyssa: Confirmed it fixes Darwinia ^^

16:44 <alyssa> lina: hold on i fixed the thing steven pointed out

16:44 <alyssa> try again with the fix :-D

16:44 <alyssa> (I repushed)

16:44 <steven> thanks alyssa!

16:44 <alyssa> cheers

16:45 <lina> Still works!

16:45 <alyssa> I'll revisit this post-CTS to deal with the performance hit

16:47 <alyssa> but CTS seems happy with this if you are

16:50 possiblemeatball has joined #asahi-gpu

16:52 <alyssa> so eMRT is up next

16:52 <alyssa> but first, lunch

16:52 alyssa has quit [Quit: leaving]

16:57 nimprod3l has quit [Remote host closed the connection]

17:00 nimprod3l has joined #asahi-gpu

17:01 <lina> Tested all the discard cases, works ^^

17:11 aafeke_ has joined #asahi-gpu

17:13 nimprod3l has quit [Quit: Leaving]

17:29 hightower2 has joined #asahi-gpu

17:41 aafeke_ has quit [Quit: aafeke_]

17:41 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

18:20 cylm has quit [Ping timeout: 480 seconds]

18:55 ourdumbfuture has joined #asahi-gpu

18:58 possiblemeatball has quit [Quit: Quit]

19:07 as400 has joined #asahi-gpu

19:20 c10l484 has quit []

19:21 c10l484 has joined #asahi-gpu

19:44 alyssa has joined #asahi-gpu

19:44 <alyssa> I'm combing through a trace of emrt this afternoon

19:44 <alyssa> I have a good handle on the transform they're doing for the fragment shaders

19:44 <alyssa> IDK if we'll do the same thing. But understanding first.

19:44 <alyssa> it relies on a few hardware features to work:

19:45 <alyssa> * special register 20 given the core (or cluster? what's the difference?) index

19:45 <alyssa> lina already found that one

19:45 <alyssa> (and documented it)

19:45 <alyssa> * special register 32 giving a sort of tile ID

19:45 <alyssa> this one is pretty subtle

19:45 <alyssa> in each core (cluster?), there can be multiple tiles being processed concurrently

19:46 <alyssa> up to max_concurrent_tiles

19:46 <alyssa> sr32 assigns an ID to each tile from 0 to max_concurrent_tiles - 1, such that each concurrently processed tile has a unique index

19:47 <alyssa> * thread_position_in_threadgroup.xy defined in fragment shaders as relative to the current tile

19:47 <alyssa> * threads_per_threadgroup defined in fragment shaders as the tile size

19:48 <alyssa> * special register 60 giving a coverage mask that's affected by.. I guess z/s testing maybe?

19:48 <alyssa> with all that information the actual lowering is nice and simple

19:48 <alyssa> the driver allocates a buffer like:

19:49 <alyssa> pixel_t buffer[# cores][max concurrent tiles per core][tile height][tile width];

19:49 <alyssa> then instead of doing tilebuffer access, the fragment shader load/stores from

19:49 <alyssa> buffer[core ID][concurrent tile ID][position in tile.y][position in tile.x]

19:50 <alyssa> What's nice about this scheme?

19:51 <alyssa> * That buffer is pretty small in practice. a few kilobytes per byte of the format.

19:52 <alyssa> * It's naturally tiled, so the cache behaviour isn't horrible, although not as good as a real twiddled format

19:52 <alyssa> * It's straightforward to address in software

19:54 c10l484 has quit []

19:55 <alyssa> So... the more spicy half is how do you get that buffer back into the compressed render target

19:55 c10l484 has joined #asahi-gpu

19:56 <alyssa> this seems to involve dispatching a tile shader, because of course it does T_T

19:58 <alyssa> not going to go through this line by line

19:58 <alyssa> but broadly:

19:59 <alyssa> 1. First it wait_pix 768, 3.. the big barrier needed

19:59 <alyssa> 2. If thread_index_in_threadgroup != 0, skip 3. (That is, step 3 executes once per tile instead of once per pixel)

20:00 <alyssa> 3. For each render target in the tilebuffer, do an image_write_block interleaved with a memory_barrer 0, 2, 10 (f5a2). It's not clear to me why the barriers are needed.

20:01 <alyssa> 4. threadgroup (tile) barrier

20:01 <alyssa> 5. Load the pixel colour from the buffer

20:02 <alyssa> 6. Store the colour into the tilebuffer, with a special sample mask source I haven't decoded

20:02 Guest3100 has joined #asahi-gpu

20:02 <alyssa> 7. Another threadgroup/tile barrier (to make sure all colours are written before proceeding)

20:02 <alyssa> 8. If thread_index_in_threadgroup != 0, skip 9. (That is, step 9 executes once per tile instead of once per pixel)

20:03 Guest3100 has quit [Remote host closed the connection]

20:03 <alyssa> 9. image_write_block from the tilebuffer to the spilled render target

20:03 <alyssa> 10. done

20:03 Malaph has joined #asahi-gpu

20:04 <alyssa> what makes this a bit tricky is that we don't have tile shaders implemented, and neither GL nor VK has tile shaders so I'm not inclined to change that until we have a 'real' use case

20:04 <alyssa> the only thing tile shaders are actually used for here (as opposed to regular fragment shaders) is the tile barriers

20:05 <alyssa> although we could use multiple draws to avoid the barriers. slower but whatever, that's not the bottleneck here :-p

20:06 <alyssa> store pipeline becomes empty at any rate

20:06 <alyssa> clear pipeline has the same fragment transform applied of course

20:06 <alyssa> partial reload is the standard partial reload + usual fragment transform

20:07 <alyssa> partial store is the spiciest

20:15 <alyssa> First, it image_write_blocks each non-spilled target with the barriers interleaved as before

20:19 <alyssa> then it loops over each pixel in the tile and st_tiles it with magic sources I don't understand yet

20:19 <alyssa> then it image_write_blocks that

20:25 <alyssa> e8: 09048604fc208000 st_tile r1l_r1h_r2l_r2h, u8norm, 1, xyzw, 64, 1337, 6

20:25 <alyssa> this is the weird st_tile

20:25 systwi_ has quit [Ping timeout: 480 seconds]

20:25 <alyssa> well that's one the other is

20:25 mlp has quit [Read error: Connection reset by peer]

20:25 <alyssa> fe: 09148604fc208000 st_tile r5l_r5h_r6l_r6h, u8norm, 1, xyzw, 64, 1337, 6

20:26 mlp has joined #asahi-gpu

20:27 systwi has joined #asahi-gpu

20:27 <alyssa> would like to know more about the structure of those last 3 sources ..

20:28 <alyssa> I guess messing with tile shader's imageblock stufff might help figure that out

20:28 <alyssa> haven't decided if I'm doing ths approach to eMRT though

20:51 <alyssa> ok, I've given this some thought

20:51 <alyssa> For now, I won't be implementing anything like what Apple does for this

20:51 <alyssa> instead will do the Simplest Thing That Could Possibly Work

20:52 <alyssa> which I think should get similar performance, just with a dramatically worse (but still tolerable) memory footprint

20:52 <alyssa> and if anything ever hits this other than the CTS we can revisit

20:52 <alyssa> there are times in driver dev to be clever and I don't think this is one

20:53 <alyssa> but that's for tomorrow

20:53 <alyssa> =)

20:54 <alyssa> or maybe today. i wonder how far I can get in a few minutes so I can stop thinking about it

21:07 ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]

21:09 possiblemeatball has joined #asahi-gpu

21:12 Malaph has quit []

21:32 ourdumbfuture has joined #asahi-gpu

22:00 * alyssa keeps lowering the bar to produce the Easiest Possible Implementation

22:00 <alyssa> I just want to pass CTS, yo

22:01 <alyssa> bindless images, sure why not, I already implement those

22:01 <alyssa> why not use em

22:01 <alyssa> ("Slow?" "Shhh")

22:04 <alyssa> it doesn't need to be fast

22:04 <alyssa> it just needs to work

22:11 possiblemeatball has quit [Quit: Quit]

23:14 nsklaus has quit [Ping timeout: 480 seconds]