#asahi-gpu on 2021-04-11 — irc logs at oftc.irclog.whitequark.org

2021-01-11 09:46 marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu

00:27 <DarkShadow44> bloom: Even for assemling shaders?

01:16 <bloom> TBD

01:16 <bloom> still experimenting

01:16 <bloom> Nothing is set in stone yet

01:21 <DarkShadow44> bloom: okay then

01:22 <DarkShadow44> Because I still have my propsed architecture from my asmtool

01:23 <bloom> Okey dokey

01:26 <DarkShadow44> would be good to have feedback before I implment too much though, don't wanna duplicate the effort

01:27 <bloom> Duplicating effort?

01:27 <bloom> I thought we were just having fun :)

01:27 zkrx has quit [Ping timeout: 240 seconds]

01:28 <DarkShadow44> heh, sure

01:29 <DarkShadow44> but making something useful in the process would be nice too :P

01:30 odmir has quit [Remote host closed the connection]

01:31 odmir has joined #asahi-gpu

01:36 <bloom> I guess

01:36 odmir has quit [Ping timeout: 268 seconds]

01:43 zkrx has joined #asahi-gpu

01:45 <bloom> Anyway uhm

01:45 <bloom> I started writing the compiler

01:45 <bloom> have most of the big ideas sketched in my head, we'll see how long it takes me to get to simple shaders

01:48 <DarkShadow44> bloom: compiler, like in from NIR to bytecode?

01:52 <bloom> NIR to real code, yes

01:52 <DarkShadow44> directly or using an IR?

02:03 odmir has joined #asahi-gpu

02:14 <dougall> bloom: oh, wow, that's awesome! fun to see how the goal of generating html turned into the possiblity of generating xml/c

02:16 <dougall> (i'll try to remember what i was doing and make a little progress on memory ops today)

02:34 odmir has quit [Remote host closed the connection]

02:34 odmir has joined #asahi-gpu

02:39 <bloom> dougall: yeah, still trying to figure out the Right way to approach it is

02:39 <bloom> I suspect whatever I did today is Not it

02:44 <dougall> haha yeah, applegpu.py doesn't feel Right either - i'd like to rewrite it a bit more deliberately, but it's enough to make RE and doc progress for now

02:46 <bloom> nod

02:47 <bloom> I suspect we don't want emulation, or the Python disasm/asm stack, in Mesa.

02:47 <bloom> Just want the actual encoding details, so then we can spit out C code to do what we need

02:48 <bloom> but not sure if there's much point in defining an intermediate XML schema for the sake of it, or trying to wire up robclark's isaspec (which will have impedance mismatches of its own)

02:50 <dougall> yeah, the python definitely feels like a bad fit for mesa - i hadn't seen isaspec, but it sounds like it has the right goals... hmm

02:56 <bloom> there's also a question of what level of magic is appropraite

02:57 <bloom> for the compilers at work, I have full visibility into the ISA so I can do proper top-down design.. so having one master ISA.xml that everything is generated from works well

02:57 phiologe has quit [Ping timeout: 250 seconds]

02:57 phiologe has joined #asahi-gpu

02:58 <bloom> for hw with an active r/e effort, that may backfire. otoh, being bale to add instructions easily matters far more for r/e!

03:04 <bloom> there's likely a healthy balance..

03:04 <dougall> yeah - it's definitely a strange challenge - after thinking on it, i don't have a clear opinion either way...

03:06 <dougall> i like less magic, i like declarative descriptions, and either will potentially have to be updated as r/e progresses

03:08 <bloom> "i like less magic, i like declarative descriptions," unfortunately writing in languages designed in the 70s means these are mutually ^

03:08 <dougall> hahaha

03:10 <bloom> srs :(

03:11 * bloom tries to wrap her head around apple control flow

03:12 <dougall> yeah, it's a bit crazy - did you see this thread? (scroll up) https://twitter.com/dougallj/status/1361063810493083648

03:13 <bloom> oh!

03:16 <dougall> (i don't know if apple compiler implementation specific stuff belongs in the docs, but the fact that you kind of have to re-structure your cfg to compile C was an intuition of Erika's that i didn't have at all)

03:16 * bloom nods

03:17 <bloom> I expect this will have big implications on how the IR will be structured

03:17 <bloom> The CFG we get from NIR is very simple, which is great for normal hw with actual branch instructions, but...

03:18 <bloom> (Basically we get `if { ... }` `loop { .. }` `break` and `continue` and that's it. while loops become things like `loop { if not cond { break } .. }`

03:18 <bloom> )

03:19 <bloom> if { ..} corresponds naturally to if_cmp/pop_exec.. the other 3 not so much

03:19 <krbtgt> WRT the earlier convo on basically starting with GPU RE; it is tricky for sure, and i'm certainly an idiot at it. me and a friend were looking at REing ancient SGI GPUs and we found a few quite things out but didn't get as far as say disassembly

03:20 <bloom> loop { ... } I guess is just `*: ...; jmp_exec_any *`

03:22 <dougall> yeah break and loop aren't bad (if you look at metal compiler output)... break is an icmpsel to r0l then pop_exec 0 (to update execmask)

03:23 <dougall> not sure about "continue"/"else" though

03:23 <bloom> oh, you can update r0l directly, right :p

03:23 <dougall> yeah - break is the only place i've seen that done, but that was really helpful to figure out what on earth was going on

03:25 <dougall> (you can only update r0l on threads that are active, which is either obvious or confusing)

03:25 <bloom> can it be both obvious and confusing?

03:25 <dougall> haha yep

03:25 <bloom> all of this seems both intuitively obvious and completely impossible to wrap my head around :-p

03:27 <bloom> something else that may be either obvious or confusing -- for optimal perf, the compiler needs to do full divergence analysis

03:28 <bloom> (I think)

03:28 * bloom tries to come up with example

03:29 <dougall> 'full * analysis' does sound optimal, but that mostly goes over my head - is that the uniform hoisting optimisation thing?

03:30 <bloom> "Divergence analysis" ~~calculates the sums of partial derivatives of a vector field~~ determines what values might differ across threads in a SIMD gorup

03:30 <dougall> hahaha

03:30 <bloom> All the silliness is needed for divergent (data-dependent) branches "if (some_varying) { ... }"

03:31 <bloom> But if we can prove at compile-time that all the threads in a given SIMD group will branch together, there's often optimizations to be had

03:32 <bloom> (Most obvious case is "if (uniform) { .. }" but also less obviously "if (ballot(..)) { ... }"

03:32 <bloom> )

03:32 <bloom> Although I'm not actually seeing a good way to use that info here, hm.

03:33 <bloom> if (varying) { expensive() } ----> if_icmp / expensive() / pop_exec

03:33 <bloom> if (uniform) { expensive() } ----> if_icmp / jmp_exec_none * / expensive() / * pop_exec

03:35 <bloom> This doesn't really prove anything since the latter compile is always legal, just not always optimal.

03:36 <dougall> yeah jmp_exec_none/any are the only conditional branches i know of, which should benefit from uniformity either way... but apple does love proving things are uniform and hoisting them into a shader that only runs once

03:36 <bloom> ^ indeed, and more aggressively than any other gpu i've played with

03:42 <dougall> (i should look into how they handle convergence (ballot/etc) more - it does sound like an optimisation oportunity, and maybe i'm missing some instructions/flags there)

03:43 <bloom> I still don't see what while_*cmp is for

03:45 <bloom> while(x) { ... } ---> start: icmpsel r0l, x, #0, #1; pop_exec 0; jmp_exec_none end; ....; jmp_exec_any start; pop_exec 0;

03:45 <bloom> seems a very natural translation

03:47 <bloom> oh but that's broken for nested loops

03:47 <bloom> er is it?

03:48 <bloom> if icmpsel runs on a thread, by definition r0l = 0, so that's fine

03:48 <dougall> yeah - it's not entirely clear in my mind... I think their while loop codegen starts with a push_exec 2... you're missing an initial push/final pop of some kind

03:49 <bloom> the last pop should've been n=1, sorry

03:50 <bloom> push_exec isn't documented but that's just a special case of if_icmp, ok

03:50 <dougall> (yeah, icmpsel #1 is more like a break than an 'if' - like if you have "if { your while } else { foo }" the "1"s and the else threads have the same value)

03:51 <bloom> ohhhh

03:51 <bloom> yikes, ok, right. thanks, knew I was doing something wrong

03:55 <dougall> yeah - it wasn't easy figuring out what these instruction do, but figuring out how to use them from what they do feels trickier (using metal compiler idioms seems easier, but it'd be good to understand it properly)

03:56 <bloom> nod

03:56 <bloom> so in NIR, we'd have `if { loop { if { break } foo } else { bar }`,

03:56 <bloom> peeling that back to Metal idioms will be one of the more memorable parts of compiler bringup ;)

03:56 * bloom tries on pen/paper

04:09 TheJollyRoger has quit [Remote host closed the connection]

04:10 TheJollyRoger has joined #asahi-gpu

04:14 <bloom> dougall: for that particular example, I think you could do:

04:15 <bloom> if_icmp 1; * icmpsel r0l, #2, #0; update_exec; ....; jmp_exec_any *; else_cmp 1; ...; pop_exec 2;

04:16 <bloom> but I fully expect even if this works, it's a hack

04:21 <bloom> actualy, is it?

04:24 <dougall> hmm - i want to write an emulator to test that, although i guess i kinda did (maybe? i don't think i implemented jumps)... but yeah, i can't be sure either way at a glance

04:25 odmir has quit [Remote host closed the connection]

04:26 odmir has joined #asahi-gpu

04:30 odmir has quit [Ping timeout: 240 seconds]

05:26 tomtastic has quit [Ping timeout: 240 seconds]

05:28 tomtastic has joined #asahi-gpu

06:54 mxw39 has quit [Quit: Konversation terminated!]

09:37 angelXwind has quit [Ping timeout: 240 seconds]

09:47 akemin_dayo has joined #asahi-gpu

10:05 Necrosporus has quit [Ping timeout: 240 seconds]

10:45 Necrosporus has joined #asahi-gpu

14:48 <bloom> dougall: After giving it some thought last night i believe this would work, but it would break if there was something after the loop before the else, etc

14:49 <bloom> You really need push_exec in general, unless you can gurantee no divergence

14:51 <glibc> ramping up on Metal... I can compile .metallib files, which I understand are LLVM IR containers. I can compile them in a metal app; how can I get the output binary?

15:10 * bloom intends on a full SSA backend compiler

15:10 <bloom> The ISA is regular enough and RA requirements simple enough it should be doable

15:52 <bloom> Ewwwwwwww

15:55 <bloom> dougall: not sure if you've figured this out yet, but it looks like the `i` bit on wait corresponds to the `u2` immediate on device_Load

15:55 <bloom> to allow a more fine grained scoreboarding

15:55 <bloom> so the metal compiler can output code like

15:55 <bloom> `load 0, load 1, wait 0, do something with 0, wait 1, do something with 1`

15:55 <bloom> and then the load 1 can be pipelined with the `do something with 0`

16:00 josiahmendes[m] has quit [Quit: Idle for 30+ days]

16:03 <bloom> "Ewwwwwwww" was at AGX architecturally lacking support for basic vertex attributes.

16:06 <bloom> ....why is `stac_adjust/stack_store/stack_load` the fastest way to convert i32 to u8norm...

16:27 odmir has joined #asahi-gpu

16:31 odmir has quit [Ping timeout: 240 seconds]

16:33 <glibc> I guess I can build an binary archive and serialize it, but I wonder if there is an easier way to get the raw compiled binaries.

17:12 tmlind has joined #asahi-gpu

17:22 odmir_ has joined #asahi-gpu

17:33 <DarkShadow44> glibc: What exactly do you want to do?

17:51 <glibc> I wanted a .metal -> .disasm workflow, which I now have :)

17:57 <DarkShadow44> glibc: Alright, otherwise I could have offered you my version. Do you also use this code https://github.com/DarkShadow44/AsahiLinux-gpu/blob/asmtool/asmtool/asmtool/metallib.c ?

18:03 <glibc> no, I wasn't aware of this -- but this looks quite nice :) Thanks for the link.

18:04 <DarkShadow44> FWIW, I just copied that - but also integrated it into my tool - so you can just feed it source and it outputs disassembly

18:05 <glibc> cool, that's exactly what I wanted. (I was still missing a clean method for extracting the raw payload)

18:10 <DarkShadow44> I take it, you're also trying to re the ISA?

18:11 <glibc> taking humble baby steps, but yeah

18:16 <DarkShadow44> heh, wishing you the best of luck! :)

18:18 * bloom has always ignored the 'nice' methods of doing this, preferring to just savagely parse GPU memory at runtime

18:19 <DarkShadow44> bloom: You can extract shaders from GPU memory after they've been uploaded?

18:21 <glibc> DarkShadow44: thanks :)

18:21 <DarkShadow44> glibc: One more thing I forgot: Keep in mind that the metallib way of getting a shader produces different bytecode from loading a shader normally

18:22 <DarkShadow44> they're compiled as "functions" (for lack of a better word) instead of a standalone program

18:22 <DarkShadow44> so if you try to run them, you better check they load/store data at the right offsets :P

18:22 <DarkShadow44> ...if you want to run them in the first place

18:23 <bloom> DarkShadow44: yep.

18:23 <bloom> and it avoids ^^ those sorts of issues

18:23 <bloom> at the cost of introducing a slew of other issues..

18:24 <DarkShadow44> cool, I kinda thought once it's uploaded you can't access it anymore

18:24 <DarkShadow44> downloading that is kinda savage indeed, but I like the idea ^

18:29 <DarkShadow44> bloom: Say, do you know if isaspec from mesa can be used to create an assembler? Because from reading their docs it seems like this is not yet the case

18:32 solarkraft has quit [Ping timeout: 240 seconds]

19:31 odmir_ has quit [Remote host closed the connection]

19:32 odmir has joined #asahi-gpu

19:36 odmir has quit [Ping timeout: 240 seconds]

20:02 odmir has joined #asahi-gpu

20:11 <bloom> DarkShadow44: I'm not sure if they added support or not, it's definitely on their wishlist though.

20:22 radex1 has quit [Quit: WeeChat 3.0]

20:26 artemist has quit [Ping timeout: 276 seconds]

20:33 artemist has joined #asahi-gpu

20:36 odmir has quit [Ping timeout: 260 seconds]

20:39 <bloom> dougall: Any reason why applegpu.py seperates {A,B,C,D} from {A,B,C,D}t? It looks like the fields are always adjacent

20:40 <bloom> I guess just more convenient that way. Probably will merge them for whatever i end up doing.

20:59 odmir has joined #asahi-gpu

21:32 odmir has quit [Ping timeout: 240 seconds]

21:44 odmir has joined #asahi-gpu

22:00 yrlf has quit [Quit: The Lounge - https://thelounge.chat]

22:01 yrlf has joined #asahi-gpu

22:18 odmir has quit [Ping timeout: 268 seconds]

22:18 <bloom> dougall: Also, patches incoming for `blend` and maybe `ld_var` if I have time before getting bored and reading a novel :p

22:29 odmir has joined #asahi-gpu

23:02 odmir has quit [Ping timeout: 240 seconds]

23:03 odmir has joined #asahi-gpu

23:14 <bloom> Note to self: >4 render targets has some seriously weird shader stuff

23:15 <dougall> bloom: A vs At etc is arbitrary, as you say, i just wanted to have splitting in one place, and try to separate 'instruction' from 'data' in the bit diagram (A can be a literal immediate, although also sometimes the low bit indicates whether or not the reg is 64-bit, so it's not perfect)

23:15 <bloom> Ack

23:17 <dougall> ah yeah, great find on i/u2! i think that makes sense

23:18 <bloom> probably worth confirming issue a lot more loads

23:18 <bloom> but if the idea is right it should be obvious how to construct test programs for that