marcan changed the topic of #asahi-gpu to: Asahi Linux: porting Linux to Apple Silicon macs | GPU / 3D graphics stack black-box RE and development (NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
<DarkShadow44> bloom: Even for assemling shaders?
<bloom> TBD
<bloom> still experimenting
<bloom> Nothing is set in stone yet
<DarkShadow44> bloom: okay then
<DarkShadow44> Because I still have my propsed architecture from my asmtool
<bloom> Okey dokey
<DarkShadow44> would be good to have feedback before I implment too much though, don't wanna duplicate the effort
<bloom> Duplicating effort?
<bloom> I thought we were just having fun :)
zkrx has quit [Ping timeout: 240 seconds]
<DarkShadow44> heh, sure
<DarkShadow44> but making something useful in the process would be nice too :P
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
<bloom> I guess
odmir has quit [Ping timeout: 268 seconds]
zkrx has joined #asahi-gpu
<bloom> Anyway uhm
<bloom> I started writing the compiler
<bloom> have most of the big ideas sketched in my head, we'll see how long it takes me to get to simple shaders
<DarkShadow44> bloom: compiler, like in from NIR to bytecode?
<bloom> NIR to real code, yes
<DarkShadow44> directly or using an IR?
odmir has joined #asahi-gpu
<dougall> bloom: oh, wow, that's awesome! fun to see how the goal of generating html turned into the possiblity of generating xml/c
<dougall> (i'll try to remember what i was doing and make a little progress on memory ops today)
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
<bloom> dougall: yeah, still trying to figure out the Right way to approach it is
<bloom> I suspect whatever I did today is Not it
<dougall> haha yeah, applegpu.py doesn't feel Right either - i'd like to rewrite it a bit more deliberately, but it's enough to make RE and doc progress for now
<bloom> nod
<bloom> I suspect we don't want emulation, or the Python disasm/asm stack, in Mesa.
<bloom> Just want the actual encoding details, so then we can spit out C code to do what we need
<bloom> but not sure if there's much point in defining an intermediate XML schema for the sake of it, or trying to wire up robclark's isaspec (which will have impedance mismatches of its own)
<dougall> yeah, the python definitely feels like a bad fit for mesa - i hadn't seen isaspec, but it sounds like it has the right goals... hmm
<bloom> there's also a question of what level of magic is appropraite
<bloom> for the compilers at work, I have full visibility into the ISA so I can do proper top-down design.. so having one master ISA.xml that everything is generated from works well
phiologe has quit [Ping timeout: 250 seconds]
phiologe has joined #asahi-gpu
<bloom> for hw with an active r/e effort, that may backfire. otoh, being bale to add instructions easily matters far more for r/e!
<bloom> there's likely a healthy balance..
<dougall> yeah - it's definitely a strange challenge - after thinking on it, i don't have a clear opinion either way...
<dougall> i like less magic, i like declarative descriptions, and either will potentially have to be updated as r/e progresses
<bloom> "i like less magic, i like declarative descriptions," unfortunately writing in languages designed in the 70s means these are mutually ^
<dougall> hahaha
<bloom> srs :(
* bloom tries to wrap her head around apple control flow
<dougall> yeah, it's a bit crazy - did you see this thread? (scroll up) https://twitter.com/dougallj/status/1361063810493083648
<bloom> oh!
<dougall> (i don't know if apple compiler implementation specific stuff belongs in the docs, but the fact that you kind of have to re-structure your cfg to compile C was an intuition of Erika's that i didn't have at all)
* bloom nods
<bloom> I expect this will have big implications on how the IR will be structured
<bloom> The CFG we get from NIR is very simple, which is great for normal hw with actual branch instructions, but...
<bloom> (Basically we get `if { ... }` `loop { .. }` `break` and `continue` and that's it. while loops become things like `loop { if not cond { break } .. }`
<bloom> )
<bloom> if { ..} corresponds naturally to if_cmp/pop_exec.. the other 3 not so much
<krbtgt> WRT the earlier convo on basically starting with GPU RE; it is tricky for sure, and i'm certainly an idiot at it. me and a friend were looking at REing ancient SGI GPUs and we found a few quite things out but didn't get as far as say disassembly
<bloom> loop { ... } I guess is just `*: ...; jmp_exec_any *`
<dougall> yeah break and loop aren't bad (if you look at metal compiler output)... break is an icmpsel to r0l then pop_exec 0 (to update execmask)
<dougall> not sure about "continue"/"else" though
<bloom> oh, you can update r0l directly, right :p
<dougall> yeah - break is the only place i've seen that done, but that was really helpful to figure out what on earth was going on
<dougall> (you can only update r0l on threads that are active, which is either obvious or confusing)
<bloom> can it be both obvious and confusing?
<dougall> haha yep
<bloom> all of this seems both intuitively obvious and completely impossible to wrap my head around :-p
<bloom> something else that may be either obvious or confusing -- for optimal perf, the compiler needs to do full divergence analysis
<bloom> (I think)
* bloom tries to come up with example
<dougall> 'full * analysis' does sound optimal, but that mostly goes over my head - is that the uniform hoisting optimisation thing?
<bloom> "Divergence analysis" ~~calculates the sums of partial derivatives of a vector field~~ determines what values might differ across threads in a SIMD gorup
<dougall> hahaha
<bloom> All the silliness is needed for divergent (data-dependent) branches "if (some_varying) { ... }"
<bloom> But if we can prove at compile-time that all the threads in a given SIMD group will branch together, there's often optimizations to be had
<bloom> (Most obvious case is "if (uniform) { .. }" but also less obviously "if (ballot(..)) { ... }"
<bloom> )
<bloom> Although I'm not actually seeing a good way to use that info here, hm.
<bloom> if (varying) { expensive() } ----> if_icmp / expensive() / pop_exec
<bloom> if (uniform) { expensive() } ----> if_icmp / jmp_exec_none * / expensive() / * pop_exec
<bloom> This doesn't really prove anything since the latter compile is always legal, just not always optimal.
<dougall> yeah jmp_exec_none/any are the only conditional branches i know of, which should benefit from uniformity either way... but apple does love proving things are uniform and hoisting them into a shader that only runs once
<bloom> ^ indeed, and more aggressively than any other gpu i've played with
<dougall> (i should look into how they handle convergence (ballot/etc) more - it does sound like an optimisation oportunity, and maybe i'm missing some instructions/flags there)
<bloom> I still don't see what while_*cmp is for
<bloom> while(x) { ... } ---> start: icmpsel r0l, x, #0, #1; pop_exec 0; jmp_exec_none end; ....; jmp_exec_any start; pop_exec 0;
<bloom> seems a very natural translation
<bloom> oh but that's broken for nested loops
<bloom> er is it?
<bloom> if icmpsel runs on a thread, by definition r0l = 0, so that's fine
<dougall> yeah - it's not entirely clear in my mind... I think their while loop codegen starts with a push_exec 2... you're missing an initial push/final pop of some kind
<bloom> the last pop should've been n=1, sorry
<bloom> push_exec isn't documented but that's just a special case of if_icmp, ok
<dougall> (yeah, icmpsel #1 is more like a break than an 'if' - like if you have "if { your while } else { foo }" the "1"s and the else threads have the same value)
<bloom> ohhhh
<bloom> yikes, ok, right. thanks, knew I was doing something wrong
<dougall> yeah - it wasn't easy figuring out what these instruction do, but figuring out how to use them from what they do feels trickier (using metal compiler idioms seems easier, but it'd be good to understand it properly)
<bloom> nod
<bloom> so in NIR, we'd have `if { loop { if { break } foo } else { bar }`,
<bloom> peeling that back to Metal idioms will be one of the more memorable parts of compiler bringup ;)
* bloom tries on pen/paper
TheJollyRoger has quit [Remote host closed the connection]
TheJollyRoger has joined #asahi-gpu
<bloom> dougall: for that particular example, I think you could do:
<bloom> if_icmp 1; * icmpsel r0l, #2, #0; update_exec; ....; jmp_exec_any *; else_cmp 1; ...; pop_exec 2;
<bloom> but I fully expect even if this works, it's a hack
<bloom> actualy, is it?
<dougall> hmm - i want to write an emulator to test that, although i guess i kinda did (maybe? i don't think i implemented jumps)... but yeah, i can't be sure either way at a glance
odmir has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
tomtastic has quit [Ping timeout: 240 seconds]
tomtastic has joined #asahi-gpu
mxw39 has quit [Quit: Konversation terminated!]
angelXwind has quit [Ping timeout: 240 seconds]
akemin_dayo has joined #asahi-gpu
Necrosporus has quit [Ping timeout: 240 seconds]
Necrosporus has joined #asahi-gpu
<bloom> dougall: After giving it some thought last night i believe this would work, but it would break if there was something after the loop before the else, etc
<bloom> You really need push_exec in general, unless you can gurantee no divergence
<glibc> ramping up on Metal... I can compile .metallib files, which I understand are LLVM IR containers. I can compile them in a metal app; how can I get the output binary?
* bloom intends on a full SSA backend compiler
<bloom> The ISA is regular enough and RA requirements simple enough it should be doable
<bloom> Ewwwwwwww
<bloom> dougall: not sure if you've figured this out yet, but it looks like the `i` bit on wait corresponds to the `u2` immediate on device_Load
<bloom> to allow a more fine grained scoreboarding
<bloom> so the metal compiler can output code like
<bloom> `load 0, load 1, wait 0, do something with 0, wait 1, do something with 1`
<bloom> and then the load 1 can be pipelined with the `do something with 0`
josiahmendes[m] has quit [Quit: Idle for 30+ days]
<bloom> "Ewwwwwwww" was at AGX architecturally lacking support for basic vertex attributes.
<bloom> ....why is `stac_adjust/stack_store/stack_load` the fastest way to convert i32 to u8norm...
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
<glibc> I guess I can build an binary archive and serialize it, but I wonder if there is an easier way to get the raw compiled binaries.
tmlind has joined #asahi-gpu
odmir_ has joined #asahi-gpu
<DarkShadow44> glibc: What exactly do you want to do?
<glibc> I wanted a .metal -> .disasm workflow, which I now have :)
<DarkShadow44> glibc: Alright, otherwise I could have offered you my version. Do you also use this code https://github.com/DarkShadow44/AsahiLinux-gpu/blob/asmtool/asmtool/asmtool/metallib.c ?
<glibc> no, I wasn't aware of this -- but this looks quite nice :) Thanks for the link.
<DarkShadow44> FWIW, I just copied that - but also integrated it into my tool - so you can just feed it source and it outputs disassembly
<glibc> cool, that's exactly what I wanted. (I was still missing a clean method for extracting the raw payload)
<DarkShadow44> I take it, you're also trying to re the ISA?
<glibc> taking humble baby steps, but yeah
<DarkShadow44> heh, wishing you the best of luck! :)
* bloom has always ignored the 'nice' methods of doing this, preferring to just savagely parse GPU memory at runtime
<DarkShadow44> bloom: You can extract shaders from GPU memory after they've been uploaded?
<glibc> DarkShadow44: thanks :)
<DarkShadow44> glibc: One more thing I forgot: Keep in mind that the metallib way of getting a shader produces different bytecode from loading a shader normally
<DarkShadow44> they're compiled as "functions" (for lack of a better word) instead of a standalone program
<DarkShadow44> so if you try to run them, you better check they load/store data at the right offsets :P
<DarkShadow44> ...if you want to run them in the first place
<bloom> DarkShadow44: yep.
<bloom> and it avoids ^^ those sorts of issues
<bloom> at the cost of introducing a slew of other issues..
<DarkShadow44> cool, I kinda thought once it's uploaded you can't access it anymore
<DarkShadow44> downloading that is kinda savage indeed, but I like the idea ^
<DarkShadow44> bloom: Say, do you know if isaspec from mesa can be used to create an assembler? Because from reading their docs it seems like this is not yet the case
solarkraft has quit [Ping timeout: 240 seconds]
odmir_ has quit [Remote host closed the connection]
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
<bloom> DarkShadow44: I'm not sure if they added support or not, it's definitely on their wishlist though.
radex1 has quit [Quit: WeeChat 3.0]
artemist has quit [Ping timeout: 276 seconds]
artemist has joined #asahi-gpu
odmir has quit [Ping timeout: 260 seconds]
<bloom> dougall: Any reason why applegpu.py seperates {A,B,C,D} from {A,B,C,D}t? It looks like the fields are always adjacent
<bloom> I guess just more convenient that way. Probably will merge them for whatever i end up doing.
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
yrlf has quit [Quit: The Lounge - https://thelounge.chat]
yrlf has joined #asahi-gpu
odmir has quit [Ping timeout: 268 seconds]
<bloom> dougall: Also, patches incoming for `blend` and maybe `ld_var` if I have time before getting bored and reading a novel :p
odmir has joined #asahi-gpu
odmir has quit [Ping timeout: 240 seconds]
odmir has joined #asahi-gpu
<bloom> Note to self: >4 render targets has some seriously weird shader stuff
<dougall> bloom: A vs At etc is arbitrary, as you say, i just wanted to have splitting in one place, and try to separate 'instruction' from 'data' in the bit diagram (A can be a literal immediate, although also sometimes the low bit indicates whether or not the reg is 64-bit, so it's not perfect)
<bloom> Ack
<dougall> ah yeah, great find on i/u2! i think that makes sense
<bloom> probably worth confirming issue a lot more loads
<bloom> but if the idea is right it should be obvious how to construct test programs for that