ChanServ changed the topic of #asahi-gpu to: Asahi Linux GPU development (no user support, NO binary reversing) | Keep things on topic | GitHub: https://alx.sh/g | Wiki: https://alx.sh/w | Logs: https://alx.sh/l/asahi-gpu
jeisom_ has quit [Quit: Leaving]
jeisom has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
Armlin has joined #asahi-gpu
Armlin has quit [Read error: Connection reset by peer]
Armlin has joined #asahi-gpu
hightower3 has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
hightower4 has quit [Ping timeout: 480 seconds]
crabbedhaloablut has joined #asahi-gpu
iaguis_ has joined #asahi-gpu
jeisom has quit [Ping timeout: 480 seconds]
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
iaguis__ has joined #asahi-gpu
iaguis has quit [Ping timeout: 480 seconds]
iaguis_ has quit [Ping timeout: 480 seconds]
Armlin has quit []
homura has joined #asahi-gpu
homura has quit []
JTL has quit []
JTL has joined #asahi-gpu
JTL has quit []
JTL has joined #asahi-gpu
JTL has quit []
pg12 has quit [Remote host closed the connection]
<lina> alyssa: Soooo.... there's no mailbox ^^;;
<lina> It's all firmware and shared memory, there is no mechanism to interrupt a shader or anything like that.
<lina> Instead what they do is they have a compute shader calculate how much memory is needed, and write that out to some shared memory, then when the next shader runs that shared memory location is passed to the firmware as an alloc request, and then the firmware allocates memory before running it.
<lina> Each job has an alloc request and a free request, and writes out the alloc/free request counts of the next job
pg12 has joined #asahi-gpu
<lina> For mesh, there are 3 compute jobs: the first one calculates the first allocation, the second one writes to it and calculates the second allocation, then the third one writes out the mesh data/stuff into the second allocation, then the final 3d/ta job uses that.
<lina> So the "just in time" allocation mechanism is just there so the jobs don't have to pingpong through the kernel for allocs, but rather they can be queued at the firmware eagerly and only when the firmware is about to run a job does it poke the kernel to allocate its memory
<lina> In typical Apple fashion, it's overcomplicated with two layers of accounting for buffer size/usage...
<lina> Also everything is done in discrete "block count" units but I think the firmware doesn't care about where we alloc or how big a block is, since that's just up to the shaders/kernel
<lina> The only interesting thing in the shaders is those shared memory counts are written to like this:
<lina> 1ca: 4591280510f83200 device_store.TODO 0, i32, xy, r18_r19, r12_r13, 2, signed, lsl 2, 1, 1, 0, 3, 0
<lina> My guess is that's an uncached store or something like that, since the firmware needs to be coherent with this stuff.
<lina> Ohh but I think the firmware may have a 64-bit free blockmap, so max 64 blocks. Blocks are 64MiB on macOS, which means 4GiB max buffer size, which makes sense as a max.
<lina> So we could use a different blocksize, but we'd lower our max usage if we do that.
iaguis__ is now known as iaguis
mairacanal has quit [Remote host closed the connection]
chadmed has joined #asahi-gpu
<lina> I think for us, the ABI for this would look essentially the same as the firmware ABI, except we use a BO and offsets for the count/shared buffers with the firmware (then we can map it in firmware land without making it world writable like Apple does, they're so bad at this...)
<lina> And the requirement is that everything is submitted as one ioctl, from alloc to free
<lina> Our ioctl could pass in a max required blocks size, and then at submit time the kernel ensures the global shared allocation has at least that much space. Then we mutex job execution using the global buffers (considering this another resource, blocking via a fence), but lazily drag buffer blocks between VMs so we avoid remapping and zeroing out when the jobs actually use less than the max memory.
chadmed has quit [Quit: Konversation terminated!]
chadmed has joined #asahi-gpu
ourdumbfuture has joined #asahi-gpu
ourdumbfuture has quit []
ourdumbfuture has joined #asahi-gpu
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
alyssa has joined #asahi-gpu
<alyssa> 09:00 <lina> 1ca: 4591280510f83200 device_store.TODO 0, i32, xy, r18_r19, r12_r13, 2, signed, lsl 2, 1, 1, 0, 3, 0
<alyssa> Yeah, we're assuming the .TODO variants have different cache bits but we're not sure what any of the bits mean yet
<alyssa> so that tracks
<alyssa> 08:57 <lina> For mesh, there are 3 compute jobs: the first one calculates the first allocation, the second one writes to it and calculates the second allocation, then the third one writes out the mesh data/stuff into the second allocation, then the final 3d/ta job uses that.
<alyssa> This seems deeply cursed
<alyssa> To be clear, "job" in this context is a thing kicked off by a RunCompute op?
<alyssa> (rather than an individual dispatch or a whole ioctl?)
<alyssa> 10:08 <lina> Our ioctl could pass in a max required blocks size
<alyssa> I don't know that information
<alyssa> So it would be set unconditionally to the size of the entire global heap
<alyssa> The entire point of this mechanism is that I /don't/ know any reasonable bounds ahead-of-time
<alyssa> (If I did I'd just static allocate in userspace)
ourdumbfuture has joined #asahi-gpu
jeisom has joined #asahi-gpu
dylanchapell has quit [Remote host closed the connection]
cylm has joined #asahi-gpu
<lina> alyssa: Yes, "job" is a RunCompute
<lina> alyssa: I know we don't know the max required blocks, but it seems more useful to allow that to be set from userspace (e.g. via env variable) than a single global kernel module parameter?
<lina> Then we could even do driconf overrides for known problem apps
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
dylanchapell has joined #asahi-gpu
mairacanal has joined #asahi-gpu
<lina> alyssa: Do you want me to look at this more or is this enough to make up a plan? I *think* I can easily test this whole mechanism from the m1n1 python stuff, since it doesn't really have to involve shaders (I can just pre-populate the buffers with whatever I want and dump whatever the firmware did to them)
<lina> (And since as far as I can tell the mechanism is uniform, it doesn't even have to be compute, I can just have a render pass request allocatinon)
commandoline has quit [Quit: Bye!]
commandoline has joined #asahi-gpu
<i509vcb> lina: Is the mentioned theoretical 4GiB maximum buffer size related to the usage of the buffer or is it more of a strict limit? (Want to understand this before blindly updating the limit in agxv)
<lina> I think macOS just restricts it to 4GiB of address space
<lina> I don't think the GPU/firmware care about that, but there does seem to be what looks like a 64-bit bitmask of blocks involved, so whatever blocksize we pick we can only have up to 64 of them.
<lina> And using blocks larger than 64MiB seems... kind of wasteful
<lina> i509vcb: What limit are you thinking about? This is mostly for internal mesh/geom/etc emulation, not a buffer exposed to users
commandoline has quit [Quit: Bye!]
ourdumbfuture has joined #asahi-gpu
commandoline has joined #asahi-gpu
<alyssa> 13:17 lina | Then we could even do driconf overrides for known problem apps
<alyssa> Maybe? shrug
<lina> I mean if we need to allocate at ioctl time it's not much harder to have it allocate-or-expand and take the buffer size from userspace
os has quit [Ping timeout: 480 seconds]
<lina> That has nothing to do with this
<lina> This is a very specific mesh/geom emu buffer thing
<lina> Not "any buffer"
<i509vcb> Okay then I guess I just ignore the above
<lina> What's the value right now?
<i509vcb> No idea, I think it's either max memory allocation size or UINT64_MAX
<lina> Yeah
<lina> User address space right now is 256GiB (half of the 512GiB half since the rest is reserved for other purposes) and Apple don't sell machines with that much RAM, so something based on RAM size probably makes more sense if you want a "useful" value
commandoline has quit []
<lina> From what I heard we want to try to de-hardcode most of this address space stuff, so eventually it might be closer to the full 512GiB lower half (full AS is 40 bits / 1 TiB but half is kernel land)
commandoline has joined #asahi-gpu
<alyssa> lina: relatedly, did we decide that SVM is hard-impossible on current-gen AGX?
<lina> No, that's what this is for. Apparently Nvidia has the same problem and somehow they make it work?
<alyssa> oh, ok
<lina> (with GPU AS < CPU AS)
<alyssa> (I didn't really follow that conversation)
<lina> (I donÂ't know how they deal with that but if they do, no raeason not to try here too)
<lina> *reason
<alyssa> :+1:
<lina> This will be another UAPI break once I flip it around
<lina> We probably want some code in mesa to rand() the carveouts to an extent, to prove this actually works
<lina> The globals that mention these addresses annoy me, but for all we know the firmware never uses them...
<lina> I guess I should grep the firmware a bit, see if I can confirm or deny any of it
<alyssa> apple doesn't do SVM on macOS do they?
<lina> no
<alyssa> fun
<lina> OK, the "magic page" address at 0x6fffff8000 is written to a hardware register, so that one does something...
<lina> I don't think any of the others are used for anything, so we can probably ignore those globals and let userspace define those addresses
<lina> I guess we should move the magic page to the top at 0x7fffff8000 and then just say we "almost" have 40 bits of address space for userspace to allocate however it wants
<lina> Which is the best we can do for SVM
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
compassion1785 has quit [Quit: lounge quit]
cylm has quit [Ping timeout: 480 seconds]
cylm has joined #asahi-gpu
compassion1785 has joined #asahi-gpu
<karolherbst> are you managing the VM in userspace or kernel space?
<karolherbst> and anyway, as long as you don't do full system SVM the address space size doesn't matter
<karolherbst> memory used for SVM needs to be allocated manually and you can just use mmap instead of malloc and restrict the area from where to allocate
<karolherbst> I still have to write a proper protototype for all of this, but I'm kinda having a plan
ourdumbfuture has joined #asahi-gpu
rappet has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
rappet has joined #asahi-gpu
cylm has quit [Ping timeout: 480 seconds]
<lina> karolherbst: Userspace, except the carveout for kernel BOs
<lina> It's the VM_BIND thing
<karolherbst> right...
<lina> But I mean, we could have an SVM mode which syncs the CPU page tables like I think nouveau has?
<karolherbst> yeah, the carveout needs to happen, but as long as you don't use mmu_notifiers to mirror arbitrary CPU memory, having the address spaces be a different size isn't a problem
<karolherbst> lina: yeah....
<karolherbst> and then you'd have to careveout all invalid regions
<lina> Yeah...
<lina> Anyway I don't know how this really works, I just want to make sure I understand what we can do ^^
<karolherbst> I don't have a solution for that, but luckly nothing yet requires it
<lina> I should sleep ^^
<karolherbst> yeah.. me neither :D
<karolherbst> heh
<karolherbst> have a good rest!
<karolherbst> but anyway.. userspace needs a way to map arbitrary host memory into the GPU AS, which is generally called "userptr"
<karolherbst> and that's all you need for basic SVM
<karolherbst> I have a prototype for that against iris, I just have to fix it
<lina> Yeah, userptr... as long as you're on a 16K kernel ^^
<lina> (That's never going to work on 4K...)
<karolherbst> nah, that's fine
<karolherbst> iris can map arbitrary pointers, you just offset it
<lina> I mean on AGX
<lina> The GPU pages are 16K so good luck mapping arbitrary CPU 4K pages...
<karolherbst> so you map the entire page, but the driver deals with the alignment
<karolherbst> mhhh
<karolherbst> ohh right...
<karolherbst> that would be 4 pages...
<karolherbst> though it _might_ just work anyway
<lina> Yeah, which won't be contiguous
<karolherbst> ohh
<karolherbst> yeah if that's required that you are screwed :D
<lina> And can't be made contiguous reliably even if you have weird migration stuff
<karolherbst> *then
<lina> So yeah that's never going to work
<karolherbst> yeah...
<karolherbst> well..
<karolherbst> in which case no SVM...
<karolherbst> but whatever
<lina> Yeah, no SVM on 4K
<lina> Anyway, nn Â^^
<karolherbst> it's such a niche feature, just that SyCL/HIP require it :')
<lina> 4K is just for gamers anyway ^^
<karolherbst> heh
<karolherbst> userspace could always try though and the kernel either accepts a mapping or not....
<karolherbst> :D
<karolherbst> but yeah...
<karolherbst> good night
<lina> (That's already how it works on 4K kernels if you try to give it unaligned BOs ^^)
<lina> It would work with hugepages w
<karolherbst> mhh
<karolherbst> I wonder if there is mmap magic userspace could do
<lina> There's no general solution to this problem
<lina> There are probably a bunch of weird specific solutions ^^
<karolherbst> just `MAP_HUGE_2MB` I guess...
<lina> Like my broken shm hack to get order-2 allocs
<jannau> I can send an subscriber link
<lina> Anyway really, nn ^^
<lina> jannau: I'll read it later but that would be nice ^^
<jannau> done
<karolherbst> uhh.. I can access lwn via RH for free, but I might have to fix some VPN stuff for it.. let's see...
<karolherbst> ohh.. I can also generate links.. handy
<jannau> it would allow using multi page folios for mmap but it's far off from being merged and might not be application controllable
<karolherbst> yeah..
<alyssa> lina: I assume the answer is yes, but 4K VMs on a 16K system are also affected (no SVM inside the VM?)
<alyssa> ?
delroth has quit [Remote host closed the connection]
delroth has joined #asahi-gpu
chadmed has quit [Ping timeout: 480 seconds]
nightstrike has joined #asahi-gpu
maximbaz has quit [Quit: bye]
maximbaz has joined #asahi-gpu
ave36309 has quit [Ping timeout: 480 seconds]
ave36309 has joined #asahi-gpu
crabbedhaloablut has quit []
nightstrike has quit [Quit: Connection closed for inactivity]
nela has quit [Quit: bye!]
nela has joined #asahi-gpu
Misthios has quit [Quit: Misthios]
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
Cyrinux9474 has quit []
Cyrinux9474 has joined #asahi-gpu
<i509vcb> alyssa: any pointers for implementing load_layer_id on agxv? I noticed a few drivers involve creating an nir_variable but I suspect this might not be right for agxv
<i509vcb> (trying to get blits to at least try to run)
<i509vcb> I'll peek there I guess
<alyssa> that MR needs to be merged into agxv/main
<alyssa> and then if you're really getting load_layer_id and not VARYING_SLOT_LAYER load_inputs, then use nir_lower_sysvals_to_varyings
<alyssa> and the code in that MR will take care of it
<alyssa> you'll need to extend lower_sysvals_to_varyings in the obvious way in that case
ourdumbfuture has quit [Quit: My Mac has gone to sleep. ZZZzzz…]
ourdumbfuture has joined #asahi-gpu
jeisom has quit [Ping timeout: 480 seconds]
darkapex1 has joined #asahi-gpu
darkapex has quit [Ping timeout: 480 seconds]