ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
<icecream95> Oops... does AArch64 allow raising faults when doing atomic operations on device memory?
<icecream95> I guess I'd better just use volatile * for reading from the MCU pages then
<icecream95> I don't think that atomics are required when reading from the userspace proxy pages either, because of the `dsb sy` after writing cached memory
* icecream95 takes another look at the "Barrier Litmus Tests" in the ARM ARM
<HdkR> Most platforms will even raise SIGBUS on unaligned atomic device memory :P
<icecream95> Do reads from write-combine memory by another thread require barriers or atomic instructions?
<icecream95> Wait a second, this memory came from a MAP_ANONYMOUS mmap, it'll be cached, oops
<HdkR> weak-memory model ARM means you'll always need barriers or acquire-load
<icecream95> Hmm I think I might actually need a barrier
erle has joined #panfrost
Daanct12 has joined #panfrost
camus has joined #panfrost
camus1 has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
jolan has joined #panfrost
<icecream95> userfaultfd: Function not implemented
<icecream95> # CONFIG_USERFAULTFD is not set
<icecream95> oops
Daaanct12 has joined #panfrost
Daanct12 has quit [Remote host closed the connection]
icecream95 has quit [Ping timeout: 480 seconds]
<HdkR> Sounds about right, I don't think the Arm64 default has it enabled
<steev> correct
davidlt has joined #panfrost
Daanct12 has joined #panfrost
Daaanct12 has quit [Read error: Connection reset by peer]
guillaume_g has joined #panfrost
Major_Biscuit has joined #panfrost
Daanct12 has quit [Remote host closed the connection]
icecream95 has joined #panfrost
Major_Biscuit has quit []
MajorBiscuit has joined #panfrost
MajorBiscuit has quit [Quit: WeeChat 3.5]
MajorBiscuit has joined #panfrost
rasterman has joined #panfrost
<icecream95> Oops... there appears to be a race condition where pandecode_dump_stream can become NULL when it shouldn't
<icecream95> I guess pandecode_next_frame was called from another thread during decode
<icecream95> Maybe I should add some locks to pandecode... the submit lock usually works, except that pandecode_next_frame can be called without it
rkanwal has joined #panfrost
digetx has joined #panfrost
camus1 has quit [Read error: Connection reset by peer]
camus has joined #panfrost
icecream95 has quit [Ping timeout: 480 seconds]
erle has quit [Ping timeout: 480 seconds]
erle has joined #panfrost
falk689 has joined #panfrost
MajorBiscuit has quit [Quit: WeeChat 3.5]
alyssa has joined #panfrost
megi has joined #panfrost
camus has quit []
davidlt has quit [Ping timeout: 480 seconds]
erle has quit [Ping timeout: 480 seconds]
MajorBiscuit has joined #panfrost
erle has joined #panfrost
davidlt has joined #panfrost
Daanct12 has joined #panfrost
Danct12 has quit [Read error: Connection reset by peer]
guillaume_g has quit []
Danct12 has joined #panfrost
Daanct12 has quit [Ping timeout: 480 seconds]
nlhowell has joined #panfrost
MajorBiscuit has quit [Quit: WeeChat 3.5]
rkanwal has quit [Quit: rkanwal]
<alyssa> robher: panfrost_mmu_flush_range has the pattern:
<alyssa> pm_runtime_get_noresume()
<alyssa> if (pm_runtime_active()) flush()
<alyssa> pm_runtime_put_sync_autosuspend()
<alyssa> I don't understand what the first and last calls are for.
<alyssa> Why not simply "if (pm_runtime_active()) flush()"?
<alyssa> (This is causing lockdep splat)
<alyssa> If pm_runtime_put_sync_autosuspend results in a suspend, the devfreq lock is taken.
<alyssa> panfrost_mmu_flush_range is called from the shrinker, which takes the fs_reclaim lock.
<alyssa> But, when probing devfreq, we take the devfreq lock (obviously) followed by the fs_reclaim lock (for malloc, via dev_set_name).
<alyssa> resulting in a circular dependency reported by lockdep, if we probe in parallel with shrinking
<alyssa> OTOH that seems impossible, so maybe this is a false positive in lockdep?
<robher> Seems impossible to me too offhand.
<alyssa> "Except when there’s very strong justification for all the complexity, the real fix is to change the locking and make it simpler"
<alyssa> Hard to do when one of the lock is deep into the kernel core..
<alyssa> Although... why do we hold a lock when probing anyway?
<alyssa> panfrost can't race itself to probe the same device..
<robher> Or at least during the memory alloc?
<robher> This is all bringing back painful memories.
<alyssa> memories of panfrost.ko or just locking in general?
<alyssa> seemingly we have little control over either lock.
<alyssa> I guess any driver that makes devfreq calls within the shrinker callback will hit that
<robclark> devfreq calls in shrinker?! That sounds like a place you just don't want to go
<alyssa> robclark: okay so I'm not crazy then? :)
<alyssa> shrinker -> panfrost_mmu_unmap -> pm_runtime_put_sync_autosuspend -> panfrost_device_suspend -> panfrost_devfreq_suspend
<robclark> in general I'd try and void anything that ends you up in other subsystem's code
<robclark> hmm, I guess a bit sad if you need runpm ref to unref?
<alyssa> hence my original question of what those calls do
<alyssa> Hmm, maybe this is part of the mystery:
<alyssa> * Note that the return value of this function can only be trusted if it is
<alyssa> * called under the runtime PM lock of @dev or under conditions in which
<alyssa> * runtime PM cannot be either disabled or enabled for @dev and its runtime PM
<alyssa> * status cannot change.
<alyssa> (For pm_runtime_active)
<alyssa> the get_noresume/put_sync_autosuspend dance ensures that if the device is awake, it will stay awake. maybe?
<alyssa> if that's all, then pm_runtime_put_sync_autosuspend can become instead pm_runtime_put_noidle
<alyssa> in that case, we can't trigger a suspend which avoids the splat
<alyssa> fwiw, git blame says I acked the patch but didn't review it, which means I didn't understand it then and don't now C:
<alyssa> I see the same get_noresume/active/put_autosuspend pattern in panfrost_mmu_release_ctx. Not sure if that's similarly wrong
<alyssa> seems to have fixed that splat
<alyssa> there are, of course, other kernel bugs that I'm hitting with the same reproducer
<alyssa> massive numbers of threads doing massive amounts of allocations finds some bugs, hmm!
<HdkR> Just throw more threads at the problem, you'll either make problems or find problems. Easy!
<alyssa> HdkR: truth
<alyssa> Next bug on the list is a WARN triggering for an error handling path
<alyssa> I rather suspect it's a bogus warning because other drivers handle the error without any fanfare.
davidlt has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
<icecream95> MR opened for decoding v10 command streams: 16 files changed, 2108 insertions(+), 183 deletions(-)
<alyssa> this is going to take a while to review
<alyssa> (Just to calibrate your expectations: I do intend to merge v10 decoding. It may take some respins of that series given how little we know how CSF so far.)
<alyssa> pan/decode changes are sort of "whatever", but we do need to be very careful about GenXML
<alyssa> (Perhaps more careful than we have been in the best....)
<icecream95> The area which has been least r/e'd is the different command-stream instructions, of which many are unknown
anholt has joined #panfrost
<icecream95> Though most commonly the standard mov instructions are used to upload descriptors
<icecream95> ...And my MR adds instruction decoding to decode.c but not GenXML
<icecream95> (Though see my csf-ins branch for implementing the pack side of that)
<icecream95> s/best/past/?
<icecream95> (Though arguably the best times are all in the past?)
<alyssa> Sure
<alyssa> I was trying to figure out today what we want this to look like in Mesa.
<alyssa> mi_builder is relevant prior art.
<alyssa> (Not saying that's exactly what we should do, but the hardware seems at least superficially similar.)
anholt has quit [Quit: Leaving]
<icecream95> alyssa: For an example of launching a compute batch using my current implementation of packing: https://gitlab.freedesktop.org/icecream95/mesa/-/blob/csf/src/panfrost/csf_test/test.c#L1163
<alyssa> pan_pack_ins(i, CS_SELECT_BUFFER, cfg) { cfg.index = 3; }
<alyssa> pan_pack_ins(i, CS_STATE, cfg) { cfg.state = 8; }
<alyssa> icecream95: ^ what do these do?
<alyssa> Do you know?
<icecream95> No idea!
<alyssa> I respect that ^^
<icecream95> Possibly SELECT_BUFFER switches between register sets for the command stream
<icecream95> Possibly STATE is ... I don't know what it's for
<alyssa> Alright. I'd prefer giving them boring names then (CS_UNK23 or something)
<alyssa> How about CS_CALL?
<icecream95> No, I don't know which is the link register, if that is your question
<alyssa> I didn't mean that specifically, just emit_cs_call in general
<icecream95> Though switching register sets (or whatever SELECT_BUFFER does) allows multiple levels of calls
<icecream95> emit_cs_call calls out to another command stream, and returns when reaching end?
* alyssa supposes she's going to need to set up the DDK to answer these questions..
<alyssa> (or better, hardware.....)
<icecream95> The kernel driver I use is https://gitlab.com/icecream95/kbase-valhall/
<alyssa> ack
<icecream95> Note that with that the blob will hang after submitting because no-one executes the `str` commands in the command stream to signal job completion
<alyssa> sure
<icecream95> I do that in my panloader branch, maybe there could be a debug flag to make that happen in Mesa pandecode?
<icecream95> That could be called PAN_MESA_DEBUG=nosync. Rather than waiting for jobs to finish right after submission, never wait at all!
<alyssa> heh
<alyssa> Re <enum name="CS Iterator"> values
<alyssa> AFAIK, there are only three iterators: compute, tiler, fragment
<alyssa> https://developer.arm.com/documentation/102811/0100 calls them compute, vertex, fragment
<alyssa> at any rate, if there are 4 values, that's not an iterator enum
<alyssa> that said: it's SUPER arm to reserve 0
<icecream95> Maybe one of them is "MCU"?
<alyssa> so for 3 values, a 2-bit field makes sense with values 1,2,3
<alyssa> (13 & 3) = 1
<alyssa> so maybe it's reserved/vertex/fragment/compute
<alyssa> and there are some flags?
<icecream95> wait a second that isn't in the MR I posted you shouldn't be looking at that!!!!
<alyssa> I'm one of those people who reads the end of the book while still at the beginning and then is disappointed in the plot twists
<alyssa> you got me
<icecream95> I guess it could be flags, maybe for whether the tiler is enabled or not?
* alyssa shrugs
<alyssa> + pandecode_log("add x%02x, x%02x, #0x%x\n",
<icecream95> But that implies the existence of a non-IDVS vertex job type, which I haven't seen yet
<alyssa> I am fairly sure that these commands have an associated register file (MMIO mapped)
<alyssa> so I would assume that the source/dests here are registers
<icecream95> Yes, which is why there is an 'x' prefix rather than '0x'... I'm copying aarch64 conventions
<alyssa> wouldn't aarch64 convention be x%u?
<icecream95> Okay, I'm not completely copying them then
<icecream95> I started r/e by staring at hexdumps, and so decided to use hex for register names
<alyssa> (and x%u probably makes more sense if we're satisfied this is a "real" ISA of sorts)
<icecream95> But the other reason for using hex is to track what fields have been ported to CS registers or not
<icecream95> Maybe XML comments would be better suited for that
<alyssa> Yeah, I used XML comments for v9 bring up (commenting out v7 XML fragments that I hadn't mapped yet) and it worked well
<icecream95> Should I convert v10.xml from using hex then?
<alyssa> Not sure yet, I am still trying to understand the problem space
<alyssa> It bothers me that we still don't know *what* the cs layout "descriptors" *are*
<alyssa> Internally, I mean
<alyssa> Are these the v9 descriptors, and CSF is just shuffling them into place?
<alyssa> (Then what does CSF help with at all? lower overhead because you can do better dirty tracking?)
<icecream95> Note that Draw and the "shader environments" have both CS and non-CS variants...
<alyssa> Are these hardware registers (in the kernel MMIO register sense), and there are no descriptors anymore?
<icecream95> Non-CS variants are unchanged from v9 (I think)
<alyssa> Yes, that part is also bouncing around in my head
<alyssa> Those two possibilities induce very different GenXML designs
<alyssa> The former case induces more or less what you've done
<alyssa> You could plausibly model everything as descriptors like before -- no GenXML changes even, except for the shuffling -- and then in the driver's draw_vbo/launch_grid hooks, do something like:
<alyssa> foreach word in descriptor:
<alyssa> if word != current_state: emit_move_instruction(word)
<alyssa> current_state = word
<alyssa> (And changes from there would be purely about lowering CPU overhead, not architectural details)
<alyssa> The latter case, however, warrants a *completely* different design of the XML compared to pre-CSF Mali
<icecream95> But then you have to load the current state, which somewhat negates the advantage...
<alyssa> sure, but dealing with that is "just" a CPU optimization at that point
<alyssa> (At this point I'm just trying to understand the hardware, not necessarily how to handle it in a real driver)
<alyssa> The latter case, however, warrants a *completely* different design of the XML compared to pre-CSF Mali
<alyssa> where each group of contiguous related words becomes its own element of the XML and given first-class register status
<alyssa> as in (IIRC) Intel's and Broadcom's GenXML variants
<alyssa> there are no descriptors at that point, just state that used to be a descriptor in pre-CSF times
<alyssa> and then the draw call code looks completely different: each group of state is emitted independently based on dirty flags, and there is no "current state of the reg file" tracked except those dirty flags
<alyssa> The latter would be a bit annoying because it implies rewriting all the draw/launch code again instead of sharing the v9 code.
<icecream95> I note that general purpose registers seem to start at x64... I don't know if that's hardware or just ABI
<alyssa> Then again for v9 I had to do more or less that anyway...
<alyssa> Interesting
<alyssa> Could be either.
<alyssa> I think that's my next question -- are these register offsets fixed?
anholt has joined #panfrost
<alyssa> s/offsets/bases/
<icecream95> Note that pan_cmdstream.c in my csf branch does have code for launching IDVS and compute jobs
<icecream95> A bit hacky, though...
<alyssa> or are there some unknown bits in the COMPUTE_LAUNCH and IDVS commands that specify the base, and then the physical registers / descriptor is read from registers [base : base + size)?
<alyssa> (This too has implications for the right XML design and maybe the driver)
<icecream95> No idea about that, I don't even know why IDVS jobs take two registers as an argument
<icecream95> Though note this: <struct name="Fragment launch" layout="ins" op="7"/>
<icecream95> No fields!
<icecream95> (None that I know of, at least)
<alyssa> no fields, or there's a field that the DDK always sets to zero? ;)
<icecream95> One thing I found interesting is that Scissor and fragment job coordinates now use the same set of registers.. and the latter isn't divided by the tile size anymore, possibly because 32x32 tiles
<alyssa> heh, cute
<alyssa> I'm looking at your pan_cmdstream.c code now
<alyssa> Definitely "not ready for merge, but might find itself in a shipping product if you're not careful" ;-)
* icecream95 is maybe not careful enough
<alyssa> git@gitlab.freedesktop.org: Permission denied (publickey,keyboard-interactive).
<alyssa> uh
<icecream95> But why does the initial merge have to be of a completely optimised driver, why couldn't we fix things later?
<alyssa> doesn't have to be optimized, but the architecture should be sound
<alyssa> + <field name="Unk register 1" size="8" start="32" type="register" default="0x42"/>
<alyssa> + <field name="Unk register 2" size="8" start="40" type="register" default="0x4a"/>
<alyssa> I notice this spells JB, i.e. "job" ;-p
<icecream95> Tell Arm about that one, so they can update their documentation.
<icecream95> In an email I sent I proposed a syntax for being able to partially upload CS descriptors, for example to only upload scissor minimum values:
<icecream95> pan_pack_cs(&batch->cs_vertex, SCISSOR, cfg, .min = 1) {
alyssa has quit [Quit: leaving]
rasterman has quit [Quit: Gettin' stinky!]