#panfrost on 2022-07-06 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:57 ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular

00:49 <icecream95> Oops... does AArch64 allow raising faults when doing atomic operations on device memory?

00:49 <icecream95> I guess I'd better just use volatile * for reading from the MCU pages then

00:52 <icecream95> I don't think that atomics are required when reading from the userspace proxy pages either, because of the `dsb sy` after writing cached memory

00:52 * icecream95 takes another look at the "Barrier Litmus Tests" in the ARM ARM

00:59 <HdkR> Most platforms will even raise SIGBUS on unaligned atomic device memory :P

01:10 <icecream95> Do reads from write-combine memory by another thread require barriers or atomic instructions?

01:11 <icecream95> Wait a second, this memory came from a MAP_ANONYMOUS mmap, it'll be cached, oops

01:13 <HdkR> weak-memory model ARM means you'll always need barriers or acquire-load

01:21 <icecream95> Hmm I think I might actually need a barrier

01:36 erle has joined #panfrost

01:39 Daanct12 has joined #panfrost

01:42 camus has joined #panfrost

01:47 camus1 has joined #panfrost

01:52 camus has quit [Ping timeout: 480 seconds]

02:29 jolan has joined #panfrost

02:55 <icecream95> userfaultfd: Function not implemented

02:55 <icecream95> # CONFIG_USERFAULTFD is not set

02:55 <icecream95> oops

03:32 Daaanct12 has joined #panfrost

03:32 Daanct12 has quit [Remote host closed the connection]

03:41 icecream95 has quit [Ping timeout: 480 seconds]

03:49 <HdkR> Sounds about right, I don't think the Arm64 default has it enabled

04:19 <steev> correct

04:43 davidlt has joined #panfrost

05:36 Daanct12 has joined #panfrost

05:38 Daaanct12 has quit [Read error: Connection reset by peer]

05:39 guillaume_g has joined #panfrost

06:44 Major_Biscuit has joined #panfrost

06:50 Daanct12 has quit [Remote host closed the connection]

06:56 icecream95 has joined #panfrost

06:59 Major_Biscuit has quit []

07:18 MajorBiscuit has joined #panfrost

07:23 MajorBiscuit has quit [Quit: WeeChat 3.5]

07:26 MajorBiscuit has joined #panfrost

08:08 rasterman has joined #panfrost

08:40 <icecream95> Oops... there appears to be a race condition where pandecode_dump_stream can become NULL when it shouldn't

08:42 <icecream95> I guess pandecode_next_frame was called from another thread during decode

08:43 <icecream95> Maybe I should add some locks to pandecode... the submit lock usually works, except that pandecode_next_frame can be called without it

09:42 rkanwal has joined #panfrost

10:10 digetx has joined #panfrost

10:16 camus1 has quit [Read error: Connection reset by peer]

10:16 camus has joined #panfrost

10:24 icecream95 has quit [Ping timeout: 480 seconds]

10:52 erle has quit [Ping timeout: 480 seconds]

11:02 erle has joined #panfrost

11:21 falk689 has joined #panfrost

12:35 MajorBiscuit has quit [Quit: WeeChat 3.5]

12:40 alyssa has joined #panfrost

13:22 megi has joined #panfrost

14:07 camus has quit []

14:29 davidlt has quit [Ping timeout: 480 seconds]

15:22 erle has quit [Ping timeout: 480 seconds]

15:25 MajorBiscuit has joined #panfrost

15:43 erle has joined #panfrost

16:14 davidlt has joined #panfrost

16:23 Daanct12 has joined #panfrost

16:23 Danct12 has quit [Read error: Connection reset by peer]

16:26 guillaume_g has quit []

16:34 Danct12 has joined #panfrost

16:40 Daanct12 has quit [Ping timeout: 480 seconds]

16:46 nlhowell has joined #panfrost

16:50 MajorBiscuit has quit [Quit: WeeChat 3.5]

16:56 rkanwal has quit [Quit: rkanwal]

18:08 <alyssa> robher: panfrost_mmu_flush_range has the pattern:

18:08 <alyssa> pm_runtime_get_noresume()

18:09 <alyssa> if (pm_runtime_active()) flush()

18:09 <alyssa> pm_runtime_put_sync_autosuspend()

18:09 <alyssa> I don't understand what the first and last calls are for.

18:09 <alyssa> Why not simply "if (pm_runtime_active()) flush()"?

18:11 <alyssa> (This is causing lockdep splat)

18:11 <alyssa> If pm_runtime_put_sync_autosuspend results in a suspend, the devfreq lock is taken.

18:12 <alyssa> panfrost_mmu_flush_range is called from the shrinker, which takes the fs_reclaim lock.

18:13 <alyssa> But, when probing devfreq, we take the devfreq lock (obviously) followed by the fs_reclaim lock (for malloc, via dev_set_name).

18:14 <alyssa> resulting in a circular dependency reported by lockdep, if we probe in parallel with shrinking

18:14 <alyssa> OTOH that seems impossible, so maybe this is a false positive in lockdep?

18:16 <robher> Seems impossible to me too offhand.

18:18 * alyssa reads https://blog.ffwll.ch/2020/08/lockdep-false-positives.html

18:18 <alyssa> "Except when there’s very strong justification for all the complexity, the real fix is to change the locking and make it simpler"

18:18 <alyssa> Hard to do when one of the lock is deep into the kernel core..

18:20 <alyssa> Although... why do we hold a lock when probing anyway?

18:20 <alyssa> panfrost can't race itself to probe the same device..

18:20 <robher> Or at least during the memory alloc?

18:20 <robher> This is all bringing back painful memories.

18:22 <alyssa> memories of panfrost.ko or just locking in general?

18:24 <alyssa> seemingly we have little control over either lock.

18:27 <alyssa> I guess any driver that makes devfreq calls within the shrinker callback will hit that

18:29 <robclark> devfreq calls in shrinker?! That sounds like a place you just don't want to go

18:30 <alyssa> robclark: okay so I'm not crazy then? :)

18:30 <alyssa> shrinker -> panfrost_mmu_unmap -> pm_runtime_put_sync_autosuspend -> panfrost_device_suspend -> panfrost_devfreq_suspend

18:30 <robclark> in general I'd try and void anything that ends you up in other subsystem's code

18:31 <robclark> hmm, I guess a bit sad if you need runpm ref to unref?

18:31 <alyssa> hence my original question of what those calls do

18:37 <alyssa> Hmm, maybe this is part of the mystery:

18:37 <alyssa> * Note that the return value of this function can only be trusted if it is

18:37 <alyssa> * called under the runtime PM lock of @dev or under conditions in which

18:37 <alyssa> * runtime PM cannot be either disabled or enabled for @dev and its runtime PM

18:37 <alyssa> * status cannot change.

18:38 <alyssa> (For pm_runtime_active)

18:38 <alyssa> the get_noresume/put_sync_autosuspend dance ensures that if the device is awake, it will stay awake. maybe?

18:40 <alyssa> if that's all, then pm_runtime_put_sync_autosuspend can become instead pm_runtime_put_noidle

18:41 <alyssa> in that case, we can't trigger a suspend which avoids the splat

18:42 <alyssa> fwiw, git blame says I acked the patch but didn't review it, which means I didn't understand it then and don't now C:

18:50 <alyssa> I see the same get_noresume/active/put_autosuspend pattern in panfrost_mmu_release_ctx. Not sure if that's similarly wrong

18:58 <alyssa> seems to have fixed that splat

18:59 <alyssa> there are, of course, other kernel bugs that I'm hitting with the same reproducer

18:59 <alyssa> massive numbers of threads doing massive amounts of allocations finds some bugs, hmm!

19:01 <HdkR> Just throw more threads at the problem, you'll either make problems or find problems. Easy!

19:02 <alyssa> HdkR: truth

19:04 <alyssa> Next bug on the list is a WARN triggering for an error handling path

19:04 <alyssa> I rather suspect it's a bogus warning because other drivers handle the error without any fanfare.

20:23 davidlt has quit [Ping timeout: 480 seconds]

20:24 icecream95 has joined #panfrost

20:48 <icecream95> MR opened for decoding v10 command streams: 16 files changed, 2108 insertions(+), 183 deletions(-)

20:51 <alyssa> this is going to take a while to review

20:59 <alyssa> (Just to calibrate your expectations: I do intend to merge v10 decoding. It may take some respins of that series given how little we know how CSF so far.)

20:59 <alyssa> pan/decode changes are sort of "whatever", but we do need to be very careful about GenXML

21:00 <alyssa> (Perhaps more careful than we have been in the best....)

21:01 <icecream95> The area which has been least r/e'd is the different command-stream instructions, of which many are unknown

21:01 anholt has joined #panfrost

21:01 <icecream95> Though most commonly the standard mov instructions are used to upload descriptors

21:01 <icecream95> ...And my MR adds instruction decoding to decode.c but not GenXML

21:03 <icecream95> (Though see my csf-ins branch for implementing the pack side of that)

21:03 <icecream95> s/best/past/?

21:04 <icecream95> (Though arguably the best times are all in the past?)

21:05 <alyssa> Sure

21:06 <alyssa> I was trying to figure out today what we want this to look like in Mesa.

21:06 <alyssa> mi_builder is relevant prior art.

21:07 <alyssa> (Not saying that's exactly what we should do, but the hardware seems at least superficially similar.)

21:09 anholt has quit [Quit: Leaving]

21:16 <icecream95> alyssa: For an example of launching a compute batch using my current implementation of packing: https://gitlab.freedesktop.org/icecream95/mesa/-/blob/csf/src/panfrost/csf_test/test.c#L1163

21:17 <alyssa> pan_pack_ins(i, CS_SELECT_BUFFER, cfg) { cfg.index = 3; }

21:17 <alyssa> pan_pack_ins(i, CS_STATE, cfg) { cfg.state = 8; }

21:17 <alyssa> icecream95: ^ what do these do?

21:18 <alyssa> Do you know?

21:18 <icecream95> No idea!

21:18 <alyssa> I respect that ^^

21:18 <icecream95> Possibly SELECT_BUFFER switches between register sets for the command stream

21:18 <icecream95> Possibly STATE is ... I don't know what it's for

21:19 <alyssa> Alright. I'd prefer giving them boring names then (CS_UNK23 or something)

21:20 <alyssa> How about CS_CALL?

21:20 <icecream95> No, I don't know which is the link register, if that is your question

21:21 <alyssa> I didn't mean that specifically, just emit_cs_call in general

21:21 <icecream95> Though switching register sets (or whatever SELECT_BUFFER does) allows multiple levels of calls

21:21 <icecream95> emit_cs_call calls out to another command stream, and returns when reaching end?

21:21 * alyssa supposes she's going to need to set up the DDK to answer these questions..

21:22 <alyssa> (or better, hardware.....)

21:23 <icecream95> The kernel driver I use is https://gitlab.com/icecream95/kbase-valhall/

21:23 <alyssa> ack

21:24 <icecream95> Note that with that the blob will hang after submitting because no-one executes the `str` commands in the command stream to signal job completion

21:24 <alyssa> sure

21:25 <icecream95> I do that in my panloader branch, maybe there could be a debug flag to make that happen in Mesa pandecode?

21:31 <icecream95> That could be called PAN_MESA_DEBUG=nosync. Rather than waiting for jobs to finish right after submission, never wait at all!

21:33 <alyssa> heh

21:34 <alyssa> Re <enum name="CS Iterator"> values

21:34 <alyssa> AFAIK, there are only three iterators: compute, tiler, fragment

21:35 <alyssa> https://developer.arm.com/documentation/102811/0100 calls them compute, vertex, fragment

21:35 <alyssa> at any rate, if there are 4 values, that's not an iterator enum

21:36 <alyssa> that said: it's SUPER arm to reserve 0

21:36 <icecream95> Maybe one of them is "MCU"?

21:36 <alyssa> so for 3 values, a 2-bit field makes sense with values 1,2,3

21:36 <alyssa> (13 & 3) = 1

21:36 <alyssa> so maybe it's reserved/vertex/fragment/compute

21:36 <alyssa> and there are some flags?

21:37 <icecream95> wait a second that isn't in the MR I posted you shouldn't be looking at that!!!!

21:38 <alyssa> I'm one of those people who reads the end of the book while still at the beginning and then is disappointed in the plot twists

21:38 <alyssa> you got me

21:38 <icecream95> I guess it could be flags, maybe for whether the tiler is enabled or not?

21:39 * alyssa shrugs

21:39 <alyssa> + pandecode_log("add x%02x, x%02x, #0x%x\n",

21:39 <icecream95> But that implies the existence of a non-IDVS vertex job type, which I haven't seen yet

21:39 <alyssa> I am fairly sure that these commands have an associated register file (MMIO mapped)

21:40 <alyssa> so I would assume that the source/dests here are registers

21:40 <icecream95> Yes, which is why there is an 'x' prefix rather than '0x'... I'm copying aarch64 conventions

21:40 <alyssa> wouldn't aarch64 convention be x%u?

21:41 <icecream95> Okay, I'm not completely copying them then

21:41 <icecream95> I started r/e by staring at hexdumps, and so decided to use hex for register names

21:41 <alyssa> (and x%u probably makes more sense if we're satisfied this is a "real" ISA of sorts)

21:42 <icecream95> But the other reason for using hex is to track what fields have been ported to CS registers or not

21:43 <icecream95> Maybe XML comments would be better suited for that

21:43 <alyssa> Yeah, I used XML comments for v9 bring up (commenting out v7 XML fragments that I hadn't mapped yet) and it worked well

21:45 <icecream95> Should I convert v10.xml from using hex then?

21:46 <alyssa> Not sure yet, I am still trying to understand the problem space

21:46 <alyssa> It bothers me that we still don't know *what* the cs layout "descriptors" *are*

21:46 <alyssa> Internally, I mean

21:47 <alyssa> Are these the v9 descriptors, and CSF is just shuffling them into place?

21:47 <alyssa> (Then what does CSF help with at all? lower overhead because you can do better dirty tracking?)

21:48 <icecream95> Note that Draw and the "shader environments" have both CS and non-CS variants...

21:48 <alyssa> Are these hardware registers (in the kernel MMIO register sense), and there are no descriptors anymore?

21:48 <icecream95> Non-CS variants are unchanged from v9 (I think)

21:48 <alyssa> Yes, that part is also bouncing around in my head

21:49 <alyssa> Those two possibilities induce very different GenXML designs

21:49 <alyssa> The former case induces more or less what you've done

21:50 <alyssa> You could plausibly model everything as descriptors like before -- no GenXML changes even, except for the shuffling -- and then in the driver's draw_vbo/launch_grid hooks, do something like:

21:50 <alyssa> foreach word in descriptor:

21:51 <alyssa> if word != current_state: emit_move_instruction(word)

21:51 <alyssa> current_state = word

21:51 <alyssa> (And changes from there would be purely about lowering CPU overhead, not architectural details)

21:52 <alyssa> The latter case, however, warrants a *completely* different design of the XML compared to pre-CSF Mali

21:52 <icecream95> But then you have to load the current state, which somewhat negates the advantage...

21:52 <alyssa> sure, but dealing with that is "just" a CPU optimization at that point

21:53 <alyssa> (At this point I'm just trying to understand the hardware, not necessarily how to handle it in a real driver)

21:53 <alyssa> The latter case, however, warrants a *completely* different design of the XML compared to pre-CSF Mali

21:53 <alyssa> where each group of contiguous related words becomes its own element of the XML and given first-class register status

21:53 <alyssa> as in (IIRC) Intel's and Broadcom's GenXML variants

21:54 <alyssa> there are no descriptors at that point, just state that used to be a descriptor in pre-CSF times

21:55 <alyssa> and then the draw call code looks completely different: each group of state is emitted independently based on dirty flags, and there is no "current state of the reg file" tracked except those dirty flags

21:55 <alyssa> The latter would be a bit annoying because it implies rewriting all the draw/launch code again instead of sharing the v9 code.

21:56 <icecream95> I note that general purpose registers seem to start at x64... I don't know if that's hardware or just ABI

21:56 <alyssa> Then again for v9 I had to do more or less that anyway...

21:56 <alyssa> Interesting

21:56 <alyssa> Could be either.

21:56 <alyssa> I think that's my next question -- are these register offsets fixed?

21:56 anholt has joined #panfrost

21:56 <alyssa> s/offsets/bases/

21:57 <icecream95> Note that pan_cmdstream.c in my csf branch does have code for launching IDVS and compute jobs

21:57 <icecream95> A bit hacky, though...

21:57 <alyssa> or are there some unknown bits in the COMPUTE_LAUNCH and IDVS commands that specify the base, and then the physical registers / descriptor is read from registers [base : base + size)?

21:58 <alyssa> (This too has implications for the right XML design and maybe the driver)

21:59 <icecream95> No idea about that, I don't even know why IDVS jobs take two registers as an argument

21:59 <icecream95> Though note this: <struct name="Fragment launch" layout="ins" op="7"/>

21:59 <icecream95> No fields!

21:59 <icecream95> (None that I know of, at least)

22:00 <alyssa> no fields, or there's a field that the DDK always sets to zero? ;)

22:00 <icecream95> One thing I found interesting is that Scissor and fragment job coordinates now use the same set of registers.. and the latter isn't divided by the tile size anymore, possibly because 32x32 tiles

22:01 <alyssa> heh, cute

22:01 <alyssa> I'm looking at your pan_cmdstream.c code now

22:02 <alyssa> Definitely "not ready for merge, but might find itself in a shipping product if you're not careful" ;-)

22:02 * icecream95 is maybe not careful enough

22:03 <alyssa> git@gitlab.freedesktop.org: Permission denied (publickey,keyboard-interactive).

22:03 <alyssa> uh

22:03 <icecream95> But why does the initial merge have to be of a completely optimised driver, why couldn't we fix things later?

22:04 <alyssa> doesn't have to be optimized, but the architecture should be sound

22:05 <alyssa> + <field name="Unk register 1" size="8" start="32" type="register" default="0x42"/>

22:06 <alyssa> + <field name="Unk register 2" size="8" start="40" type="register" default="0x4a"/>

22:06 <alyssa> I notice this spells JB, i.e. "job" ;-p

22:06 <icecream95> Tell Arm about that one, so they can update their documentation.

22:07 <icecream95> In an email I sent I proposed a syntax for being able to partially upload CS descriptors, for example to only upload scissor minimum values:

22:08 <icecream95> pan_pack_cs(&batch->cs_vertex, SCISSOR, cfg, .min = 1) {

22:20 alyssa has quit [Quit: leaving]

23:11 rasterman has quit [Quit: Gettin' stinky!]