ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
DrNick2 has joined #dri-devel
mattrope has joined #dri-devel
<zmike> robclark: can you check out the first 2 patches also?
<zmike> they're pretty small
<robclark> yeah, I looked at them to understand the later primconvert patches
<zmike> ah so your rb applies to all the gallium patches?
<robclark> only parts I really skipped over where zink
<zmike> cool
<robclark> yet
pzanoni` is now known as pzanoni
<zmike> thanks!
<robclark> zmike: re: CTS run, I wasn't sure if zink was using all the emulation (since I skipped over those parts), or if the only parts tested where what zink needed
<zmike> oh you mean force enabling full emulation
<robclark> right
<zmike> I did that locally
<zmike> not in ci
<robclark> ok, that should be fine
<robclark> ie. shouldn' really matter what hw you are testing on
<zmike> yea
<zmike> well I'm testing through lavapipe anyway
<robclark> I do kinda wish we had some gallium level perf_debug() sorta thing, for emitting perf warnings when hitting slow fallbacks
<zmike> would be neat
<zmike> I've got my hands full wrangling trace into being useful these days
<robclark> maybe I'll take a stab at typing something before I add freedreno caps for this
<zmike> I'd review
DrNick2 has quit []
ngcortes__ has quit [Remote host closed the connection]
camus1 has joined #dri-devel
tarceri_ is now known as tarceri
<tarceri> airlied: are you able to give this a go on that old machine you were using https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11523 ?
camus has quit [Ping timeout: 480 seconds]
GloriousEggroll has quit [Quit: Death is life's way of telling you you've been fired.]
<airlied> tarceri: will try and schedule it this week
<tarceri> airlied: thanks! we can bump up the % of items deleted in one go if needed but this should be a good improvement
Lucretia has quit []
Lightkey has quit [Ping timeout: 480 seconds]
Lightkey has joined #dri-devel
boistordu has joined #dri-devel
boistordu_ex has quit [Ping timeout: 480 seconds]
khfeng has joined #dri-devel
luzipher_ has joined #dri-devel
luzipher__ has quit [Ping timeout: 480 seconds]
<zmike> airlied mareko: one of you want to ab this so we can merge stuff? https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11548
<zmike> or literally anyone
<robclark> zmike: a-b
<zmike> \o/
<robclark> I think general rule is try to make sure this isn't actually a thing you are causing, but otherwise auto a-b for disabling out of service ci farms/runners
blue__penquin has joined #dri-devel
<zmike> I thought the rule was that it had to have an actual ab
<zmike> it already killed 2 MRs
<robclark> In worst case, if someone got it wrong and their MR was actually the thing causing things, we can push reverts of offending MR
<zmike> yea
<robclark> but if I'm asleep, and freedreno ci farm gets hit by a meteor then go ahead and push a disable ;-)
<zmike> I've reassigned the MRs that hit this
<zmike> 👍
<robclark> (might be a good idea around branch point for whoever is making release branch to pay attention to issues like that.. but in general sorting out issues the next day is fine.. we don't need people to be awake 24-7 ;-))
<zmike> I'm just trying to jam this primconvert thing in before I go to bed myself
<zmike> tired battling ci is worst battling ci
<robclark> last couple weeks have been smoother as far as CI, or at least I haven't seen as many flakes (and have even had some time to fix a bunch of fails).. but yeah, real hardware is hard
<zmike> yeah it's been v nice recently
<zmike> could barely disguise all the compile fails I was getting
<robclark> I suppose it is an open question of how to deal with things that pass enough to land but are legit issues.. but I guess villagers with ci-anger-pitchforks does seem to get those isseus resolved
<robclark> *angry villagers
heat_ has joined #dri-devel
heat has quit [Remote host closed the connection]
Peste_Bubonica has quit [Quit: Leaving]
aravind has joined #dri-devel
jcline_ has joined #dri-devel
jcline has quit [Ping timeout: 480 seconds]
sarnex has quit [Quit: Quit]
sarnex has joined #dri-devel
jcline_ has quit [Ping timeout: 480 seconds]
jcline has joined #dri-devel
<imirkin> zmike: huh? "PrimitiveRestartForPatches is now explicitly set to false because no driver supports it" -- nvc0 supported it...
luzipher__ has joined #dri-devel
<airlied> zmike: can confirm nvc0 set it
<imirkin> well, nvc0 setting it was explicitly removed in that change :) i guess the thing to do would be to define a BITFIELD_MASK(PIPE_PRIM_MAX) ^ BIT(PIPE_PRIM_PATCHES) somewhere
<imirkin> and use that all over
Emantor has quit [Quit: ZNC - http://znc.in]
luzipher_ has quit [Ping timeout: 480 seconds]
Emantor has joined #dri-devel
tzimmermann has joined #dri-devel
mattrope has quit [Remote host closed the connection]
Duke`` has joined #dri-devel
sdutt has quit [Remote host closed the connection]
heat_ has quit [Ping timeout: 480 seconds]
<airlied> daniels, tomeu7 : the iris runners are acting a bit random
<airlied> had a fair few misc timeouts, the start running tests then fall over
<airlied> grr now the a530 run dies
<airlied> someone has angered the ci gods tonight
itoral has joined #dri-devel
<daniels> airlied: please link me some jobs? not sure what ‘misc timeouts’ means
<mareko> imirkin: the PATCHES bit must be set if you have tessellation
<imirkin> mareko: yeah ... i was using a bit of shorthand
<mareko> imirkin: because if you don't set the bit in the prim restart cap, u_vbuf will lower it (which is incorrect with GL)
<imirkin> nvc0 also doesn't use u_vbuf :)
<daniels> airlied: tbf the other 28 jobs of the last 30 passed without incident
<mareko> imirkin: you just said to use BITFIELD_MASK(PIPE_PRIM_MAX) ^ BIT(PIPE_PRIM_PATCHES)
<mareko> which would break GL
<mareko> because of u_vbuf
pnowack has joined #dri-devel
pnowack has quit [Remote host closed the connection]
pnowack has joined #dri-devel
<imirkin> mareko: yeah... would need to be moderated a bit
<imirkin> mareko: basically you'd want to ensure that u_vbuf never touches patches
<daniels> airlied: so I looked through job history and it's annoying for sure (happens ~0.6% of the time), still trying to work out if it's kernel or machine death, but it's not like anything is on fire?
<daniels> aha, just seen the crocus MR now where AML-y also hung. yeah, that one's preceded by infinite GPU-hang recovery fail spam in dmesg
<airlied> daniels: yeah I hit two iris fails in a row, then the a530 shot at me
<daniels> grr, now I'm properly awake
RobertC has joined #dri-devel
luzipher_ has joined #dri-devel
alatiera has quit [Quit: Ping timeout (120 seconds)]
Duke`` has quit [Ping timeout: 480 seconds]
alatiera has joined #dri-devel
alatiera is now known as Guest276
<daniels> zmike: thanks for disabling T760, good fix ... they both wandered off into the sunset together at the same time in different ways when doing a kernel boot run
andrey-konovalov has joined #dri-devel
<daniels> no idea wtf but our monitoring has already been shouting about it overnight, so someone will look when they wake up
luzipher__ has quit [Ping timeout: 480 seconds]
thelounge05 has joined #dri-devel
Guest276 has quit [Ping timeout: 480 seconds]
mlankhorst has joined #dri-devel
RobertC has quit [Ping timeout: 480 seconds]
thelounge05 is now known as alatiera
luzipher__ has joined #dri-devel
luzipher_ has quit [Ping timeout: 480 seconds]
bcarvalho_ has joined #dri-devel
bcarvalho has quit [Ping timeout: 480 seconds]
danvet has joined #dri-devel
thellstrom has joined #dri-devel
bcarvalho_ has quit []
lynxeye has joined #dri-devel
qyliss has quit [Quit: bye]
qyliss has joined #dri-devel
qyliss has quit []
qyliss has joined #dri-devel
qyliss has quit []
qyliss has joined #dri-devel
bcarvalho has joined #dri-devel
imirkin has quit [Ping timeout: 480 seconds]
daleh00k has quit []
<danvet> bnieuwenhuizen, so finally managed to send out my rfc patch for radv
<danvet> together with the import/export ioctls from jekstrand
<danvet> and the implicit sync fix from könig (in drm-misc-next) I think at least in theory it should work and give you all the pieces
<danvet> it practice it might be a few oops instead :-)
imirkin has joined #dri-devel
<HdkR> oh, new ioctls?
luzipher_ has joined #dri-devel
<HdkR> Let me know if it is merged so I can update my downstream implementation :)
luzipher__ has quit [Ping timeout: 480 seconds]
agx_ has left #dri-devel [#dri-devel]
agx_ has joined #dri-devel
<MrCooper> zmike: I get "WARNING: Some incorrect rendering might occur because the selected Vulkan device (AMD RADV RAVEN) doesn't support base Zink requirements: feats.features.alphaToOne", is this known?
<pq> danvet, "Re: [PATCH] dma-buf: fix and rework dma_buf_poll v3"; I'm not sure what to say... I guess I haven't been that much involved with that side to have an opinion.
<danvet> pq, currently dma-buf poll only waits for either exclusive or the shared slots
<pq> danvet, or to be more precise, I have not had a use cases where I would poll for writable.
<danvet> but spec says it should wait for both for write access
<danvet> pq, yeah maybe that's why no one cares
<pq> right, I can't say anything about that
<danvet> so with current code if you want to write, and _not_ overwrite the previous wait
<danvet> you need to set both IN|OUT
<danvet> but what we actually documented would do that for you automatically
<pq> except polling for IN|OUT means IN or OUT, not both
<danvet> well yeah but then wait until both are signalled or something like that
<danvet> make that IN & OUT :-)
<danvet> (just wait for the other after you got one from the first wait or something like that)
<danvet> anyway I don't think there's much of a use-case for waiting only for readers, but not any previous writer
<pq> tbh, even the concepts of shared and exclusice fences are not too familiar with me - a bit too far under the hood
<danvet> implicit sync only has one writer at most, conceptually
Lucretia has joined #dri-devel
<pq> sounds fine to me then, FWIW
<bbrezillon> danvet: Re: What is 'pfdev->reset.pending' protecting => it's making sure we don't queue new jobs when a reset is pending (at least that's what it's used for after 'drm/panfrost: Queue jobs on the hardware')
<danvet> bbrezillon, but isn't that caught by stopping the scheduler kthread?
<bbrezillon> (let me check what it was used for before that)
<bbrezillon> danvet: nope, because the scheduler thread is not stopped immediately
<danvet> it's new with your sched work extraction
<danvet> https://paste.debian.net/1202108/ is what I'm thinking
<danvet> there's some inherit races, but fundamentally it's panfrost_scheduler_stop that makes sure we don't leak
<bbrezillon> leak jobs?
<danvet> well whatever it is you want to stop with reset.pending
agx_ has quit []
<bbrezillon> the thing is, the scheduler is not stopped until the timeout and/or reset handler are called
<danvet> oops deleted too much
<bbrezillon> so we need a way to prevent queueing things to HW queues when a reset is pending
<danvet> https://paste.debian.net/1202109/ we still need to schedule the reset
<danvet> bbrezillon, yup
agx has joined #dri-devel
<danvet> sched_stop does that
<danvet> until that's called there's going to be more victims, which is unfortunately, but can't be avoided
<bbrezillon> danvet: sorry, I have to go, but I'll be back in ~20min
<bbrezillon> well, it's avoided with this reset.pending field ;)
<danvet> at least my locking gut feel is that with these async workers you always have races, and it's best to 100% build on top of the work primitives and their ordering
<danvet> and not roll your own
<danvet> and I /think/ schedule_work + kthread_park in sched_stop already give you all the guarantees you want
<danvet> bbrezillon, yeah but that's _after_ we already stoped the queue
<danvet> which means the check right above this one will already make sure we're not re-scheduling the reset work
<danvet> so all your preventing here is making a fundamentally unavoidable race maybe a few instructions smaller
<danvet> while making the code a lot more complex :-)
rasterman has joined #dri-devel
<lynxeye> bbrezillon: What's the issue with queuing more work when the GPU is hung? You stop the scheduler in the timeout handler, then all jobs are removed from the HW runqueue and once you restart the scheduler after you handled the reset all jobs that were queued after the bad one are re-inserted by the scheduler.
<lynxeye> So you don't kick jobs, just because they happen to be on the HW queue while one jobs hit a timeout.
camus has joined #dri-devel
camus1 has quit [Read error: Connection reset by peer]
Anorelsan has joined #dri-devel
<lynxeye> bbrezillon: From a coarse look panfrost_job_timedout just looks wrong. What you need to do in there is stop *all* scheduler of all HW queues, wait until all HW queues don't make any progress anymore (if your hardware doesn't stop them all on fault, then execute device reset, then restart *all* schedulers. This way you correctly replay all jobs after the device is working again, so the race window of timeout hit -> schedulers stop
<lynxeye> an be infinitely large without causing any issues.
<bbrezillon> danvet: before some recent changes (not queued yet), we were calling drm_sched_fault() for any kind of faults, even some where the GPU was still alive, so the scheduler attached to the queue was not stopped until we reached the timeout handler, and new jobs could be queued in the meantime, jobs that were interrupted during the reset and could not restart properly after that
<bbrezillon> we're migrating to an threaded interrupt handler, so calling drm_sched_stop() directly from there might be an option now
<bbrezillon> *a threaded interrupt handler
<danvet> bbrezillon, hm but I'm not seeing anything touching that over there?
<danvet> bbrezillon, anyway I think I have a rough understanding, and we just need more standard work queue locking patterns and less self-rolled stuff here I think
<bbrezillon> lynxeye: I've tried that already, and this approach was racy. I've tried various things to fix those race, but moving the reset logic to a separate work was the simplest option
<bbrezillon> danvet: I agree with your last statement :)
<danvet> bbrezillon, well I might still be completely wrong about what's going on :-)
<danvet> bbrezillon, anyway need to grocery shop, then lunch, hopefully afterwards I can hack up a quick demo patch
<lynxeye> bbrezillon: I don't see where you would race there, except with hardware still alive. The timeout path isn't exactly a time critical path, so you can invest some time to make extra sure the hardware is quiesced. Like stop schedulers to stop queuing, then wait for all alive queues to drain.
<danvet> lynxeye, yeah I think a lot of the locking and tricks can be removed
<danvet> and just rely on schedule_work doing the right thing
<danvet> maybe with a check so that we don't reset twice unecessarily too often (but fundamentally that's a race that can't really be avoided I think, with multiple queues feeding to a single reset point)
<danvet> so first order just schedule_work would be enough
<bbrezillon> which means keeping a separate work for the reset, right?
<danvet> bbrezillon, yeah that part is a very solid design choice
<bbrezillon> ok, good
<danvet> because it means you can rely on work guarantees
<bbrezillon> exactly
<danvet> like that they're single-threaded
<danvet> and the re-queue race handling guarantees of schedule_work
<danvet> also, the implied barrier of schedule_work
<danvet> everything becomes so much easier in general with that design pattern
<danvet> so much so that usually people then break it by adding additional locking that's not needed, because they can't believe it's really not needed :-)
<danvet> bbrezillon, where does this "remove the bad job from the list" happen, and what "list"?
<danvet> oh that commit isn't in upstream
<danvet> an no I removed it :-)
<danvet> ok I read that part of the scheduler code
<danvet> I'm not sure what race you're trying to plug there
<danvet> hm I think there's another leak when free_guilty is set
<danvet> or maybe not
<hch12907> MrCooper: that zink warning is known, happens on Renoir too
<MrCooper> k, is this going to be fixed in RADV?
<hch12907> alphaToOne is not enabled for a bunch of AMD graphics on AMDVLK windows too, doesn't seem like an issue exclusive to RADV
<MrCooper> could it be worked around in zink then?
libv_ has joined #dri-devel
Lightkey has quit [Ping timeout: 480 seconds]
libv has quit [Ping timeout: 480 seconds]
luzipher__ has joined #dri-devel
libv_ is now known as libv
luzipher_ has quit [Ping timeout: 480 seconds]
RobertC has joined #dri-devel
<bbrezillon> danvet: ok, so there are several things. First, the reset and timeout handler can run concurrently, since both are queued to the system queue which, if I'm correct, is multi-threaded/multi-CPU. Is that correct?
<bbrezillon> if this is the case, then calling drm_sched_stop() without first making sure the scheduler is still running will trigger this warning https://elixir.bootlin.com/linux/v5.13-rc7/source/kernel/kthread.c#L593 when kthread_park()
<bbrezillon> i called
<bbrezillon> *is called
Lightkey has joined #dri-devel
<hch12907> MrCooper: you have to ask zmike for that; right now zink passes the alpha_to_one boolean directly to VkPipelineMultisampleStateCreateInfo
<bbrezillon> manipulation of the pending_list also needs to be protected with a lock, and then there's the problem of the 'bad' job being removed from the pending_list before the ->timedout_job() method is called, and re-inserted when drm_sched_stop() is called, but if the drm_sched_stop() in the reset handler is called before the one in the timeout handler, we will leak the job
bcarvalho has quit [Quit: Leaving]
<bbrezillon> (the job leak only happens once a 'stop scheduler only once' solution is in place ofc)
bcarvalho has joined #dri-devel
<danvet> bbrezillon, lynxeye ok I think I got it
<danvet> spent some time pondering on a walk
<danvet> fundamentally I think the right solution is to put the reset into a single threaded work to solve all the problems
<danvet> problem is, you're trying to achieve this by only changing panfrost code
<danvet> instead of adjusting drm/scheduler to fit your requirements
<danvet> so I think what we need is
<danvet> 1) panfrost allocates a tdr work queue which is single-threaded
<danvet> 2) we pass that wq as an optional argument to sched_init
<danvet> 3) drm/scheduler uses that wq to queue up the tdr work
<danvet> 4) panfrost timeout callback does the global reset directly, including stopping all queues and everything
<danvet> no locks needed beyond the ordering that drm/scheduler already ensures
<danvet> the fundamental issue is that drm/scheduler assume that there's only on tdr doing the reset
<danvet> so we need to make sure that's the case by single-threading all the tdr from all queues onto one non-concurrent wq
<danvet> trying to fix this in panfrost only results in horrible code, and I'm not sure it's even possible to do it correctly
<bbrezillon> it could work, if we store faulty jobs in the scheduler struct
<bbrezillon> instead of passing them to the ->timedout_job() method
<bbrezillon> cause there can be several jobs (at most one per queue) that failed when the global work is scheduled
khfeng has quit [Remote host closed the connection]
khfeng has joined #dri-devel
<bbrezillon> oh, and this global tdr itself can't be attached drm_sched_job_timedout(), since this function is assuming the work object is embedded in drm_gpu_scheduler, so we need something aggregating timeouts for all queues, and scheduling the timer with the closest deadline
camus1 has joined #dri-devel
camus has quit [Remote host closed the connection]
itoral has quit []
<danvet> bbrezillon, that's not quite what I had in mind
<danvet> it's still a per-sched tdr
<danvet> but a global queue
Anorelsan has quit [Quit: Leaving]
<danvet> so that at any given moment, there's guaranteed to be only one tdr running
<danvet> so single tdr wq globally, but per-queue tdr work
<danvet> so that we don't have to rip up the entire per-queue guilty job tracking
<danvet> which is another thing that I think we should avoid
<bbrezillon> oh, ok
<danvet> bbrezillon, I was for a bit considereing de-midlayering the tdr flow
<danvet> i.e. that the tail of it, including bad job cleanup and everything would be a separate function that drivers would call themselves
<danvet> but then you need to keep track of jobs and everything and it gets messy
<danvet> and all we really need is that any given moment there's only one tdr job running
<danvet> bbrezillon, also, i915 will have exactly the same problem
<danvet> well i915 is even more fun, since we can recover through a preempt, per-engine reset, or global reset
<danvet> so 3 levels of escalation
<danvet> I think jekstrand and mbrost will owe you one if you improve drm/scheduler to make this easy on drivers
<zmike> MrCooper: yes, radv prints that
<zmike> imirkin: huh I thought I handled that
<lynxeye> danevt
<lynxeye> danvet: bbrezillon: Shouldn't have amdgpu have solved all those issues? They also have multiple submit queues and can recover on multiple levels iirc.
<bbrezillon> lynxeye: last time I looked they had pretty complex sync mechanisms to have it working
<danvet> lynxeye, yeah amdgpu does a lot of very funky stuff
<bbrezillon> which is why I moved to an approach where things are serialized with a work
<bbrezillon> but that's even better if we can do that at the drm_sched level
<bbrezillon> s/better/simpler for me/ :-)
<danvet> lynxeye, they have a true monster lock with amdgpu_device_lock_hive_adev
<bbrezillon> but I didn't consider using my own ordered workqueue, which is actually a nice trick
<danvet> lynxeye, the problem is that they don't just need an ordered tdr queue
<danvet> they also need to sync against atomic commits
<danvet> which is an entire different level of horrible
<danvet> lynxeye, but aside from that it's pretty close to what I'm suggesting
<danvet> it does the entire tdr across everything involved from the timeout callback
<danvet> except they don't ensure single-threadedness through a single workqueue
<danvet> but through their giantic lock
<danvet> they also have some local reset that doesn't need drm_sched_stop, which they do first
<bbrezillon> yeah, I feel amdgpu reset logic is several orders of magnitude more complex than the panfrost one
<bbrezillon> *I feel like
<danvet> multi-gpu hive reset and the "we also reset the display and everything else for lolz"
<danvet> are imo the absolute worst hw design decisions
<danvet> it's slightly better than "welp, let's just reboot", but not much
<bbrezillon> tbh, I'll be happy if I can have all my races fixed with danvet's solution :-)
<bbrezillon> (well, races were fixed already, but panfrost locking became a nightmare, and lockdep started complaining after I added dma fence annotations in the reset handler :))
<danvet> lynxeye, or do you think there's a race left if we single-thread all the tdr work across all queues?
<danvet> hm drm_sched_fault also looks like it might race
<danvet> but I guess that's a different problem
<bbrezillon> what's the issue with drm_sched_fault()?
<danvet> so scenario:
<danvet> 1. gpu dies in a fault
<danvet> 2. tdr fires because bad luck, starts resetting the gpu, but not yet reset it
<danvet> 3. irq handler fires, notices the fault bit, which isn't cleared yet
<danvet> irq handler re-queues tdr work
<danvet> 4. tdr work completes, new jobs get queued into hw
<danvet> 5. tdr work runs against, next task gets shot
<bbrezillon> yep, but it can already happen with what we have now
<lynxeye> danvet: sched_fault doesn't queue the tdr, it just sets the timeout to 0, so if you are already in tdr handling it should be a no-op
<danvet> oh yeah I mean this is a general bug in drm_sched_fault, not in the panfrost implementation
<danvet> lynxeye, it requeues when the work is running already
<danvet> at least that's how they generally work
<danvet> otherwise feel free to adjust the timing to make it a requeue
<danvet> if it doesn't requeue while it's running already you'd have missed wakeup issues everywhere
<bbrezillon> if it doesn't requeue we have a problem if a job triggers an unrecoverable fault right after the resubmit :-/
<danvet> so that's why work items requeue as soon as they started running
<danvet> bbrezillon, yup
yk has quit [Remote host closed the connection]
<bbrezillon> danvet: according to https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L1697, it does requeue when delay is 0, regardless of the tdr state
<bbrezillon> and I think I have a solution to avoid spurious reset
<danvet> bbrezillon, for sched_fault?
<bbrezillon> yes
<bbrezillon> that's basically what I'm doing right now in the reset handler
<bbrezillon> 1/ mask interrupts 2/ synchronise irqs 3/ stop running jobs 4/ handle remaining job irqs before resetting
<bbrezillon> 5/ cancel timeout handlers of all HW queues after the reset
<bbrezillon> that guarantees that all faults coming in after the reset are caused by jobs that were resubmitted (which are valid faults), not older unhandled faults
thellstrom has quit [Remote host closed the connection]
<bbrezillon> when I say 'what I'm doing right now', I mean what's done in https://patchwork.kernel.org/project/dri-devel/list/?series=504375, which is not merged yet
<danvet> bbrezillon, I'm not understanding how this closes the race
<danvet> step 5) smells very fishy
<bbrezillon> since things are serialized at the wq level, cancelling any subsequent tdr (triggered by drm_sched_fault()) should work
<bbrezillon> I mean, the reset is happening in the ordered workqueue context now
aravind has quit []
haasn` has joined #dri-devel
<bbrezillon> unless you tell me it's not possible to cancel a pending tdr from inside another tdr handler
<bbrezillon> BTW, I omitted 6/resubmit jobs 7/restart the schedulers
haasn has quit [Ping timeout: 480 seconds]
Stary_ has joined #dri-devel
Stary has quit [Ping timeout: 480 seconds]
blue__penquin has quit []
<danvet> bbrezillon, well you have your threaded irq handler running in parallel with the tdr work
<danvet> and those can race with the asycn irq delivery on top in funny ways
<bbrezillon> even if I call synchronize_irq()?
<bbrezillon> after masking the interrupt
<bbrezillon> *interrupts
<danvet> maybe
<danvet> ordering gets very very funny here
<danvet> also you don't want to synchronize against your irq handler
<danvet> you need to synchronize against the hw fault thing itself
sarnex has quit [Quit: Quit]
<danvet> i.e. guarantee you match up the fault with the right job
<danvet> since your irq handler can race against the scheduler hammering in a new job
<bbrezillon> there's one handler for all queues
<danvet> yeah but lots of scheduler threads
<bbrezillon> queues have been stopped at that point
<danvet> yeah that might work I guess
thellstrom has joined #dri-devel
<bbrezillon> I need this synchronization to soft-stop non-faulty jobs anyway
<bbrezillon> (so they can be resumed after the reset, otherwise they'll trigger new faults)
sdutt has joined #dri-devel
sarnex has joined #dri-devel
sarnex has quit []
<danvet> bbrezillon, oh how do you soft-stop other queues?
<danvet> ah there's a timeout
<bbrezillon> yep, there's not much we can do if soft-stop takes too long, so some jobs might end up faulting after the reset in case the soft-stop failed
<bbrezillon> but I guess that's fine, given it's unlikely too happen
<tarceri> someone needs to upate the developers page on mesa3d.org just clicked on it by accident. "Intel has recently contributed the new GLSL compiler in Mesa 7.9"
ddavenport has joined #dri-devel
sarnex has joined #dri-devel
<daniels> tarceri: that someone could be you!
<tarceri> yeah it could be, but probably won't be today hehe. To late to write anything that makes sense right now
<daniels> pscht, it's not even midnight :P
ddavenport has quit [Remote host closed the connection]
sarnex has quit [Quit: Quit]
<tarceri> daniels: We were silly enough to go back for baby number 3. I've slept in and also had an afternoon nap and I'm still tired :)
<tarceri> 6 month in now so not so bad, but last night ws rough :P
pcercuei has joined #dri-devel
<daniels> tarceri: oh, congrats!
<daniels> I think you've earned a pass then
<tarceri> haha cheers mate
camus1 has quit []
Peste_Bubonica has joined #dri-devel
randomher0 has joined #dri-devel
<bbrezillon> danvet: and here's a new deadlock after moving everything to the timeout handler https://gitlab.freedesktop.org/-/snippets/2238 :-(
<pepp> does a X11 compositor import apps' framebuffers using dma-buf itself? or is this done by Xorg/glamor and the compositor gets a X11 object (pixmap?) to draw?
sdutt has quit []
sdutt has joined #dri-devel
<danvet> bbrezillon, hm the cancel_delayed_work stuff I'm not sure why you need that?
<danvet> bbrezillon, yeah you can't take that lock in there
<danvet> strictly speaking at least
<MrCooper> pepp: a compositor which uses OpenGL gets a dma-buf via DRI3 for Xorg's window pixmap, which is not the same as any of the buffers a direct rendering client draws to
<bbrezillon> danvet: to cancel work queued by the job interrupt handling logic calling drm_sched_fault()
<bbrezillon> before we reset the GPU
<danvet> bbrezillon, hm ...
<danvet> feels fragile, but might work
<bbrezillon> that's what we've been discussing above :)
<bbrezillon> but that's unrelated to the deadlock :)
<danvet> bbrezillon, ok so the drm_gem_object_put in panfrost_job_cleanup is murky
<bbrezillon> danvet: does that mean I can't call drm_sched_stop()?
<danvet> or well your cleanup code
<danvet> the standard code pattern is to trylock in the gem_bo cleanup code
<danvet> and if the trylock fails, push the final cleanup onto a separate worker
<bbrezillon> ok
<danvet> so maybe do the same trick of dropping the annotation right now, but explain why for each
sarnex has joined #dri-devel
<danvet> if you revamp the bo locking that's an entire different can of worms
<danvet> since ideal would be to just use dma_resv_lock as bo lock
<danvet> and then maybe share a bunch of the infra around that in shmem helpers
<danvet> like the cleanup work queue
<danvet> bbrezillon, I guess if you patch up drm_gem_shmem_free_object it would work, but with the current design of separate mutex for everything instead of just dma_resv_lock it gets rather ugly
<pepp> MrCooper: "Xorg's window pixmap" is for instance the surface created by the app using eglCreateWindowSurface, right?
<MrCooper> no, it's a pixmap which Xorg uses for storing the window contents
<MrCooper> Xorg creates that pixmap implicitly when the compositor redirects the window
<danvet> panfrost_gem_mapping_put might have the same issue
ddavenport has joined #dri-devel
<bbrezillon> danvet: ouch
<danvet> bbrezillon, in reality you probably can't hit this as long as you don't have a shrinker or anything like that
<danvet> it's another memalloc recursion
<danvet> bbrezillon, so I'd do the hack + big comment for now
<danvet> that way at least the locking issues with the tdr are clean and will stay clean
<danvet> and I think from that pov the new tdr is much better
<bbrezillon> hack == disable annotation around drm_sched_stop(), right?
<pepp> MrCooper: so this pixmap is a copy of the window content, and the compositor imports it to draw the composited desktop. Correct?
<MrCooper> not "a copy of", other than that yeah
<MrCooper> it's the canonical contents
elongbug has joined #dri-devel
<pepp> MrCooper: hmm... is this pixmap the one created by (glamor_)pixmap_from_fd?
<MrCooper> no
<MrCooper> that is used for creating pixmaps corresponding to DRI3 buffers
<pepp> meh, it's all too confusing :(
<MrCooper> the contents of those buffers are copied to the window (i.e. effectively to the window pixmap) via PresentPixmap
<MrCooper> the compositor gets a damage event for that and then draws the window contents using its representation of the window pixmap
ddavenport has quit [Remote host closed the connection]
RobertChiras has joined #dri-devel
RobertC has quit [Remote host closed the connection]
aravind has joined #dri-devel
<pepp> MrCooper: thanks a lot for your answers. FWIW the context is MR !11449
mattrope has joined #dri-devel
aravind has quit []
aravind has joined #dri-devel
bl4ckb0ne has quit [Remote host closed the connection]
emersion has quit [Remote host closed the connection]
emersion has joined #dri-devel
<danvet> bbrezillon, yup
<danvet> with big comment explaining why
emersion has quit [Remote host closed the connection]
<danvet> karolherbst, [PATCH] remove unused varialble "struct device *dev"
<danvet> lacking drm/nouveau: prefix, but it's for you I think
RobertChiras has quit [Ping timeout: 480 seconds]
Duke`` has joined #dri-devel
emersion has joined #dri-devel
bl4ckb0ne has joined #dri-devel
<karolherbst> yeah, looks like it
<karolherbst> I think we should move those patches through drm-misc or so
ddavenport has joined #dri-devel
<danvet> karolherbst, we still haven't fixed you up with drm-misc commit rights!
<danvet> mripard, tzimmermann mlankhorst ack for karolherbst for pushing oddball nouveau fixes to drm-misc?
camus has joined #dri-devel
adavy has quit [Ping timeout: 480 seconds]
adavy has joined #dri-devel
soreau has quit [Remote host closed the connection]
soreau has joined #dri-devel
aravind has quit []
Peste_Bubonica has quit [Quit: Leaving]
camus1 has joined #dri-devel
camus has quit [Ping timeout: 480 seconds]
hch12907 has quit [Remote host closed the connection]
ddavenport has quit [Remote host closed the connection]
GloriousEggroll has joined #dri-devel
<danvet> bbrezillon, can you review/merge the 3 panfrost patches I sent out directly?
<danvet> they should be just stand-alone fixes
heat_ has joined #dri-devel
tobiasjakobi has joined #dri-devel
camus1 has quit [Remote host closed the connection]
<tzimmermann> karolherbst, danvet, sure, why not? do you need commit rights, or just want to push a patch?
gouchi has joined #dri-devel
<karolherbst> tzimmermann: in the end I just want to push patches from time to time
<danvet> that's kinda what commit rights are for
<karolherbst> yeah.. I guess
<karolherbst> :p
<karolherbst> I wouldn't mind doing MRs, but I guess drm-misc is not ready for this
<danvet> well I guess you could try help make it happen
<danvet> we'd need some gitlab build bots and some scripts
<karolherbst> yeah, after I am done doing it for nouveau
<danvet> but that's about it
<danvet> well could do it in drm-misc for nouveau :-)
<karolherbst> we already have our own gitlab project with everything set up
ecrn has joined #dri-devel
<karolherbst> dunno if it makes sense to do it in drm-misc long term
<bbrezillon> danvet: done
<danvet> bbrezillon, should I just push them to drm-misc-next?
<bbrezillon> yep
<daniels> btw we should be ready from an infra capacity PoV to welcome DRM with open arms
<daniels> so it would be nice to bleed in a couple more projects (not drm-intel/drm-tip since they're the heaviest for both commit volume + pulls AFAICT?) and test the waters
libv_ has joined #dri-devel
<danvet> bbrezillon, ok, will do later today
<danvet> bbrezillon, btw since you have multiple engines, don't you hit issues with oversync with the current uapi?
<danvet> it's always a write hazard, and not even opt-out for vk driver
libv has quit [Ping timeout: 480 seconds]
hch12907 has joined #dri-devel
<bbrezillon> danvet: not sure what oversync is. If you mean that we're syncing more than required, then it's definitely the case
<bbrezillon> as in (no access flags on implicit deps on BOs, and no way to disable implicit deps)
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
valentind has quit [Remote host closed the connection]
valentind has joined #dri-devel
khfeng has quit [Ping timeout: 480 seconds]
<danvet> bbrezillon, yeah that's what I mean with oversync
<danvet> although I think the best way to do this is going to be slightly different with jekstrand 's import/export stuff
<danvet> at least for vk
<danvet> hm and it looks incomplete
<danvet> you have the read vs write thing
<danvet> but not the follow implicit sync vs not thing
<danvet> or I'm missing something
<danvet> bnieuwenhuizen, do you expect me to do the revision to switch from fd to context?
ngcortes has joined #dri-devel
lynxeye has quit [Quit: Leaving.]
qyliss has quit [Quit: bye]
<mlankhorst> karolherbst: sure go right ahead
qyliss has joined #dri-devel
ngcortes has quit [Remote host closed the connection]
ngcortes has joined #dri-devel
<tzimmermann> i'll ack the request
<karolherbst> okay, cool
tzimmermann has quit [Quit: Leaving]
karolherbst has quit [Quit: Konversation terminated!]
unerlige2 has left #dri-devel [#dri-devel]
gouchi has quit [Remote host closed the connection]
karolherbst has joined #dri-devel
unerlige has joined #dri-devel
gouchi has joined #dri-devel
bcarvalho has quit [Quit: Leaving]
ecrn has quit [Remote host closed the connection]
<bnieuwenhuizen> daniels: happy to take this stuff over
<bnieuwenhuizen> er danvet: ^
<bnieuwenhuizen> since you and Christian figured out the dma_resv mess this looks way more feasible for me to solve :)
<danvet> bnieuwenhuizen, yeah I'd be very happy to offload this to you
<danvet> there's still a bunch of tricky auditing for me left to do around dma_resv
<bnieuwenhuizen> yeah just was apprehensive to mess around while Christian was still up in arms around semantics
<danvet> bnieuwenhuizen, and I guess once we have radv sorted the next step would be to make sure radeonsi doesn't oversync too
<bnieuwenhuizen> but that seems mostly seolved now :)
<danvet> yeah totally makes sense
<danvet> and yeah I think we've resolved about 3 years of talking past each another these last 2 weeks or so
cphealy_ has joined #dri-devel
cphealy has quit [Remote host closed the connection]
gouchi has quit [Remote host closed the connection]
elongbug has quit [Remote host closed the connection]
cphealy_ has quit []
gouchi has joined #dri-devel
gpoo has joined #dri-devel
heat_ has quit [Remote host closed the connection]
yoslin has quit [Quit: WeeChat 3.2]
randomher0 has quit [Ping timeout: 480 seconds]
<zmike> did v3d ci fall off a cliff?
<daniels> zmike: yeah, they all went off simultaneously, so either Igalia's NUC immolated, or someone put a bulldozer through their fibre
<daniels> please disable all the VC4+V3D jobs
<daniels> samuelig, jasuarez, tanty: ^ I assume you already know this, but ...
<jasuarez> Ups
<jasuarez> Let me check
<Sachiel> zmike killing all the CIs
<jasuarez> I'm rebooting it
<jasuarez> Seems something wrong happen
iive has joined #dri-devel
<jasuarez> zmike: daniels fixed
<daniels> jasuarez: thanks for the quick fix!
<jasuarez> i've restarted the jobs
<daniels> jasuarez: as well as for the expert insight of 'something wrong happened' :D
<jasuarez> thanks for warning!
<jasuarez> :)
<daniels> jasuarez: hope the rest of your night is more relaxing
<jasuarez> we have a red button labelled "Press in case of emergency" :D
<daniels> haha
<zmike> jasuarez: thanks!
<jasuarez> Exactly! 😁
Daanct12 has joined #dri-devel
Danct12 has quit [Ping timeout: 480 seconds]
boistordu has quit [Remote host closed the connection]
boistordu has joined #dri-devel
iive has quit [Ping timeout: 480 seconds]
gouchi has quit [Remote host closed the connection]
Duke`` has quit [Ping timeout: 480 seconds]
rpigott has quit [Remote host closed the connection]
rpigott has joined #dri-devel
andrey-konovalov has quit [Ping timeout: 480 seconds]
pnowack has quit [Quit: pnowack]
bcarvalho has joined #dri-devel
mlankhorst has quit [Ping timeout: 480 seconds]
Daanct12 has quit [Ping timeout: 480 seconds]
Danct12 has joined #dri-devel
YuGiOhJCJ has joined #dri-devel
rasterman has quit [Quit: Gettin' stinky!]
pcercuei has quit [Quit: dodo]
reductum has quit [Quit: WeeChat 2.8]
reductum has joined #dri-devel
fcarrijo has joined #dri-devel
<pinchartl> dianders: I've found a bit of time to look at the sn65dsi86 again. the good news is that the latest changes in drm-misc don't break it completely for my use case :-)
<dianders> pinchartl: \o/ Yay! That was my belief when I looked at everything, but good you were able to confirm.
<dianders> Now you can race with bamse (who's trying to implement the PWM function of the chip) to see who has to rebase their changes on who. ;-)
danvet has quit [Ping timeout: 480 seconds]
<pinchartl> fortunately PWM shouldn't affect me :-)
<pinchartl> I've replied to a few of your e-mails, once we agree on a few open questions, I'll send a v2
<dianders> pinchartl: Just finished responding to your most recent email. I don't think there were any open questions to me on your earlier email (the one in regards to patch 10), but if I missed one then please yell!
<pinchartl> no, it's mostly the last one
sarnex has quit [Ping timeout: 480 seconds]
<pinchartl> regarding the scrambler, I have to run tests