#dri-devel on 2023-05-02 — irc logs at oftc.irclog.whitequark.org

2022-12-21 00:45 ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar

00:05 Haaninjo has quit [Quit: Ex-Chat]

00:11 zzoon_2 has joined #dri-devel

00:16 Piraty has quit [Remote host closed the connection]

00:28 mbrost_ has quit [Remote host closed the connection]

00:28 mbrost_ has joined #dri-devel

00:30 pallavim has quit [Ping timeout: 480 seconds]

00:33 ngcortes has quit [Read error: Connection reset by peer]

00:39 Piraty has joined #dri-devel

00:41 mbrost_ has quit [Ping timeout: 480 seconds]

00:44 Piraty has quit []

00:47 Piraty has joined #dri-devel

00:50 co1umbarius has joined #dri-devel

00:51 Piraty has quit [Remote host closed the connection]

00:51 columbarius has quit [Ping timeout: 480 seconds]

00:52 Piraty has joined #dri-devel

00:54 Kayden has joined #dri-devel

00:55 penguin42 has quit [Remote host closed the connection]

01:20 mbrost_ has joined #dri-devel

01:27 mbrost_ has quit [Remote host closed the connection]

01:27 mbrost_ has joined #dri-devel

01:38 mbrost_ has quit [Ping timeout: 480 seconds]

01:39 mbrost_ has joined #dri-devel

01:53 mbrost_ has quit [Ping timeout: 480 seconds]

01:54 mbrost_ has joined #dri-devel

02:04 TMM has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

02:04 TMM has joined #dri-devel

02:09 mbrost_ has quit [Ping timeout: 480 seconds]

02:28 mbrost_ has joined #dri-devel

02:34 dcl^ has joined #dri-devel

02:45 aravind has joined #dri-devel

02:51 Guest12549 has quit [Quit: ZNC 1.8.2 - https://znc.in]

02:52 peelz has joined #dri-devel

02:53 peelz is now known as Guest242

02:57 heat has quit [Ping timeout: 480 seconds]

03:07 aravind has quit [Ping timeout: 480 seconds]

03:08 aravind has joined #dri-devel

03:46 alanc-away has quit [Remote host closed the connection]

03:47 alanc-away has joined #dri-devel

03:54 mbrost_ has quit [Remote host closed the connection]

03:54 mbrost_ has joined #dri-devel

04:36 mbrost__ has joined #dri-devel

04:38 mbrost has joined #dri-devel

04:41 mbrost_ has quit [Ping timeout: 480 seconds]

04:44 danvet has joined #dri-devel

04:45 mbrost__ has quit [Ping timeout: 480 seconds]

04:53 orbea has quit [Remote host closed the connection]

04:53 orbea has joined #dri-devel

05:05 Zopolis4_ has joined #dri-devel

05:18 itoral has joined #dri-devel

05:42 mbrost has quit [Read error: Connection reset by peer]

05:42 Company has quit [Quit: Leaving]

05:42 bmodem has joined #dri-devel

05:44 mbrost has joined #dri-devel

05:45 mbrost_ has joined #dri-devel

05:53 mbrost has quit [Ping timeout: 480 seconds]

05:53 dcz has joined #dri-devel

06:12 tzimmermann has joined #dri-devel

06:13 camus1 has quit [Ping timeout: 480 seconds]

06:16 mvlad has joined #dri-devel

06:27 rasterman has joined #dri-devel

06:32 vliaskov has joined #dri-devel

06:49 camus has joined #dri-devel

06:49 frieder has joined #dri-devel

06:49 frieder has quit [Remote host closed the connection]

06:51 i-garrison has quit [Ping timeout: 480 seconds]

06:51 i-garrison has joined #dri-devel

06:55 kzd has quit [Quit: kzd]

06:57 jfalempe has joined #dri-devel

06:58 sghuge has quit [Remote host closed the connection]

06:58 sghuge has joined #dri-devel

07:01 pjakobsson has quit [Remote host closed the connection]

07:14 frieder has joined #dri-devel

07:15 Zopolis4_ has quit [Quit: Connection closed for inactivity]

07:19 frieder has quit []

07:19 frieder has joined #dri-devel

07:20 tursulin has joined #dri-devel

07:21 pcercuei has joined #dri-devel

07:28 kts has joined #dri-devel

07:28 urja has quit [Read error: Connection reset by peer]

07:29 urja has joined #dri-devel

07:30 Haaninjo has joined #dri-devel

07:49 <airlied> Venemo: hey you might know, do mesh shaders outputs get fixed function clipped like any outputs from the vert stages?

07:57 Ziemas has quit [Ping timeout: 480 seconds]

07:58 lynxeye has joined #dri-devel

08:09 vliaskov has quit [Ping timeout: 480 seconds]

08:10 vliaskov has joined #dri-devel

08:12 pochu has joined #dri-devel

08:14 kts has quit [Quit: Konversation terminated!]

08:17 swalker_ has joined #dri-devel

08:18 swalker_ is now known as Guest262

08:19 swalker__ has joined #dri-devel

08:20 <javierm> danvet: I wrote on Friday the patch you suggested for mutter but have some issues with the compat layer, the damaged clips are added to the plane's atomic state before committing

08:21 <javierm> danvet: I've added my findings in the MR https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/2979, it would be great if you could take a look when you have some time

08:21 <javierm> tzimmermann: probably you can help here too ^

08:21 mbrost_ has quit []

08:26 Guest262 has quit [Ping timeout: 480 seconds]

08:28 elongbug has joined #dri-devel

08:42 <bbrezillon> danvet: With drm_sched_resubmit_jobs() being deprecated, I wonder who's supposed to set the job's parent field back to something non-NULL before drm_sched_start() is called (those are set to NULL in drm_sched_stop(), when the fence callback is removed). Are drivers supposed to manually iterate over the pending_list to set it up?

08:46 mbrost has joined #dri-devel

08:47 mbrost_ has joined #dri-devel

08:48 <dj-death> any radv folk willing to glance over the NIR/Spirv bits of KHR_ray_tracing_position_fetch? : https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22734

08:54 mbrost has quit [Ping timeout: 480 seconds]

09:02 mwalle has joined #dri-devel

09:14 devilhorns has joined #dri-devel

09:17 thellstrom has joined #dri-devel

09:21 gawin has joined #dri-devel

09:24 <javierm> emersion: ah, Zack did post a v2 that I missed before https://lore.kernel.org/all/50fd57193508f33a1e559ef74599c9e52764850d.camel@vmware.com/T/

09:30 <mareko> airlied: yes

09:31 <mareko> airlied: mesh shaders are just like vertex shaders, but are launched as compute shaders that must produce "VS outputs"

09:36 <emersion> javierm: ah, so what's missing?

09:38 <javierm> emersion: only a single thing that danvet mentioned, I've just asked in the mailing list to Zack if plans to post a v3 or we could help with that

09:38 <javierm> emersion: https://lore.kernel.org/all/YvPfedG%2FuLQNFG7e@phenom.ffwll.local/

09:39 <javierm> basically not using a DRIVER_VIRTUAL driver cap and instead add a new plane type

09:42 <javierm> a DRM_PLANE_TYPE_VIRTUAL_CURSOR or something like that. Since other than that the cursor, the virtual machine KMS drivers behave by the uAPI contract

09:46 <javierm> emersion: but I'm very happy that this is almost ready to land, I first thought that would need to implement the whole thing to get virtio-gpu out of the mutter atomic deny list

09:47 YuGiOhJCJ has joined #dri-devel

09:50 <emersion> hm, not sure a new plane type makes sense

09:50 <emersion> you'd also need the not-fun part: igt

09:55 _xav_ has quit [Ping timeout: 480 seconds]

10:03 _xav_ has joined #dri-devel

10:08 djbw has quit [Ping timeout: 480 seconds]

10:10 mbrost_ has quit [Ping timeout: 480 seconds]

10:12 <danvet> javierm, don't use dirtyfb and page_flip together

10:12 <danvet> dirtyfb is for frontbuffer rendering, no page flip

10:12 <danvet> and hence the fb check to make sure that you actually report damage for the current fb

10:12 <danvet> page_flip is for a new buffer and has an implicit damage of everything

10:13 <danvet> if you want both page flip and damage, you must use atomic

10:13 <danvet> javierm, I don't have an account there, please add this to the mr ^^^

10:14 <danvet> bbrezillon, I'm lost, or well not on top of sched discussions enough to have any idea of what you should do :-/

10:14 <danvet> emersion, btw on your hdr hackfest summary, unstable uabi isn't a thing in upstream

10:15 <emersion> hm

10:15 <danvet> all the "hide it behind knobs" we've done was for uapi we planned to keep forever, but weren't sure we have all the pieces and the kernel+userspace stack was bug-free enough to unleash it on users

10:15 <danvet> but uapi you plan to actually break/remove is a completely different kind of thing

10:16 <danvet> unless you are very careful with not leaking users and make sure all users can fall back to something you wont get regression reports for

10:16 <danvet> it'll be a regression and the experimental uabi becomes rather permanent

10:16 <emersion> i'm personally in the "let's not merge vendor specific stuff" camp FWIW

10:17 <emersion> but this is something we wanted to discuss on-list

10:17 <emersion> maybe it's no big deal if it's stable

10:17 <danvet> I'm no fan of vendor kms props either

10:18 <emersion> harry seemed okay with maintaining this uAPI forever for this hw

10:18 <danvet> your writeup at least sounded like that's the compromise because you can remove it again

10:18 <danvet> and there's solid chances that's not how it'll play out

10:18 <emersion> yeah… that's how we wanted to compromise

10:18 <danvet> yeah if it's for some specific gpu just to get things going it might be ok

10:18 zzoon_2 has quit [Ping timeout: 480 seconds]

10:19 <danvet> yeah that's not really how it works :-)

10:19 <emersion> okay, good to know

10:19 <emersion> yup, it would just be for AMD DCN 3

10:19 <emersion> but then nothign stops somebody else from wanting to do the same thing

10:19 <emersion> or expand to more hw

10:19 <emersion> then things get into a meh state

10:20 <danvet> yeah I think we need really solid reasons that we really can't get off a good per-plane color mgmt api without going vendor specific at first

10:20 <emersion> anyways, this is all mostly to deal with the generic uAPI being stuck

10:20 <emersion> i really hope we can un-stuck it now

10:21 <emersion> a lot of frustration seems to come from the cost of adding new uAPI

10:21 <emersion> (from me first tbh)

10:21 <emersion> we've also discussed mandatory vkms impl for new uAPI BTW

10:22 <emersion> (which is at odds with the above)

10:22 <emersion> (vkms not caught up enough for this to be practical just yet anyways)

10:23 <Venemo> airlied: yes, mesh shader output rasterization works exactly the same as any other

10:24 <javierm> danvet: ah, I see. So then was my misunderstanding how the legacy KMS API should be used. Thanks for the clarification, I'll just drop that MR then and instead focus on using atomic for virtio-gpu

10:25 <danvet> emersion, yeah I think vkms for blending uapi would be nice, so that you can validate the igts against something everyone can use

10:25 <emersion> javierm: nice :)

10:27 <javierm> emersion, danvet: it was meant to be a temporary workaround but I understand now that legacy KMS + damage clipping + page flip isn't a supported combination

10:28 <javierm> emersion, danvet: about uAPI, I remember than in media/v4l2 at some point new features were merged but just the uAPI not exposed

10:29 <danvet> yeah that'd be an option too

10:29 <javierm> that seems to be a good compromise to at least experiment and try to find patterns between the different drivers to define a generic uAPI

10:30 <emersion> uAPI not exposed?

10:30 <emersion> how does that work?

10:31 <javierm> emersion: people will need to run a patched kernel but it's better than keeping all the patches out-of-tree for a long time

10:31 <emersion> ah, so just a patch to #define ENABLE_WHATEVER basically?

10:31 <emersion> that's a bit weird

10:32 <javierm> emersion: it is, yeah. But better than getting stuck with a uAPI that was defined too early

10:32 <emersion> yea

10:33 <ccr> what is needed is a time machine. travel to future to acquire the perfect uAPI and see what mistakes were made, etc.

10:36 anholt_ has joined #dri-devel

10:38 <bbrezillon> danvet: any idea would I should ask?

10:39 <danvet> könig? or whoever marked that function obsolete

10:40 <bbrezillon> also tried hooking up native fence support, as you suggested last time, and it's not clear to me when the core can check the FW seqno against the drm_sched_job->parent->seqno

10:41 anholt has quit [Ping timeout: 480 seconds]

10:41 <bbrezillon> danvet: yep, it's könig, but he doesn't seem to be on IRC

10:41 <danvet> yeah könig's not on irc

10:41 <bbrezillon> oh well, I'll write an email

11:09 invertedoftc096 has quit []

11:11 heat has joined #dri-devel

11:14 elongbug has quit [Remote host closed the connection]

11:14 elongbug has joined #dri-devel

11:24 <karolherbst> jenatali: if you have some time, mind reviewing the clc patch in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22776 ?

11:29 <gawin> karolherbst: that snippet for nv30 where should be put? have you sent patch for next stable release?

11:29 <karolherbst> I have no idea yet

11:30 <karolherbst> would have to figure something out what would be the best place for it

11:30 hansg has joined #dri-devel

11:30 <karolherbst> I've asked Ben tho

11:33 hansg has quit []

11:38 apinheiro has joined #dri-devel

11:56 mbrost has joined #dri-devel

11:57 mbrost_ has joined #dri-devel

12:03 bmodem has quit [Ping timeout: 480 seconds]

12:05 mbrost has quit [Ping timeout: 480 seconds]

12:27 TMM has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

12:27 TMM has joined #dri-devel

12:27 apinheiro has quit [Quit: Leaving]

12:29 itoral has quit [Remote host closed the connection]

12:51 bmodem has joined #dri-devel

13:11 bmodem1 has joined #dri-devel

13:18 bmodem has quit [Ping timeout: 480 seconds]

13:25 dcz has quit [Ping timeout: 480 seconds]

13:26 dcz has joined #dri-devel

13:28 Jeremy_Rand_Talos__ has quit [Remote host closed the connection]

13:28 Jeremy_Rand_Talos__ has joined #dri-devel

13:33 dcl^ has quit [Remote host closed the connection]

13:39 dcz_ has joined #dri-devel

13:40 dcz has quit [Read error: Connection reset by peer]

13:46 invertedoftc096 has joined #dri-devel

13:48 dcz has joined #dri-devel

13:51 dcz_ has quit [Ping timeout: 480 seconds]

13:52 heat has quit [Remote host closed the connection]

13:52 heat has joined #dri-devel

13:55 mbrost_ has quit [Ping timeout: 480 seconds]

13:59 mbrost_ has joined #dri-devel

14:00 mbrost__ has joined #dri-devel

14:01 JohnnyonFlame has quit [Ping timeout: 480 seconds]

14:01 stuarts has joined #dri-devel

14:08 mbrost_ has quit [Ping timeout: 480 seconds]

14:16 rasterman has quit [Quit: Gettin' stinky!]

14:30 fxkamd has joined #dri-devel

14:32 <swick[m]> emersion: you mean the adaptive sync CTS?

14:32 <emersion> yea

14:33 <emersion> there are more CTSes but haven't looked at the others

14:33 <swick[m]> not being able to read DisplayID from the dedicated bus doesn't seem to be a problem... I've not seen a single monitor exposing data there

14:33 greenjustin is now known as Guest304

14:33 <emersion> do you have a command to read everything from user-space via I2C?

14:34 greenjustin has joined #dri-devel

14:35 <emersion> i tried to come up with one but pretty sure i got it wrong

14:37 <swick[m]> ugh, on a custom kernel without i2c-dev rn..

14:37 heat has quit [Read error: Connection reset by peer]

14:37 heat has joined #dri-devel

14:38 <swick[m]> I used i2cdetect and i2cget FWIW

14:39 <emersion> yea that's what i (tried to) use as well

14:42 <swick[m]> and EDDC specifies the addresses

14:52 <emersion> i'm trying `i2cget 3 0xA0 0x00 b 32` and it's not working

14:52 <emersion> "SET Segment 0, device A0h or A4h, start address 00h, READ 128 bytes"

15:11 JohnnyonFlame has joined #dri-devel

15:11 <bbrezillon> mbrost__: Do you have a branch with the latest drm_sched-kthread-to-wq patches, or should I assume https://gitlab.freedesktop.org/drm/xe/kernel/-/commit/18b968ab6a98eb1801a385d6a6a82967f3ebb322 is the latest version?

15:11 greenjustin has quit [Remote host closed the connection]

15:12 greenjustin has joined #dri-devel

15:12 <mbrost__> This my latest version of Xe + DRM scheduler changes: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-upstreaming/-/tree/upstream_prep?ref_type=heads

15:14 <mbrost__> I want to post another, non-RFC, rev of this soon too https://patchwork.freedesktop.org/series/116055/

15:15 <mbrost__> The branch I shared should be more or less Xe based on that RFC + feedback

15:16 <mbrost__> oh, I might have mis-understood your question... multitasking...

15:18 <mbrost__> the version you posted is close, in the branch I shared I allow the user to pass in the run_wq

15:19 <mbrost__> apply this on top... https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-upstreaming/-/commit/d55965795198583f222f2e191cd3edfbdfa64d6a

15:22 Company has joined #dri-devel

15:22 pallavim has joined #dri-devel

15:25 devilhorns has quit [Quit: Leaving]

15:26 frieder has quit [Remote host closed the connection]

15:29 alanc-away has quit [Remote host closed the connection]

15:30 alanc-away has joined #dri-devel

15:34 bmodem has joined #dri-devel

15:36 kzd has joined #dri-devel

15:37 robmur01 has quit [Ping timeout: 480 seconds]

15:40 bmodem1 has quit [Ping timeout: 480 seconds]

15:46 YuGiOhJCJ has quit [Quit: YuGiOhJCJ]

15:47 <swick[m]> emersion: what do you mean with not working? most displays just don't have anything there...

15:47 <emersion> i mean that the parameters i give to i2cget are invalid

15:47 <emersion> it complains about them being out of range

15:49 <swick[m]> there was a gotcha wrt the address but I already forgot

15:52 aravind has quit [Ping timeout: 480 seconds]

15:53 mszyprow has joined #dri-devel

15:57 <swick[m]> emersion: ah yes, the address i2c tools wants is the one without the read/write bit, so A0h right shifted by one

15:58 JohnnyonFlame has quit [Ping timeout: 480 seconds]

15:59 <swick[m]> we should probably add a script to libdisplay-info to read the EDID and DisplayID blobs directly from i2c

15:59 <emersion> that would be handy

16:00 <emersion> hm still getting the error

16:00 <emersion> Error: Chip address out of range (0x08-0x77)!

16:01 <emersion> maybe i'm not understanding what chip vs. data address is?

16:01 fxkamd has quit []

16:01 lemonzest has quit [Quit: WeeChat 3.6]

16:02 <swick[m]> i2cget 3 0x50 0x00 does not work?

16:03 <emersion> bleh sorry, i used 0x80 by mistake

16:04 <emersion> i get the expected "read failed"

16:04 <emersion> hm, but that's the same address for EDID though?

16:05 <emersion> the spec says, for EDID:

16:05 <emersion> "SET Segment 0, device A0h, start address 00h, READ 128 bytes"

16:05 <emersion> and for DisplayID:

16:05 <emersion> "SET Segment 0, device A0h or A4h, start address 00h, READ 128 bytes"

16:05 <emersion> now i'm confused

16:05 <bbrezillon> mbrost__: thanks, I think I had a version with the custom run_wq already

16:05 <emersion> i'd like to check that i can grab the EDID at least

16:05 robmur01 has joined #dri-devel

16:05 <mbrost__> cool, that should be the latest

16:06 <swick[m]> emersion: A0h can contain either EDID or DisplayID, A4h can only contain DisplayID afaiu

16:06 <emersion> i see

16:06 <swick[m]> but A0h should be readable

16:06 <mbrost__> I'll probably get all drm sched changes out today on the list

16:06 <emersion> hm

16:08 CME has quit []

16:11 <bbrezillon> mbrost__: drm_sched_main() still has a loop to dequeue jobs until there's a reason to stop dequeuing (dep no signaled, or all job slots filled). I was wondering if it wouldn't be better to dequeue one job at a time (or at least limit the number of jobs you dequeue) so that you don't block other works scheduled on the same run_wq.

16:12 <bbrezillon> That's particularly important if we want to use the same ordered (single-threaded) workqueue for both the job dequeueing and tdr, to guarantee that nothing tries to queue stuff to the HW while we're resetting it

16:13 bmodem has quit [Ping timeout: 480 seconds]

16:14 <bbrezillon> mbrost__: which is basically what ckönig suggested here https://lore.kernel.org/dri-devel/5c4f4e89-6126-7701-2023-2628db1b7caa@amd.com/

16:15 lemonzest has joined #dri-devel

16:19 mbrost__ has quit [Ping timeout: 480 seconds]

16:28 swalker__ has quit [Remote host closed the connection]

16:33 <jenatali> Down to 41 CTS fails :)

16:40 greenjustin_ has joined #dri-devel

16:42 tzimmermann has quit [Quit: Leaving]

16:46 greenjustin has quit [Ping timeout: 480 seconds]

16:51 kts has joined #dri-devel

17:02 mszyprow has quit [Ping timeout: 480 seconds]

17:02 mbrost has joined #dri-devel

17:04 <mbrost> bbrezillon, that is kinda tricky

17:05 <mbrost> give me a few and let see what it looks like

17:07 rsalvaterra has quit []

17:08 rsalvaterra has joined #dri-devel

17:08 <mbrost> but let's say you use the system_wq, you can as many cores on the machine running in parallel as long as each item is its work_struct

17:08 <mbrost> the system_wq is just a wrapper for a pool of kthreads I think

17:10 <mbrost> so IMO I don't see each work_struct running in a loop while it has work as that big of a deal

17:11 tursulin has quit [Ping timeout: 480 seconds]

17:11 <mbrost> Another option to would be pass in unique run_wq to each scheduler and now you timeslicing

17:11 tlwoerner has quit [Quit: Leaving]

17:11 <mbrost> if you have more work_struct than cores...

17:11 JohnnyonFlame has joined #dri-devel

17:12 <mbrost> or pass in a few ordered wq to many schedulers

17:12 tlwoerner has joined #dri-devel

17:13 kts has quit [Quit: Konversation terminated!]

17:14 <mbrost> but either way let prototype 1 dequeue per pass of the main loop if think that would be better, I'm not going argue this one either way

17:16 Guest304 has quit [Remote host closed the connection]

17:16 greenjustin_ is now known as greenjustin

17:17 <bbrezillon> mbrost: so, multithreaded workqueue improves the situation, yes, but ckönig was actually suggesting using a single-threaded workqueue, so we don't have to call drm_sched_stop/start() to stop/start the drm_sched while we're resetting the GPU

17:19 <mbrost> well you can pass in any run wq you want but confused how you could away from not calling start / stop

17:19 <mbrost> the existing code calls start / stop on the kthread

17:20 <mbrost> we call start / stop in our reset flows and they rock solid, esp when compared to the i915

17:21 <mbrost> do you mean the same ordered wq as the reset one / tdr?

17:22 <bbrezillon> well, if you have a single-threaded wq for your drm_sched workers, and you queue your reset worker to this queue, you're guaranteed that the reset function will only be executed when all drm_sched workers are idle

17:22 <bbrezillon> yes, that's what ckönig suggested, if I'm correct

17:23 <mbrost> yea that is true

17:24 <mbrost> nothing preventing you from doing that now but yea in that case i guess dequeuing 1 item would make a bit more sense

17:25 <mbrost> i don't think I want to design Xe to share a WQ though, calling start / stop is easy enough

17:25 <bbrezillon> mbrost: ok. I guess we need a replacement for drm_sched_job_resubmit() then...

17:26 <bbrezillon> didn't just what you were using in the Xe driver

17:26 <bbrezillon> *didn't check

17:26 <mbrost> and like I said our reset flows are rock solid, I've written tests to hammer these paths as I was super paranoid about this not working after working on the i915

17:26 <mbrost> we just kill the entity scheduler if it hangs

17:27 <bbrezillon> drm_sched_stop()

17:27 <bbrezillon> I guess

17:27 <mbrost> no recovery, I think this faith's idea for all VM bind drivers i think

17:28 <mbrost> We have a buy in from our UMDs

17:28 <mbrost> 1 sec, i'll point you to our TDR callback

17:29 <bbrezillon> you mean remaining jobs are just cancelled?

17:29 yussef has joined #dri-devel

17:29 <mbrost> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L786

17:29 <mbrost> yea and entity is banned

17:31 <mbrost> that function should be pretty well commented

17:31 alanc-away has quit []

17:32 alanc has joined #dri-devel

17:32 <bbrezillon> hm, okay. So that's for the 'one entity is going crazy, but the GPU and FW-proc are still functional' case

17:33 <mbrost> yes

17:33 <bbrezillon> we do have global hangs where we need to stop all entities, reset the GPU and FW, and start all entities (and by start, I mean kick the previously active entities at the FW level, and call drm_sched_start())

17:34 <mbrost> We also have GT resets (entire GPU is trashed) but in practice that should be impossible or very, very rare

17:34 <mbrost> in that case, we try to find the bad entity, ban it, resubmit the others

17:35 <bbrezillon> that's where synchronization gets messy, because the reset operation, which is always queued on a workqueue, has to wait for all drm_sched workers to be idle

17:36 <mbrost> A GT reset basically should only be triggered by junk hardware or if our KMD has a bug

17:36 <mbrost> yea, let me point you to our GT reset code...

17:37 <mbrost> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1362

17:37 <mbrost> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1401

17:38 <bbrezillon> mbrost: would you mind replying to Christian regarding the drm_sched_{stop,start}()? I feel like your opinions are diverging here, and more and more driver keep depending on your work ;-)

17:38 <mbrost> yea we do have to loop over every entity and stop it

17:38 <bbrezillon> yep, that's what we were planning to do initially

17:38 <mbrost> there is nothing stopping from the current interfaces for doing it either way

17:39 <mbrost> s/for/from

17:39 <bbrezillon> it's just that, with drm_sched_resubmit_jobs() being deprecated, I was a bit lost as to what drivers were supposed to do to reassign the drm_sched_job::parent fence

17:40 <mbrost> that is news to me it being deprecated

17:40 Duke`` has joined #dri-devel

17:41 <mbrost> the seems like a unilateral decision, drivers should be allowed to do this either way

17:42 <bbrezillon> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.3&id=5efbe6aa7a0e

17:43 <mbrost> is this merged?

17:43 <bbrezillon> it's in Linus' tree :)

17:43 <mbrost> ugh... I need to pay better attention to the list

17:44 <bbrezillon> was merged in 6.3-r1 apparently

17:45 <mbrost> Xe doesn't need most of the nonsense in that function anyways

17:45 <mbrost> really all we need is loop over pending jobs and call run_job

17:47 <bbrezillon> well, the only thing we'd need in the powervr driver is a loop re-assigning the parent fence, we don't even need to call run_job() (the ringbuf should already be filled and it content preserved, unless I'm mistaken)

17:47 <mbrost> ok maybe we just open code the resubmit

17:48 <bbrezillon> just felt odd to iterate over the pending_list manually

17:48 <mbrost> in Xe the parent fence would be the same anyways

17:48 <bbrezillon> but if that's how it's supposed to be done, I'm fine with that

17:48 <bbrezillon> yep, same here, we don't re-allocate the fence

17:49 <bbrezillon> it's the same object, same seqno

17:49 <mbrost> and ref count is still correct too

17:50 <bbrezillon> in any case, we should document what drivers are expected to do between the stop and start calls, because it's unclear right now

17:51 <bbrezillon> like, the bare minimum is to re-assign parent fences, otherwise pending jobs are dropped when drm_sched_start() is called

17:51 <mbrost> again I don't think we should force a driver to do anything

17:53 <mbrost> I'm confused by that last statement

17:53 <bbrezillon> well, it's not about forcing them to do anything, but if you call drm_sched_start() after having kicked the queue, your driver is likely to end up with use-after-free bugs

17:53 <mbrost> in Xe you can run_job as many times as you want you always get the same fence back

17:53 <bbrezillon> at least that's what happens if you keep in-flight jobs in some internal list/queue

17:54 <mbrost> which is assigned to the parent

17:55 <mbrost> it is fence which looks at a memory location for a value to be greater or equal to the jobs seqn

17:55 <bbrezillon> oh, you're taking about this statement https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L562

17:55 <bbrezillon> yep, same here: u32 mem object containing the seqno

17:57 <bbrezillon> and the dma_fence is just a wrapper around that, with dma_fence::seqno being checked against the FW fence object seqno

17:57 <bbrezillon> run_job() does queue things to the ringbuf though

17:57 <mbrost> i think I understand what you are getting at

17:58 JohnnyonFlame has quit [Ping timeout: 480 seconds]

17:58 <bbrezillon> but we can easily add a "submitted" boolean to avoid resubmitting it

17:58 rasterman has joined #dri-devel

17:59 <bbrezillon> and even if we were actually resubmitting, that's not a big deal, as long as we reset the ringbuf pointer before kicking the queue

17:59 <mbrost> we stop the signaling of fences returned from run_job in our reset flows

17:59 <mbrost> xe_hw_fence_irq_stop

17:59 <mbrost> xe_hw_fence_irq_start

18:00 tobiasjakobi has joined #dri-devel

18:01 tobiasjakobi has quit []

18:03 <mbrost> resets are confusing

18:04 <mbrost> if it would help I can post a kernel doc patch explaining how Xe does these and avoids races

18:04 <mbrost> it might already done this at one point...

18:07 <bbrezillon> dunno, I really thought everyone was on-board with this drm_sched_resubmit_jobs() deprecation

18:08 <bbrezillon> looks like it's not really the case...

18:13 <mbrost> If you do a GT reset (think of it as a power cycle) and you don't want lose all pending work, run_job needs to be called again

18:13 <mbrost> on all pending jobs

18:14 <mbrost> get rid of drm_sched_resubmit_jobs() fine, Xe will just have a loop where it calls run job

18:16 <bbrezillon> then what's the point of deprecating it, if people keep open-coding it...

18:16 <mbrost> yea that what i'm confused about

18:17 <bbrezillon> I mean, I can code it differently, and optimize the resubmit logic, but why should I bother, it's not a hot-path anyway

18:17 <mbrost> how would a pending job complete after a power cycle unless you stick it back on the hardware

18:21 <bbrezillon> tried reading amdgpu code, and I don't quite get how they do it...

18:28 mbrost has quit [Ping timeout: 480 seconds]

18:30 mbrost has joined #dri-devel

18:31 <mbrost> i might have lost a few comments

18:31 macromorgan is now known as Guest324

18:31 vliaskov has quit [Quit: Leaving]

18:31 macromorgan has joined #dri-devel

18:34 <mbrost> 'implement

18:34 <mbrost> + * recovery after a job timeout.' - yea that isn't how Xe is using it

18:34 <mbrost> Xe uses after the entire device is reset

18:36 lynxeye has quit [Quit: Leaving.]

18:39 Guest324 has quit [Ping timeout: 480 seconds]

18:39 CME has joined #dri-devel

18:42 jewins has joined #dri-devel

18:48 mvlad has quit [Remote host closed the connection]

18:51 djbw has joined #dri-devel

18:57 JohnnyonFlame has joined #dri-devel

19:02 junaid has joined #dri-devel

19:08 camus1 has joined #dri-devel

19:14 camus has quit [Ping timeout: 480 seconds]

19:21 iive has joined #dri-devel

19:26 CME has quit []

19:26 CME has joined #dri-devel

19:28 mbrost has quit [Remote host closed the connection]

19:28 mbrost has joined #dri-devel

19:31 elongbug has quit [Remote host closed the connection]

19:32 elongbug has joined #dri-devel

19:36 mbrost has quit [Remote host closed the connection]

19:36 mbrost has joined #dri-devel

19:42 gawin has quit [Ping timeout: 480 seconds]

19:49 junaid has quit [Remote host closed the connection]

19:49 elongbug has quit [Read error: Connection reset by peer]

19:53 mbrost has quit [Ping timeout: 480 seconds]

20:15 mbrost has joined #dri-devel

20:26 gawin has joined #dri-devel

20:40 stuarts has quit [Remote host closed the connection]

20:42 Duke`` has quit [Ping timeout: 480 seconds]

20:56 apinheiro has joined #dri-devel

21:04 dcz has quit [Ping timeout: 480 seconds]

21:05 <mbrost> bbrezillon: drm sched changes we discussed: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-upstreaming/-/commit/2cdb17a975e12e9206b5a9c433581699c67c619f

21:05 <anholt_> Have you wished that piglit didn't take so long? I have an MR for you! https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/804

21:05 <mbrost> Seems to work just fine, it would be fun with we had some benchmarks and see if this had a difference

21:08 <jenatali> anholt_: <3

21:09 <zmike> anholt++

21:12 <anholt_> I swear we need to have static int called = 0; assert(called++ < 1000) in piglit_probe_pixel.

21:14 <jenatali> Sounds like a good idea to me

21:29 thellstrom has quit [Ping timeout: 480 seconds]

21:35 ngcortes has joined #dri-devel

21:37 <DemiMarie> danvet emersion: What about hiding the uAPI behind BROKEN?

21:37 <DemiMarie> I assume no distro will ship a CONFIG_BROKEN=y kernel, and if they do (or patch out the dependency) they get to keep both pieces.

21:38 stuarts has joined #dri-devel

21:40 <emersion> DemiMarie: what is CONFIG_BROKEN?

21:45 ngcortes has quit [Ping timeout: 480 seconds]

21:48 pcercuei has quit [Quit: dodo]

21:49 <DemiMarie> emersion: It is a catchall, never-enabled Kconfig entry used to disable broken code (hence the name). If there is a Kconfig that should never actually be set, one can use `depends on BROKEN` to make sure it is in fact never set.

21:51 <emersion> i see

21:55 dcz has joined #dri-devel

21:58 Zopolis4_ has joined #dri-devel

21:59 mbrost_ has joined #dri-devel

22:00 ngcortes has joined #dri-devel

22:00 mszyprow has joined #dri-devel

22:01 mbrost__ has joined #dri-devel

22:01 konstantin_ has joined #dri-devel

22:01 mbrost has quit [Ping timeout: 480 seconds]

22:03 gawin has quit [Quit: Konversation terminated!]

22:04 konstantin has quit [Ping timeout: 480 seconds]

22:08 mbrost_ has quit [Ping timeout: 480 seconds]

22:11 rasterman has quit [Quit: Gettin' stinky!]

22:17 pallavim has quit [Ping timeout: 480 seconds]

22:18 apinheiro has quit [Quit: Leaving]

22:24 mszyprow has quit [Ping timeout: 480 seconds]

22:29 mbrost__ has quit [Ping timeout: 480 seconds]

22:31 i509vcb has joined #dri-devel

22:33 pallavim has joined #dri-devel

22:47 mbrost has joined #dri-devel

22:48 mbrost_ has joined #dri-devel

22:50 mbrost_ has quit [Remote host closed the connection]

22:50 mbrost_ has joined #dri-devel

22:53 iive has quit [Quit: They came for me...]

22:55 mbrost has quit [Ping timeout: 480 seconds]

23:06 RSpliet has quit [Quit: Bye bye man, bye bye]

23:07 RSpliet has joined #dri-devel

23:15 danvet has quit [Ping timeout: 480 seconds]

23:19 alatiera2 has joined #dri-devel

23:20 alatiera has quit [Ping timeout: 480 seconds]

23:20 alatiera2 is now known as alatiera

23:21 <airlied> zmike: you any clues on the lvp memory model fails that CI is seeing?

23:30 jernej has quit [Quit: Free ZNC ~ Powered by LunarBNC: https://LunarBNC.net]

23:30 jernej has joined #dri-devel

23:32 penguin42 has joined #dri-devel

23:33 <penguin42> TToTD: If you hit a PROFILING_INFO_NOT_AVAILABLE it might be because you missed a wait() on an event - it took me a while to figure that out

23:33 <penguin42> especially since ROCm didn't complain about it

23:34 <zmike> airlied: I had a ticket where I was tracking all the known fails

23:34 <zmike> if it's not in there I have no idea

23:38 <airlied> ah there was a flake lodged by DavidHeidelberg[m]

23:41 <DavidHeidelberg[m]> ?

23:44 <airlied> you logged a lavapipe flake a while back, seems to be more prevalent now, will have to make the effort to track it down I suppose

23:44 FireBurn has joined #dri-devel

23:44 <airlied> zmike: just saw 20520 fell down the crack

23:46 <DavidHeidelberg[m]> right

23:46 <jenatali> Anybody know who I should ping to get opinions on !22800?

23:48 <airlied> jenatali: whoever wrote your qsort :-P

23:48 <jenatali> Heh, blame the C standard authors for not requiring it to be stable :P

23:50 <zmike> airlied: oh I forgot about that

23:50 <zmike> I'm not actually sure it's correct now that I look again

23:50 oneforall2 has quit [Remote host closed the connection]

23:50 alanc has quit [Remote host closed the connection]

23:52 <airlied> Venemo, mareko : do mesh shaders interact with transform feedback?

23:53 <airlied> ah d3d12 doesn't seem to support it at least

23:53 <jenatali> Right

23:54 greenjustin has quit [Ping timeout: 480 seconds]

23:56 <jenatali> Huh, actually having working caches seems to have taken ~20 minutes off my CTS run time. I'll take it

23:56 <jenatali> Maybe it means I can bump the CI factor from 4 to 3...

23:57 oneforall2 has joined #dri-devel

23:59 <zmike> airlied: no, there is no xfb with mesh