ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
ngcortes has quit [Remote host closed the connection]
martin19_ has joined #dri-devel
martin19 has quit [Read error: Connection reset by peer]
lemonzest has quit [Quit: Quitting]
Lucretia has quit []
Lightkey has quit [Ping timeout: 480 seconds]
Lightkey has joined #dri-devel
camus has joined #dri-devel
YuGiOhJCJ has joined #dri-devel
soreau has quit [Quit: Leaving]
soreau has joined #dri-devel
flto has quit [Remote host closed the connection]
flto has joined #dri-devel
khfeng has joined #dri-devel
luzipher__ has joined #dri-devel
gpoo has quit [Ping timeout: 480 seconds]
luzipher_ has quit [Ping timeout: 480 seconds]
aravind has joined #dri-devel
camus1 has joined #dri-devel
camus has quit [Ping timeout: 480 seconds]
luzipher_ has joined #dri-devel
luzipher__ has quit [Ping timeout: 480 seconds]
sdutt has quit []
sdutt has joined #dri-devel
blue__penquin has joined #dri-devel
<ishitatsuyuki> 1. This line (https://gitlab.freedesktop.org/mesa/mesa/-/blob/fb586a8e3c7259d94f06fb764ac25310e54d3e5a/src/amd/compiler/aco_insert_waitcnt.cpp#L323) suggests that all threads in a wavefront can access LDS coherently without issuing waitcnt. Is this understanding correct?
<ishitatsuyuki> pendingchaos: Hi, I have a question regarding LDS coherency
<ishitatsuyuki> 2. Does the coherency also applies in wave64 mode? When writing my shader I've seen behavior suggesting that it's not coherent, and I suspect subgroupBarrier() might not work correctly in such cases
Duke`` has joined #dri-devel
vpandya has joined #dri-devel
<vpandya> hello, what are the config/build command to build lavapipe with mason? Also how to use it to run some simple vulkan application ?
tzimmermann has joined #dri-devel
<JoshuaAshton> Maybe Venemo can answer that
<JoshuaAshton> (the aco thing)
martin19_ has quit [Ping timeout: 480 seconds]
itoral has joined #dri-devel
lemonzest has joined #dri-devel
adjtm is now known as Guest755
adjtm has joined #dri-devel
<vpandya> Can llvmpipe do jit execution on non X86 platform?
<vpandya> I am looking at relevant code which sets up llvmpipe's target for JIT
<HdkR> vpandya: Yes
<vpandya> I think I should look for something like LLVMInitializeNativeTarget in code base
Guest755 has quit [Ping timeout: 480 seconds]
mlankhorst has joined #dri-devel
<Venemo> ishitatsuyuki: LDS access (or any DS instruction in general) always needs a waitcnt lgkmcnt
<ishitatsuyuki> hmm
<ishitatsuyuki> any idea about the linked code above?
<ishitatsuyuki> I interpreted that as an undocumented optimization that access within a warp does not need a waitcnt
<Venemo> The highlighted line is for barriers
<ishitatsuyuki> oh
<ishitatsuyuki> I see
<Venemo> It determines the kind of waitcnt that the barrier needs
<ishitatsuyuki> hmm
<ishitatsuyuki> "barrier" includes memory barrier right?
<Venemo> Yes, but there is more than one kind of memory
<ishitatsuyuki> yeah so that's what I'm wondering about
<Venemo> And the barrier can have different scope not just workgroup
<ishitatsuyuki> the thing I noticed first is that subgroupBarrier() didn't had the same effect as barrier() somehow, in a shader with local_size=64 (corresponding to wave64)
<ishitatsuyuki> the former resulted in a race while the latter yielded expected synchronization
<Venemo> Different barriers have different requirements
<Venemo> If more than one wave in a workgroup are accessing the same LDS space then you not only need to wait for LDS but the waves must also wait for each other
<ishitatsuyuki> I think both subgroupBarrier and barrier is a full execution+memory barrier, except they have different scopes
<ishitatsuyuki> And in this case, workgroup==subgroup
<Venemo> If the workgroup size is the same as the subgroup size then that is not necessary, since there is only one wave. So it doesn't have to wait for other waves (because there are no other waves), it just only has to wait for the LDS
<ishitatsuyuki> Does it needs to wait for the LDS *write*, at the *hardware* level?
<ishitatsuyuki> So again, the perform_barrier code seems to suggest that LDS is internally synchronized within a subgroup
<ishitatsuyuki> actually I should come up with a repro for what I think is a compiler bug
<ishitatsuyuki> later on that
Duke`` has quit [Ping timeout: 480 seconds]
<Venemo> ishitatsuyuki: LDS is a different hardware unit than the ALU so the ALU needs to execute a waitcnt instruction to wait for the result from LDS
<Venemo> Ideally this waitcnt is right before the shader tries to use that result
<ishitatsuyuki> I know that *reads* need to be waited
<ishitatsuyuki> but the confusing is about *writes*
<ishitatsuyuki> my understanding is that a memory fence basically translates to a waitcnt lgkmcnt for writes
<ishitatsuyuki> is that correct?
<Venemo> Yes, that sounds like it
<Venemo> Well, if you think you found a bug you can easily check the shader disassembly using RADV_DEBUG=shaders
<ishitatsuyuki> then the linked code seems to say that the fence can be omitted if the sync is subgroup scope
<ishitatsuyuki> ok, I'll work on the repro
<Venemo> What exactly is the problem you have?
<ishitatsuyuki> basically subgroupBarrier() not working the same as barrier() when workgroup==subgroup
<Venemo> Not working the same, in what manner?
<ishitatsuyuki> only half (32 threads) of the wave64 are properly synchronized
<ishitatsuyuki> barrier() -> correct, subgroupBarrier() -> race
<Venemo> Race between what?
<Venemo> Can you show me the shader?
<ishitatsuyuki> give me a bit of time, I throwed away that working tree yesterday
<tzimmermann> pinchartl, hi. you already reviewed the irq_enabled cleanup. could you also take a look at the rsp armada patch? it's just a one-liner https://patchwork.freedesktop.org/patch/441193/
<ishitatsuyuki> ah I actually realized what I was doing wrong
<pinchartl> tzimmermann: you can add my Rb
<ishitatsuyuki> called subgroupBarrier inside a non-uniform control flow
sdutt has quit [Remote host closed the connection]
<tzimmermann> pinchartl, thanks
<ishitatsuyuki> sorry, wasn't a compiler issue at all
<Venemo> ishitatsuyuki: okay
<Venemo> If you need help let me know :)
camus1 has quit [Remote host closed the connection]
camus has joined #dri-devel
alanc has quit [Remote host closed the connection]
pnowack has joined #dri-devel
alanc has joined #dri-devel
frieder has joined #dri-devel
blue__penquin has quit [Quit: Connection closed for inactivity]
martin19 has joined #dri-devel
RobertC has joined #dri-devel
danvet has joined #dri-devel
thellstrom has joined #dri-devel
<dschuermann> ishitatsuyuki: that's interesting. according to the ISA, there is no difference to be expected
<ishitatsuyuki> dschuermann: It wasn't a compiler bug, it was my error in the shader code
<dschuermann> unrelated to the barrier?
<ishitatsuyuki> and actually the fault was me accidentally using AMDVLK without noticing...
<ishitatsuyuki> and since it uses wave32 by default the assumption that subgroup==workgroup==64 didn't hold
<ishitatsuyuki> subgroup memory fences are apparently noop in ACO btw
<dschuermann> amdvlk uses wave32 by default for CS? that is interesting...
<ishitatsuyuki> yeah, same for AMD on Windows
<dschuermann> ishitatsuyuki: yeah, Instructions of the same type are returned in the order they were issued. as wave64 just double-issues (for each half), it shouldn't change the order
aravind has quit [Ping timeout: 480 seconds]
<dschuermann> ishitatsuyuki: if you create a workgroup of 64, they should use wave64, though?
<ishitatsuyuki> no... they always use wave32
<dschuermann> ok, that sounds like we should do some benchmarking
<ishitatsuyuki> amdvlk is easily worse at least for my shader
<dschuermann> then you can still use subgroup_size_control
whald has joined #dri-devel
aravind has joined #dri-devel
andrey-konovalov has joined #dri-devel
qawsed420_- has joined #dri-devel
bcarvalho has joined #dri-devel
andrey-konovalov has quit [Ping timeout: 480 seconds]
lynxeye has joined #dri-devel
rasterman has joined #dri-devel
ickle has joined #dri-devel
Lucretia has joined #dri-devel
luzipher__ has joined #dri-devel
luzipher_ has quit [Ping timeout: 480 seconds]
elongbug has joined #dri-devel
frieder has quit [Ping timeout: 480 seconds]
vpandya has quit [Quit: Connection closed for inactivity]
Daaanct12 has joined #dri-devel
frieder has joined #dri-devel
Daanct12 has quit [Ping timeout: 480 seconds]
vivijim has joined #dri-devel
blue__penquin has joined #dri-devel
<bbrezillon> danvet, lynxeye: when/if you have time, could you have a look at "drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr" and "drm/sched: Declare entity idle only after HW submission"?
gpoo has joined #dri-devel
jcristau has joined #dri-devel
pcercuei has joined #dri-devel
iive has joined #dri-devel
jcristau has quit []
hch12907 has quit [Ping timeout: 480 seconds]
jcristau has joined #dri-devel
martin19 has quit [Ping timeout: 480 seconds]
vivijim has quit [Remote host closed the connection]
jcristau has quit []
jcristau has joined #dri-devel
frieder has quit [Ping timeout: 480 seconds]
<bbrezillon> lynxeye: thanks
camus1 has joined #dri-devel
NiksDev has joined #dri-devel
camus has quit [Read error: Connection reset by peer]
frieder has joined #dri-devel
camus has joined #dri-devel
camus1 has quit [Read error: Connection reset by peer]
itoral has quit []
Peste_Bubonica has joined #dri-devel
vivijim has joined #dri-devel
camus has quit [Remote host closed the connection]
camus has joined #dri-devel
flto_ has joined #dri-devel
flto has quit [Ping timeout: 480 seconds]
thellstrom has quit [Quit: thellstrom]
Daaanct12 has quit [Quit: Quitting]
Danct12 has joined #dri-devel
flto_ has quit []
flto has joined #dri-devel
martin19 has joined #dri-devel
agd5f has joined #dri-devel
<danvet> bbrezillon, lynxeye some overview docs for the concurrency design of drm/scheduler would be good
Danct12 has quit [Quit: Quitting]
andrey-konovalov has joined #dri-devel
luzipher_ has joined #dri-devel
hch12907 has joined #dri-devel
Danct12 has joined #dri-devel
luzipher__ has quit [Ping timeout: 480 seconds]
<bbrezillon> danvet, lynxeye: https://gitlab.freedesktop.org/-/snippets/2279 ?
camus1 has joined #dri-devel
thellstrom has joined #dri-devel
camus has quit [Ping timeout: 480 seconds]
<danvet> bbrezillon, s/recover/recovery/
<danvet> and make the callback kerneldoc so it links'
<danvet> ?
<danvet> bbrezillon, maybe some more details in @timeout_job kerneldoc about the sequencing that's expected?
<danvet> or maybe put the entire timeout handling discussion in there?
minecrell has quit [Quit: :( ]
<bbrezillon> danvet: ok, so everything moved to the ->job_timedout section, and no mention of that in the overview?
minecrell has joined #dri-devel
<danvet> bbrezillon, yeah I think for those details that makes the most sense ..
minecrell is now known as Guest794
Guest794 is now known as minecrell
minecrell has quit []
minecrell has joined #dri-devel
<bbrezillon> I can also move it to the drm_sched_init() doc (don't know which one is best)
luzipher__ has joined #dri-devel
sdutt has joined #dri-devel
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
luzipher_ has quit [Ping timeout: 480 seconds]
<thellstrom> danvet, mlankhorst, mauld: I'm off for vacation on wednesday so if you have any comments on the gem_migration series, it would be great to have them tonight.
<mlankhorst> I was hoping for CI results, but I think it's too far behind
<danvet> bbrezillon, I'd include how drm_sched_stop/start (or whatever they were) gives you all the guarantees
<danvet> and that you need single-threaded workqueue if you need guarantees across a scheduler
tlwoerner has quit [Remote host closed the connection]
tlwoerner has joined #dri-devel
<bbrezillon> danvet: isn't an ordered wq enough to guarantee that work execute sequentially?
<danvet> bbrezillon, sure, but that doesn't sync against the kthread
<danvet> and drm_sched_stop only syncs against your own kthread
<bbrezillon> no, I meant, s/single-threaded/ordered/
<danvet> and I'm not seeing that spelled out yet in docs, would be good to add that
<bbrezillon> stop/start are still needed
<bbrezillon> and I agree this should be documented
<danvet> bbrezillon, I mean your para is good, but it's kinda a lonely island in a sea of undocumented things in this area
<ishitatsuyuki> what does a amdgpu soft_recovery do? cancel a single job that timed out?
andrey-konovalov has quit [Remote host closed the connection]
<ishitatsuyuki> in that case, is it likely that a faulty application would get the system permanently stuck? as the kernel would only attempt to do soft_recovery unless it also times out
<danvet> ishitatsuyuki, maybe I got confused, but on recently reading the amdgpu tdr it falls back to full chip reset if soft recovery fails?
<danvet> full chip reset is a bit an adventure, but in theory it works and should be able to recover
<ishitatsuyuki> I'd appreciate if you link to some reference for that
<ishitatsuyuki> I think it depends on how you define soft recovery *fails*
sdutt has quit []
sdutt has joined #dri-devel
<bbrezillon> danvet: how about this one https://gitlab.freedesktop.org/-/snippets/2279
<bbrezillon> ?
<ishitatsuyuki> hmm actually an application should terminate if the GPU job ended with error
<ishitatsuyuki> I need to do some more tests on this
<danvet> bbrezillon, stellar
<ishitatsuyuki> I'm looking into this recovery thing because in practice recovery have never worked with my GNOME setup
<danvet> bbrezillon, maybe clarify that you need to stop all schedulers potentially impacted by the reset
<danvet> you don't need to globally stop them all
<danvet> e.g. multi-gpu, or maybe you have multiple domains or whatever
<bbrezillon> danvet: done (you can check the new version at the same URL)
<MrCooper> ishitatsuyuki: when soft recovery succeeds, one only notices a freeze for a few seconds, then things continue normally; recovering from a full GPU reset would require the display server (and possibly apps using GPU acceleration as well) to react to it (via robustness extensions) by destroying the existing GPU context and creating a new one
luzipher_ has joined #dri-devel
luzipher__ has quit [Ping timeout: 480 seconds]
<ishitatsuyuki> hmm I think found the problem with soft recovery and it is that it often fails to kill the offender
<ishitatsuyuki> actually nevermind. it's probably compositor-killer ignoring errors
<jekstrand> cwabbott: Nice seeing ir3 subgroups finally making progress!
<cwabbott> jekstrand: thanks - currently working through all the failures
<jekstrand> Yeah.....
<cwabbott> the worst thing is that this is a very invasive compiler thing and yet there are no tests involving tricky control flow at all
<cwabbott> i had liveness wrong at the beginning (not accounting for fallthrough edges with scalar registers), but 0 tests failed because of it
RobertC has quit [Ping timeout: 480 seconds]
frieder has quit [Remote host closed the connection]
<MrCooper> ishitatsuyuki: soft recovery doesn't kill anything
<MrCooper> (except for the shader invocations which hung, I guess)
<ishitatsuyuki> it does seem to make applications lose context if they opt into GL_LOSE_CONTEXT_ON_RESET
<ishitatsuyuki> but otherwise, the application receives no notification
<ishitatsuyuki> I'm still digging into the exact behavior, but this doesn't sound about right
<MrCooper> hmm, yeah I guess the app whose shader hung does get that
<MrCooper> (in my case it was usually Firefox lately, and it fell back to software rendering in response)
<ishitatsuyuki> my firefox just hangs, but maybe I configured something wrong
<ishitatsuyuki> in this case, it's the wrong victim
<ishitatsuyuki> because the long running invocation is compositor-killer
<ishitatsuyuki> but that's not the biggest concern
<ishitatsuyuki> the concern is that the error might be swallowed if the application doesn't opt in to GL_LOSE_CONTEXT_ON_RESET, but I need to verify this...
Bennett has joined #dri-devel
<ishitatsuyuki> OpenGL didn't have any means of signaling context lost before KHR_robustness, so it's opt-in by design...
<ishitatsuyuki> uh ok, now I figured out
<ishitatsuyuki> in any case, the "fix" is to make applications properly handle resets
<ishitatsuyuki> thanks for the pointers guys
frieder has joined #dri-devel
Duke`` has joined #dri-devel
GloriousEggroll has joined #dri-devel
frieder_ has joined #dri-devel
<Venemo> ishitatsuyuki: unfortunately none of the current desktop environments or apps are designed to survive a full GPU reset currently
<ishitatsuyuki> true
<jekstrand> cwabbott: Yeah....
frieder has quit [Ping timeout: 480 seconds]
frieder_ has quit [Remote host closed the connection]
<jekstrand> cwabbott: If qcom has a solid concept of uniform/scalar stuff and you make heavy use of it, that helps.
<cwabbott> jekstrand: they have it, but it's kinda weirdly gimped and mostly used for subgroup stuff
<cwabbott> i think they rely on the preamble for optimizing most uniform calculations
<Venemo> cwabbott: sadly the CTS is not good at these edge cases. If you look closely I bet you that a great many subgroup tests could be constant folded and would end up not testing anything subgroup related at all
mmenzyns has joined #dri-devel
<jekstrand> cwabbott: :(
<jekstrand> Venemo: They're pretty good at making sure we can't fold.
<Venemo> Well, maybe they improved
<jekstrand> Venemo: So for subgroup ops in uniform control-flow, I think they're pretty good. In non-uniform control-flow or optimizations involving subgroup ops, however, that's basically untested.
<Venemo> Last I looked, there were lots of tests where the subgroup functionality could have been entirely optimized out
<jekstrand> I don't remember seeing that
<Venemo> The reason we didn't is because very few apps use any of these
<Venemo> Hm
<Venemo> I wish I remembered which ones these were
<jekstrand> There are some where our simple NIR opts eat them for lunch but it's not quite a full fold. I don't remember which those were.
<Venemo> Me neither, has been a long while since I last worked on this
ngcortes has joined #dri-devel
cedric has quit []
<zmike> if only there was another test suite that had cases for subgroups, perhaps one in a higher level graphics api that could be tested through emulation if only a pending MR could be landed
luzipher__ has joined #dri-devel
luzipher_ has quit [Ping timeout: 480 seconds]
zackr has joined #dri-devel
tzimmermann has quit [Quit: Leaving]
cedric has joined #dri-devel
lemonzest has quit [Quit: Quitting]
cedric is now known as bluebugs
gouchi has joined #dri-devel
khfeng has quit [Ping timeout: 480 seconds]
aravind has quit [Ping timeout: 480 seconds]
elongbug has quit [Remote host closed the connection]
gouchi has quit [Remote host closed the connection]
gouchi has joined #dri-devel
Ahuj has joined #dri-devel
<jekstrand> Plagman: It would be really awesome if we could get someone at LunarG to look at the Mesa ICD interface code and figure out how to bump to the latest version.
<jekstrand> Plagman: I suspect it's easier to figure out how to bump mesa than understand the ICD interface
<Plagman> i have approximately 0 time to try to parse/understand what you're asking but if you write me an email with exactly what you need i'll make it happen
<jekstrand> Ok
<jekstrand> will do
<Plagman> sorry, bit crazy here rn
<jekstrand> no worries
<jekstrand> sent
<Plagman> i have routed it appropriately, thanks!
Daanct12 has joined #dri-devel
<austriancoder> anholt: any idea what could be wrong here? https://gitlab.freedesktop.org/mesa/mesa/-/jobs/11347597 "invalid reference format"
Danct12 has quit [Ping timeout: 480 seconds]
Daanct12 has quit [Quit: Quitting]
Danct12 has joined #dri-devel
blue__penquin has quit [Quit: Connection closed for inactivity]
zackr has quit [Remote host closed the connection]
kem has quit [Ping timeout: 480 seconds]
kem has joined #dri-devel
kem has quit []
kem has joined #dri-devel
<anholt> austriancoder: I've got nothing, sorry.
<anholt> you've tried restarting?
<austriancoder> two times.. yes
<austriancoder> daniels: ^
<daniels> austriancoder: can you please go the 'container registry' section of your project, removing the tags, and trying again?
<austriancoder> daniels: done.. lets hope for the best
iive has quit [Ping timeout: 480 seconds]
sagar_ has quit [Ping timeout: 480 seconds]
iive has joined #dri-devel
qawsed420_- has quit []
sagar_ has joined #dri-devel
camus has joined #dri-devel
mbrost has joined #dri-devel
camus1 has quit [Ping timeout: 480 seconds]
lynxeye has quit [Quit: Leaving.]
Duke`` has quit [Ping timeout: 480 seconds]
luzipher_ has joined #dri-devel
Ahuj has quit [Ping timeout: 480 seconds]
luzipher__ has quit [Ping timeout: 480 seconds]
gouchi has quit [Remote host closed the connection]
camus1 has joined #dri-devel
camus has quit [Ping timeout: 480 seconds]
rasterman has quit [Quit: Gettin' stinky!]
bcarvalho has quit [Quit: Leaving]
i-garrison has quit []
pepp has quit [Remote host closed the connection]
pepp has joined #dri-devel
i-garrison has joined #dri-devel
mixfix41_ has joined #dri-devel
mixfix41 has quit [Ping timeout: 480 seconds]
rpigott has quit [Ping timeout: 480 seconds]
rpigott has joined #dri-devel
pcercuei has quit [Quit: dodo]
i-garrison has quit [Read error: No route to host]
i-garrison has joined #dri-devel
tobiasjakobi has joined #dri-devel
danvet has quit [Ping timeout: 480 seconds]
iive has quit []
thellstrom1 has joined #dri-devel
thellstrom has quit [Remote host closed the connection]
mlankhorst has quit [Ping timeout: 480 seconds]
luzipher__ has joined #dri-devel
camus has joined #dri-devel
luzipher_ has quit [Ping timeout: 480 seconds]
camus1 has quit [Read error: Connection reset by peer]
phomes has joined #dri-devel
ngcortes has quit [Remote host closed the connection]
pnowack has quit [Quit: pnowack]
Lucretia has quit []
Corbin has quit [Ping timeout: 480 seconds]
martin19 has quit [Ping timeout: 481 seconds]