ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
Daaanct12 has quit [Quit: Quitting]
Danct12 has joined #dri-devel
heat_ has quit [Ping timeout: 480 seconds]
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
yuq825 has joined #dri-devel
Haaninjo has quit [Quit: Ex-Chat]
<ishitatsuyuki>
Lynne, I probably need to understand what kind of structure your algorithm is. Can you write out it in pseudocode? e.g. Step 1: Do prefix sum for x[k][i]..x[k][j] for all k
<Lynne>
step 1 is load all values needed from the source images and compute the s1 - s2 diff
<Lynne>
store that per-pixel vector in the integral image
<Lynne>
then compute the horizontal prefix sum, followed by the vertical prefix sum
<Lynne>
which gives you the integral image
<Lynne>
finally, get the a, b, c, d (a rectangle) vector values from the integral for each individual block at a given offset, compute the weight, and add it in the weights array
<Lynne>
opencl does it by having a separate pass for each step, and when I tried to do that, the performance was 20 _times_ slower than opencl than vulkan, because of all the pipeline barriers
<Lynne>
even though I was using a prefix sum algorithm which is around 100 times faster than the naive sum that opencl uses
<Lynne>
so I decided to merge the horizontal, vertical and weights passes into a single shader, to eliminate pretty much all of the pipeline barriers
Leopold__ has joined #dri-devel
<Lynne>
merging the horizontal and vertical passes works fine, but merging the weights pass causes any integral image loads after height amount of _horizontal_ pixels to give old, pre-vertical pass data
<ishitatsuyuki>
this does sound like a use case requiring pipeline barrier unfortunately
<ishitatsuyuki>
because you need to wait for the entire prefix sum to finish before proceeding to the next stage
<ishitatsuyuki>
there is https://w3.impa.br/~diego/projects/NehEtAl11/, which gives you reduced bandwidth for integral image by keeping the horizontal-only prefix sum in shared memory
<ishitatsuyuki>
but it's significantly more complicated to implement
<ishitatsuyuki>
another approach is to do decoupled lookback but with 2D indices, where you calculate in the order of block (0,0), (1,0),(0,1),(2,0),(1,1),(0,2),... which will give you the block on top and left on each step
<ishitatsuyuki>
at the end of day you will still need a barrier between generating the integral image and weight computation
Guest10794 has quit [Ping timeout: 480 seconds]
<Lynne>
why there and not between the horizontal and vertical prefix sums too?
<ishitatsuyuki>
for you current approach you need barrier there too, but I described alternative algorithms that doesn't require synchronization and even memory store between hori/vert prefix sums
<Lynne>
why need a pipeline barrier at all?
<ishitatsuyuki>
because computing vertical prefix sum requires the entire horizontal prefix sum to finish?
<Lynne>
why is barrier() not enough?
<ishitatsuyuki>
barrier only synchronize within the workgroup
<Lynne>
just copypaste from other code, I'll remove it to make it clearer
<ishitatsuyuki>
a single workgroup does sound like why it's slow though
<ishitatsuyuki>
a GPU has many WGPs (in amd terms) and a workgroup runs within a single WGP
ngcortes_ has joined #dri-devel
<Lynne>
it's a max-sized workgroup, 1024, with each invocation handling 4 pixels at a time for a 4k image
<ishitatsuyuki>
for a single workgroup, it should work if you put controlBarrier(wg,wg,buf,acqrel | makeavail | makevisible) between each pass
<Lynne>
nope, doesn't, even if I splat it before each prefix_sum call
<Lynne>
on neither nvidia nor radv
ngcortes has quit [Ping timeout: 480 seconds]
<Lynne>
btw, each dispatch handles 4 displancements (xoffs/yoffs) at a time, and for a research radius of 3, there are 36 dispatches ((2_r)_(2*r) - 1)/4) that have to be done
<Lynne>
so we still do multiple wgps, it's just that each dispatch handles one
<ishitatsuyuki>
ok, utilization should be fine with that
<Lynne>
the default research radius is 15 btw so that's 900ish dispatches, hence you can see why barriers kill performance -_-
<Lynne>
needing to do 3 passes would result in 3x the number of dispatches and barriers
<ishitatsuyuki>
use the meme split barrier feature that no one use? :P (vkCmdSetEvent)
<Lynne>
that would still need a memory barrier, though, wouldn't it?
<ishitatsuyuki>
consider it an async barrier
<Lynne>
I'll leave it as a challenge for someone to do better than my current approach
<ishitatsuyuki>
yeah fair
jewins has quit [Ping timeout: 480 seconds]
<ishitatsuyuki>
back to debugging, did you try putting the barrier right after prefix sum as well?
<Lynne>
yup, splatted it everywhere, the integral image buffer has coherent flags too
<ishitatsuyuki>
i'm afraid i'm out of ideas again
<Lynne>
oh, hey, maybe I could test with llvmpipe
<Lynne>
well, that's weird, the part which is broken on a GPU is also broken on lavapipe
<Lynne>
but the part which is fine on a gpu is pure black on lavapipe
<Lynne>
limiting lavapipe to a single thread doesn't help either
<Lynne>
do none of the thousand of synchronization options actually do anything in vulkan?
<Lynne>
I know a lot of them are there just to satisfy some alien hardware's synchronization requirements, but still
<HdkR>
Gitlab having some performance issues right now?
<HdkR>
Managing to clone at a blazing 38KB/s
Daanct12 has joined #dri-devel
aravind has joined #dri-devel
Daaanct12 has joined #dri-devel
bmodem has joined #dri-devel
Danct12 has quit [Ping timeout: 480 seconds]
Daanct12 has quit [Ping timeout: 480 seconds]
<Lynne>
anyone with any ideas or willing to run my code?
<Lynne>
it's the last roadblock to merging the entire video patchset in ffmpeg
<zmike>
what "synchronization options" are you referring to
zf_ has joined #dri-devel
zf has quit [Read error: Connection reset by peer]
<ishitatsuyuki>
you first identify the slow pass (in your case, there should be only a single compute pass), then go to instruction timing
<ishitatsuyuki>
the numbers should give you a rough idea of "cost" of instructions
JohnnyonFlame has joined #dri-devel
devilhorns has quit []
FireBurn has quit [Quit: Konversation terminated!]
Daaanct12 has quit [Remote host closed the connection]
Daaanct12 has joined #dri-devel
<tleydxdy>
when I look at all the fds opened by a vulkan game and their corresponding drm_file, some of them have the correct pid but they don't seems to be doing any work, and some have the pid of Xwayland and is doing all the work. does anyone know why that is?
<tleydxdy>
I assume it is because those fds are sent over by the X server, but why is it using those to do all the rendering work?
Daaanct12 has quit [Remote host closed the connection]
<alyssa>
Strictly I think maybe I could get away with xfb_size if I make my dispatch more complicated...
yuq8251 has left #dri-devel [#dri-devel]
<tleydxdy>
emersion I see, so if the underlying wsi for vulkan is not X11 DRI3 then the pattern should not be seen?
<emersion>
on Wayland, all DRM FDs should be opened by the client *
<emersion>
( * except DRM leasing stuff)
<emersion>
(but that's only for VR apps, and the DRM FD send by the compositor is read-only before the DRM lease is accepted by the compositor)
<tleydxdy>
got it
<jenatali>
alyssa: Ack, but without seeing how it's used it's hard for me to really get why it's needed
<jenatali>
XFB is one of those things that's magic from my POV, I don't know any of the implementation details
<danvet>
emersion, yeah but I thought for vk you still open the render node
<danvet>
since winsys is only set up later
<emersion>
what do you mean?
<danvet>
emersion, like you can set up your entire render pipeline and allocate all the buffers without winsys connection
<emersion>
maybe Mesa will open render nodes, but these won't be coming from the compositor
<danvet>
and so no way to get the DRI3 fd
<danvet>
and only when you set up winsys will that part happen
<danvet>
so I kinda expected that the render nodes will have most buffers and the winsys one opened by xwayland just winsys buffers
<emersion>
sure. the question was about FDs coming from Xwayland though
<emersion>
ah
<danvet>
but it seems to be the other way round
<danvet>
per tleydxdy at least
<emersion>
maybe the swapchain buffers are allocated via Xwayland's FD\
<danvet>
for glx/egl it'll all be on the xwayland fd
<emersion>
the swapchain is tied to WSI
<danvet>
300mb swapchain seems a bit much :-)
<tleydxdy>
yeah, also all the gfx engine time is logged on the xwayland fd
<tleydxdy>
so it's also doing cs_ioctl on that
<emersion>
that's weird
sgruszka has quit [Remote host closed the connection]
<tleydxdy>
the game is unity, fwiw, and as far as I can tell it's not doing anything special
kts has quit [Quit: Leaving]
cheako has joined #dri-devel
aravind has quit [Ping timeout: 480 seconds]
<gfxstrand>
danvet: That's how things work initially, yes. I think some drivers are trying to get a master and use that if they can these days.
<gfxstrand>
They shouldn't be getting it from the winsys, though. That'd be weird.
<danvet>
gfxstrand, well the master you only need for display, and for that you pretty much have to lease it
<danvet>
unless bare metal winsys
<danvet>
no one else should be master than the current compositor
<alyssa>
jenatali: purely software xfb implementation, see linked MR :~)
<jenatali>
Ah I missed that link
<alyssa>
~~just you wait for geometry shaders~~
<jenatali>
Got it, so you run an additional VS with rast discard and just augment the VS to write out the xfb data?
<alyssa>
Yep
sebastien has joined #dri-devel
<alyssa>
I mean, that's conceptually how it works for GLES3.0 level transform feedback
<alyssa>
and that's what panfrost does
<alyssa>
all the real fun comes in when you want the full big GL thnig
sebastien is now known as Guest10882
<jenatali>
Hm. Can't you mix VS+XFB+rast?
<alyssa>
indexed draws, triangle strips, all that fun stuff
<jenatali>
I haven't looked at the ES limitations for XFB so I dunno
<alyssa>
GLES3.0 is just glDrawArrays() with POINTS/LINES/STRIPS
<alyssa>
1 vertex in, 1 vertex our
<alyssa>
*out
<alyssa>
which is all panfrost does (and hence panfrost fails a bunch of piglits for xfb)
<MrCooper>
emersion tleydxdy: there are pending kernel patches which will correctly attribute DRM FDs passed from an X server to a DRI3 client to the latter
<alyssa>
s/STRIPS/TRIANGLES/
Guest10882 has quit []
<alyssa>
for full GL there's all sorts of batshit interactions allowed, e.g. indirect indexed draw + primitive restart + TRIANGLE_STRIPS + XFB
<alyssa>
how is that supposed to work? don't even worry about it ;-)
<emersion>
MrCooper: my preference would've been to fix the X11 protocol, instead of fixing the kernel…
<alyssa>
spec has a really inane requirement that you can draw strips/loops/fans but they need to streamed out like plain lines/triangles
siddh has quit [Quit: Reconnecting]
siddh has joined #dri-devel
<alyssa>
(e.g. drawing 4 vertices with TRIANGLE_STRIPS would emit 6 vertices for streamout, duplicating the shared edge)
<alyssa>
in that case the linked MR does the stupid simple approach of invoking the transform feedback shader 6 times (instead of 4) and doing some arithmetic to work out which vertex should be processed in a given invocation
<alyssa>
this is suboptimal but hopefully nothing ever hits this other than piglit
<alyssa>
(..hopefully)
siddh has quit []
siddh has joined #dri-devel
jewins has joined #dri-devel
fxkamd has quit []
Duke`` has joined #dri-devel
<jenatali>
alyssa: Right that all makes sense. But can you not mix XFB+rast?
<jenatali>
Or alternatively, does GLES3 not allow SSBOs/atomics in VS?
<alyssa>
VS side effects are optional in all the khronos apis
<alyssa>
for mali, works on older hw but not newer ones due to arm's questionable interpretations of the spec
<alyssa>
for agx, IDK, haven't tried, Metal doesn't allow it and I don't know what's happening internally
<alyssa>
(VS side effects are unpredictable on tilers in general, the spec language is very 'forgiving' here)
<alyssa>
wouldn't help a ton in every case, consider e.g. GL_TRIANGLE_FANS with 1000 triangles drawn
<alyssa>
vertex 0 needs to be written out 1000 times
<alyssa>
all other vertices are written out just once
<alyssa>
re side effects being unpredictable, I *think* this means the decoupled approach is kosher even if we allow vertex shader side effects
<alyssa>
but I'd need to spec lawyer to find out
<karolherbst>
alyssa: are SSBO writes a thing in vertex shaders?
<alyssa>
15:23 < alyssa> VS side effects are optional in all the khronos apis
<karolherbst>
right.. was more like about is it a thing in your driver/hardware
<alyssa>
15:23 < alyssa> for mali, works on older hw but not newer ones due to arm's questionable interpretations of the spec
<alyssa>
15:24 < alyssa> for agx, IDK, haven't tried, Metal doesn't allow it and I don't know what's happening internally
<alyssa>
15:24 < alyssa> (VS side effects are unpredictable on tilers in general, the spec language is very 'forgiving' here)
alyssa has left #dri-devel [#dri-devel]
<daniels>
'newer ones' being anything with IDVS?
<jenatali>
I was very surprised when I learned that Vulkan not only allows side effects, but also wave ops and even quad ops in VS. Like wtf is the meaning of a quad of vertex invocations?
<jenatali>
FWIW D3D does wave ops, but not quads
<gfxstrand>
jenatali: Well, when you render with GL_QUADS...
* gfxstrand
shows herself out
<jenatali>
Which Vulkan doesn't have, right?
<jenatali>
... right?
<gfxstrand>
There was a quads extension but we killed it. :)
<gfxstrand>
Also, quad lane groups in a VS have nothing whatsoever to do with GL_QUAD. I was just making dumb jokes.
<gfxstrand>
They're literally just groups of 4 lanes which you can do stuff on.
<jenatali>
Yeah I know, I'm just also confirming :)
<gfxstrand>
They do make sense with certain CS patterns you can do with the NV derivatives extension, though.
<jenatali>
I guess I could lower quad ops to plain wave ops and support them in VS
<jenatali>
Yeah D3D defines quad ops + derivatives in CS
dsrt^ has joined #dri-devel
<gfxstrand>
Yeah, that's really all they are
<gfxstrand>
In fact, I think we have NIR lowering for it
<gfxstrand>
Yup. lower_quad
<mareko>
quad ops work in VS if num_patches == 4 or 2 and TCS is present
<mareko>
I mean num_input_cp
<jenatali>
Oh cool, I should just run that on non-CS/FS and then I can support quad ops everywhere
djbw has joined #dri-devel
<macromorgan>
so question... I'm trying to troubleshoot a problem that happens only on suspend and shutdown of regulators unbalanced disables.
MajorBiscuit has joined #dri-devel
<macromorgan>
As best I can tell when I try to shut down a panel mipi_dsi_drv_shutdown is getting called which runs the panel_nv3051d_shutdown function which calls drm_panel_unprepare which calls panel_nv3051d_unprepare. Then, I also see panel_bridge_post_disable is calling panel_nv3051d_unprepare
<macromorgan>
should there not be a shutdown function for the panel?
MajorBiscuit has quit []
MajorBiscuit has joined #dri-devel
<macromorgan>
a lot of panels have a "prepared" or "enabled" flag, but when I was upstreaming the driver I was told not to do that
<jenatali>
Has the branchpoint happened?
Leopold has joined #dri-devel
rasterman has joined #dri-devel
MajorBiscuit has quit [Ping timeout: 480 seconds]
<jenatali>
Oh yep, there it is. Would be nice to have a dedicated label for the post-branch MR that bumps the version for the next release. eric_engestrom
<jenatali>
I'd subscribe to that
Leopold__ has quit [Ping timeout: 480 seconds]
swalker__ has quit [Remote host closed the connection]
heat has joined #dri-devel
heat has quit [Remote host closed the connection]
heat has joined #dri-devel
JohnnyonFlame has quit [Ping timeout: 480 seconds]
bmodem has quit [Ping timeout: 480 seconds]
iive has joined #dri-devel
tursulin has quit [Ping timeout: 480 seconds]
stuarts has joined #dri-devel
vliaskov__ has quit [Ping timeout: 480 seconds]
kts has joined #dri-devel
kts has quit [Remote host closed the connection]
kts has joined #dri-devel
frieder has quit [Ping timeout: 480 seconds]
ngcortes has joined #dri-devel
jkrzyszt has quit [Ping timeout: 480 seconds]
MajorBiscuit has joined #dri-devel
lynxeye has quit [Quit: Leaving.]
jeeeun841351 has joined #dri-devel
jeeeun84135 has quit [Ping timeout: 480 seconds]
<eric_engestrom>
jenatali: I've created the label ~mesa-release and I'll write up some doc in a bit, and hopefully we (dcbaker and I) won't forget to use it too often :]
<dcbaker>
eric_engestrom: thanks for doing that
<jenatali>
Thanks!
vliaskov__ has joined #dri-devel
MajorBiscuit has quit [Quit: WeeChat 3.6]
stuarts has quit [Remote host closed the connection]
<dcbaker>
karolherbst: I left you a few comments, it's really annoying that they treat the command line as always up for change
<karolherbst>
yeah...
<karolherbst>
get_version seems to work alright, nice
<dcbaker>
sweet
<karolherbst>
it's a bit annoying that rust.bindgen is already taken otherwise it could be some higher level struct and rust.bindgen => rust.bindgen.generate and we could just add rust.bindgen.version_compare().... but oh well...
<dcbaker>
Yeah. A long time ago I'd written a find_program() cache, which would have made this a bit simpler since you could have done something like find_program('bindgen').get_version().version_compare(...) and since all calls to find_program would use the same cache the lookup would be effectively free and we could just recommend that
<dcbaker>
unfortunately I never got it working quite right
<karolherbst>
yeah.. but that's also kinda annoying
<karolherbst>
I'd kinda prefer wrapping those things so it's always in control of meson
mbrost has joined #dri-devel
Haaninjo has joined #dri-devel
Leopold has quit []
<karolherbst>
dcbaker: anyway... would be cool to get my rust stuff resolved for 1.2 so I only have to bump the version once :D
<karolherbst>
do I need to add any kwarg stuff?
Leopold_ has joined #dri-devel
<dj-death>
oh noes, gitlab 504
JohnnyonFlame has quit [Ping timeout: 480 seconds]
stuarts has joined #dri-devel
<dcbaker>
karolherbst: Yeah, I left you a comment, otherwise I think that looks good
gouchi has joined #dri-devel
<anholt>
starting on the 1.3.5.1 CTS update (with a couple extra bugfixes pulled in)
rasterman has quit [Quit: Gettin' stinky!]
<karolherbst>
now I need somebody else or me to figure out that isystem stuff ...
ngcortes has quit [Ping timeout: 480 seconds]
prahladk has joined #dri-devel
kts has quit [Quit: Leaving]
alyssa has joined #dri-devel
<alyssa>
daniels: yeah, Arm's implementation of IDVS is "creative"
<alyssa>
gfxstrand: lol at VK_QUADS ops
mbrost has quit [Remote host closed the connection]
prahladk has quit []
FloGrauper[m] has quit []
madhavpcm has quit []
MatrixTravelerbot[m]1 has quit []
tuxayo has quit []
LaughingMan[m] has quit []
cleverca22[m] has quit []
tintou has quit []
bluepqnuin has quit []
Celmor[m] has quit []
Harvey[m] has quit []
Guest10825 has quit []
arisu has quit []
zzxyb[m] has quit []
chema has quit []
hch12907 has quit []
cmeissl[m] has quit []
neobrain[m] has quit []
Andy[m]1 has quit []
kallisti5[m] has quit []
YuGiOhJCJ has joined #dri-devel
Ella[m] has quit []
bylaws has quit []
ngcortes has joined #dri-devel
danvet has quit [Ping timeout: 480 seconds]
Duke`` has quit [Ping timeout: 480 seconds]
ngcortes has quit [Remote host closed the connection]
ngcortes has joined #dri-devel
<karolherbst>
quads? reasonable primitives
<alyssa>
* Catmull-Clark has entered the chat
<robclark>
alyssa: idvs sounds _kinda_ like qcom's VS vs binning shader (except that adreno VS also calcs position/psize)
<alyssa>
robclark: same idea, yeah
<alyssa>
the problem isn't the concept, it's an implementation detail :~)
<robclark>
hmm, ok
gouchi has quit [Quit: Quitte]
Leopold_ has quit []
Leopold_ has joined #dri-devel
JohnnyonFlame has joined #dri-devel
bluetail42 has joined #dri-devel
bluetail4 has quit [Ping timeout: 480 seconds]
kts has joined #dri-devel
fxkamd has joined #dri-devel
<Kayden>
there's some mention of nir_register in src/freedreno/ir3/* still, is this dead code now that the backend is SSA?
<anholt>
Kayden: indirect temps are still registers
<Kayden>
oh, interesting, ok
<gfxstrand>
We should convert ir3 to load/store_scratch
<gfxstrand>
Unless you really can indirect on the register file and really want to be doing that.
<karolherbst>
anholt: or scratch mem if lowered to it
<Kayden>
was just a little surprised to see it there still
<Kayden>
wasn't sure if it was leftover or still used :)
<karolherbst>
I don't know if I or somebody else ported codegen to scratch, but I think it was done...
<karolherbst>
ahh nope
<gfxstrand>
Kayden: I mean, the Intel vec4 back-end still uses it last I checked... 😭
<karolherbst>
or was it...
<karolherbst>
mhhh
<Kayden>
gfxstrand: not surprised to see it in general, just in ir3 :)
<karolherbst>
what was the pass again to lower to scratch?
<gfxstrand>
Kayden: Ues, but we should kill NIR register indirects in general.
<Kayden>
ah.
<Kayden>
yeah, probably
<gfxstrand>
I suppose I do have a haswell sitting in the corner over there.
<anholt>
Kayden: register indirects turn into register array accesses. large temps get turned into scratch.
<karolherbst>
I'm sure it's almost trivial to remove `nir_register` in codegen as it already supports scratch memory
<gfxstrand>
NAK won't support it
<karolherbst>
yeah, no point in supporting it on nv hw
<gfxstrand>
There's no point on Intel, either. They go to scratch in the vec4 back-end it's just that the back-end code to do that has been around for a long time and no one has bothered to clean it up.
<gfxstrand>
Technically, Intel can do indirect reads
<gfxstrand>
And indirect stores if the indirect is uniform and the stars align.
<alyssa>
I don't have a great plan for ir3 nir_register use.
pcercuei has quit [Quit: dodo]
elongbug has quit [Read error: Connection reset by peer]
iive has quit [Quit: They came for me...]
<karolherbst>
anybody here every played around with onednn? I kinda want to know what I need to actually use it
<karolherbst>
bonus point: make it non painful
bluebugs has joined #dri-devel
<gfxstrand>
That sounds like an Intel invention
<karolherbst>
it is
<gfxstrand>
Of course...
<gfxstrand>
They have to have One of everything...
<karolherbst>
but apparently it has a CL backend
<karolherbst>
and it can run pytorch
<karolherbst>
and stuff
<karolherbst>
dunno
<karolherbst>
just want to see what CL extensions it needs
<karolherbst>
but I think it's INTEL_subgroup and INTEL_UVM
jewins1 has joined #dri-devel
pzanoni` has joined #dri-devel
jssummer has joined #dri-devel
mattrope_ has joined #dri-devel
djbw_ has joined #dri-devel
jhli_ has quit []
jhli has joined #dri-devel
jewins has quit [Ping timeout: 480 seconds]
pzanoni has quit [Ping timeout: 480 seconds]
stuarts has quit [Ping timeout: 480 seconds]
djbw has quit [Ping timeout: 480 seconds]
mattrope has quit [Ping timeout: 480 seconds]
jssummers has joined #dri-devel
pzanoni has joined #dri-devel
mattrope has joined #dri-devel
djbw__ has joined #dri-devel
<karolherbst>
ehhh..
<karolherbst>
why is oneDNN checking if the platform name is "Intel" 🙃
jssummer has quit [Ping timeout: 480 seconds]
pzanoni` has quit [Ping timeout: 480 seconds]
mattrope_ has quit [Ping timeout: 480 seconds]
jewins1 has quit [Ping timeout: 480 seconds]
djbw_ has quit [Ping timeout: 480 seconds]
<karolherbst>
they even have a vendor id check 🙃
<psykose>
because it was made by intel
<karolherbst>
you know that this won't stop me!
<karolherbst>
(but rusticl not having all the required features will! 🙃)
<gfxstrand>
What features are you missing? Intel subgroups you should be able to pretty much just turn on