<anholt>
before I go typing a thing: do zink's descriptor template updates update every descriptor in the set?
<zmike>
yep
<anholt>
looks making use of that will be good for 74 -> 102 fps on gfxbench driveroverhead (submits disabled. we also suck at cmd streams and so you only get 64ish fps when processing submits)
<zmike>
suppose it'd be possible to create incremental templates for doing only partial updates, but the overhead of tracking more precise changes would probably negate the gains
<zmike>
what exactly is this case doing that's hitting perf issues in descriptor updates?
<anholt>
the issue we have is a partial update means you have to memcpy the un-updated descriptors, and that's a read of WC memory.
<zmike>
is it just doing like 1 sampler update for a set of 16?
<zmike>
i.e., only updating the first sampler in a set
<anholt>
there are only 5 descriptors in the set I think, let's see what they are.
<zmike>
be interested to know types and which ones are being updated
<anholt>
5 UBOs, apparently.
<zmike>
all updated?
<zmike>
should be possible to check this trivially with GALLIUM_TRACE btw
<anholt>
actually not sure where 5 ubos are even coming from
<anholt>
looks like it's just a loop of set_constant_buffer(VERTEX, 0), draw_vbo.
<zmike>
hm so that should be updating the push set
<zmike>
which is its own set
<zmike>
ahhh
<zmike>
okay because it'll update all the stages yeah I see
<zmike>
so I think the solution to this would be something like
<zmike>
in zink_descriptors_init create more push templates in different combinations
<zmike>
store them to an array
<anholt>
but you're always going to have the layout have descriptors for every possible stage?
<zmike>
yeah that's kinda unavoidable
<anholt>
prediction: partial update's going to be slower for us than that.
<anholt>
since partial update means WC read. better to just send us the whole mess then.
<zmike>
ah
<zmike>
well I think that's what it should be doing currently?
<zmike>
the issue is it's currently optimized to just slam in the same layout for every pipeline
<anholt>
so I just need to check that the template covers the full layout, at which point I can skip the WC read.
<zmike>
and changing that adds a lot of complexity
<zmike>
yeah probably
<zmike>
you can see in zink_descriptors_init the push template that gets created for it
<zmike>
and it'll always update all the descriptors every time
<zmike>
so it's never partial
<zmike>
zink never does partial set updates except for bindless
<zmike>
so you should never have wc read
<anholt>
ok, so if I fix this, I think we're at no-submit driveroverhead 102 fps zink+tu vs 108 fps freedreno, and yes-hw driveroverhead 64 vs 69 fps.
<zmike>
ooo
<zmike>
that's pretty competitive
<anholt>
(well, I also had to force sysmem. so two things to fix, at which point I will declare driveroverhead Fine)
<zmike>
nice 💪
<zmike>
anholt: btw there's cases for this in vkoverhead too now
<zmike>
template and no-template
<anholt>
looks like it doesn't, that uses the non-push path.
<zmike>
ohh it's unique to push descriptors?
<zmike>
ok, I should add cases for that too
<anholt>
for push, we have to copy the old set over to the new space. if you're doing non-push update, then it's in place.
<anholt>
(and the lifetimes are in your hands)
<zmike>
gotcha
<zmike>
didn't expect any other drivers might prefer the non-push path
<zmike>
there's a driver workaround already to disable it on amd
<anholt>
I think full-update push should work out the same as a non-push with you manually managing the space for your full updates with my upcoming MR.