ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
pendingchaos has quit [Ping timeout: 480 seconds]
pendingchaos has joined #panfrost
rasterman has quit [Quit: Gettin' stinky!]
pendingchaos has quit [Ping timeout: 480 seconds]
cphealy has joined #panfrost
camus has joined #panfrost
pendingchaos has joined #panfrost
camus1 has joined #panfrost
alpernebbi has quit [Ping timeout: 480 seconds]
alpernebbi has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
camus has joined #panfrost
camus1 has quit [Read error: Connection reset by peer]
Lyude has quit [Quit: WeeChat 3.4]
Lyude has joined #panfrost
JulianGro has joined #panfrost
samuelig has quit [Server closed connection]
samuelig has joined #panfrost
FLHerne has joined #panfrost
FLHerne is now known as Guest2059
minicom has joined #panfrost
wilkom has joined #panfrost
enick_689 has joined #panfrost
erlehmann has quit [Ping timeout: 480 seconds]
Major_Biscuit has joined #panfrost
erlehmann has joined #panfrost
MajorBiscuit has joined #panfrost
Major_Biscuit has quit [Ping timeout: 480 seconds]
tolszak has joined #panfrost
rasterman has joined #panfrost
Guest2059 has quit []
FLHerne has joined #panfrost
FLHerne is now known as Guest2079
Guest2079 has quit [Remote host closed the connection]
FLHerne_ has joined #panfrost
FLHerne_ is now known as FLHerne
nlhowell has joined #panfrost
rasterman has quit [Quit: Gettin' stinky!]
rasterman has joined #panfrost
JulianGro has quit [Ping timeout: 480 seconds]
nlhowell is now known as Guest2089
nlhowell has joined #panfrost
Guest2089 has quit [Ping timeout: 480 seconds]
oftcpass has joined #panfrost
tolszak has quit [Ping timeout: 480 seconds]
nlhowell has quit [Ping timeout: 480 seconds]
nlhowell has joined #panfrost
tolszak has joined #panfrost
oftcpass has quit [Ping timeout: 480 seconds]
tjcorley has quit [Ping timeout: 480 seconds]
tjcorley has joined #panfrost
<alyssa> bbrezillon: how do blits work in (pan)vk?
<alyssa> are they like draws (special "draw a quad with the contents of this image" added to a render pass?)
<alyssa> or are they self-contained batches? (1 fragment job per blit)
jekstrand has joined #panfrost
<anholt> bbrezillon: I'm trying to uprev to cts 1.3.1 and the custom caselist for panvk is totally breaking
<anholt> I think I need to switch you over to just using include filters if you really want to be doing caselist subsetting.
<alyssa> anholt: caselist subsetting is non-optional for panvk right now, too much unimplemented
<alyssa> jekstrand: welcome ^^
JulianGro has joined #panfrost
<alyssa> bbrezillon: If it's a self-contained batch, then on Bifrost+ we can always use pre-frame shaders
<jekstrand> alyssa: I even added #panfrost to my auto-join list so I should be here from now on.
<alyssa> bbrezillon: I.e. to blit from image 1 (a, b, c, d) to image 2 (x, y, z, w), we make a fragment job rendering to image 2 with bounding box (x, y, z, w) with an empty polygon list but a preframe shader sampling from image 1
<alyssa> bypassing the tiler entirely
<macc24> jekstrand: i recognize your nick from somewhere
<alyssa> macc24: Jason Ekstrand, gfx ninja
<alyssa> just joined Collabora
<alyssa> he used to work for AMD or Intel or Valve or something like that
<macc24> hmm
<alyssa> bbrezillon: If I'm not mistaken, GENX(pan_blit) is only used in the inner loop of meta_blit
<alyssa> and that loop has the structure of "open new batch, pan_blit, close batch"
<jekstrand> alyssa: Yeah, I'm sure it was one of those.
<alyssa> where closing a batch creates a fragment jobs
<alyssa> Meaning panvk wants the "blit as its own dedicated batch" too, the API to pan_blit just doesn't make that clear
MajorBiscuit has quit [Quit: WeeChat 3.4]
<jekstrand> bbrezillon: I think my biggest question about !14406 is, "why?" Is it really not possible to implement secondaries any better way than a full capture+replay?
<jekstrand> I guess that's sort-of explained in the header.
<bbrezillon> alyssa: oh, ok
<bbrezillon> (Re: no need for a tiler job for blits)
<bbrezillon> jekstrand: we're currently evaluating this option
<bbrezillon> Manas (AKA sin3point14) is working on a native secondary cmdbuf implementation for panvk
<jekstrand> dozen's going to need this, isn't it?
<jekstrand> I don't think D3D12 has them
<bbrezillon> but the preliminary perf results are a bit disappointing (the memcpy + address relocation logic is worse than the SW-replay approach, at least for simple draw replays)
<jekstrand> !
<jekstrand> That's... surprising.
<jekstrand> But I guess it could be
<bbrezillon> jekstrand: yeah, I wouldn't exclude a silly mistake in our code
<bbrezillon> so I wouldn't trust those preliminary results just yet
<jekstrand> k
<alyssa> bbrezillon: alright, wasn't sure if this was an oversight (understandable one, we have 5 major archs and 2000 pages of API specs to worry about..), or a deliberate design
<bbrezillon> and yes, we need it for dozen, at least until d3d12 provides us with a way to duplicate a cmdlist
<alyssa> if oversight, will see if I can excise that code path
<bbrezillon> alyssa: cool
<jekstrand> bbrezillon: Dozen makes total sense to me.
<alyssa> bbrezillon: re address relocation, there's a lot less of it on Valhall
<jekstrand> bbrezillon: For panvk, this is one of those things that makes me suddenly put my maintainer hat on and start asking questions. If there are architectural decisions that are required to support secondaries efficiently natively, it's better to make those early than a giant refactor later.
<alyssa> hardware-allocated varying buffers <3
<jekstrand> But, knowing little to nothing about Mali hardware, I can totally believe that it's intractable for some reason I'm not aware of.
<bbrezillon> alyssa: ah, that's awesome news!
<alyssa> bbrezillon: not 100% sure how to do XFB on valhall though ;-p
<bbrezillon> jekstrand: it's doable, but doesn't seem to provide the expected perf improvements
<alyssa> what's the hot path?
<bbrezillon> memcpy() is pretty hot
<bbrezillon> and the other one in BO allocation
<alyssa> don't you have both of those (effectively) for replay though?
<alyssa> what are you memcpy'ing *from*?
<bbrezillon> but the SW-implem has pretty much the same BO alloc overhead
<bbrezillon> CPU-only buffer (basically an mmap(private)) to a BO that's big enough to contain all the descs contained in the cmdbuf
<alyssa> mmap(private)? not just malloc?
<bbrezillon> the advantage of mmap() is that you can make the buffer grow without copying with mremap()
<alyssa> right..
<bbrezillon> and our test case (a modified version of dEQP-VK.api.command_buffers.record_many_draws_secondary_1 where we execute 50 times a secondary cmdbuf with thousands of draws instead of once) generates arounds 18MB of descs
<bbrezillon> the relocation logic is pretty cheap (just like the SW-cmd-queue replay) compared to the allocation/write instructions
<bbrezillon> the thing with the current relocation step is that we
<bbrezillon> 1/ copy holes (represent something around 6% of the total amount of memory)
<bbrezillon> 2/ write twice to the address slots (once during the BO copy, and once when relocating)
<bbrezillon> I guess we could have similar perfs if we weren't copy the whole BO, relocating/writing in one step
<bbrezillon> but I'm doubtful we'd end up if significantly better perfs, at least not with simple draws (might be a bit different with blits, or other more complex operations though)
<bbrezillon> *end up with
<jekstrand> bbrezillon: When you're evaluating perf, the perf of vkCmdExecuteCommands() is what matters. If the over-all cost is the same but weighted such that vkCmdExecuteCommands() is cheaper, that's still a win.
<jekstrand> (Sorry, I can't tell exactly what bits are being referred to when you say it's about the same.)
<bbrezillon> jekstrand: yep, I'm filtering on CmdExecuteCommands, of course
<jekstrand> Ok, cool
<jekstrand> So why is that doing any BO allocation?
<jekstrand> It should just be memcpy and reloc
<bbrezillon> it's allocating a BO in the primary cmdbuf
<jekstrand> Right, to memcpy into, I suppose
<bbrezillon> yep
<jekstrand> Is that just part of the natural BO growing or do you always have to allocate for CmdExecuteCommands()?
<bbrezillon> I even tried to pass MAP_POPULATE to pre-populate the CPU MMU and avoid faults during the copy, but it didn't help much
<bbrezillon> CPU-visible BOs are currently not growable
<bbrezillon> so we just allocate 64kb chunk at a time (or more if we know we'll need more, which is the case here, since the src buffer is 18MB wide)
<bbrezillon> jekstrand: BTW, I don't know if you noticed, but I have a sligthly reworked implem of the secondary-cmdbuf here https://gitlab.freedesktop.org/bbrezillon/mesa/-/commits/panvk-exec-cmd-alt
tomeu829 has joined #panfrost
<bbrezillon> anholt: can we filter out tests with `dEQP-VK.<tests>.*` rules? if yes, then we can probably for a skip list
<anholt> bbrezillon: skips and includes are both regexes.
CounterPillow_ has quit [charon.oftc.net helix.oftc.net]
megi1 has quit [charon.oftc.net helix.oftc.net]
mriesch has quit [charon.oftc.net helix.oftc.net]
br has quit [charon.oftc.net helix.oftc.net]
vstehle has quit [charon.oftc.net helix.oftc.net]
Stary has quit [charon.oftc.net helix.oftc.net]
jernej has quit [charon.oftc.net helix.oftc.net]
<bbrezillon> anholt: ok, I'll take a look tomorrow then, unless you want to give it a try
<anholt> bbrezillon: if you have any hints of how you generated your list, that would help, since I'm putting the MR together currently
<bbrezillon> we just test dEQP-VK.api.copy_and_blit.* and dEQP-VK.pipeline.blend.*
<bbrezillon> so anything else should be skipped
CounterPillow_ has joined #panfrost
megi1 has joined #panfrost
br has joined #panfrost
mriesch has joined #panfrost
vstehle has joined #panfrost
Stary has joined #panfrost
sigmaris has joined #panfrost
jernej has joined #panfrost
stepri01 has joined #panfrost
stebler[m] has joined #panfrost
mmind00 has joined #panfrost
<anholt> bbrezillon: great!
<alyssa> bbrezillon: Is implementing nir_load_push_constant todo..?
<bbrezillon> alyssa: IIRC, I had something in my WIP branch
<alyssa> ack
<bbrezillon> right now it's just a dedicated UBO
<alyssa> might snarf it, having meta shaders do load_push_constant is a lot easier to verify than "load_ubo and pray it gets pushed"
<alyssa> er wait, that doesn't implement push constants, that lowers them
<bbrezillon> yeah, I considered implementing it as real push constants
<bbrezillon> but then I realized it might take slots that could be used by sysvals
<alyssa> right..
<bbrezillon> so I decided to leave them as UBOs and let the UBO -> push_constant opt pass do its job
<bbrezillon> (this being said, this opt pass is disabled in panvk, 'cause I wanted to keep things simple at first :p)
<alyssa> heh :)
<alyssa> once we support pilot shaders the push code will be reworked I imagine
<alyssa> freedreno/vulkan/tu_shader.c
<alyssa> um
<alyssa> (In the mean time I'd rather not break the downstream push optimizations)
<bbrezillon> ah, nice
<bbrezillon> jekstrand: FYI, here are the preliminary perf results we got https://gitlab.freedesktop.org/-/snippets/4417
<bbrezillon> panvk_v7_native_CmdExecuteCommands() is the native implementation, but for some reason, the memcpy that's inside this function doesn't appear under it in perf report
<bbrezillon> panvk_v7_sw_CmdExecuteCommands() is the SW-queue implementation
<bbrezillon> they both execute the same secondary cmdbuf (that's recorded in both forms in the vkCmdXxx() functions)
<jekstrand> bbrezillon: Well, that is disappointing...
<bbrezillon> couldn't agree more
<bbrezillon> as I said, I'm pretty confident we can reach the same level of perfs with some optimizations, but I'm not sure things will be significantly better with the native implem
<anholt> is native doing some WC reads or something?
<bbrezillon> nope, we're using a CPU-only buffer (allocated with mmap(MAP_PRIVATE))
<bbrezillon> as the source
<bbrezillon> the destination is a WC buffer, but we don't read from it (at least, we shouldn't)
<bbrezillon> I didn't figure out why we have this page fault in the memcpy path though
<jekstrand> anholt: I was going to ask that too but I figured I'd already asked enough obvoius questions. 😂
<anholt> ha
<bbrezillon> I did an mlock() to make things resident on the CPU-buffer (just as a test)
<bbrezillon> and I keep having those faults
<bbrezillon> also added MAP_POPULATE to the BO mmap() so pre-populate the CPU pagetable
<anholt> oh, interesting. I was wondering if this was something like your cmd pool didn't have bos in it.
<bbrezillon> but if I trust perf, that only accounts for 1% on the 2.5% diff we have between the 2 implems
<bbrezillon> anholt: yep, the cmd pool is definitely empty (no directly available bufs) when we start executing the secondary cmd buf
<bbrezillon> but the same goes for the SW implementation
rasterman has quit [Quit: Gettin' stinky!]
evx256 has joined #panfrost
nlhowell has quit [Ping timeout: 480 seconds]
<jekstrand> bbrezillon: Is it just me or has no one implemented DestroyCommandPool for panvk?
<jekstrand> Ok, didn't expect it to be in a different file....
<bbrezillon> yeah, having panvk_cmd_buffer.c doesn't make much sense
<bbrezillon> we should probably merge them
<jekstrand> Yeah, ANV's split is awkward too
<bbrezillon> even if that means duplicating some code
<bbrezillon> I mean, duplicating code in the .o
<jekstrand> yeah
<bbrezillon> all the files prefixed panvk_vX are per-arch
<jekstrand> Yup. Figured that one out already. :)
<alyssa> jekstrand: not going to complain about panvk_per_arch? :-p
<jekstrand> alyssa: Yeah, it's a bit long.
<bbrezillon> alyssa: VKGENX()? can't use GENX() because it's suffixing stuff with vX, and we need a prefix for entrypoints generation
q4a has joined #panfrost
<jekstrand> vX()?
* jekstrand isn't going to be opinionated here. -ENOTMYPROJECT
<bbrezillon> wfm. s/panvk_per_arch(/vX(/ should be pretty easy to review compared to a new feature addition ;-)
stebler[m] has quit [Server closed connection]
stebler[m] has joined #panfrost
q4a has left #panfrost [#panfrost]
tolszak has quit [Ping timeout: 480 seconds]