ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
<daniels> [tapping-temple-meme] can't watch CTS scroll by if you don't have a CTS
<icecream95> Hmm.. it seems that we may not be updating blend constants properly?
<icecream95> How does Panfrost handle blend constants at the moment?
<jekstrand> They're sysvals that get uploaded as part of the sysvals UBO.
<icecream95> Really? vmul.FMUL.f16 TMP0.xyz, R0.xyz, <0.0470581, 0.282471, 0.933105>
<icecream95> Maybe for panvk?
<jekstrand> Or maybe they're immediates?
<jekstrand> Yeah, I think alyssa mentioned that recently...
<jekstrand> For blend shaders, I think they're immediates.
<jekstrand> They're sysvals when you do blending in the FS
<icecream95> I note that constants are not included in pan_blend_shader_key!
<jekstrand> uh oh...
<jekstrand> Are they getting patched in after the fact?
<icecream95> I think that used to happen, but not any more (because Bifrost is more complex)?
* jekstrand is way outside of his area of expertise at this point. Sorry.
<icecream95> Oh wait, constants *are* seemingly handled later: memcmp(iter->constants, state->constants
<icecream95> Let me guess: key.has_constants is wrong?
<icecream95> Ohhh... the code for reusing variants after hitting PAN_BLEND_SHADER_MAX_VARIANTS has an obvious bug in it
<icecream95> Or maybe multiple obvious bugs
<icecream95> And now "pan/blit: Prepare for Valhall port" is causing segfaults
<alyssa> *hides*
<icecream95> Wait, did I already fix that bug somewhere? It seems familiar
<icecream95> Looks like the code related to patched_s is pretty broken
<alyssa> *hides*
<alyssa> icecream95: Hopefully your fix is just hitting the delete key? :P
<icecream95> alyssa: I don't know how to avoid returning a pointer to a stack value with just delete
<alyssa> Oh oof
<alyssa> I thought I fixed tat
<alyssa> that
<alyssa> In my defence I'm pretty sure all of Panfrost's stencil handling is and has always been busted
<alyssa> and we just keep adding band-aid fixes...
<icecream95> I really want to have tracking for whether stencil has ever been written to, as otherwise it just keeps getting preloaded...
<alyssa> seems reasonable
<alyssa> alyssa@rootfs:~/build-es/external/openglcts/modules$ ls *.qpa | wc -l
<alyssa> 12
<alyssa> this needs to get to 56, wow!
* alyssa upside-down face
<icecream95> alyssa: Reminds me of my hack for parsing .qpa files to get the images with sed and grep
<alyssa> I know better than to ask questions
<icecream95> Hmm actually the blend shader variant code is less broken than I thought... only missing a memcpy
atler is now known as Guest128
atler has joined #panfrost
Guest128 has quit [Ping timeout: 480 seconds]
<alyssa> lessbroken(tm)
<alyssa> alyssa@rootfs:~/build-es/external/openglcts/modules$ ls *.qpa | wc -l
<alyssa> 30
<icecream95> alyssa: Case in point (stencil handling being broken): I think panfrost_invalidate_resource is broken for invalidating depth when separate stencil is used
<alyssa> kinda mesmerizing, i guess
<alyssa> icecream95: i'd buy it
<alyssa> it being broken, I mean
<alyssa> i would not pay for separate stencil
<alyssa> i would pay for depth, though, in case there's a buy 1 get 1 free going on
<icecream95> I'd pay for depth, so long as all the values in the buffer were dernomal
<icecream95> Oh wait, you can't invalidate depth but not stencil, can you?
<icecream95> What are people doing with resources that compression is so effective? 9%: 4820 KB -> 476 KB (total 41 MB saved)
camus has joined #panfrost
<alyssa> sounds like an app that should be using ASTC...
<alyssa> 56/56 sessions passed, conformance test PASSED
<alyssa> heck yeah!
<alyssa> <Summary Type="es31" Conformant="True">
<icecream95> alyssa: To beat that ratio with ASTC would require a block size of at least 8x6, which might be too lossy to be usable
<alyssa> icecream95: ah, I see
<alyssa> makes me wonder about the app, though ...
<icecream95> Even SuperTuxKart saves about 400 MB with compression
<alyssa> O_O
<alyssa> icecream95: btw, did you ever try wiring up "LCRA but 1000x faster" to Midgard?
<alyssa> LCRA was built for Midgard, trading off perf/memory usage for precise handling of arbitrary vectors
<alyssa> Something that is, ttbomk, impossible to do in a graph colouring RA...
<icecream95> alyssa: Not yet, because the larger bottleneck was guessing register pressure for how many push uniforms to use.
<icecream95> RA isn't so bad on Midgard as vectorisation means that you have 4x fewer nodes
<alyssa> Linear scan (and by extension, tree scan, i.e. SSA-based RA) can do that, but..
<alyssa> 1. Linear scan on its own sucks.
<alyssa> 2. TTBOMK, effective SSA-based RA for vector architectures is an open problem. ir3 is state of the art in this respect but... ir3 is still a scalar ISA, aside from intrinsics.
<icecream95> mir_estimate_pressure could maybe be sped up with: if (max_live > 96) return 7; in the inner loop
<alyssa> 3. Hybrid RA (i.e. local linear scan, global graph colouring) could maybe work but hybrid RA has some fundamental problems of its own. IGC used it once, I think
<icecream95> I've found a lot of bugs today... Assertion `batch->maxx > batch->minx' failed
<alyssa> So.. I suspect that your much improved version of LCRA (exploiting sparseness) is probably state-of-the-art for vec4 architecutres.
<alyssa> if we were academics, I'd suspect it's publishable
<alyssa> admittedly there are some really weird RA papers out there, and I haven't surveyed what's out there in a few years
<icecream95> Ooh I know I need to get a patent on the algorithm! /s
<alyssa> (honourable mention to Pereira's "register allocation by puzzle solving" paper :-p)
<alyssa> 02:12 < icecream95> RA isn't so bad on Midgard as vectorisation means that you have 4x fewer nodes
<alyssa> That also fed into why LCRA was a reasonable idea for Midgard and a bad one for Bifrost.
<alyssa> As you know, with dense data structures, LCRA is Θ(N^2) space and time complexity [worse if you spill.]
<alyssa> The flip side is that 1/4 of the nodes means 1/16 the resource demand..
<alyssa> In all honesty, I don't know what the space *or* time complexity of "LCRA+nodearray" is.
<icecream95> Well.. it's closer to N^3 with your version which tries each possible register one by one....
<alyssa> Ω(N) and O(N^2) but that's a big gap
floof58 is now known as Guest137
floof58 has joined #panfrost
<alyssa> O(R N^2) for N nodes and R registers
<alyssa> but actually O(R S N^3) for S values spilled, and S is O(N) so in that sense it's N^3 :(
<alyssa> In a formal sense, the "try all regs at once with NEON" and "try one at a time" are the same. Constant number of registers
<icecream95> The final 10x or so for the 1000x speedup was from making spilling not redo nodes that have already been calculated
<icecream95> alyssa: But if you only have 5 nodes in your program, you won't use all 64 registers
<alyssa> (And if you imagine a significantly larger reg file, your algorithm will slow down proprtionately. It's asymptoptically the same.)
<alyssa> that.. muddies the waters, yes
Guest137 has quit [Ping timeout: 480 seconds]
<alyssa> 02:24 < icecream95> The final 10x or so for the 1000x speedup was from making spilling not redo nodes that have already been calculated
<alyssa> I'm not convinced this works.
<alyssa> assuming by calculated you mean coloured ("solved")
<alyssa> although it's late and I should've been done with work like 5 hours ago so I'm signing off for now
<alyssa> night!
<icecream95> Ah no, I think we still reset solutions, jsut liveness
<icecream95> (isn't recalculated)
<icecream95> gnight
* jekstrand attempts to enable robustBufferAccess
<jekstrand> Passed: 228/1068 (21.3%)
<jekstrand> Failed: 0/1068 (0.0%)
* jekstrand kicks off one more run. I think descriptor sets will be good to go after this. \o/
tlwoerner has quit [Remote host closed the connection]
tlwoerner has joined #panfrost
tlwoerner has quit [Remote host closed the connection]
tlwoerner has joined #panfrost
pch has quit []
pch has joined #panfrost
Daanct12 has joined #panfrost
indy has quit [Read error: Connection reset by peer]
indy has joined #panfrost
guillaume_g has joined #panfrost
tlwoerner has quit [Remote host closed the connection]
tlwoerner has joined #panfrost
pch has quit [Remote host closed the connection]
pch has joined #panfrost
<icecream95> Uh oh.. there seems to be one SkQP test which fails because of transaction elinimation, though it needs most of the rest of SkQP to run for it to trigger...
rasterman has joined #panfrost
robmur01_ is now known as robmur01
<icecream95> Hmm.. the tests pass with noafbc.. what difference does that make?
* icecream95 goes looking for the branch which checks CRCs in software
* icecream95 tries to work out what crc_clear_color does
<icecream95> Hmm... CRC buffers are often coming back full of zeroes, which isn't great
stepri01 has quit [Remote host closed the connection]
stepri01 has joined #panfrost
Rathann has quit [Remote host closed the connection]
robmur01 has quit [Ping timeout: 480 seconds]
<icecream95> I have no idea where the clear_val | 0xc000000000000000 formula for crc_clear_color came from, it's completely wrong
Daanct12 has quit [Remote host closed the connection]
Daanct12 has joined #panfrost
robmur01 has joined #panfrost
<icecream95> I *think* that the correct thing to do is to just calculate an actual CRC. But rather than use an actual CRC implementation, it makes sense to calculate it in bigger and bigger parts:
robmur01_ has joined #panfrost
Daaanct12 has joined #panfrost
Daanct12 has quit [Read error: Connection reset by peer]
robmur01 has quit [Ping timeout: 480 seconds]
robmur01_ is now known as robmur01
<icecream95> https://github.com/madler/spoof/blob/master/spoof.c#L201 explains how to do calculate sparse CRCs, which I think could be usable
<icecream95> Or maybe the intent is just to write *any* clear-unique value as the clear colour, in which case why bother with the shifting?
<icecream95> That means that a tile can now be given two different CRC values, either the clean or non-clean tile one, so transaction elimination won't work quite so well
<icecream95> ("a tile": "a certain set of pixel values making up a tile")
rkanwal has joined #panfrost
erle has quit [Read error: Connection reset by peer]
icecream95 has quit [Ping timeout: 480 seconds]
Daaanct12 has quit [Remote host closed the connection]
erle has joined #panfrost
icecream95 has joined #panfrost
<icecream95> Oh wait, empty tiles use the top bit of the CRC as a flag bit, so they can't alias anyway
<icecream95> (So the current value for crc_clear_color is fine)
icecream95 has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
anarsoul has quit [Read error: Connection reset by peer]
anarsoul has joined #panfrost
rasterman has quit [Quit: Gettin' stinky!]
<icecream95> And if things aren't already confusing enough, v7 appears to calculate the CRC *after* compression, whereas v6 and earlier do it after tiling but while still uncompressed
icecream95 has quit []
rasterman has joined #panfrost
Guest860 has quit []
leah has joined #panfrost
<alyssa> icecream95: crc_clear_color is ignored unless empty_tiled_write_enable is set
<alyssa> and is new in v7 unless the XML got botched
JulianGro has joined #panfrost
<cwabbott> alyssa: actually SSA-based RA for midgard would probably be even easier than a scalar ISA, I think
<cwabbott> the main problem is having contiguous register classes with no alignment, which iirc midgard doesn't have
<alyssa> hmm?
<cwabbott> there was even a thesis where someone implemented ssa-based RA for ARM's compiler for midgard
<alyssa> I vaguely recall that thesis... I thought they did a hybrid or graph colouring though, not real SSA RA?
<alyssa> Maybe not
<cwabbott> I thought it was
floof58 is now known as Guest186
floof58 has joined #panfrost
Guest186 has quit [Ping timeout: 480 seconds]
tris- has joined #panfrost
tris_ has quit [Read error: Connection reset by peer]
`join_subline has quit [Read error: Connection reset by peer]
`join_subline has joined #panfrost
Danct12 has quit [Remote host closed the connection]
JulianGro has quit [Remote host closed the connection]
guillaume_g has quit []
MajorBiscuit has joined #panfrost
MajorBiscuit has quit [Quit: WeeChat 3.4]
MajorBiscuit has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
JulianGro has joined #panfrost
Rathann has joined #panfrost
karolherbst has quit [Quit: Konversation terminated!]
MajorBiscuit has quit [Quit: WeeChat 3.4]
karolherbst has joined #panfrost
urja has quit [Ping timeout: 480 seconds]
<alyssa> jekstrand: I think I've reviewed all your stack
<alyssa> cwabbott: Assuming you're talking about Max Andersson's thesis, it seems to just be graph colouring and linear scan
<alyssa> Ah, but there's another thesis implementing SSA-based RA for ARM's compiler for *bifrost*
<alyssa> Never saw that one until now
urja has joined #panfrost
karolherbst has quit [reticulum.oftc.net coulomb.oftc.net]
JulianGro has quit [reticulum.oftc.net coulomb.oftc.net]
atler has quit [reticulum.oftc.net coulomb.oftc.net]
kenzie has quit [reticulum.oftc.net coulomb.oftc.net]
JulianGroOld[m] has quit [reticulum.oftc.net coulomb.oftc.net]
go4godvin has quit [reticulum.oftc.net coulomb.oftc.net]
jenneron[m] has quit [reticulum.oftc.net coulomb.oftc.net]
psydroid[m] has quit [reticulum.oftc.net coulomb.oftc.net]
jschwart has quit [reticulum.oftc.net coulomb.oftc.net]
br has quit [reticulum.oftc.net coulomb.oftc.net]
macc24 has quit [reticulum.oftc.net coulomb.oftc.net]
CounterPillow has quit [reticulum.oftc.net coulomb.oftc.net]
mriesch has quit [reticulum.oftc.net coulomb.oftc.net]
ndufresne has quit [reticulum.oftc.net coulomb.oftc.net]
pH5 has quit [reticulum.oftc.net coulomb.oftc.net]
HdkR has quit [reticulum.oftc.net coulomb.oftc.net]
tomeu has quit [reticulum.oftc.net coulomb.oftc.net]
Stary has quit [reticulum.oftc.net coulomb.oftc.net]
mmind00 has quit [reticulum.oftc.net coulomb.oftc.net]
suihkulokki has quit [reticulum.oftc.net coulomb.oftc.net]
rtp has quit [reticulum.oftc.net coulomb.oftc.net]
karolherbst has joined #panfrost
atler has joined #panfrost
kenzie has joined #panfrost
JulianGroOld[m] has joined #panfrost
jenneron[m] has joined #panfrost
jschwart has joined #panfrost
pH5 has joined #panfrost
macc24 has joined #panfrost
rtp has joined #panfrost
tomeu has joined #panfrost
ndufresne has joined #panfrost
mriesch has joined #panfrost
Stary has joined #panfrost
br has joined #panfrost
HdkR has joined #panfrost
mmind00 has joined #panfrost
JulianGro has joined #panfrost
CounterPillow has joined #panfrost
suihkulokki has joined #panfrost
go4godvin has joined #panfrost
psydroid[m] has joined #panfrost
Danct12 has joined #panfrost
<alyssa> "It is simply not yet known if peephole optimization is
<alyssa> important for performance in Mali GPUs"
<alyssa> great sentence.
go4godvin is now known as Guest205
<alyssa> "The pattern specifics not be disclosed in this report as the Valhall ISA is not public"
<alyssa> Booo
rkanwal has quit [Quit: rkanwal]
rasterman- has joined #panfrost
rasterman- has quit []
rasterman has quit [Remote host closed the connection]
rasterman has joined #panfrost
Danct12 has quit [Quit: Quitting]
<robmur01> you'll just have to keep fixing that then!
<alyssa> :D
Danct12 has joined #panfrost
<jekstrand> alyssa: \o/
<jekstrand> alyssa: dual-src blend will take a bit more work yet
<alyssa> jekstrand: oh... okay
Rathann has quit [Ping timeout: 480 seconds]
icecream95 has joined #panfrost
<icecream95> alyssa: v6.xml has the empty tile fields but not crc_clear_color. Did the XML get botched?
<alyssa> possibly
<alyssa> er.. maybe not
<alyssa> empty tile works different on v6 and v7 apparently
<alyssa> (and is probably subtly broken on either :p)
<icecream95> On v7, taking a bit out of the CRC means that collisions will happen more often...
rasterman has quit [Quit: Gettin' stinky!]
<jekstrand> alyssa: Specifically (WRT dual-src blend), it currently requires that the other source gets passed in as part of the options struct. The pass isn't smart enough to scrape it out of the shader.
<jekstrand> Those smarts can be added; it'll just take a bit more work.
<jekstrand> The annoying part, of course, is that we don't know what order the two writes will happen at the end of the shader so we need to be clever about trying to scrape it all together.
<jekstrand> There's an intel pass which does something similar for alpha-to-coverage, IIRC.
<icecream95> jekstrand: I wrote a pass which reordered the writes, but I think it has since been removed?
<jekstrand> icecream95: Yeah, shouldn't be hard to do if you're willing to assume that they're all in the last block (which we are)
<icecream95> I used exec_node_* functions for that... is that supposed to be a supported way of manipulating NIR?
<jekstrand> It's generally not great.
<jekstrand> If you're going to pull it out and stick it right back in, it works, but nir_instr_remove/insert is better
<jekstrand> There are a few passes which have to do it that way for very specific reasons but I frown on it in general.
<icecream95> alyssa: The reason that I was poking about with empty_tile_write and friends is because I was seeing CRC data not being updated even with the write flag being set, so the values from a previous use of the resource remain
<alyssa> icecream95: :|
<alyssa> Not sure what would cause that off hand
<alyssa> jekstrand: I think we still have a NIR pass for that
<alyssa> I think based on what icecream95 did
<icecream95> alyssa: c4da1b84d3f ("panfrost: Remove pan_nir_reorder_writeout")
<alyssa> icecream95: you wrote 2 nir passes, I combined them, no?
<alyssa> I'm pretty sure that's your code
<icecream95> er yes, I forgot about the other one
<jekstrand> alyssa: Does Mali have no lod opcode?
<icecream95> "that's [my] code" How could you ever tell?
<alyssa> jekstrand: what would a lod opcode do?
<alyssa> icecream95: It just feels very you and not very me
<alyssa> I also don't remember writing that code and faintly remember merging it
<jekstrand> alyssa: given a coordinate, it takes the necessary derivatives and, instead of sampling, just gives you the LOD.
<alyssa> I also would not have implemented 2 src blending myself probably.
<alyssa> jekstrand: uh.. I don't think so, directly.
<alyssa> I think indirectly, maybe, hold on let me think through this
<alyssa> maybe you could snarf the LOD out of the gradient descriptor produced by GRDESC or GRDESC_DER
<jekstrand> Well, it's nir_texop_lod if you want to play with it. There are Vulkan CTS tests for it.
<jekstrand> dEQP-VK.glsl.texture_functions.query.texturequerylod.*
<alyssa> Okay, yes, I see how to do this
<alyssa> Admittedly I'm not sure if Arm intended the hardware to support this
<jekstrand> idk. Vulkan requires it as does desktop GL.
<jekstrand> I'm a bit surprised GLES doesn't have it
<alyssa> (vs a side effect of how textureGrad is implemented)
<alyssa> jekstrand: are there any interactions with anisotropic filtering?
<jekstrand> alyssa: I doubt it. I don't think anisotropic affects LOD computations.
<alyssa> OK
<alyssa> is there a lowering for it btw?
<alyssa> oh, there is but it's a pile of ALU, delghtful
<jekstrand> Yeah, you really don't want to do it all in ALU if the texture unit has support.
<alyssa> Yeah, ok
<alyssa> I'll probably wire that up tomorrow then
<jekstrand> \o/
<alyssa> currently procrastinating on the thing I'm procrastinating on
<jekstrand> :D
<jekstrand> alyssa: FYI: I switched the image size stuff to 16-bit. It's sitting as a FIXUP patch in the MR right now.
<jekstrand> I figure textureSize() isn't that common and halving the amount of descriptor data we have to upload is a good idea.
<alyssa> jekstrand: If we're going to do the fixup, would we rather just crawl the actual image descriptor?
<alyssa> (OTOH, saves only a ludicrously trivial amount of space and complicates porting because we don't have genxml nir_builder yet)
<jekstrand> alyssa: Maybe? But that means adding a UBO for it and, on bifrost, since it's all one table not one-per-descriptor-set, that's gonna be annoying.
<jekstrand> IDK. Maybe it's not THAT bad if we make it a fixed-index one like sysvals and push
<alyssa> rright
<alyssa> bifrost problems, sigh
<alyssa> (fixed in valhall)
<jekstrand> Yeah, on valhall, it's pretty obvious that we can just crawl the descriptors
<icecream95> alyssa: The problem with CRC data not being written happens with both v6 and v7 but T860 seems to be fine
<alyssa> Curious.
<alyssa> We don't set any empty tile flags though, right?
<icecream95> nope
<alyssa> Hm. then that should be midgard compatible behaviour.
<alyssa> ...Wait, since when do you have T860 hw? :p
<alyssa> to be clear, you're rendering into the resource, not writing to it any other way?
<icecream95> Same time as I got G52
<alyssa> Ah
<icecream95> The bug only happens with AFBC (I assume because of the clear to avoid invalid headers), so there is no other way to update it
* alyssa racks her brain for AFBC+CRC interactions
* icecream95 writes a version of init_afbc_headers for other layouts
<alyssa> icecream95: if you do so, please put it in pan_layout and unit test it?
<alyssa> also, which layouts specifically?
<icecream95> alyssa: This is just for debugging
<alyssa> ^ My project while waiting for CTS yesterday, not ready for merge but probably stuff relevant
<alyssa> icecream95: Ah, ok. Do your worst then ;)
<alyssa> icecream95: Oh, crap.
<icecream95> Hmm?
<alyssa> CRC on v6 is probably broken with MRT
<icecream95> MRT is not being used here, I think
<alyssa> Panfrost is written with the assumption that CRC happens on a single render target, 0 on v6
<alyssa> I think that's wrong, v6 uses all render targets
<icecream95> Really? :)
<alyssa> the hard way to deal is tracking CRC per framebuffer and not per render target and being super duper careful
<alyssa> the easy way is "if multiple render targets are enabled and PAN_ARCH == 6, disable CRC for the fragment job"
<icecream95> Now I need to work out whether the CRC is calculated for each RT "one-by-one", or interleaved
<alyssa> Have fun :)
<icecream95> (Quotes because calculation can be done in parallel regardless)
<alyssa> icecream95: Ooh, I have a thought what might be wrong, um
<alyssa> do you mind pasting the failing framebuffer descriptor?
<alyssa> (from pandecode)
* icecream95 goes back to writing debug code
<icecream95> https://0x0.st/oARe.txt I think
<icecream95> alyssa: ^^
<alyssa> Hum, no.
<alyssa> Was wondering if there were funky dimensions or bounding boxes or tile sizes or something
<alyssa> But that's as 'regular' as it gets.
<icecream95> Hmm.. if I force clean_pixel_write_enable to true then the CRC data stays valid, but the test fails?