ChanServ changed the topic of #panfrost to: Panfrost - FLOSS Mali Midgard & Bifrost - Logs https://oftc.irclog.whitequark.org/panfrost - <macc24> i have been here before it was popular
atler is now known as Guest1449
atler has joined #panfrost
Guest1449 has quit [Ping timeout: 480 seconds]
stano_ has joined #panfrost
stano has quit [Ping timeout: 480 seconds]
camus has joined #panfrost
camus1 has quit [Ping timeout: 480 seconds]
rando25892 has quit [Ping timeout: 480 seconds]
camus1 has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus1 has quit [Read error: Connection reset by peer]
vstehle1 has quit [Ping timeout: 480 seconds]
camus1 has joined #panfrost
camus has quit [Remote host closed the connection]
vstehle1 has joined #panfrost
rando25892 has joined #panfrost
rando25892 has quit [Remote host closed the connection]
rando25892 has joined #panfrost
<tomeu>
bbrezillon: when executing, you mean on batch_close?
<bbrezillon>
tomeu: I mean when vkCmdExecuteCommands() is called
<bbrezillon>
you should have a valid state (FB, pipeline, ...) when that happens
<bbrezillon>
tomeu: BTW, you shouldn't really 'close' the batch when recording secondary commands (at least not the sort of close we do for primary cmd buffers)
<bbrezillon>
correction: you only need to special case the close for secondary cmdbufs that are supposed to be called inside a render pass initiated by the primary cmdbuf
<tomeu>
ok, ok, I think I'm starting to see how this is supposed to work
rasterman has joined #panfrost
<tomeu>
bbrezillon: we indeed need to do some closing, as for example thousands of draw commands could be recorded in the same secondary buffer, causing an overflow of the job index
<tomeu>
bbrezillon: what are you referring to when you say special casing the closes of "incomplete" secondary buffers?
<bbrezillon>
tomeu: depends how you handle that I guess. I mean, if those draws are part of a render pass initiated by the primary command buf, I'd let the ExecuteCommands() do the split
<bbrezillon>
tomeu: I mean you shouldn't emit the TLS, FBD, ... when closing the batch (I'm not even sure close_batch() should be called actually)
wwilly has joined #panfrost
<bbrezillon>
when recording a secondary cmdbuf, store tiler/vertex jobs in CPU memory and keep a reference to each job issued. Then, when ExecuteCommands is called, you patch all those jobs and queue them to the primary batch
<tomeu>
ok, so we would not be using any of the batch code
<bbrezillon>
of course, if you're not in a primary render pass, or have all the information you need (FB and render pass passed through VkCommandBufferInheritanceInfo), you can issue the draws normally
<bbrezillon>
tomeu: you might be able to re-use some bits, but most of it would be different, yes
<tomeu>
ok, will we see how we can reshuffle things so we don't end up with 2 different drivers :)
<bbrezillon>
actually, even if you're not supposed to be called in a primary render pass, if VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT is set, you have to keep the various descs in CPU memory, and copy them to GPU mem when ExecuteCommands is called
<bbrezillon>
tomeu: yep, those changes are quite invasive, anything containing pointers has to be patched when vkCmdExecuteCommands()
<bbrezillon>
is called
<bbrezillon>
tomeu: there's still the option of recording vkCmdXxx() calls in a high-level representation (basically keeping the VkXxx structs around for each of these call) and replaying them when ExecuteCommands() is called
<tomeu>
yeah, was wondering about that, and if it's something that could be reused among vulkan drivers
<tomeu>
was thinking of the perf implications of ithat
<bbrezillon>
well, you get the Vk -> HW-desc conversion overhead each time the secondary command buf is executed, so it's not ideal
<bbrezillon>
but it's way simpler to implement :)
<tomeu>
guess the idea behind secondary cmdbufs is that translating vulkan state to GPU state is done once and reused multiple times, but if due to the GPU design we anyway need to patch it so heavily, it just might not be a win on mali
<bbrezillon>
I guess it depends how often the secondary buf is re-used, and it's hard to tell how much CPU time can be saved without implementing both solutions, but I'm all for implementing the simplest approach first and leaving this potential optimization for later
<tomeu>
yeah, me too
<tomeu>
specially, as "Specifying the exact framebuffer that the secondary command buffer will be executed with may result in better performance at command buffer execution time."
<tomeu>
the spec warns about it
wwilly_ has joined #panfrost
wwilly has quit [Ping timeout: 480 seconds]
camus has joined #panfrost
<bbrezillon>
tomeu: sure, but that's not enough. I mean, emitting draws outside of an existing batch (which you can do if you have a FB) when we could have merged those with the batch coming from the primary cmdbuf has a cost (extra FB reloads)
camus1 has quit [Read error: Connection reset by peer]
warpme_ has joined #panfrost
<daniels>
icecream95: there's nothing stopping you from getting involved and working if you want to - you can buy the same phone, download the same images, work on the same hardware
<tomeu>
in the RAW32 case, the result is all zeroes
<tomeu>
I'm not sure why the stride is different
<bbrezillon>
the clear color is different
<bbrezillon>
but it's probably normal (different clear value for uint and unorm)
<bbrezillon>
looks like the FB is not cleared
<bbrezillon>
Color:(0, 0, 0, 0)
<bbrezillon>
are you sure you pass the right value through the uniform?
<bbrezillon>
also need to check the shader (should be an uint color in one case, and a float in the other)
<bbrezillon>
blend descriptors are also worth a check (they should not be the same for UNORM and UINT)
<bbrezillon>
tomeu: did you check the TestResults.qpa output?
<tomeu>
yeah, there's no dump, but mentions that the first pixel in the failing case is all zeroes
<tomeu>
by blend descriptor, you mean the internal conversion descriptor?
<tomeu>
that's the same because it's using the same shader
<tomeu>
(that's how it's done in buf2img, which I used as a template)
camus1 has joined #panfrost
camus has quit [Remote host closed the connection]
camus has joined #panfrost
camus1 has quit [Read error: Connection reset by peer]
nlhowell has quit [Ping timeout: 480 seconds]
camus1 has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
wwilly has joined #panfrost
wwilly__ has joined #panfrost
wwilly_ has quit [Ping timeout: 480 seconds]
wwilly has quit [Ping timeout: 480 seconds]
<bbrezillon>
tomeu: buf2img is a bit different though, it's expected to do a raw copy, but what you want here is draw a rect with a fixed color, and the clear value will depend on the format
<bbrezillon>
tomeu: do you this code pushed to a public repo?
<alyssa>
will be moved to collabora.com shortly, am having technical difficulties oops
<alyssa>
No GPUs were harmed in the making of this document.
<macc24>
fun fact: sku0 duets are shipping lol got cadmium user with one
<HdkR>
alyssa: Good job!
<alyssa>
HdkR: Thank you <3
<alyssa>
HdkR: My Twitter is being very loud :-p
<HdkR>
So many birds
<anarsoul>
alyssa: so ARM didn't provide ISA documentation for Valhall?
<alyssa>
anarsoul: Collabora is providing ISA documentation for Valhall ๐
<alyssa>
That is my story and I'm sticking to it.
<alyssa>
daniels: ^ ๐
<urja>
alyssa: on page 6 the table for blend shaders ABI is not being an actual table, i think
<urja>
or page 5 as the page claims itself to be, 6 as by the pdf reader :P
<alyssa>
uhhh
<robclark>
alyssa: heh, arm still uses the same program-binary format? I think qcom is on their ~3rd iteration (for ir3)
<alyssa>
urja: Oh, oof, thanks. Fixing
<alyssa>
robclark: I mean, I think they keep shuffling things around, but the "scan for magic word and then start disassembling" hack still works :-p
<robclark>
oh, qcom's first iteration was "scan for magic word".. they later went to a more structured things with headers and offsets to different sections (which have offsets to different sections, and so on)
<alyssa>
Nod, that sounds like MBS
<alyssa>
again I didn't actually pay attention to the MBS, just knew enough to look for OBC :p
<alyssa>
*OBJC
<alyssa>
urja: Fixed now, thanks for spotting it!
* urja
refreshes
<urja>
yup, fixed :)
<alyssa>
urja: Didn't realize that part of pandoc markdown was case sensitive
<alyssa>
Need to find a home for the docs
<alyssa>
most of it is generated from ISA.xml which will be upstreamed to Mesa when I can get around to git rebase'ing harder
<alyssa>
the intro/appendix is extra prose (CC BY-SA) which I guess I can stick on gitlab.freedesktop.org
<urja>
I was sorta amused by "Putting it together gives a code sequence for sin" :D
<icecream95>
alyssa: "number of instructions to the next BLEND intsruction minus". Missing "one"? So an off-by-one error on the off-by-one?
<alyssa>
yes, and it should be fixed now
<alyssa>
well already was fixed but uh
<icecream95>
"instructions et giving the same building blocks". ?
camus has joined #panfrost
camus1 has quit [Ping timeout: 480 seconds]
<icecream95>
"FMA_RSCALE.f32 scaled, x, #24" seems to be missing a source
<alyssa>
uhhh indeed it is
<alyssa>
Corrected, thanks
rasterman has quit [Quit: Gettin' stinky!]
camus1 has joined #panfrost
camus has quit [Ping timeout: 480 seconds]
<icecream95>
I do wonder why 0.75 << 20 (intBitsToFloat(0x49400000)) was chosen as the "SINCOS_BIAS", it should only differ from 0.5 << 20 for values where sin wouldn't give a sensible answer anyway
* icecream95
tries some negative numbers and finds out why
<icecream95>
(Otherwise for negative numbers the exponent would decrease so the answer would be shifted left by one place)
<alyssa>
Not sure what the story with the magic # is but it's the same for bifrost/valhall
<icecream95>
alyssa: Floats have 24 mantissa^W significand bits.
<icecream95>
After multiplying by 2/pi, the range of values is [0, 4) (because 2*pi * 2/pi = 4), so we want to put (value % 4) in the lower 6 bits of the float
<icecream95>
4 is represented in floating-point with a significand of 0x800000 >> 24 (0.5) and an exponent of 3, as 0.5 << 3 == 4
<icecream95>
We want to find a number that, when added to 4, shifts the 1 bit in the significand to bit 7 ^H 6
<icecream95>
0x800000 >> x = 0x40. log2(0x800000) - x = log2(0x40). x = log2(0x800000) - log2(0x40) = 23 - 6 = 17
<icecream95>
But 4 already has an exponent of three, so the magic value must have an exponent of 17 + 3 = 20
<icecream95>
(The significand is fixed point 0:24 in the range [0.5, 1), so the first bit is always 1 and not stored)
<alyssa>
Ah, clever. Nice find!
<icecream95>
The Valhall documentation could probably make it clearer that it is only the lower six bits of the float/significand that equal 32/pi*(x mod 2pi), and add parentheses to show that it is 32/pi*(x mod 2pi) and not (32/pi*x) mod 2pi
<anholt_>
oh, this isn't the channel I meant to drop that in.
<anholt_>
but enjoy
* icecream95
learnt about the floating point format from the ZX Spectrum BASIC manual, which had far more of this sort of low-level stuff than any of the C++ books he later read