<Lynne>
no need for a separate option, VAAPI_MPEG4_ENABLED is already there to break all systems
<ccr>
:)
sdutt has joined #dri-devel
aravind has quit [Read error: Connection reset by peer]
aravind has joined #dri-devel
<Lynne>
it's weird how mpeg-4 decoding is so broken on all systems, on mine I need a hard reboot if I try to use it, whilst on nvidia there's a ton of artifacts and errors
<Lynne>
though it's unsurprising if anyone's seen the spec, it's ridiculous how much features they put into it with no effort made to make them work together
mclasen has joined #dri-devel
<Lynne>
they even had OBMC in the base video codec, which was way ahead of its time, but made it impossible to use as they defined it as unsupported in all profiles
<ccr>
design by committee. everyone wants THEIR kitchen sink included in the standard.
<ccr>
except for hardware manufacturers who want just the simplest possible thing they can copy the HW blocks from the previous codec(s)
sdutt has quit []
sdutt has joined #dri-devel
<Lynne>
mmm, don't think so, mpeg-4 is vastly different to mpeg-2
<Lynne>
mpeg-4 had different, actually specified DCTs, rather that mpeg-2's, <lol don't care, just as long as it's IEEE compatible>, which led to desyncs over long GOPS
<clever>
GOPS? is that due to different FPU's being differently accurate?
<Lynne>
IEEE (forgot the standard number) DCTs just defined a target precision from a real floating point DCT
<Lynne>
fixed or float is an implementation detail
<clever>
i have tried to implement softfloat on a "int only" vector core in the past
<Lynne>
that IEEE standard served as a way to spec JPEGs, which referenced this standard, hence why you had both fixed point and float jpeg encoders
<HdkR>
ccr: As long as I get my kitchen sink I can be happy
<HdkR>
:P
<clever>
ah, so thats different from the IEEE 754 spec?
<Lynne>
okay, looked it up, the IEEE precision target spec is IEEE 1180-1990
<Lynne>
yes, it's different from the IEEE float spec
<clever>
one of my side-projects is to RE an h264/mpeg2 encoder block
<clever>
my rough guess as to how things work, is that you first have to run a motion estimator over it, to compute the motion vectors
<clever>
and then you compute the frame from only the motion vectors, and then compute a delta to correct errors the motion vector cant correct
<clever>
and then ... something it
<Lynne>
err, sort of, you run motion estimation, you get the MVs + residuals for each block, you DCT the residuals and you encode both MVs and residuals
<clever>
ah, residuals sounds like what i said, the delta between oldframe+MV and newframe
<Lynne>
yes, it's just the difference in the pixel values
<clever>
and for the hw i'm targeting, the motion vector stuff, is computed on a duplicate copy of the frame, that is half resolution
<Lynne>
yup, that's reasonable
<clever>
> The H264 blocks need the frames in a weird column format(*), and also a second 2x2 subsampled version of the image to do a coarse motion search on. The ISP can produce both these images efficiently, and there isn't an easy way to configure the outside world to produce and pass in this pair of images simultaneously.
<clever>
a forum post from somebody that knows how the hardware works
<clever>
> (*) If you divide your image into 128 column wide strips with both the luma and respective U/V (NV12) interleaved chroma, and then glue these strips together end on end, that's about right. The subsampled image is either planar or a similar column format but 32 pixels wide. Cleverer people than me designed it for optimised SDRAM access patterns.
<clever>
and that 2nd part, sounds like its to optmize axi bus burst's
<Lynne>
for starters, you can just use MVs of all zeroes and skip MV search altogether
<clever>
yeah, that sounds like it would still be valid enough for the decoder to be happy, but the residuals will use up more space in the file
<Lynne>
then you can implement whatever MV search you want, be it old-school downsampled one or new fancy hexagonal searches
<clever>
from the notes ive gathered, it sounds like there is hardware acceleration for the MV search
<Lynne>
"hardware accerelation" is just slang for fast SAD when it comes to motion estimation
<clever>
SAD?
<Lynne>
sum of absolute differences
<clever>
ah
<clever>
so basically just: int sum = 0; for (int i; i<count; i++) { sum += abs(a[i] - b[i]; } ?
<Lynne>
yes, it's probably just SIMD, where you load 2 registers with one 8-pixel row from both the ref block and the compared block and it gives you a sum
<clever>
ah, and the vector core (seperate from the h264 core) also has that same operation
<clever>
the vector core always operates on sets of 16, and can optionally repeat on up to 64 sets of 16, giving you 1024 ops in one opcode
<clever>
but my understanding is that the h264 core is entirely independant
<Lynne>
it's probably doing that type of SAD on fixed-length internally
<clever>
Lynne: but what is a/b? in that computation?
<Lynne>
the reference block and the compared block
oneforall2 has joined #dri-devel
<Lynne>
you want to find one with the smallest delta, so that's why you use SAD
<Lynne>
that's what indicates where the block has moved to
<clever>
but how does that become a MV? the only way i can see, is to create 100's of reference blocks (each offset by a different xy), and then compare?
<clever>
and find the offset with the smallest difference
<Lynne>
you pick a block, which is always on an 8x8 grid (for mpeg-2)
<Lynne>
that's your reference
<clever>
one sec...
<Lynne>
then you pick another "block of pixels", which is, say 4 pixels to the right
<Lynne>
you measure the SAD, and if it's low enough, that's your MV, otherwise you keep searching
<clever>
ah
<clever>
so your just brute-force checking every possible MV until you find a SAD thats low enough
<clever>
for every macroblock in the frame
<Lynne>
optimally, yes, otherwise you use a close-enough algorithm
<Lynne>
which is what e.g. diamond search or hexagonal search is
<clever>
you could also use previous MV's as a hint
<clever>
assume the camera/object is still moving in the same direction
<Lynne>
yes, you can start searching by using the past 2 MVs as a trend
<clever>
this explains why encoding is so much more complex, because a lot of the math your doing is basically going to waste
<Lynne>
yeah, and add to that how limited hardware encoders are, and you can guess why software implementations are always better
<Lynne>
*limited in terms of memory, which is expensive to buit-into a chip
<clever>
and also why i want better documentation for this hardware encoder
<clever>
so i could create a hybrid, of both
<clever>
for more context, i'm talking about the rpi line of soc's
<clever>
there are at least 4 cpu clusters in every chip
<clever>
arm, vpu, qpu, vce
<clever>
vpu is where the bulk of the firmware runs, its dual core with scalar and vector ops, and the vector register bank is 4096 bytes in size
<Lynne>
you wouldn't benefit that much from hybrid implementations
<clever>
qpu is for shaders and 3d rendering
<Lynne>
many have tried to offload encoding to a GPU or a dedicated unit, and it's just not worth it in most cases
<clever>
vce handles mpeg2/vc1/h264 encode/decode, it has its own cpu core and ISA to manage coordinating all of the actual hw accel blocks
<clever>
it sounds like vce is a specialized core, with vectorized ops to do things like SAD and other stuff
<Lynne>
yes, but the overhead involved from copying from system RAM to <memory the chip can see> is significant
<Lynne>
and keep in mind this is for a simple encoder
<clever>
yeah
<Lynne>
what e.g. x264 does is SATD, where it measures the difference in the transformed domain
<clever>
the VPU does have a solution to that dram latency problem
<clever>
while the VPU is dual-core, and each core has its own 4096 byte vector register bank
<Lynne>
and measures the number of bits it would take to encode with a given MV
<clever>
there is only 1 actual vector computation core, shared between them
<clever>
so while core0 is doing computation, core1 can do load/store in parallel
iive has joined #dri-devel
<clever>
and then they can swap roles, and keep the computation core busy 100% of the time
<Lynne>
sure, whatever
<Lynne>
what I really would like is for GPUs to expose the codec primitives
<clever>
exactly
<Lynne>
NVIDIA actually made an extension for opengl that did that
<clever>
and then you can build your own things, by re-arranging those primitives
<Lynne>
yup, codec instructions have plenty of uses even outside of codec usecases, for example for smooth FPS interpolation
<clever>
i suspect that the VCE is capable of scalar opcodes, because the mpeg2/vc1 license checks are implemeted on the VCE
<clever>
which also leads into a potential legal issue
kts has joined #dri-devel
<clever>
RPF didnt want to pay the mpeg2 hw encoding patent fee's, because that would drive up the cost of every board
<clever>
so they disable it in software, and then have the end-user pay to turn it back on
<Lynne>
err, didn't RPI remove license fees for RPI2 and onwards
<clever>
they only removed it on the pi4, by just dropping mpeg2/vc1 support
<clever>
i suspect the hardware still supports it, they just disabled it at compile time
<Lynne>
so H264 support is always built-in?
<clever>
yeah, h264 is always enabled
<clever>
and my research says h264, vc1, and mpeg2, are all done in the same hardware block
<clever>
its just a matter of which ones the software lets you use
<Lynne>
broadcom'll either delete that part of the SoC in future chips, or enable it permanently for mpeg-2, because its patents ran out IIRC
<clever>
i heard that the patent hasnt ran out in malaysia
<clever>
and until they can proove your not using the board in malaysia, you must still pay the fee
<Lynne>
probably the former, you don't make money on a codec with no patent fees after all, do you :)
<clever>
and with how powerful the arm core is on the pi4, they just decided to get out of that mess, and remove mpeg2/vc1 in software
<clever>
but h264 remains, because it was already fee free
<Lynne>
(it's fee was likely subsidised in the SoC's cost, but sssh)
<clever>
mpeg2/vc1 could have been also?
<clever>
my understanding is thar rpf decided to not subsidise those into the cost, because it drive the price of the board up
<clever>
but its a bit fuzzy, because one hardware block can apparently do all 3 codecs
<clever>
which does bring up a question, what is the technical difference between mpeg2 and h264 encoders?
hansg has joined #dri-devel
<Lynne>
biggest ons is h264 has intraprediction
<Lynne>
*one
<clever>
but could you then basically just disable intraprediction and magically have an mpeg2 encoder?
<Lynne>
yeah, I think that's what x262 did, which hacked x264 to make it an mpeg-2 only encoder
<Lynne>
of course the bitstream is still different
<Lynne>
that's why you can't decode mpeg-2 with an h264 decoder
<clever>
yeah, there would be some differences in the serializer as well
<Lynne>
no clue what that is, we just call it bitstream
<Lynne>
you never "serialize" a bitstream, that's just silly
<clever>
so if i do figure out h264 encode/decode, and document it well enough, the other codecs could be made from that
<Lynne>
copying everything beforehand into well-structured data is unnecessary and wasteful, codecs are designed to allow decoding in parallel with parsing the bitstream
flacks has quit [Quit: Quitter]
mclasen has quit []
<clever>
yeah, makes sense
mclasen has joined #dri-devel
<Lynne>
in theory, unless they've fused some steps
<Lynne>
I'm not sure why you'd want to encode mpeg-2 on a raspberry pi when h264 is supported though?
<clever>
if the consumer on the other end is mpeg2 only
<clever>
or if you want to instead decode mpeg2 content
mattrope has joined #dri-devel
jewins has joined #dri-devel
flacks has joined #dri-devel
kts has quit [Quit: Konversation terminated!]
hansg has quit [Quit: Leaving]
fxkamd has joined #dri-devel
mclasen has quit [Remote host closed the connection]
FireBurn has joined #dri-devel
maxzor has quit [Ping timeout: 480 seconds]
<FireBurn>
I seem to be hitting two seperate bugs on my Navy Flounder card when playing Horizon Zero Dawn
gouchi has quit [Remote host closed the connection]
ngcortes has quit [Read error: Connection reset by peer]
gouchi has joined #dri-devel
ngcortes has joined #dri-devel
<jekstrand>
Dang.... M1 Ultra...
* jekstrand
might have to join Ashai now...
heat has joined #dri-devel
* airlied
wonders will it do multiple monitors by having each M1 do one monitor :-P
tzimmermann has quit [Quit: Leaving]
<daniels>
don’t forget to get the Studio Display, with the same SoC as two iPhones ago, or the most recent iPad
listout has joined #dri-devel
ngcortes has quit [Remote host closed the connection]
<airlied>
daniels: that statement cuts both ways :-)
<airlied>
it's an old SoC is both annoying and very impressive :-P
<daniels>
airlied: only 2.5 years old?
<airlied>
daniels: imagine how bad it would be if it was a qualcomm :-P
<daniels>
airlied: the GPU is worse than the current flagship Qualcomm (which is on a better process), but CPU perf is roughly equivalent
<airlied>
daniels: how's the memory bw compare
<gawin>
finally got r300 passthrough working \0/
<airlied>
ah the au$12k apple machine for the top of the line
<gawin>
I wonder if my old i5 is enough to emulate powerpc
<daniels>
airlied: same
mbrost has quit [Ping timeout: 480 seconds]
dliviu has quit [Ping timeout: 480 seconds]
dliviu has joined #dri-devel
rkanwal has quit [Ping timeout: 480 seconds]
listout has quit []
rasterman has quit [Quit: Gettin' stinky!]
csileeeeeeoe has quit [Remote host closed the connection]
<ajax>
does anyone know of a less painful way of reading the vulkan specs?
ppascher has quit [Ping timeout: 480 seconds]
<ajax>
i hate to praise the gl extn spec convention but it has a certain Brevity to recommend it
<Sachiel>
you can try the "Say something blatantly wrong and have someone here correct you" method
<zmike>
every time you need to reference info in the vk spec it's like a treasure hunt
<zmike>
and you'll emerge from it, hours later, bruised and battered, different than you were when you set out
<MTCoster>
ajax: i use dash on mac for offline docs, it keeps a pretty snappy index. i think there’s a linux equivalent that shares the same source docsets
<jekstrand>
ajax: Start with the overview description in the appendix. That has links to all added structs and entrypoints.
<ajax>
jekstrand: hey, your name is on KHR_timeline_semaphore, maybe you know where to point me
<ajax>
say i was trying to emulate GLX_EXT_swap_interval in vulkan. 0 is immediate present, 1 is fifo, -1 is fifo relaxed. but abs(interval) > 1 means i need to get an event back from the display for presentation times that _didn't happen_. and/or a way to query the refresh interval, i suppose.
<ajax>
what other extension would give me a timeline semaphore fitting either of those descriptions?
<jekstrand>
uh oh
<ajax>
or, am i barking up the wrong set of trees
<jekstrand>
There's no Vulkan extension for that right now AFAIK
ngcortes has quit [Remote host closed the connection]
<ajax>
so you're saying i need to write VK_EXT_present_queue_msc or sometihng
<jekstrand>
Something was proposed some time ago that was something akin to a timeline semaphore that gets triggered on every vblank or something but I don't think anything ever came of it.
<jekstrand>
Yeah, I think that's roughly what'd have to happen.
<jekstrand>
Not sure if that's actually what we want, though. Does anyone use abs(interval) > 1?
<jekstrand>
i.e. is it something we want to optimize for or just make possible?
<ajax>
my instinct is to say it's fairly rare, but then i remember that serato has an fps cap slider in the preferences that is clearly just swap interval
<jekstrand>
:(
<ajax>
but the reason to do _that_ is to reserve enough cpu that the audio dsp work you're doing gets enough cycles, so maybe doing this by guessing refresh rate and putting a thread to sleep until some time passes is good enough
<jekstrand>
Perhaps.
<jekstrand>
If you did want a VK_EXT, I can imagine it working one of two ways:
<jekstrand>
1) Some exception to let you vkQueuePresent the same image multiple times in FIFO mode and stack them up.
<jekstrand>
2) An integer for how many frames you want an image to last in FIFO mode.
<jekstrand>
I suspect 2 is probably easier to implement and less likely to have weird side-effects. 1 is a bit more flexible but maybe that's not a good thing here.
<ajax>
1 sounds awful from the callers perspective, i don't want to do 2N presents to do 1 swapbuffers
idr has quit [Quit: Leaving]
<ajax>
sure would be nice to just get a timeline semaphore whose value was the msc tho
<dcbaker>
@alyssa: I've hard to revert a panfrost patch from the staging/22.0 branch because it's causing CI failures
<dcbaker>
patch :panfrost: Fix set_sampler_views for big GL
<dcbaker>
not sure what you want to do about it
<ajax>
jekstrand: is wp_presentation the state of the art for this kind of thing in wayland?
<daniels>
yeah
<ajax>
so... same deal then, you don't get events unless you tried to commit something.
<daniels>
you’ll need to use that and guesstimate when to submit based on last_present + (n * refresh_interval)
<daniels>
we did type up a bunch of target-time stuff, but it never went anywhere due to lack of actual real-world users
<ajax>
enh. you estimate when to wake up and unblock submission, which is right when the last frame of the interval starts. after that you let the app block naturally.
<ajax>
if you have links i'd be curious, otherwise i may end up just rewriting OML_sync_control here
<daniels>
I meant target-time in Wayland protocol, not in Vulkan-side (that’s all in GitLab somewhere but terminally stalled)
<ajax>
i did too. present already gives me msc events, i think i know how to do this for x11