ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar
tzimmermann has quit [Ping timeout: 480 seconds]
jewins has quit [Ping timeout: 480 seconds]
pcercuei has quit [Quit: dodo]
columbarius has joined #dri-devel
co1umbarius has quit [Ping timeout: 480 seconds]
nchery has quit [Quit: Leaving]
danvet has quit [Ping timeout: 480 seconds]
Emantor_ has quit []
sdutt has quit [Ping timeout: 480 seconds]
Emantor has joined #dri-devel
<karolherbst> wow.. rust macros are insane :O
oneforall2 has quit [Quit: Leaving]
oneforall2 has joined #dri-devel
<clever> anholt_: did you work on the v3d end of the pi much, or just the 2d/hvs end?
imre has quit [Remote host closed the connection]
heat has quit [Ping timeout: 480 seconds]
YuGiOhJCJ has joined #dri-devel
mattrope has quit [Quit: Leaving]
lemonzest has joined #dri-devel
Duke`` has joined #dri-devel
i-garrison has quit []
i-garrison has joined #dri-devel
alanc has quit [Remote host closed the connection]
alanc has joined #dri-devel
Daanct12 has joined #dri-devel
Daaanct12 has joined #dri-devel
Danct12 has quit [Remote host closed the connection]
rsalvaterra has quit [Quit: Leaving...]
danvet has joined #dri-devel
Daanct12 has quit [Ping timeout: 480 seconds]
rasterman has joined #dri-devel
frieder has joined #dri-devel
frieder has quit []
rasterman has quit [Quit: Gettin' stinky!]
rsalvaterra has joined #dri-devel
Tooniis[m] has quit []
Tooniis[m] has joined #dri-devel
Tooniis[m] has quit []
Tooniis[m] has joined #dri-devel
gouchi has joined #dri-devel
pcercuei has joined #dri-devel
alatiera has quit [Quit: The Lounge - https://thelounge.chat]
pekkari has joined #dri-devel
alatiera has joined #dri-devel
achrisan has quit []
Ahuj has joined #dri-devel
Hi-Angel has joined #dri-devel
pnowack has joined #dri-devel
mlankhorst has joined #dri-devel
Hi-Angel has quit [Quit: Konversation terminated!]
Hi-Angel has joined #dri-devel
rcf has quit [Ping timeout: 480 seconds]
alatiera has quit [Quit: Ping timeout (120 seconds)]
alatiera has joined #dri-devel
alatiera is now known as Guest6297
Ahuj has quit [Ping timeout: 480 seconds]
iive has joined #dri-devel
pnowack has quit [Quit: pnowack]
mlankhorst has quit [Ping timeout: 480 seconds]
Adrinael_ has joined #dri-devel
Adrinael has quit [Read error: Connection reset by peer]
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
Company has joined #dri-devel
NiksDev has joined #dri-devel
heat has joined #dri-devel
Adrinael_ has quit []
Adrinael has joined #dri-devel
Guest6297 has quit []
thelounge53 has joined #dri-devel
<karolherbst> jenatali: how are you managing the CL queue in your OpenCL impl?
<jenatali> karolherbst: What do you mean?
<karolherbst> jenatali: like.. the cl_command_queue thing
<jenatali> Yeah but what do you mean by managing?
<karolherbst> in clover we are chaining the cl_event objects I think, but I was under the impression you had a different solution there
<jenatali> Every command creates a "task," and if the caller wants an event, they get a pointer to the task
<jenatali> The tasks sit in a queue until the queue is flushed, at which point the tasks are drained into a task pool
<jenatali> The task pool has a worker thread which picks up any ready tasks, turns them into D3D commands, and executes them
<karolherbst> mhhhh
<jenatali> By default, queues are in-order, so each task has a dependency on the task before it in the queue
NiksDev has quit [Ping timeout: 480 seconds]
<jenatali> It's really inefficient right now because tasks aren't considered ready until the previous task has finished executing, in an ideal world I'd walk the dependency chain and batch together all tasks that would be made ready
<jenatali> But that read like a violation of the spec to me so I played it safe to start
<karolherbst> well with gallium we can just keep pushing work into the driver
<karolherbst> so I don't really care all that much about this part
<karolherbst> as I'd rely on the driver to block
<karolherbst> (once a hw queue or whatever gets too full)
<jenatali> Yeah my main concern was about having the CPU report event A as done before event B as ready, if B depended on A
<karolherbst> mhh
<karolherbst> at least with nouveau it's all in order
<karolherbst> I think I'd map a cl_command_queue to a gallium context as clover is doing it already and rely on its properties
<karolherbst> and that is very much in order and single threaded afaik
<jenatali> That gets tricky when you get cross-queue dependencies though
<karolherbst> we do have fence objects but yeah...
<karolherbst> cross queue deps sound nasty, that's even legal?
<jenatali> Queue A has a task, then queue B has a task which depends on it, then queue A has another task which depends on that one, and none of those have been flushed yet
<jenatali> Absolutely. I've seen it in retail apps
<karolherbst> uhh
<jenatali> Then you flush queue A and expect all 3 tasks to complete ;)
<karolherbst> I guess that's how you sync across devices
<karolherbst> implementing CL from scratch isn't all that fun because of those silly details :D
<jenatali> I dunno, I enjoyed it :)
<karolherbst> yeah well, you didn't learn a new language while doing it :p
<jenatali> True :P
<clever> anholt_: ah, i found the last major blocker, the shader code itself, was not aligned correctly!
<karolherbst> jenatali: but in theory I like the idea, I just have to see how much Rust likes me to require shared mutable state :/
<karolherbst> although I suspect as long as the object itself stay immutable
<karolherbst> or well immutable enough
<karolherbst> and atomics are considered immutable
<jenatali> That was to solve a Photoshop problem IIRC
<karolherbst> ohh right..
<karolherbst> but I am so far away from actually running stuff atm :D
<karolherbst> I didn't even wired up the compilation stack
<karolherbst> *wire
<karolherbst> write/read_buffer stuff is just the things I have to work on next
<karolherbst> and that kind of requires having a plan for all the event stuff
<jenatali> Yeah, just pointing out you probably want to not write yourself into a corner with the design
<karolherbst> yeah...
<karolherbst> the stupid thing about CL is just, that like everything has to be thread safe except cl_kernel objects
<karolherbst> it's soo annoying
<jenatali> Yep
<karolherbst> although I think there is a little more
<jenatali> Hooray, I finally got an EGL implementation up and running on Windows :D
<karolherbst> yay
<jenatali> karolherbst: FWIW, what I ended up doing was having a platform-wide lock any time any thread is enqueueing work, since you can have not only cross-queue but also cross-device dependencies
<karolherbst> although I am wondering if I need this queue + worker thread architecture or if I can just call into the driver and let the driver figure those things out... mhh... but queue + threads do have the advantage, that you could split work across multiple threads for an out of order queue
<jenatali> I only have one thread for the entire device
<karolherbst> what the hell is even "CL_QUEUE_SIZE" supposed to be
<karolherbst> "Specifies the size of the device queue in bytes?!?!?."
<jenatali> The benefit is that if you have multiple queues and flush them all, the only thing that needs to be synchronized is dumping those tasks into the ready task pool, and then the worker thread will pick up all of them
<jenatali> karolherbst: I think that's only for the on-device queues
<karolherbst> yeah...
<karolherbst> I guess there is some opaque size
<karolherbst> but..
<karolherbst> how can the application even make any sense out of it?
<karolherbst> so what if the on device queue is 5MiB big
<jenatali> Yeah I dunno how the app's supposed to figure out what takes up that space
heat has quit [Ping timeout: 480 seconds]
<jenatali> Guess it's just supposed to be captured variables or kernel args? I dunno
<karolherbst> I think I'd use a worker thread per queue actually as our architecture would allow this already
<karolherbst> but CPU overhead is not really a big concern with CL
<karolherbst> or is it?
<jenatali> Yeah
<jenatali> I thought about doing that, but it got complicated trying to think about cross-queue sync IIRC
<karolherbst> currently I have the PipeScreen on the cl_device_id object and I'd use a PipeContext per cl_command_queue.. I think that makes the most sense
<karolherbst> and cl_context is just this.. weird collection of devices
<jenatali> Yeah...
<jenatali> I don't remember the exact reason, but I essentially have a context per device as well, instead of a context per queue
<karolherbst> mhhh
<karolherbst> yeah well.. that makes threading impossible I guess :p
<karolherbst> but maybe a d3d12 context is more than I think it is
<karolherbst> guess it's easier if you have one worker per device also
<karolherbst> so no need for another context
<jenatali> D3D12 doesn't have contexts, but I'm using a helper library that does have contexts
<karolherbst> ahh
<karolherbst> clover has this tendency to do those copies on the CPU :/ it's very annoying if you just want to check out how clover implemented things :D
<jenatali> Those copies?
<karolherbst> writeBuffer e.g.
<karolherbst> thta's done on the CPU
<jenatali> Ahh
<karolherbst> we do use hw copies, but for actual device to device copies
<karolherbst> a.k.a. copyBuffer
<karolherbst> but we do have the gallium interfaces for user data
<karolherbst> write_buffer -> pipe_context::buffer_subdata I guess
<karolherbst> which is doing a CPU copy :D
<karolherbst> but maybe that's fine.. dunno
<karolherbst> one could create a pipe_resource from user memory and use blit mhhh
<karolherbst> oh well...
<karolherbst> I guess I'll experiment with that
thelounge53 has quit []
thelounge53 has joined #dri-devel
rcf has joined #dri-devel
mlankhorst has joined #dri-devel
pekkari has quit [Quit: Konversation terminated!]
bcarvalho__ has joined #dri-devel
bcarvalho_ has quit [Ping timeout: 480 seconds]
luckyxxl has joined #dri-devel
rsalvaterra_ has joined #dri-devel
rsalvaterra has quit [Ping timeout: 480 seconds]
rcf has quit [Quit: WeeChat 3.1]
rcf has joined #dri-devel
rsalvaterra_ has quit []
rsalvaterra has joined #dri-devel
V has joined #dri-devel
rcf has quit []
rcf has joined #dri-devel
tobiasjakobi has joined #dri-devel
macromorgan is now known as Guest6316
macromorgan has joined #dri-devel
Guest6316 has quit [Remote host closed the connection]
tobiasjakobi has quit [Remote host closed the connection]
mlankhorst has quit [Ping timeout: 480 seconds]
yogesh_m1 has joined #dri-devel
yogesh_mohan has quit [Ping timeout: 480 seconds]
pnowack has joined #dri-devel
The_Company has joined #dri-devel
Ahuj has joined #dri-devel
Company has quit [Ping timeout: 480 seconds]
tobiasjakobi has joined #dri-devel
tobiasjakobi has quit [Remote host closed the connection]
<clever> how exactly does the translation from GLSL to NIR to vc4 work within mesa? how would i go about writing a util to do offline pre-compiliation of shaders?
jernej_ is now known as jernej
<imirkin_> clever: use ARB_get_program_binary
<robclark> clever: I'd start with adding disk_cache support for vc4.. and then maybe build on that?
<clever> robclark: the other complication, is that i want to pre-compile vc4 shaders, on a host without vc4
<imirkin_> clever: if you don't have a vc4 host, then you're in for a lot more pain
<robclark> maybe drm-shim could help compiling on other host.. but in a lot of cases you actually have to run the game/whatever to get the actual shader variants used
<imirkin_> oh yeah, good point, drm-shim should make it possible
<clever> imirkin_: for more context, i'm driving the vc4 3d core directly, in baremetal
<imirkin_> how many shaders do you need? like 5 or 5000?
<clever> imirkin_: its more, that i dont want to deal with writing shaders by hand in raw asm, i want to learn a more useful thing like GLSL, and just compile it
<clever> no need to re-invent all of mesa
<imirkin_> i understand
<imirkin_> but
<imirkin_> is it going to be a fixed set of shaders
<clever> currently, i just need 2 fixed shaders
<imirkin_> do you want to be able to run arbitrary glsl things
<clever> fragment and vertex
<imirkin_> so then you could probably just dump the binary out "by hand" for them
<imirkin_> from mesa
<clever> but having the ability to do arbitrary glsl things would be a useful demo
<robclark> another idea is write an assembler for vc4?
<clever> robclark: already got one, but id still need to write the asm first
<imirkin_> clever: sure, but as long as you're OK with a manual step in between and it's not a ton of shaders...
<clever> robclark: https://github.com/cleverca22/gl/blob/master/texture.s is something i wrote years ago, to do fragment shading of a texture with alpha blending
<clever> imirkin_: i can use scripting to heavily automate that
<clever> i'm thinking i just need to force opengl into loading the vc4 drivers (on x86), then compile a shader, and then GetProgramBinary the compiled shader
<clever> and skip the hw init step
<imirkin_> yeah
<imirkin_> you can use drm-shim to make that happen
<clever> this is the current state of my output
<clever> a single polygon and a single fragment shader, with 3 varyings per vertex, directly treating the varyings as RGB
<imirkin_> yeah, there's something very wrong
<imirkin_> oh, unless it's not a flat polygon?
<clever> its a triangle
<imirkin_> erm
<imirkin_> wrong term
<imirkin_> unless the triangle is angled relative to the viewport plane
<imirkin_> i.e. if it has "depth"
<clever> it shouldnt be
<clever> this code is generating the vertex and varying data
<imirkin_> that looks off then
<imirkin_> too much blue
<imirkin_> and the blue is weirdly invading the red along the edge
<clever> a lot of that banding is artifacts from the phone camera interacting with the phosphor grid in the crt
<imirkin_> yeah i'm not talking about that
<clever> it looks much smoother to my quishy eye
<imirkin_> not about smoothness. it's about proportions
<imirkin_> too much blue.
<clever> the coordinates are also not taking aspect ratio into account
<clever> the code assumes the pixels are square
<clever> but its a 720x480 canvas, on a 4:3 crt
<imirkin_> supposed to be an equal amount of red/green/blue
<imirkin_> i don't see that at all in the image
<clever> to my naked eye, there is a thin band of yellow in the top-center
<clever> i need to hook up the lcd tv as well, to get around all of those artifacts
<imirkin_> this is what the "unit" triangle is supposed to look like
<clever> or get hdmi init going
<clever> yeah, thats pretty close to what i have on my tv
<imirkin_> ok. that's not at all what it looks like in that pic
<clever> yeah
<imirkin_> in your pic, like 2/3 of it is blue
<clever> the cellphone camera made a horid mess
<clever> i think the AWB in the phone, is trying to subtract green
<clever> because the picture is too green
<imirkin_> maybe
<clever> let me set a black bg, and re-take
<clever> hmmm, now its blinding the camera, and washed out
<clever> HDR mode?
<imirkin_> well, i mean i trust you if it looks right, then it looks right
<imirkin_> take a look at this program:
<imirkin_> it's designed to work with drm-shim
<imirkin_> you could easily add get binary support to that
<clever> ah, that sounds like a good starting place
<imirkin_> or you could add something in vc4 to dump stuff to stdout or whatever
<clever> i dug around a bit, and found that vc4_update_compiled_shaders is involved in generating the shader binary
<imirkin_> it was created for computing compilation stats on various drivers
<imirkin_> but you can repurpose it for your needs
<clever> nice
<imirkin_> yeah, i think vc4 works around various missing hw features by adding shader-based workarounds/implementations
<clever> also, just for the heck of it, i'm currently running the vc4 3d core, at 1.28mhz
<clever> currently, the rendering phase takes 98.521ms
<clever> some rough math, says that this frame then took 126,106 clock cycles to render
<clever> imirkin_: how does this one look?
<clever> heh, the red turned into an orange....
<clever> i should probably just give up on getting color accurate photos of a CRT
<imirkin_> :)
<clever> let me get one last image...
<clever> imirkin_: this is the exact same shader code, and nearly identical everything else, running under linux, poking the hardware thru /dev/mem, and rendering to hdmi
<clever> its also changing the vertex data, because spinning makes it better!
<imirkin_> looks fine
<imirkin_> spinning is always better
<clever> making it spin on baremetal is one of the next goals
<clever> that will involve defering page-flips until vsync, and handling irq's
<clever> i also need to confirm if i even have a cos() and sin() implementation, lol
<clever> i'm currently cheating, -O let gcc call them at compile-time
gouchi has quit [Remote host closed the connection]
<imirkin_> cheating is always good.
<clever> until you change the angle, and it needs to compute a runtime, leading to linker errors
thelounge53 has quit []
<imirkin_> cheating can have downsides ;)
thelounge53 has joined #dri-devel
<clever> i'm also updating the display list on every vsync irq
<clever> i should instead be preparing one outside of irq, and only doing page-flip in irq
<clever> imirkin_: also, my randomly picked goal, is to just make it spin a tea-pot, lol
<imirkin_> glutTeapot()? :)
<clever> but on that angle, there are some questions...
<clever> -rw-r--r-- 1 clever users 461K Jul 2 03:26 'Utah_teapot_(solid).stl'
<clever> are there simpler ways to generate such a model, more in line with how SVG operates?
<clever> to compute the model at runtime, rather then adding nearly half a MB of binary data to the program
<imirkin_> i mean ... sure ... you could come up with a mathematical representation of the teapot and tessellate it?
<clever> was more asking, is such a sample code, already out in the wild already?
gouchi has joined #dri-devel
<imirkin_> i'm not aware of it
<clever> hmmm, and it looks like i have zero chance of doing full-frame 3d, with the ram off, lol
<clever> a single 720x480 RGBA8888 frame is 1.3mb
<imirkin_> RGB565 to the rescue?
<clever> 675kb then, but i only have 128kb of L2 cache to work with
<imirkin_> i hear AMD makes video cards
<bnieuwenhuizen> time for some L3?
<clever> i'm trying to push the rpi to its limits, with as little code as possible
<clever> and turning off random hw blocks, just to make the task more fun :P
<bnieuwenhuizen> if you're good at timing you might see if there is scanline info so you can generate parts of the image on demand?
<clever> bnieuwenhuizen: there is a current scanline field in the 2d composition hardwre
<clever> and there is an h-sync irq
<clever> in theory, i could just configure the composition hardware to have 2 64 pixel high stripes of image data, repeating
<clever> and then render one stripe of 3d at a time
<clever> and if i race it fast enough, then i can keep ahead of the read pointer
<clever> treating it like a ring-buffer, kinda
<clever> so that would be 720x128 then
<clever> 180kb, still too big!
<bnieuwenhuizen> is 32 high not ok?
<clever> bnieuwenhuizen: the 3d core on the rpi generates image data in 64x64 tiles
<clever> basically, it can only render a 64x64 tile of the screen, and there is extra support logic, to schedule multiple times, and clip things
<bnieuwenhuizen> time for 4:2:0 then
<clever> i think enabling 4x multi-sampling reduces that to a 32x32 tile
<imirkin_> so maybe don't do that?
<clever> as-in, it renders 64x64, but then shrinks it down to 32x32 when it writes to ram
<imirkin_> it renders 64x64 samples probably
<bnieuwenhuizen> can't you do 32x32 in a compute shader btw?
<imirkin_> which corresponds to 32x32 pixels once resolved
<clever> bnieuwenhuizen: ive not even tried compute shaders yet
<clever> imirkin_: exactly
<bnieuwenhuizen> I hear software rasterizers in compute shader is all the norm these days
<imirkin_> with GL_EXT_multisample_render_to_texture or whatever it's called
rasterman has joined #dri-devel
<clever> imirkin_: i believe the 64x64 -> 32x32, is a dedicated hw step, when storing that tile back to ram
<imirkin_> clever: yes, a "resolve"
<imirkin_> but you can't just do that if you're meant to preserve the samples' info
<imirkin_> although vc4 might not need to support that given its GL level
<clever> in theory, i can just re-configure things, so the 32x32 tiles still fill the entire screen
<clever> but only render 2 strips of them, so 720x64
<clever> and use those 2 strips as A/B buffers, racing ahead of the scanline
<clever> that would get me down to 90kb of image data
<clever> which only leaves 38kb for code, lol
<clever> bnieuwenhuizen: but you mentioned 4:2:0, let me see what output formats i have...
<clever> i think the 3d core is limited to rgb only
<clever> it can either do bgr565 dithered, rgba8888, or bgr565
<clever> for now, i'll just turn the ram on, lol
jessica_24 has quit [Quit: Connection closed for inactivity]
<clever> imirkin_: i'm also entirely in the dark on how vertex shaders work, got any tutorial links handy?
<clever> all i really know, is that it converts the xyz to xy, and deals with projection, rotation, and translation
<imirkin_> clever: at the vc4 level (it gets more complicated with higher GL versions), it reads attributes and generates a gl_Position and other varyings to be passed to the fragment shader for interpolation
<clever> i implemented it all once before in qbasic, but that was a very crude algo, and not really what a gpu expects
<imirkin_> the gl_Position (+ viewport sttings) drive rasterization details
<imirkin_> (rasterization is the process of determinign which pixels on the pixel grid are covered by the triangle in question)
<clever> ah, i think vc4 calls that binning
<imirkin_> binning is the process of determining which tiles are covered
<clever> this code, is currently generating xyz coords, but with a flat z=1 across the whole process
<imirkin_> so that not every tile has to process every polygon
<imirkin_> with tiling architectures, you rerun the whole geometry for each tile
<clever> i assume that with vertex shaders, these coords are passed into the VS, which then emits xy back out, and that lands ... smewhere in ram
<imirkin_> but then if you know that some triangle is only on certain tiles, then you can skip rasterizing that triangle for the tiles where it's not
<clever> yeah, during debug, i had to parse the "tile allocation data" by hand
<clever> and its just a compressed primitive index list
<imirkin_> the details of where vertex shader results are stored is very hw-specific
<clever> for what is in the tile
<clever> all kinds of fancy tricks, like using 4bit ints, relative to the last polygon
<imirkin_> (and i know nothing about vc4 specifically)
<clever> https://docs.broadcom.com/doc/12358545 has the vc4 docs, if your interested in following along
<clever> page 78, there is "GL Shader state record", "NV Shader state record", and "VG shader state record"
<clever> currently, i'm using "NV Shader state record", (i believe NV == non-vertex), so it just runs the fragment shader directy on the vertex data i supplied
<imirkin_> can't say that i am
<clever> ahh, and right on the next page (81), shaded vertex format in memory!
<clever> ah yeah, that is what i have to supply, when in non-vertex shading mode
heat has joined #dri-devel
gouchi has quit [Remote host closed the connection]
<clever> and zero mention of non-shaded vertex data!
<clever> hmmm, when using a "GL Shader state record", you actually supply it with 3 shaders, fragment, vertex, and coordinate!
<clever> the fragment shader gets some code, a list of uniforms, and a varying count
<alyssa> imirkin_: lol
<alyssa> clever: That's normalish for tilers
<clever> the vertex shader gets some attribute arrays, code, and list of uniforms
<clever> the coordinate shader gets more attributes, code, and a list of uniforms
<alyssa> yes that is how coordinate shaders work
<clever> so each shader, has its own set of uniforms, and the vertex/coordinate shaders can each select a subset of the attributes
<clever> alyssa: hmmmm, is this maybe going to feed x/y/z in thru attributes, and then generate the fully shaded vertex data i already make (in non-gl software)?
<clever> alyssa: where could i find out more about coordinate shaders? google isnt giving any good hits
<alyssa> it uh
<alyssa> sounds like you already figured everything about them out
<alyssa> it's not a very complex concept as far as 3D goes :)
<alyssa> clever: re mesa layers of crap, that's horribly out of date
<clever> alyssa: this is the closest ive gotten to "vertex shading", find the x and y difference, divide by z difference to give perspective, no rotation at all, lol
<alyssa> the rest is just matrix math
<alyssa> have you taken a linear algebra class?
<alyssa> (first year uni, typically)
<clever> i never finished high school
<clever> but i did expand pythagorean theory from 2d to 3d easily enough
<alyssa> if you're serious about graphics, 100% recommend reading a book on linear algebra
<clever> some traces of that are in the code i just linked
<clever> but it was overflowing the poor ints in qbasic, lol
<alyssa> I don't think the math is any harder than what's done in high school, there's just more of it
<clever> and i was always ahead of what school taught, in random areas
<clever> reading thru https://learnopengl.com/Getting-started/Coordinate-Systems , at the perspective projection part
<clever> i now see why the vc4 hardware wants 1/w as a float
<clever> x * (1/w) is cheaper then x/w, if 1/w is pre-computed
<alyssa> Yep
<alyssa> Mali wants the same
<clever> and the vector compute core on the rpi (seperate from shaders) lacks a division opcode
<alyssa> you're the librepi person? :'o
<clever> yeah
<alyssa> do I.. do I still have any code left in your tree? Lol
<clever> what part did you work on?
<alyssa> this line looks like my doing :-p
<clever> ah, i see your commits in the git log
<clever> there are some design limits in rpi-open-firmware, that ive been wanting to get around
<clever> so ive been re-doing everything under LK
<clever> alyssa: that part of the code is now over here
<clever> but something is wrong with it, and linux never prints a single byte
<alyssa> 🤷
<alyssa> I've garbage collected videocore, sorry :(
<clever> emoji also dont render on this irc client
<clever> and that bug, is basically outside the VC4 area
<clever> its all arm->arm
<clever> i just need to wire jtag up again, and dump it
<alyssa> what are all these htonl calls?
<clever> byte order swaps
<alyssa> wwwwhy?
<clever> device-tree is big-endian
<clever> so i must write BE to those fields, or it just wont work
<alyssa> but... doesn't libfdt take care of that?
<clever> fdt_setprop just expects a blob of binary data and a size
<alyssa> right, ok
<clever> this property, is an array of 32bit values
<clever> and it can be 64bit values, but those are encoded as 2 32bit values
<clever> i suspect half of the problem is changes in LK, since arch_chain_load was last used on arm
<clever> it needs to first disable the MMU and flush the caches
heat has quit [Remote host closed the connection]
<alyssa> I also, errr, don't understand why you're driving the v3d in a bootloader?
<clever> but when you disable the MMU, it starts treating PC as physical, not virtual
<clever> alyssa: the power gating to v3d has to be flipped on, before linux even has a chance of driving that hw with the existing drivers
<clever> and i wanted to confirm the 3d is fully working, before i bother testing it in linux
<clever> there is already a problem in the 2d area, that this would have helped with, kinda
<clever> if the arm core, writes even a single HVS register (the 2d core), it gets an async external abort
<clever> but, having a full demo (https://youtu.be/u7DzPvkzEGA) that shows the HVS is fully powered on and working, when driven by the VPU
<clever> that prooves its not power gating that is at fault
<alyssa> still seems like reusing mesa would be a lot easier than trying to open code the demo
<alyssa> (doable? easily, have done it for every gpu i've brought up. useful when you already having a perfectly cromulent open driver? I dunno)
<clever> alyssa: can mesa run on an entirely new arch, that it has never been cross-compiled to before, without a kernel?
<clever> also, the code i ported to LK, is the original hackdriver code, from before mesa even had vc4 support
Duke`` has quit [Ping timeout: 480 seconds]
<alyssa> hah! nice :-)
<clever> i now have 4 different graphics demos
luckyxxl has quit []
<clever> https://www.youtube.com/watch?v=V6ogpgieJrQ this would be hackdriver, running under linux, poking the v3d via /dev/mem
<clever> back in 2014!
<clever> https://www.youtube.com/watch?v=JFmCin3EJIs this would be LK, running as a baremetal arm kernel, poking the HVS config, but it relied on the blobs to init the hdmi hw
<alyssa> phire's code?
<clever> a fork of it
<clever> i made it spin, and put some of it in kernel during the later stages
<alyssa> what memories!
<clever> https://www.youtube.com/watch?v=u7DzPvkzEGA this is now baremetal VPU LK, configuring the HVS and VEC (ntsc generator) from scratch, with zero help from the blobs
<alyssa> (also wild to read the demo code after reading so much beautiful vc4/v3d code in linux+mesa)
<clever> alyssa: and then this is the latest demo, hackdriver, now running on the VPU, and doing power gating enable
<clever> alyssa: the hardest part of porting it, was this chunk to enable power to the graphical sub-subsystems (hvs isnt graphical??) and figuring out alignments that the docs didnt specify
<clever> alyssa: and now i want to make the VPU spin a teapot, just because i can :P
<alyssa> because I can, I can get behind that!
Ahuj has quit [Ping timeout: 480 seconds]
<alyssa> Good luck :-)
<clever> do you remember if the vc4 accepts raster textures?
<clever> there it is, chapter 4 of https://docs.broadcom.com/doc/12358545 ....
<clever> looks like it must be LT or T format
<clever> yeah, looks like that ruins my plans of putting /dev/fb0 on the side of a spinning cube, lol
<clever> something would need to do a (semi-costly) linear->t-format conversion, on every frame
<clever> alyssa: how crazy would it have been, to have the "dumb" framebuffer linux is using, wrapped on a teapot, without linux even being aware of that? :P
<clever> hmmmm, i can kinda picture, how i might use the VPU's vector core, to do that translation....
Ahuj has joined #dri-devel
<clever> alyssa: but back to what i was saying, i have repeatedly asked the engineers for some tips, on why the arm core cant touch the HVS
<clever> alyssa: until that is fixed, none of the existing mesa drivers will work
<clever> alyssa: they have all claimed its something like power gating, but given that i can drive it to do all of these things, its definitely not gated!
cleverca22[m] has joined #dri-devel
Ahuj has quit [Ping timeout: 480 seconds]
rasterman has quit [Quit: Gettin' stinky!]
iive has quit []
<pinchartl> is there a good free software graphics library that provides an image object that can support multi-planar formats (such as NV12) and provide CPU access to pixels ? I'm thinking about the same level of abstraction as pixman_image_t or the OpenCV Mat
<alyssa> pinchartl: opencv doesn't cut it?
<pinchartl> not entirely sure. I think it can support conversion from multi-planar formats to the native BGR format
<pinchartl> I'm asking because I'm trying to figure out how to design an image/surface class in libcamera
<pinchartl> we have a FrameBuffer class that represents a multi-planar frame buffer with dmabufs, offsets and strides
<pinchartl> so that's very similar to the DRM model
<pinchartl> it's all nice, works well, but doesn't provide CPU access
<pinchartl> and it's not nice to give such a frame buffer to applications and tell them to figure out how to map it
<clever> pinchartl: the same libcamera used by the rpi?
<alyssa> oh, I see the problem. right..
<pinchartl> as there are nasty things to take into account, such as page alignment when calling mmap(), discontiguous planes using different dmabufs vs. contiguous planes using the same dmabufs + offsets, ...
<alyssa> yeah
<pinchartl> so I'm trying to implement a good helper for that
<pinchartl> it's not very difficult
<pinchartl> but it requires a class to model the "mapped framebuffer"
<clever> > (*) If you divide your image into 128 column wide strips with both the luma and respective U/V (NV12) interleaved chroma, and then glue these strips together end on end, that's about right.
<pinchartl> that one is nasty to design, as I can very easily see it turning into a full-fledged image processing API
<clever> pinchartl: also, the h264 encoder needs an even more weird planar format
<pinchartl> and I don't want that :-)
<pinchartl> clever: yes, same libcamera
<alyssa> Do you need something more sophistciated than `struct { void *plane0, *plane1; unsigned stride0, stride1; } map_cpu(SnazzyNV12Framebuffer)`..?
<clever> pinchartl: i think what its doing, is its cutting the entire image up, into 128 pixel wide strips, and each strip is then in 3 planes
<clever> so you have plane-1a, plane-2a, plane-3a, plane-1b, plane-2b, plane-3b
<clever> where 1/2/3 are the color planes, and a/b/c are the 128 pixel wide strips of the image
<alyssa> mh
<alyssa> if the images are nonlinear in memory I'm not sure there is /any/ sane API to do a CPU mapping
<clever> the the h264 accel, needs it in that whack format
<pinchartl> alyssa: possibly not. my problem is that I'm trying to draw the line at the right place between a very ad-hoc solution that will not be very flexible, and yet another 2D image processing library that I have no time to develop. I was thus wondering if there was an existing image/surface implementation that I could use as a model
<clever> and currently, no api can deal with that
<alyssa> pinchartl: Okay. First question, are the images linear in memory then?
<clever> so the ISP just converts for you, and lets you use more normal formats
<pinchartl> clever: I don't have that issue on the camera side
<clever> (the source for what i said)
<pinchartl> NV12 is normal NV12 there
<alyssa> Then IMO don't fix what isn't broken. Just make a two-plane map() function that hides the dma-buf ugliness and call it a day.
<pinchartl> alyssa: on all the devices I have to support now, yes. the planes can be contiguous or disjoint, but within a plane, it's linear
<clever> pinchartl: yeah, i believe the ISP does bayar to normal planar, and if you want to later h264 encode, it passes it thru the ISP again, to shuffle the planar format up
<pinchartl> I wouldn't fully rule out tiled formats in the future, although that's more common on the display side than the camera side
<clever> pinchartl: so because your not dealing with that shuffled planar, the ISP has to do an extra copy, if you want to generate h264 streams
<alyssa> (And if it were tiled or something, IMO better off just blitting from a linear staging resource and letting the user pretend it's linear, unless you have pressing perf issues. )
<clever> and broadcom wont let RPF release the docs for the ISP
<alyssa> (This is how it works in Mesa. see `transfer_map()` prototype)
<alyssa> (which works seamless for linear, tiled or compressed by either directly mapping, staging and CPU tiling, or staging and GPU blitting)
<alyssa> that's good enough for OpenGL, dunno if it's good enough for cameras
pendingchaos has quit [Ping timeout: 480 seconds]
<pinchartl> if I ever have to deal with tiled formats, I'll want to map them through a tiler device to have a linear CPU view. how to do the mapping will be an interesting question, that calls for a system-wide graphics memory management library I think. definitely out of scope for libcamera, but I'll use it once it will exist :-)
<clever> pinchartl: weird, why does #libcamera require ssl to join?
<pinchartl> clever: anti-spam measure, we got spam in the beginning from non-ssl connections, so it was a cheap measure to fight against it
<clever> ah
<pinchartl> and everybody should encrypt their connection anyway :-)
<pinchartl> maybe the spammers went away, it was during the freenode to OFTC transition
<clever> i still havent bothered setting up ssl on this client
<clever> yeah, that was a huge mess
<pinchartl> I didn't try to disable it to check if they were still there :-)
<pinchartl> alyssa: thanks for the advices
<clever> it does seem to have died off now
<pinchartl> I really fear that at some point I'll need a 2D image processing library that can work natively on YUV planar formats
<pinchartl> when that day comes, I'll cry and then do something about it
<pinchartl> but that's not for today
<clever> pinchartl: i believe the ISP is capable of transforming to/from tiled formats, and also between yuv and rgb
<clever> and of course, bayer is a valid input too
<pinchartl> not sure about the tiling part, for the rest, sure
<clever> i have been pushing to get more open-source everything on the rpi
<pinchartl> I'll happily let the RPi developers handle tiling
<clever> pinchartl: https://i.imgur.com/2Fxr0U3.jpg is my most recent feat, the 3d core is now running, without a single blob involved
<pinchartl> their camera team is very supportive
<clever> start.elf is fully open source on this demo
<pinchartl> and the camera implementation does use a closed-source firmware, but it's really a thin glue layer now, Linux has direct control of the ISP
<clever> from what i heard last, linux is only in control of the unicam (csi input)
<clever> the ISP is under control of start(4).elf, and linux is just issuing commands to the blob over an RPC
<clever> and then depending on which kms overlay you load, different parties are involved in the 2d output
<pinchartl> Linux has direct access to the unicam hardware. for the ISP, it does go through the firmware, but the implementation is much better today
<pinchartl> the firmware used to implement full control of the ISP, exposing a high-level API
<pinchartl> now it's a thin glue layer, the ISP features are exposed directly to Linux, and libcamera controls the ISP
<clever> at that point, what do the have left as a secret?
<clever> why not just have proper linux drivers?
lemonzest has quit [Quit: WeeChat 3.2]
<pinchartl> I don't have all the details, but I think it's partly due to the control of the graphics side. as the VC4 was traditionally controlled from the firmware for display (this is changing too), power management is implemented in the firmware as far as I understand
<clever> with the work ive been doing, i can boot linux on a pi2 or pi3, without any blobs involved at any point
<pinchartl> but feature-wise, the firmware now exposes the ISP features to Linux
<clever> in theory, getting the unicam to work under that, is just a matter of adding power-management code around the unicam block
<clever> but if you want the ISP to work, your currently out of options
<pinchartl> unicam is fully controlled by Linux already as far as I can tell, including PM
<clever> the real kms drivers, dont cover PM
<clever> they assume a lot of the hardware has been pre-initialized
<clever> and there is still a magic handshake missing, that stops the arm from touching the HVS, even when its fully working
<pinchartl> while it may be nice to avoid the closed firmware to use the ISP, I'm already happy that the full set of ISP features is now exposed to Linux
<clever> and having that full feature set, does at least make RE'ing the isp simpler
<pinchartl> there's no closed-source part of the camera algorithms anymore
<clever> it gives a much better idea of what its doing behind the curtain, and lets you tweak knobs, and see what registers change
<pinchartl> which is really great
<pinchartl> RPi has done a really good job open-sourcing the camera stack
<pinchartl> and when it comes to the thin firmware that is left
<pinchartl> I don't think they would mind dropping that
<clever> something i mentioned in this channel earlier, is that there are only 2 hw blocks on a pi2/pi3, that you may want to use, and still lack drivers
<clever> the h264/mpeg/vc1 accel block
<clever> and the isp
<pinchartl> but I don't see it as a priority feature-wise, and it's something that can be done in the background, it's mostly the ISP kernel driver that will need to change, the rest of the stack shouldn't be too affected
<clever> once those get docs, or a loadable blob, the open firmware could be a viable feature-complete replacement
<pinchartl> we don't support the pi2 though
<pinchartl> only the pi3 and pi4
<clever> are you just ignoring arm32 support?
<pinchartl> no, it's not about arm32, just about different ISP generations
<clever> ahh
<pinchartl> with limited resources, you have to draw the line somewhere :-)
<clever> i wasnt aware the isp changed between pi2 and pi3
<pinchartl> I don't know all the details
<clever> my understanding, was that the pi0-pi3 lineup, is essentially identical, if you dont look at the arm block
<pinchartl> the RPi camera team open-sourced their stack and work with us to integrate it in libcamera. I'm really grateful for that, they've been very supportive
<pinchartl> and focussing on pi3 and pi4 makes sense with limited resources
<clever> yeah
<clever> in my case, pi2 and pi3 support, was more of an accident, because i got stuck on one bug, and just changed target for a bit
<clever> the pi3 support was failing hard, because i didnt enable certain arm L2 access permissions
<pinchartl> my earlier question about an image class was actually related to this, I'm working on fixing a regression in libcamera that RPi has found to break everything :-)
<clever> so linux couldnt flush the L2 cache, and things literally became an incoherent mess
<pinchartl> I want to fix it before Monday
<clever> before i had figured that out, i gave up, and switched to the pi2
<clever> but that, failed because i hadnt enabled SMP support, and the mutex opcodes where illegal
<clever> by pure chance, the same control register flag, fixed both of those bugs, lol
clever has quit [Quit: Changing server]
clever has joined #dri-devel
pendingchaos has joined #dri-devel
pendingchaos has quit []
pendingchaos has joined #dri-devel
pcercuei has quit [Quit: dodo]
tzimmermann__ has joined #dri-devel