#dri-devel on 2021-09-04 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:56 ChanServ changed the topic of #dri-devel to: <ajax> nothing involved with X should ever be unable to find a bar

00:03 tzimmermann has quit [Ping timeout: 480 seconds]

00:15 jewins has quit [Ping timeout: 480 seconds]

00:16 pcercuei has quit [Quit: dodo]

00:48 columbarius has joined #dri-devel

00:50 co1umbarius has quit [Ping timeout: 480 seconds]

01:08 nchery has quit [Quit: Leaving]

01:12 danvet has quit [Ping timeout: 480 seconds]

01:20 Emantor_ has quit []

01:20 sdutt has quit [Ping timeout: 480 seconds]

01:20 Emantor has joined #dri-devel

01:26 <karolherbst> wow.. rust macros are insane :O

01:28 oneforall2 has quit [Quit: Leaving]

01:29 oneforall2 has joined #dri-devel

01:31 <clever> anholt_: did you work on the v3d end of the pi much, or just the 2d/hvs end?

03:22 imre has quit [Remote host closed the connection]

03:36 heat has quit [Ping timeout: 480 seconds]

03:36 YuGiOhJCJ has joined #dri-devel

03:43 mattrope has quit [Quit: Leaving]

05:42 lemonzest has joined #dri-devel

05:49 Duke`` has joined #dri-devel

06:16 i-garrison has quit []

06:18 i-garrison has joined #dri-devel

06:38 alanc has quit [Remote host closed the connection]

06:38 alanc has joined #dri-devel

07:03 Daanct12 has joined #dri-devel

07:08 Daaanct12 has joined #dri-devel

07:08 Danct12 has quit [Remote host closed the connection]

07:13 rsalvaterra has quit [Quit: Leaving...]

07:14 danvet has joined #dri-devel

07:14 Daanct12 has quit [Ping timeout: 480 seconds]

07:31 rasterman has joined #dri-devel

07:52 frieder has joined #dri-devel

07:53 frieder has quit []

07:56 rasterman has quit [Quit: Gettin' stinky!]

08:17 rsalvaterra has joined #dri-devel

08:25 Tooniis[m] has quit []

08:25 Tooniis[m] has joined #dri-devel

08:27 Tooniis[m] has quit []

08:27 Tooniis[m] has joined #dri-devel

08:28 gouchi has joined #dri-devel

08:44 pcercuei has joined #dri-devel

08:54 alatiera has quit [Quit: The Lounge - https://thelounge.chat]

08:57 pekkari has joined #dri-devel

09:15 alatiera has joined #dri-devel

09:24 achrisan has quit []

09:37 Ahuj has joined #dri-devel

10:06 Hi-Angel has joined #dri-devel

10:26 pnowack has joined #dri-devel

10:52 mlankhorst has joined #dri-devel

10:59 Hi-Angel has quit [Quit: Konversation terminated!]

10:59 Hi-Angel has joined #dri-devel

12:09 rcf has quit [Ping timeout: 480 seconds]

12:50 alatiera has quit [Quit: Ping timeout (120 seconds)]

12:51 alatiera has joined #dri-devel

12:51 alatiera is now known as Guest6297

12:55 Ahuj has quit [Ping timeout: 480 seconds]

13:17 iive has joined #dri-devel

13:25 pnowack has quit [Quit: pnowack]

13:36 mlankhorst has quit [Ping timeout: 480 seconds]

13:45 Adrinael_ has joined #dri-devel

13:45 Adrinael has quit [Read error: Connection reset by peer]

13:45 YuGiOhJCJ has quit [Quit: YuGiOhJCJ]

13:46 Company has joined #dri-devel

13:47 NiksDev has joined #dri-devel

13:48 heat has joined #dri-devel

13:49 Adrinael_ has quit []

13:49 Adrinael has joined #dri-devel

13:50 Guest6297 has quit []

13:50 thelounge53 has joined #dri-devel

14:02 <karolherbst> jenatali: how are you managing the CL queue in your OpenCL impl?

14:02 <jenatali> karolherbst: What do you mean?

14:03 <karolherbst> jenatali: like.. the cl_command_queue thing

14:03 <jenatali> Yeah but what do you mean by managing?

14:03 <karolherbst> in clover we are chaining the cl_event objects I think, but I was under the impression you had a different solution there

14:04 <jenatali> Every command creates a "task," and if the caller wants an event, they get a pointer to the task

14:04 <jenatali> The tasks sit in a queue until the queue is flushed, at which point the tasks are drained into a task pool

14:05 <jenatali> The task pool has a worker thread which picks up any ready tasks, turns them into D3D commands, and executes them

14:05 <karolherbst> mhhhh

14:05 <jenatali> By default, queues are in-order, so each task has a dependency on the task before it in the queue

14:05 NiksDev has quit [Ping timeout: 480 seconds]

14:06 <jenatali> It's really inefficient right now because tasks aren't considered ready until the previous task has finished executing, in an ideal world I'd walk the dependency chain and batch together all tasks that would be made ready

14:06 <jenatali> But that read like a violation of the spec to me so I played it safe to start

14:07 <karolherbst> well with gallium we can just keep pushing work into the driver

14:07 <karolherbst> so I don't really care all that much about this part

14:07 <karolherbst> as I'd rely on the driver to block

14:07 <karolherbst> (once a hw queue or whatever gets too full)

14:08 <jenatali> Yeah my main concern was about having the CPU report event A as done before event B as ready, if B depended on A

14:08 <karolherbst> mhh

14:08 <karolherbst> at least with nouveau it's all in order

14:09 <karolherbst> I think I'd map a cl_command_queue to a gallium context as clover is doing it already and rely on its properties

14:09 <karolherbst> and that is very much in order and single threaded afaik

14:09 <jenatali> That gets tricky when you get cross-queue dependencies though

14:09 <karolherbst> we do have fence objects but yeah...

14:10 <karolherbst> cross queue deps sound nasty, that's even legal?

14:10 <jenatali> Queue A has a task, then queue B has a task which depends on it, then queue A has another task which depends on that one, and none of those have been flushed yet

14:10 <jenatali> Absolutely. I've seen it in retail apps

14:10 <karolherbst> uhh

14:10 <jenatali> Then you flush queue A and expect all 3 tasks to complete ;)

14:11 <karolherbst> I guess that's how you sync across devices

14:12 <karolherbst> implementing CL from scratch isn't all that fun because of those silly details :D

14:12 <jenatali> I dunno, I enjoyed it :)

14:13 <karolherbst> yeah well, you didn't learn a new language while doing it :p

14:13 <jenatali> True :P

14:13 <clever> anholt_: ah, i found the last major blocker, the shader code itself, was not aligned correctly!

14:14 <karolherbst> jenatali: but in theory I like the idea, I just have to see how much Rust likes me to require shared mutable state :/

14:14 <karolherbst> although I suspect as long as the object itself stay immutable

14:14 <karolherbst> or well immutable enough

14:14 <karolherbst> and atomics are considered immutable

14:16 <jenatali> karolherbst: https://github.com/microsoft/OpenCLOn12/commit/62a34f6d3bf3e183f7b05140b3e220436b856ef9

14:16 <jenatali> That was to solve a Photoshop problem IIRC

14:16 <karolherbst> ohh right..

14:17 <karolherbst> but I am so far away from actually running stuff atm :D

14:17 <karolherbst> I didn't even wired up the compilation stack

14:17 <karolherbst> *wire

14:18 <karolherbst> write/read_buffer stuff is just the things I have to work on next

14:18 <karolherbst> and that kind of requires having a plan for all the event stuff

14:21 <jenatali> Yeah, just pointing out you probably want to not write yourself into a corner with the design

14:25 <karolherbst> yeah...

14:25 <karolherbst> the stupid thing about CL is just, that like everything has to be thread safe except cl_kernel objects

14:25 <karolherbst> it's soo annoying

14:25 <jenatali> Yep

14:26 <karolherbst> although I think there is a little more

14:27 <jenatali> Hooray, I finally got an EGL implementation up and running on Windows :D

14:27 <karolherbst> yay

14:29 <jenatali> karolherbst: FWIW, what I ended up doing was having a platform-wide lock any time any thread is enqueueing work, since you can have not only cross-queue but also cross-device dependencies

14:30 <karolherbst> although I am wondering if I need this queue + worker thread architecture or if I can just call into the driver and let the driver figure those things out... mhh... but queue + threads do have the advantage, that you could split work across multiple threads for an out of order queue

14:30 <jenatali> I only have one thread for the entire device

14:30 <karolherbst> what the hell is even "CL_QUEUE_SIZE" supposed to be

14:31 <karolherbst> "Specifies the size of the device queue in bytes?!?!?."

14:31 <jenatali> The benefit is that if you have multiple queues and flush them all, the only thing that needs to be synchronized is dumping those tasks into the ready task pool, and then the worker thread will pick up all of them

14:31 <jenatali> karolherbst: I think that's only for the on-device queues

14:31 <karolherbst> yeah...

14:31 <karolherbst> I guess there is some opaque size

14:31 <karolherbst> but..

14:32 <karolherbst> how can the application even make any sense out of it?

14:32 <karolherbst> so what if the on device queue is 5MiB big

14:34 <jenatali> Yeah I dunno how the app's supposed to figure out what takes up that space

14:35 heat has quit [Ping timeout: 480 seconds]

14:38 <jenatali> Guess it's just supposed to be captured variables or kernel args? I dunno

14:38 <karolherbst> I think I'd use a worker thread per queue actually as our architecture would allow this already

14:38 <karolherbst> but CPU overhead is not really a big concern with CL

14:38 <karolherbst> or is it?

14:38 <jenatali> Yeah

14:39 <jenatali> I thought about doing that, but it got complicated trying to think about cross-queue sync IIRC

14:39 <karolherbst> currently I have the PipeScreen on the cl_device_id object and I'd use a PipeContext per cl_command_queue.. I think that makes the most sense

14:39 <karolherbst> and cl_context is just this.. weird collection of devices

14:40 <jenatali> Yeah...

14:40 <jenatali> I don't remember the exact reason, but I essentially have a context per device as well, instead of a context per queue

14:41 <karolherbst> mhhh

14:41 <karolherbst> yeah well.. that makes threading impossible I guess :p

14:41 <karolherbst> but maybe a d3d12 context is more than I think it is

14:42 <karolherbst> guess it's easier if you have one worker per device also

14:42 <karolherbst> so no need for another context

14:42 <jenatali> D3D12 doesn't have contexts, but I'm using a helper library that does have contexts

14:42 <karolherbst> ahh

14:43 <karolherbst> clover has this tendency to do those copies on the CPU :/ it's very annoying if you just want to check out how clover implemented things :D

14:44 <jenatali> Those copies?

14:44 <karolherbst> writeBuffer e.g.

14:44 <karolherbst> thta's done on the CPU

14:44 <jenatali> Ahh

14:44 <karolherbst> we do use hw copies, but for actual device to device copies

14:45 <karolherbst> a.k.a. copyBuffer

14:45 <karolherbst> but we do have the gallium interfaces for user data

14:50 <karolherbst> write_buffer -> pipe_context::buffer_subdata I guess

14:51 <karolherbst> which is doing a CPU copy :D

14:52 <karolherbst> but maybe that's fine.. dunno

14:53 <karolherbst> one could create a pipe_resource from user memory and use blit mhhh

14:53 <karolherbst> oh well...

14:53 <karolherbst> I guess I'll experiment with that

14:55 thelounge53 has quit []

14:55 thelounge53 has joined #dri-devel

15:09 rcf has joined #dri-devel

15:10 mlankhorst has joined #dri-devel

15:16 pekkari has quit [Quit: Konversation terminated!]

16:00 bcarvalho__ has joined #dri-devel

16:06 bcarvalho_ has quit [Ping timeout: 480 seconds]

16:22 luckyxxl has joined #dri-devel

16:55 rsalvaterra_ has joined #dri-devel

17:01 rsalvaterra has quit [Ping timeout: 480 seconds]

17:13 rcf has quit [Quit: WeeChat 3.1]

17:14 rcf has joined #dri-devel

17:14 rsalvaterra_ has quit []

17:14 rsalvaterra has joined #dri-devel

17:16 V has joined #dri-devel

17:18 rcf has quit []

17:20 rcf has joined #dri-devel

17:27 tobiasjakobi has joined #dri-devel

17:27 macromorgan is now known as Guest6316

17:27 macromorgan has joined #dri-devel

17:28 Guest6316 has quit [Remote host closed the connection]

17:29 tobiasjakobi has quit [Remote host closed the connection]

18:02 mlankhorst has quit [Ping timeout: 480 seconds]

18:10 yogesh_m1 has joined #dri-devel

18:14 yogesh_mohan has quit [Ping timeout: 480 seconds]

18:49 pnowack has joined #dri-devel

18:50 The_Company has joined #dri-devel

18:52 Ahuj has joined #dri-devel

18:54 Company has quit [Ping timeout: 480 seconds]

19:13 tobiasjakobi has joined #dri-devel

19:13 tobiasjakobi has quit [Remote host closed the connection]

19:21 <clever> https://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Mesa_layers_of_crap_2016.svg/800px-Mesa_layers_of_crap_2016.svg.png

19:21 <clever> how exactly does the translation from GLSL to NIR to vc4 work within mesa? how would i go about writing a util to do offline pre-compiliation of shaders?

19:27 jernej_ is now known as jernej

19:31 <imirkin_> clever: use ARB_get_program_binary

19:31 <robclark> clever: I'd start with adding disk_cache support for vc4.. and then maybe build on that?

19:32 <clever> robclark: the other complication, is that i want to pre-compile vc4 shaders, on a host without vc4

19:33 <clever> imirkin_: reading this... https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_get_program_binary.txt

19:34 <imirkin_> clever: if you don't have a vc4 host, then you're in for a lot more pain

19:34 <robclark> maybe drm-shim could help compiling on other host.. but in a lot of cases you actually have to run the game/whatever to get the actual shader variants used

19:34 <imirkin_> oh yeah, good point, drm-shim should make it possible

19:34 <clever> imirkin_: for more context, i'm driving the vc4 3d core directly, in baremetal

19:34 <imirkin_> how many shaders do you need? like 5 or 5000?

19:35 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L137-L147

19:35 <clever> imirkin_: its more, that i dont want to deal with writing shaders by hand in raw asm, i want to learn a more useful thing like GLSL, and just compile it

19:35 <clever> no need to re-invent all of mesa

19:35 <imirkin_> i understand

19:35 <imirkin_> but

19:35 <imirkin_> is it going to be a fixed set of shaders

19:35 <clever> currently, i just need 2 fixed shaders

19:35 <imirkin_> do you want to be able to run arbitrary glsl things

19:36 <clever> fragment and vertex

19:36 <imirkin_> so then you could probably just dump the binary out "by hand" for them

19:36 <imirkin_> from mesa

19:36 <clever> but having the ability to do arbitrary glsl things would be a useful demo

19:36 <robclark> another idea is write an assembler for vc4?

19:36 <clever> robclark: already got one, but id still need to write the asm first

19:36 <imirkin_> clever: sure, but as long as you're OK with a manual step in between and it's not a ton of shaders...

19:37 <clever> robclark: https://github.com/cleverca22/gl/blob/master/texture.s is something i wrote years ago, to do fragment shading of a texture with alpha blending

19:37 <clever> imirkin_: i can use scripting to heavily automate that

19:38 <clever> i'm thinking i just need to force opengl into loading the vc4 drivers (on x86), then compile a shader, and then GetProgramBinary the compiled shader

19:38 <clever> and skip the hw init step

19:38 <imirkin_> yeah

19:38 <imirkin_> you can use drm-shim to make that happen

19:39 <clever> https://i.imgur.com/2Fxr0U3.jpg

19:39 <clever> this is the current state of my output

19:40 <clever> a single polygon and a single fragment shader, with 3 varyings per vertex, directly treating the varyings as RGB

19:40 <imirkin_> yeah, there's something very wrong

19:40 <imirkin_> oh, unless it's not a flat polygon?

19:40 <clever> its a triangle

19:40 <imirkin_> erm

19:40 <imirkin_> wrong term

19:40 <imirkin_> unless the triangle is angled relative to the viewport plane

19:40 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L330-L381

19:41 <imirkin_> i.e. if it has "depth"

19:41 <clever> it shouldnt be

19:41 <clever> this code is generating the vertex and varying data

19:41 <imirkin_> that looks off then

19:41 <imirkin_> too much blue

19:41 <imirkin_> and the blue is weirdly invading the red along the edge

19:42 <clever> a lot of that banding is artifacts from the phone camera interacting with the phosphor grid in the crt

19:42 <imirkin_> yeah i'm not talking about that

19:42 <clever> it looks much smoother to my quishy eye

19:42 <imirkin_> not about smoothness. it's about proportions

19:42 <imirkin_> too much blue.

19:43 <clever> the coordinates are also not taking aspect ratio into account

19:43 <clever> the code assumes the pixels are square

19:43 <clever> but its a 720x480 canvas, on a 4:3 crt

19:44 <imirkin_> supposed to be an equal amount of red/green/blue

19:44 <imirkin_> i don't see that at all in the image

19:44 <clever> to my naked eye, there is a thin band of yellow in the top-center

19:45 <imirkin_> https://community.khronos.org/t/color-of-fragment-interpolation/76916

19:45 <clever> i need to hook up the lcd tv as well, to get around all of those artifacts

19:45 <imirkin_> this is what the "unit" triangle is supposed to look like

19:45 <clever> or get hdmi init going

19:45 <clever> yeah, thats pretty close to what i have on my tv

19:46 <imirkin_> ok. that's not at all what it looks like in that pic

19:46 <clever> yeah

19:46 <imirkin_> in your pic, like 2/3 of it is blue

19:46 <clever> the cellphone camera made a horid mess

19:46 <clever> i think the AWB in the phone, is trying to subtract green

19:46 <clever> because the picture is too green

19:46 <imirkin_> maybe

19:46 <clever> let me set a black bg, and re-take

19:49 <clever> hmmm, now its blinding the camera, and washed out

19:49 <clever> HDR mode?

19:49 <imirkin_> well, i mean i trust you if it looks right, then it looks right

19:49 <imirkin_> take a look at this program:

19:50 <imirkin_> https://cgit.freedesktop.org/mesa/shader-db/tree/run.c

19:50 <imirkin_> it's designed to work with drm-shim

19:50 <imirkin_> you could easily add get binary support to that

19:50 <clever> ah, that sounds like a good starting place

19:50 <imirkin_> or you could add something in vc4 to dump stuff to stdout or whatever

19:51 <clever> i dug around a bit, and found that vc4_update_compiled_shaders is involved in generating the shader binary

19:51 <imirkin_> it was created for computing compilation stats on various drivers

19:51 <imirkin_> but you can repurpose it for your needs

19:51 <clever> nice

19:51 <imirkin_> yeah, i think vc4 works around various missing hw features by adding shader-based workarounds/implementations

19:51 <clever> also, just for the heck of it, i'm currently running the vc4 3d core, at 1.28mhz

19:52 <clever> currently, the rendering phase takes 98.521ms

19:53 <clever> some rough math, says that this frame then took 126,106 clock cycles to render

19:55 <clever> https://matrix.org/_matrix/media/r0/download/matrix.org/rvuldQFZBKWuNiMMZoWLtNeU/20210904_165049.jpg

19:55 <clever> imirkin_: how does this one look?

19:56 <clever> heh, the red turned into an orange....

19:56 <clever> i should probably just give up on getting color accurate photos of a CRT

19:57 <imirkin_> :)

20:00 <clever> let me get one last image...

20:01 <clever> https://www.youtube.com/watch?v=V6ogpgieJrQ

20:01 <clever> imirkin_: this is the exact same shader code, and nearly identical everything else, running under linux, poking the hardware thru /dev/mem, and rendering to hdmi

20:02 <clever> its also changing the vertex data, because spinning makes it better!

20:02 <imirkin_> looks fine

20:02 <imirkin_> spinning is always better

20:02 <clever> making it spin on baremetal is one of the next goals

20:03 <clever> that will involve defering page-flips until vsync, and handling irq's

20:03 <clever> i also need to confirm if i even have a cos() and sin() implementation, lol

20:03 <clever> i'm currently cheating, -O let gcc call them at compile-time

20:04 gouchi has quit [Remote host closed the connection]

20:04 <imirkin_> cheating is always good.

20:05 <clever> until you change the angle, and it needs to compute a runtime, leading to linker errors

20:05 thelounge53 has quit []

20:05 <imirkin_> cheating can have downsides ;)

20:05 thelounge53 has joined #dri-devel

20:05 <clever> i'm also updating the display list on every vsync irq

20:06 <clever> i should instead be preparing one outside of irq, and only doing page-flip in irq

20:18 <clever> imirkin_: also, my randomly picked goal, is to just make it spin a tea-pot, lol

20:19 <imirkin_> glutTeapot()? :)

20:19 <clever> but on that angle, there are some questions...

20:19 <clever> -rw-r--r-- 1 clever users 461K Jul 2 03:26 'Utah_teapot_(solid).stl'

20:20 <clever> are there simpler ways to generate such a model, more in line with how SVG operates?

20:20 <clever> to compute the model at runtime, rather then adding nearly half a MB of binary data to the program

20:21 <imirkin_> i mean ... sure ... you could come up with a mathematical representation of the teapot and tessellate it?

20:22 <clever> was more asking, is such a sample code, already out in the wild already?

20:22 gouchi has joined #dri-devel

20:22 <imirkin_> i'm not aware of it

20:23 <clever> hmmm, and it looks like i have zero chance of doing full-frame 3d, with the ram off, lol

20:23 <clever> a single 720x480 RGBA8888 frame is 1.3mb

20:24 <imirkin_> RGB565 to the rescue?

20:25 <clever> 675kb then, but i only have 128kb of L2 cache to work with

20:25 <imirkin_> i hear AMD makes video cards

20:25 <bnieuwenhuizen> time for some L3?

20:25 <clever> i'm trying to push the rpi to its limits, with as little code as possible

20:25 <clever> and turning off random hw blocks, just to make the task more fun :P

20:26 <bnieuwenhuizen> if you're good at timing you might see if there is scanline info so you can generate parts of the image on demand?

20:26 <clever> bnieuwenhuizen: there is a current scanline field in the 2d composition hardwre

20:26 <clever> and there is an h-sync irq

20:27 <clever> in theory, i could just configure the composition hardware to have 2 64 pixel high stripes of image data, repeating

20:27 <clever> and then render one stripe of 3d at a time

20:28 <clever> and if i race it fast enough, then i can keep ahead of the read pointer

20:28 <clever> treating it like a ring-buffer, kinda

20:28 <clever> so that would be 720x128 then

20:28 <clever> 180kb, still too big!

20:29 <bnieuwenhuizen> is 32 high not ok?

20:29 <clever> bnieuwenhuizen: the 3d core on the rpi generates image data in 64x64 tiles

20:30 <clever> basically, it can only render a 64x64 tile of the screen, and there is extra support logic, to schedule multiple times, and clip things

20:30 <bnieuwenhuizen> time for 4:2:0 then

20:30 <clever> i think enabling 4x multi-sampling reduces that to a 32x32 tile

20:30 <imirkin_> so maybe don't do that?

20:30 <clever> as-in, it renders 64x64, but then shrinks it down to 32x32 when it writes to ram

20:30 <imirkin_> it renders 64x64 samples probably

20:30 <bnieuwenhuizen> can't you do 32x32 in a compute shader btw?

20:31 <imirkin_> which corresponds to 32x32 pixels once resolved

20:31 <clever> bnieuwenhuizen: ive not even tried compute shaders yet

20:31 <clever> imirkin_: exactly

20:31 <bnieuwenhuizen> I hear software rasterizers in compute shader is all the norm these days

20:31 <imirkin_> with GL_EXT_multisample_render_to_texture or whatever it's called

20:31 rasterman has joined #dri-devel

20:31 <clever> imirkin_: i believe the 64x64 -> 32x32, is a dedicated hw step, when storing that tile back to ram

20:31 <imirkin_> clever: yes, a "resolve"

20:32 <imirkin_> but you can't just do that if you're meant to preserve the samples' info

20:32 <imirkin_> although vc4 might not need to support that given its GL level

20:32 <clever> in theory, i can just re-configure things, so the 32x32 tiles still fill the entire screen

20:32 <clever> but only render 2 strips of them, so 720x64

20:33 <clever> and use those 2 strips as A/B buffers, racing ahead of the scanline

20:33 <clever> that would get me down to 90kb of image data

20:33 <clever> which only leaves 38kb for code, lol

20:34 <clever> bnieuwenhuizen: but you mentioned 4:2:0, let me see what output formats i have...

20:35 <clever> i think the 3d core is limited to rgb only

20:35 <clever> it can either do bgr565 dithered, rgba8888, or bgr565

20:36 <clever> for now, i'll just turn the ram on, lol

20:47 jessica_24 has quit [Quit: Connection closed for inactivity]

20:49 <clever> imirkin_: i'm also entirely in the dark on how vertex shaders work, got any tutorial links handy?

20:49 <clever> all i really know, is that it converts the xyz to xy, and deals with projection, rotation, and translation

20:50 <imirkin_> clever: at the vc4 level (it gets more complicated with higher GL versions), it reads attributes and generates a gl_Position and other varyings to be passed to the fragment shader for interpolation

20:50 <clever> i implemented it all once before in qbasic, but that was a very crude algo, and not really what a gpu expects

20:50 <imirkin_> the gl_Position (+ viewport sttings) drive rasterization details

20:50 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L330-L381

20:50 <imirkin_> (rasterization is the process of determinign which pixels on the pixel grid are covered by the triangle in question)

20:51 <clever> ah, i think vc4 calls that binning

20:51 <imirkin_> binning is the process of determining which tiles are covered

20:51 <clever> this code, is currently generating xyz coords, but with a flat z=1 across the whole process

20:51 <imirkin_> so that not every tile has to process every polygon

20:51 <imirkin_> with tiling architectures, you rerun the whole geometry for each tile

20:52 <clever> i assume that with vertex shaders, these coords are passed into the VS, which then emits xy back out, and that lands ... smewhere in ram

20:52 <imirkin_> but then if you know that some triangle is only on certain tiles, then you can skip rasterizing that triangle for the tiles where it's not

20:52 <clever> yeah, during debug, i had to parse the "tile allocation data" by hand

20:52 <clever> and its just a compressed primitive index list

20:52 <imirkin_> the details of where vertex shader results are stored is very hw-specific

20:52 <clever> for what is in the tile

20:52 <clever> all kinds of fancy tricks, like using 4bit ints, relative to the last polygon

20:52 <imirkin_> (and i know nothing about vc4 specifically)

20:53 <clever> https://docs.broadcom.com/doc/12358545 has the vc4 docs, if your interested in following along

20:54 <clever> page 78, there is "GL Shader state record", "NV Shader state record", and "VG shader state record"

20:55 <clever> currently, i'm using "NV Shader state record", (i believe NV == non-vertex), so it just runs the fragment shader directy on the vertex data i supplied

20:55 <imirkin_> can't say that i am

20:55 <clever> ahh, and right on the next page (81), shaded vertex format in memory!

20:56 <clever> ah yeah, that is what i have to supply, when in non-vertex shading mode

20:56 heat has joined #dri-devel

20:57 gouchi has quit [Remote host closed the connection]

20:57 <clever> and zero mention of non-shaded vertex data!

21:05 <clever> hmmm, when using a "GL Shader state record", you actually supply it with 3 shaders, fragment, vertex, and coordinate!

21:06 <clever> the fragment shader gets some code, a list of uniforms, and a varying count

21:06 <alyssa> imirkin_: lol

21:06 <alyssa> clever: That's normalish for tilers

21:06 <clever> the vertex shader gets some attribute arrays, code, and list of uniforms

21:07 <clever> the coordinate shader gets more attributes, code, and a list of uniforms

21:07 <alyssa> yes that is how coordinate shaders work

21:07 <clever> so each shader, has its own set of uniforms, and the vertex/coordinate shaders can each select a subset of the attributes

21:08 <clever> alyssa: hmmmm, is this maybe going to feed x/y/z in thru attributes, and then generate the fully shaded vertex data i already make (in non-gl software)?

21:12 <clever> alyssa: where could i find out more about coordinate shaders? google isnt giving any good hits

21:20 <alyssa> it uh

21:20 <alyssa> sounds like you already figured everything about them out

21:20 <alyssa> it's not a very complex concept as far as 3D goes :)

21:21 <alyssa> clever: re mesa layers of crap, that's horribly out of date

21:22 <clever> https://github.com/cleverca22/3D/blob/master/3D.BAS#L349-L363

21:23 <clever> alyssa: this is the closest ive gotten to "vertex shading", find the x and y difference, divide by z difference to give perspective, no rotation at all, lol

21:23 <alyssa> the rest is just matrix math

21:23 <alyssa> have you taken a linear algebra class?

21:23 <alyssa> (first year uni, typically)

21:23 <clever> i never finished high school

21:24 <clever> but i did expand pythagorean theory from 2d to 3d easily enough

21:24 <alyssa> if you're serious about graphics, 100% recommend reading a book on linear algebra

21:24 <clever> some traces of that are in the code i just linked

21:24 <clever> but it was overflowing the poor ints in qbasic, lol

21:24 <alyssa> I don't think the math is any harder than what's done in high school, there's just more of it

21:25 <clever> and i was always ahead of what school taught, in random areas

21:26 <clever> reading thru https://learnopengl.com/Getting-started/Coordinate-Systems , at the perspective projection part

21:27 <clever> i now see why the vc4 hardware wants 1/w as a float

21:27 <clever> x * (1/w) is cheaper then x/w, if 1/w is pre-computed

21:27 <alyssa> Yep

21:27 <alyssa> Mali wants the same

21:28 <clever> and the vector compute core on the rpi (seperate from shaders) lacks a division opcode

21:28 <alyssa> you're the librepi person? :'o

21:28 <clever> yeah

21:28 <alyssa> do I.. do I still have any code left in your tree? Lol

21:28 <clever> what part did you work on?

21:29 <alyssa> https://github.com/librerpi/rpi-open-firmware/blob/master/arm_chainloader/loader.cc#L147

21:29 <alyssa> this line looks like my doing :-p

21:29 <clever> ah, i see your commits in the git log

21:31 <clever> there are some design limits in rpi-open-firmware, that ive been wanting to get around

21:31 <clever> so ive been re-doing everything under LK

21:31 <clever> https://github.com/librerpi/lk-overlay/blob/master/app/linux-bootloader/loader.c#L121-L124

21:31 <clever> alyssa: that part of the code is now over here

21:31 <clever> but something is wrong with it, and linux never prints a single byte

21:32 <alyssa> 🤷

21:32 <alyssa> I've garbage collected videocore, sorry :(

21:32 <clever> emoji also dont render on this irc client

21:33 <clever> and that bug, is basically outside the VC4 area

21:33 <clever> its all arm->arm

21:33 <clever> i just need to wire jtag up again, and dump it

21:33 <alyssa> what are all these htonl calls?

21:33 <clever> byte order swaps

21:33 <alyssa> wwwwhy?

21:33 <clever> device-tree is big-endian

21:34 <clever> so i must write BE to those fields, or it just wont work

21:34 <alyssa> but... doesn't libfdt take care of that?

21:35 <clever> fdt_setprop just expects a blob of binary data and a size

21:35 <alyssa> right, ok

21:35 <clever> this property, is an array of 32bit values

21:35 <clever> and it can be 64bit values, but those are encoded as 2 32bit values

21:36 <clever> i suspect half of the problem is changes in LK, since arch_chain_load was last used on arm

21:37 <clever> it needs to first disable the MMU and flush the caches

21:37 heat has quit [Remote host closed the connection]

21:37 <alyssa> I also, errr, don't understand why you're driving the v3d in a bootloader?

21:37 <clever> but when you disable the MMU, it starts treating PC as physical, not virtual

21:37 <clever> alyssa: the power gating to v3d has to be flipped on, before linux even has a chance of driving that hw with the existing drivers

21:38 <clever> and i wanted to confirm the 3d is fully working, before i bother testing it in linux

21:38 <clever> there is already a problem in the 2d area, that this would have helped with, kinda

21:38 <clever> if the arm core, writes even a single HVS register (the 2d core), it gets an async external abort

21:39 <clever> but, having a full demo (https://youtu.be/u7DzPvkzEGA) that shows the HVS is fully powered on and working, when driven by the VPU

21:39 <clever> that prooves its not power gating that is at fault

21:39 <alyssa> still seems like reusing mesa would be a lot easier than trying to open code the demo

21:40 <alyssa> (doable? easily, have done it for every gpu i've brought up. useful when you already having a perfectly cromulent open driver? I dunno)

21:40 <clever> alyssa: can mesa run on an entirely new arch, that it has never been cross-compiled to before, without a kernel?

21:40 <clever> also, the code i ported to LK, is the original hackdriver code, from before mesa even had vc4 support

21:40 Duke`` has quit [Ping timeout: 480 seconds]

21:41 <alyssa> hah! nice :-)

21:41 <clever> i now have 4 different graphics demos

21:41 luckyxxl has quit []

21:42 <clever> https://www.youtube.com/watch?v=V6ogpgieJrQ this would be hackdriver, running under linux, poking the v3d via /dev/mem

21:42 <clever> back in 2014!

21:42 <clever> https://www.youtube.com/watch?v=JFmCin3EJIs this would be LK, running as a baremetal arm kernel, poking the HVS config, but it relied on the blobs to init the hdmi hw

21:42 <alyssa> phire's code?

21:43 <clever> a fork of it

21:43 <clever> i made it spin, and put some of it in kernel during the later stages

21:43 <alyssa> what memories!

21:43 <clever> https://www.youtube.com/watch?v=u7DzPvkzEGA this is now baremetal VPU LK, configuring the HVS and VEC (ntsc generator) from scratch, with zero help from the blobs

21:43 <alyssa> (also wild to read the demo code after reading so much beautiful vc4/v3d code in linux+mesa)

21:44 <clever> https://i.imgur.com/2Fxr0U3.jpg

21:44 <clever> alyssa: and then this is the latest demo, hackdriver, now running on the VPU, and doing power gating enable

21:44 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/v3d/v3d.c#L464-L473

21:45 <clever> alyssa: the hardest part of porting it, was this chunk to enable power to the graphical sub-subsystems (hvs isnt graphical??) and figuring out alignments that the docs didnt specify

21:46 <clever> alyssa: and now i want to make the VPU spin a teapot, just because i can :P

21:47 <alyssa> because I can, I can get behind that!

21:47 Ahuj has quit [Ping timeout: 480 seconds]

21:47 <alyssa> Good luck :-)

21:47 <clever> do you remember if the vc4 accepts raster textures?

21:48 <clever> there it is, chapter 4 of https://docs.broadcom.com/doc/12358545 ....

21:49 <clever> looks like it must be LT or T format

21:50 <clever> yeah, looks like that ruins my plans of putting /dev/fb0 on the side of a spinning cube, lol

21:50 <clever> something would need to do a (semi-costly) linear->t-format conversion, on every frame

21:51 <clever> alyssa: how crazy would it have been, to have the "dumb" framebuffer linux is using, wrapped on a teapot, without linux even being aware of that? :P

21:53 <clever> hmmmm, i can kinda picture, how i might use the VPU's vector core, to do that translation....

21:58 Ahuj has joined #dri-devel

22:02 <clever> alyssa: but back to what i was saying, i have repeatedly asked the engineers for some tips, on why the arm core cant touch the HVS

22:02 <clever> alyssa: until that is fixed, none of the existing mesa drivers will work

22:02 <clever> alyssa: they have all claimed its something like power gating, but given that i can drive it to do all of these things, its definitely not gated!

22:04 cleverca22[m] has joined #dri-devel

22:07 Ahuj has quit [Ping timeout: 480 seconds]

22:23 rasterman has quit [Quit: Gettin' stinky!]

22:26 iive has quit []

22:31 <pinchartl> is there a good free software graphics library that provides an image object that can support multi-planar formats (such as NV12) and provide CPU access to pixels ? I'm thinking about the same level of abstraction as pixman_image_t or the OpenCV Mat

22:37 <alyssa> pinchartl: opencv doesn't cut it?

22:39 <pinchartl> not entirely sure. I think it can support conversion from multi-planar formats to the native BGR format

22:39 <pinchartl> I'm asking because I'm trying to figure out how to design an image/surface class in libcamera

22:40 <pinchartl> we have a FrameBuffer class that represents a multi-planar frame buffer with dmabufs, offsets and strides

22:40 <pinchartl> so that's very similar to the DRM model

22:40 <pinchartl> it's all nice, works well, but doesn't provide CPU access

22:40 <pinchartl> and it's not nice to give such a frame buffer to applications and tell them to figure out how to map it

22:41 <clever> pinchartl: the same libcamera used by the rpi?

22:41 <alyssa> oh, I see the problem. right..

22:41 <pinchartl> as there are nasty things to take into account, such as page alignment when calling mmap(), discontiguous planes using different dmabufs vs. contiguous planes using the same dmabufs + offsets, ...

22:41 <alyssa> yeah

22:42 <pinchartl> so I'm trying to implement a good helper for that

22:42 <pinchartl> it's not very difficult

22:42 <pinchartl> but it requires a class to model the "mapped framebuffer"

22:42 <clever> > (*) If you divide your image into 128 column wide strips with both the luma and respective U/V (NV12) interleaved chroma, and then glue these strips together end on end, that's about right.

22:43 <pinchartl> that one is nasty to design, as I can very easily see it turning into a full-fledged image processing API

22:43 <clever> pinchartl: also, the h264 encoder needs an even more weird planar format

22:43 <pinchartl> and I don't want that :-)

22:43 <pinchartl> clever: yes, same libcamera

22:43 <alyssa> Do you need something more sophistciated than `struct { void *plane0, *plane1; unsigned stride0, stride1; } map_cpu(SnazzyNV12Framebuffer)`..?

22:43 <clever> pinchartl: i think what its doing, is its cutting the entire image up, into 128 pixel wide strips, and each strip is then in 3 planes

22:44 <clever> so you have plane-1a, plane-2a, plane-3a, plane-1b, plane-2b, plane-3b

22:44 <clever> where 1/2/3 are the color planes, and a/b/c are the 128 pixel wide strips of the image

22:44 <alyssa> mh

22:44 <alyssa> if the images are nonlinear in memory I'm not sure there is /any/ sane API to do a CPU mapping

22:45 <clever> the the h264 accel, needs it in that whack format

22:45 <pinchartl> alyssa: possibly not. my problem is that I'm trying to draw the line at the right place between a very ad-hoc solution that will not be very flexible, and yet another 2D image processing library that I have no time to develop. I was thus wondering if there was an existing image/surface implementation that I could use as a model

22:45 <clever> and currently, no api can deal with that

22:45 <alyssa> pinchartl: Okay. First question, are the images linear in memory then?

22:45 <clever> so the ISP just converts for you, and lets you use more normal formats

22:45 <clever> https://www.raspberrypi.org/forums/viewtopic.php?p=1854716#p1854716

22:45 <pinchartl> clever: I don't have that issue on the camera side

22:45 <clever> (the source for what i said)

22:45 <pinchartl> NV12 is normal NV12 there

22:46 <alyssa> Then IMO don't fix what isn't broken. Just make a two-plane map() function that hides the dma-buf ugliness and call it a day.

22:46 <pinchartl> alyssa: on all the devices I have to support now, yes. the planes can be contiguous or disjoint, but within a plane, it's linear

22:46 <clever> pinchartl: yeah, i believe the ISP does bayar to normal planar, and if you want to later h264 encode, it passes it thru the ISP again, to shuffle the planar format up

22:46 <pinchartl> I wouldn't fully rule out tiled formats in the future, although that's more common on the display side than the camera side

22:46 <clever> pinchartl: so because your not dealing with that shuffled planar, the ISP has to do an extra copy, if you want to generate h264 streams

22:46 <alyssa> (And if it were tiled or something, IMO better off just blitting from a linear staging resource and letting the user pretend it's linear, unless you have pressing perf issues. )

22:46 <clever> and broadcom wont let RPF release the docs for the ISP

22:47 <alyssa> (This is how it works in Mesa. see `transfer_map()` prototype)

22:47 <alyssa> (which works seamless for linear, tiled or compressed by either directly mapping, staging and CPU tiling, or staging and GPU blitting)

22:47 <alyssa> that's good enough for OpenGL, dunno if it's good enough for cameras

22:47 pendingchaos has quit [Ping timeout: 480 seconds]

22:48 <pinchartl> if I ever have to deal with tiled formats, I'll want to map them through a tiler device to have a linear CPU view. how to do the mapping will be an interesting question, that calls for a system-wide graphics memory management library I think. definitely out of scope for libcamera, but I'll use it once it will exist :-)

22:48 <clever> pinchartl: weird, why does #libcamera require ssl to join?

22:48 <pinchartl> clever: anti-spam measure, we got spam in the beginning from non-ssl connections, so it was a cheap measure to fight against it

22:48 <clever> ah

22:49 <pinchartl> and everybody should encrypt their connection anyway :-)

22:49 <pinchartl> maybe the spammers went away, it was during the freenode to OFTC transition

22:49 <clever> i still havent bothered setting up ssl on this client

22:49 <clever> yeah, that was a huge mess

22:49 <pinchartl> I didn't try to disable it to check if they were still there :-)

22:49 <pinchartl> alyssa: thanks for the advices

22:49 <clever> it does seem to have died off now

22:50 <pinchartl> I really fear that at some point I'll need a 2D image processing library that can work natively on YUV planar formats

22:50 <pinchartl> when that day comes, I'll cry and then do something about it

22:50 <pinchartl> but that's not for today

22:50 <clever> pinchartl: i believe the ISP is capable of transforming to/from tiled formats, and also between yuv and rgb

22:50 <clever> and of course, bayer is a valid input too

22:51 <pinchartl> not sure about the tiling part, for the rest, sure

22:51 <clever> i have been pushing to get more open-source everything on the rpi

22:51 <pinchartl> I'll happily let the RPi developers handle tiling

22:51 <clever> pinchartl: https://i.imgur.com/2Fxr0U3.jpg is my most recent feat, the 3d core is now running, without a single blob involved

22:51 <pinchartl> their camera team is very supportive

22:51 <clever> start.elf is fully open source on this demo

22:52 <pinchartl> and the camera implementation does use a closed-source firmware, but it's really a thin glue layer now, Linux has direct control of the ISP

22:53 <clever> from what i heard last, linux is only in control of the unicam (csi input)

22:53 <clever> the ISP is under control of start(4).elf, and linux is just issuing commands to the blob over an RPC

22:54 <clever> and then depending on which kms overlay you load, different parties are involved in the 2d output

22:54 <pinchartl> Linux has direct access to the unicam hardware. for the ISP, it does go through the firmware, but the implementation is much better today

22:54 <pinchartl> the firmware used to implement full control of the ISP, exposing a high-level API

22:55 <pinchartl> now it's a thin glue layer, the ISP features are exposed directly to Linux, and libcamera controls the ISP

22:55 <clever> at that point, what do the have left as a secret?

22:55 <clever> why not just have proper linux drivers?

22:56 lemonzest has quit [Quit: WeeChat 3.2]

22:56 <pinchartl> I don't have all the details, but I think it's partly due to the control of the graphics side. as the VC4 was traditionally controlled from the firmware for display (this is changing too), power management is implemented in the firmware as far as I understand

22:56 <clever> with the work ive been doing, i can boot linux on a pi2 or pi3, without any blobs involved at any point

22:57 <pinchartl> but feature-wise, the firmware now exposes the ISP features to Linux

22:57 <clever> in theory, getting the unicam to work under that, is just a matter of adding power-management code around the unicam block

22:57 <clever> but if you want the ISP to work, your currently out of options

22:57 <pinchartl> unicam is fully controlled by Linux already as far as I can tell, including PM

22:58 <clever> the real kms drivers, dont cover PM

22:58 <clever> they assume a lot of the hardware has been pre-initialized

22:58 <clever> and there is still a magic handshake missing, that stops the arm from touching the HVS, even when its fully working

22:58 <pinchartl> while it may be nice to avoid the closed firmware to use the ISP, I'm already happy that the full set of ISP features is now exposed to Linux

22:59 <clever> and having that full feature set, does at least make RE'ing the isp simpler

22:59 <pinchartl> there's no closed-source part of the camera algorithms anymore

22:59 <clever> it gives a much better idea of what its doing behind the curtain, and lets you tweak knobs, and see what registers change

22:59 <pinchartl> which is really great

22:59 <pinchartl> RPi has done a really good job open-sourcing the camera stack

22:59 <pinchartl> and when it comes to the thin firmware that is left

23:00 <pinchartl> I don't think they would mind dropping that

23:01 <clever> something i mentioned in this channel earlier, is that there are only 2 hw blocks on a pi2/pi3, that you may want to use, and still lack drivers

23:01 <clever> the h264/mpeg/vc1 accel block

23:01 <clever> and the isp

23:01 <pinchartl> but I don't see it as a priority feature-wise, and it's something that can be done in the background, it's mostly the ISP kernel driver that will need to change, the rest of the stack shouldn't be too affected

23:01 <clever> once those get docs, or a loadable blob, the open firmware could be a viable feature-complete replacement

23:01 <pinchartl> we don't support the pi2 though

23:01 <pinchartl> only the pi3 and pi4

23:02 <clever> are you just ignoring arm32 support?

23:02 <pinchartl> no, it's not about arm32, just about different ISP generations

23:02 <clever> ahh

23:02 <pinchartl> with limited resources, you have to draw the line somewhere :-)

23:02 <clever> i wasnt aware the isp changed between pi2 and pi3

23:03 <pinchartl> I don't know all the details

23:03 <clever> my understanding, was that the pi0-pi3 lineup, is essentially identical, if you dont look at the arm block

23:03 <pinchartl> the RPi camera team open-sourced their stack and work with us to integrate it in libcamera. I'm really grateful for that, they've been very supportive

23:04 <pinchartl> and focussing on pi3 and pi4 makes sense with limited resources

23:04 <clever> yeah

23:04 <clever> in my case, pi2 and pi3 support, was more of an accident, because i got stuck on one bug, and just changed target for a bit

23:04 <clever> the pi3 support was failing hard, because i didnt enable certain arm L2 access permissions

23:05 <pinchartl> my earlier question about an image class was actually related to this, I'm working on fixing a regression in libcamera that RPi has found to break everything :-)

23:05 <clever> so linux couldnt flush the L2 cache, and things literally became an incoherent mess

23:05 <pinchartl> I want to fix it before Monday

23:05 <clever> before i had figured that out, i gave up, and switched to the pi2

23:05 <clever> but that, failed because i hadnt enabled SMP support, and the mutex opcodes where illegal

23:06 <clever> by pure chance, the same control register flag, fixed both of those bugs, lol

23:09 clever has quit [Quit: Changing server]

23:09 clever has joined #dri-devel

23:09 pendingchaos has joined #dri-devel

23:12 pendingchaos has quit []

23:13 pendingchaos has joined #dri-devel

23:49 pcercuei has quit [Quit: dodo]

23:56 tzimmermann__ has joined #dri-devel