#etnaviv on 2020-08-25 — irc logs at oftc.irclog.whitequark.org

2020-05-12 17:40 austriancoder changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://freenode.irclog.whitequark.org/etnaviv

00:12 pcercuei has quit [Quit: dodo]

08:10 T_UNIX has joined #etnaviv

10:29 pcercuei has joined #etnaviv

10:51 smurray has quit [Ping timeout: 272 seconds]

10:51 austriancoder has quit [Ping timeout: 272 seconds]

11:02 smurray has joined #etnaviv

11:02 austriancoder has joined #etnaviv

11:44 berton has joined #etnaviv

12:02 lynxeye has joined #etnaviv

12:04 <marex> lynxeye: hey

12:05 <marex> lynxeye: I have a few fixes for that PGC, but I also noticed one thing about the hang

12:05 <marex> lynxeye: it seems like the hangs are clock related, as if there is some weird clock glitch

12:06 <marex> lynxeye: it almost seems like it might be particularly bad idea to let the PGC and etnaviv manage the clock at the same time

12:06 <lynxeye> marex: I'm all ears

12:06 <marex> austriancoder: also, MR6454, that was reproducible only after some weeks of runtime (ew)

12:06 <marex> lynxeye: all ears regarding the clock or what ?

12:07 <lynxeye> Yep, I would be very intersted to know what you found

12:07 <marex> lynxeye: well look at what the TFA does , I suspect they enable all the clock in gpumix for that exact reason

12:07 <marex> lynxeye: because if you keep enabling and disabling the clock, the GPU cluster becomes unstable

12:07 <lynxeye> driving the clocks from both PGC and etnaviv is what we do on i.MX6 and i.MX8M

12:07 <marex> lynxeye: except there you drive the clock for the entire cluster, right ?

12:08 <lynxeye> marex: is it the _bus clock that does funky things for you?

12:08 <marex> lynxeye: I didnt identify the clock just yet

12:08 <marex> lynxeye: btw you should sync the HSK only after you turn the PGC PUP request on

12:08 <marex> lynxeye: and tear the HSK down before PDN

12:09 <marex> my understanding of that HSK is that it disconnects some bus bridge

12:09 <lynxeye> marex: yeah, I thought about splitting power up/down in the gpc driver to better handle this sequencing

12:09 <lynxeye> yep, the HSK drives the amba domain bridges

12:09 <marex> lynxeye: thats likely what needs to be done

12:09 <marex> lynxeye: the current way is awful

12:10 <lynxeye> agreed

12:10 <marex> lynxeye: but it seems like maybe we should rather do as the imx-scu , with .start / .stop

12:10 <marex> because with the current setup, the platform_device_add turns the power domain on and off on boot for no good reason

12:11 <marex> but the comment in soc-imx-scu.c explains that all NXP drivers are broken and needs to be fixed first :(

12:11 <marex> lemme find that one

12:16 <marex> ah, here drivers/firmware/imx/scu-pd.c

12:17 <marex> lynxeye: I sent you the patches I have now, still needs work

12:17 <marex> back to digging in the clock stuff

12:18 <lynxeye> marex: Re patch 2: shouldn't regmap already guard the register access?

12:19 <marex> lynxeye: I _think_ I've seen concurrent access there, so I just tossed the mutex in to be sure

12:19 <marex> lynxeye: and no, I think regmap only serializes specific accesses, not across the entire function call

12:20 <marex> lynxeye: i.e. regmap_update_bits is called with a lock help, but the entire Pxx function isn't called with a lock

12:20 <marex> lynxeye: I suspect if you get both gpu-2d and gpu-3d access the HSK at the same time, that might mess things up

12:20 <marex> for example

12:20 <lynxeye> marex: And patch 3 is wrong from my understanding. This bit needs to be set in before the power down, otherwise the domain won't power down when you trigger the pdn_req

12:21 <marex> lynxeye: shouldn't the bit then be set unconditionally ?

12:21 <marex> lynxeye: the PCR needs to be set before power up at least, and likely also before power down then

12:22 <marex> lynxeye: surely enable_power_control (which = !on) is wrong too

12:22 <lynxeye> marex: the doc says the bit should be asserted before pdn_req and should not be changed until the domain is completely powered up again.

12:22 <lynxeye> so yep, setting it all the time looks like the right thing to do

12:23 <marex> lynxeye: lemme recheck what NXP does in their TFA too

12:23 <lynxeye> marex: they switch it before powering up, which is inconsitent with the docs

12:24 <marex> lynxeye: yep :)

12:24 <marex> lynxeye: I would expect that is the sensible thing to do though, no?

12:24 <marex> lynxeye: I mean, if that bit enables "power control"

12:24 <lynxeye> marex: I think we should just both enable the CPU mapping and power control on probe of the domain drivers

12:25 <lynxeye> I see no reason to ever switch those two things while the driver is active

12:25 <marex> lynxeye: I suspect it is TFA defensive programming

12:25 <marex> lynxeye: btw do you use NXP TFA fork or upstream one ?

12:25 <marex> lynxeye: because they do different PGC init

12:26 <lynxeye> marex: I'm still using downstream NXP TFA, because I can't get the DRAM to reclock with upstream TFA

12:26 <lynxeye> and too much other fires right now

12:27 <marex> yep, very much the reason I got back to you only nowish

12:27 <marex> lynxeye: I'll keep digging and let you know if I find something

12:28 <marex> lynxeye: lets ask daniel in linux-imx about the PGC

12:32 <daniels> lynxeye: speaking of downstream imx, did you ever get a chance to test the last dcss patchset & push to drm-misc?

12:34 <marex> daniels: dcss is the DSI stuff ?

12:34 <daniels> marex: it's the display controller in imx8mq (and maybe also imx8m for some display types?)

12:35 <lynxeye> daniels: Still on my list. I dropped the ball a bit, as there were still discussions ongoing while I was on vacation and I'm still trying to dig out from the usual after-holidays crazyness

12:36 <daniels> fair enough, good luck :)

12:36 <lynxeye> nope, dcss is only on the i.MX8MQ. All the others only have the simple eLCDIF controller

12:36 <daniels> ah, I thought there was one which had elcdif + dcss for different display paths

12:36 <daniels> but maybe that's the only bit about imx8 they made even a little bit simple ;)

12:36 <lynxeye> the MQ has both DCSS and eLCDIF

12:36 <daniels> ah, right

12:37 <marex> lynxeye: wasnt lcdif paralel RGB only ?

12:37 <lynxeye> apparently DCSS was a bit too big/power hungry for the scaled down variants

12:47 <marex> lynxeye: ah duh, that pdn_req, its not asserted only by PGC I think, there might be others

12:47 <marex> lynxeye: I need to check again

12:47 shoragan has quit [Ping timeout: 240 seconds]

12:49 shoragan has joined #etnaviv

12:59 _daniel_ has joined #etnaviv

13:12 JohnnyonFlame has quit [Read error: Connection reset by peer]

13:27 <marex> lynxeye: um, btw, SRC_GPU_RCR is shared between gpu2d and gpu3d , and the reset is only released after both GPU PDs are up

13:27 <marex> lynxeye: maybe thats why you cant turn them on/off separately

13:31 <lynxeye> marex: Yea, I'm not exactly sure about this thing. We don't really use the SRC reset, exactly because it's shared (even on MX6 the GPUs have a shared reset from the SRC), but I'm not sure what the PGC does to those resets

13:32 <marex> lynxeye: well I wouldn't be surprised if it did some de-glitch

13:32 <marex> lynxeye: which could explain the odd behavior of gpu2d

13:32 <lynxeye> marex: yea, maybe we actually need to smash those domains together, like we did with the PCIe domains on 8mq

13:33 <lynxeye> which would be a shame, as now we have 3(!) power domains for the GPUs, but still can't gate them separately

13:35 <marex> lynxeye: how does it not surprise me at all

13:36 <marex> lynxeye: I think maybe the people who implemented the TFA code can tell us more about that reset

13:46 _daniel_ has quit [Quit: Leaving.]

14:14 <lynxeye> marex: eLCDIF is just the display controller with parallel output. On i.MX8MM there is a MIPI DSI bridge attached to the controller, on MX8MP they even added the LVDS bridge again.

14:19 <marex> ha

14:35 JohnnyonFlame has joined #etnaviv

15:10 lynxeye has quit [Quit: Leaving.]

15:47 <mntmn> on i.MX8MQ, either DCSS or LCDIF can drive MIPI-DSI, but only DCSS can drive HDMI/DP.

15:47 <marex> this PGC is total madness

15:48 <marex> turn on VPUMIX, PU indicates GPUMIX PD failed to turn on

15:48 <marex> wt-actual-f

16:50 T_UNIX has quit [Quit: Connection closed for inactivity]

17:21 berton_ has joined #etnaviv

17:24 berton has quit [Ping timeout: 264 seconds]

17:24 berton_ has quit [Client Quit]

18:09 _daniel_ has joined #etnaviv

18:47 <cphealy> marek: For your MR6454, what i.MX platform was this on?

19:06 <mntmn> minetest merged workaround for etnaviv/gc7000 https://github.com/minetest/minetest/pull/10036

19:06 <mntmn> now just needs the mesa thing merged...

19:07 <mntmn> kicad also merged workaround for etnaviv https://gitlab.com/kicad/code/kicad/-/merge_requests/252

19:08 <mntmn> but sadly glamor/xorg needs 2 fixes that will never be accepted i guess

19:08 <mntmn> (remove 1 glClear() and add 1 glFinish())

19:13 <austriancoder> cphealy: should not matter as it is a general multi-context problem

19:15 <cphealy> I was asking as we were doing some work with qtwebengine too and I'm curious which cores it effects. GC2000? GC3000? GC7000L?

19:16 <austriancoder> cphealy: all.. as it is a general driver problem

19:16 <austriancoder> not connected to any specific core

19:43 <cphealy> ack

19:46 _daniel_ has quit [Quit: Leaving.]

20:17 <marex> austriancoder: thanks for the review, I'll update the patch and then ask for another month of testing (since thats how long the original one was under test)

20:22 <austriancoder> marex: hmm.. I know such test from day job and it sucks to no have a faster reproducer

20:22 <marex> austriancoder: it's OK, I have a bugfix locally, upstream will have to wait a month or two

20:23 <austriancoder> marex: we can land the first patch really fast if thats okay for you

20:24 <marex> austriancoder: the first patch (remove whatever function) is useless without the locking, FYI

20:24 <austriancoder> marex: I know - thats why we can land it faster

20:25 <marex> sounds totally pointless to me :)

20:25 <austriancoder> okay

20:25 <austriancoder> so you can drop this patch at all :)

20:26 <marex> no, because the function has broken locking

20:26 <marex> and I'm not gonna fix it, since its unused

20:27 <austriancoder> marex: okay.. but if its not used we can drop it now - or not? I am happy with the change when you (or I) remove the stable marker

20:28 <marex> its just waiting for someone to use it and break the state, just remove it

20:28 <marex> I was really considering adding the same fixes: tag to it

20:30 <austriancoder> aha .. so shall we drop it now in master or do you want to wait until you get test feedback?

20:30 <marex> it doesn't matter without the locking fix ; with the locking fix, the function is broken

20:30 <marex> without it, it was also broken

20:31 <marex> I find applying useless half of patchset minus actual bugfixes kinda pointless

20:33 <austriancoder> marex: puhh.. in maser etna_resource_get_status(..) has no callers -> dead code -> can be remove with out your other patch. correct?

20:34 <marex> austriancoder: I pushed what I think is harmless update to the bugfix patch, I didn't test it, it should address most of your concerns

20:36 <marex> I'm rather concerned about the performance of this enormous convoluted locking than about newlines

20:36 <austriancoder> marex: okay.. but do you want me to wait ~1 month to get your test feedback before landing it? If yes I would love to land "etnaviv: Remove etna_resource_get_status()" the next 2-3 days.

20:36 <marex> austriancoder: I believe with this update, you can land them both

20:37 <marex> austriancoder: if you need any more involved changes, then you need to wait

20:37 <marex> austriancoder: because putting untested crap upstream isn't good

20:37 <austriancoder> marex: thats why I ask

20:38 * austriancoder hates untestes crap

20:38 <marex> look at the changes, they're newlines + that bool, that should be harmless

20:40 <marex> austriancoder: note that the patch also should be easy to backport, so splitting functions isn't what I want to see in a bugfix

20:40 <marex> austriancoder: although I do agree that the function you pointed out is horrific

20:41 <austriancoder> marex: patch is okay.. but you have now to many new lines . I only wanted them after the if's (as I wrote in the review)

20:42 <austriancoder> /* if resource has no pending ctx's reset its status */

20:42 <austriancoder> rsc->status &= ~ETNA_PENDING_READ;

20:42 <austriancoder> if (_mesa_set_next_entry(rsc->pending_ctx, NULL) == NULL)

20:42 <austriancoder> mtx_unlock(&rsc->lock);

20:43 <austriancoder> marex: but lets land it

20:43 <marex> austriancoder: and backport to stable , should still apply to 20.1.y at least

20:45 <marex> I still think the locking should be somehow reworked and it should be possible to make it much simpler ... or ?

20:45 <austriancoder> marex: yep

20:46 <marex> in fact, how come mesa doesn't have something generic ?

20:46 <marex> I mean, this must be a problem with other GPUs too

20:47 <austriancoder> marex: but that needs to wait or we have a fix after an other round of testing

20:48 <marex> austriancoder: well obviously

20:48 <marex> austriancoder: I'm just concerned that this convoluted locking will have other even nastier warts

20:50 <marex> austriancoder: and it is just too difficult to reason about the correctness about this

20:51 <austriancoder> marex: me too .. but I dont want to block/delay your work that fixes a real world problem. performance wise you should have a feeling if it got much worser or not

20:52 <marex> austriancoder: performance-wise its the same

20:52 <marex> austriancoder: got any ideas about how to go about the locking rework ?

20:55 <austriancoder> marex: I think batching would be the way to look into - like done in other drivers

20:55 <marex> austriancoder: iirc robclark explained to me at some point there's a specific reason for batching in freedreno

20:56 <marex> but I was too green back then to understand it fully (still am)

20:56 <marex> austriancoder: and that it might not be necessary for vivante

20:57 <austriancoder> aha

20:58 <marex> austriancoder: it had to do with being able to interrupt the command stream and flush it at any point I think

20:58 <marex> you should be able to do it on vivante, and not on freedreno

20:58 <austriancoder> I am not sure about this statement .. but for the moment I am fine with it as I want to work on other etnaviv stuff :)

20:58 <marex> right, because on vivante, we generate the entire BO with the command stream and then flush it into the kernel

20:58 <marex> so uh ... the generation would have to be turned into something like ... ummm

20:59 <marex> somehow the generation of the command stream would have to be changed so it's not just adding into a BO

20:59 <marex> or something

20:59 <marex> hmmmm

20:59 <marex> austriancoder: I dont think I want to touch the locking for a bit, it makes my stomach turn

21:00 <marex> austriancoder: but uh, isn't the ^ some form of batching already ?

21:02 <austriancoder> marex: hmm.. na... it is more about describing of draws/clears etc. in a batch that descripes dependencies to resources etc.

21:02 <austriancoder> but.. I have put it on my long list of todos

21:03 <marex> austriancoder: well right now the draw calls almost directly pipe stuff into the final command stream, so somehow batching them should avoid the locking at the command stream end

21:04 <marex> austriancoder: I think we are talking roughly about the same

21:39 <mntmn> austriancoder: btw maybe i can pile some more work on top... any news about https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5456 ?

21:41 <austriancoder> mntmn: r-b

21:41 <mntmn> austriancoder: awesome!

21:42 <mntmn> austriancoder: what happens after that? :D

21:43 <austriancoder> mntmn: update the change with my r-b, remove the wip and I will do the rest

21:43 <marex> mntmn: add the RB to your commit (git commit --amend ...) and repush

21:43 <marex> mntmn: that's necessary to increase efficiency of the process ... or something :)

21:43 <austriancoder> bed time now

21:46 <mntmn> ok, will do, thanks for explaining!

21:49 <mntmn> marex: correct like this?

21:54 <marex> mntmn: like what ? :)

21:55 <marex> mntmn: ah well, I guess

21:55 <mntmn> ok

22:39 <marex> well where is lynxeye when you need him ...

22:41 <marex> the MX8MM GPCv2 can fail because if the gpu2d exits imx_gpc_pu_pgc_sw_pxx_req() in one thread and is just before pm_runtime_put() , and gpu3d enters imx_gpc_pu_pgc_sw_pxx_req() and is just before pm_runtime_get_sync(), then depending on the order, the gpumix might just be enabled and disabled right away, followed by gpu3d PU enabling, which obviously fails

22:41 <marex> so PD nesting might need some locking work