austriancoder has quit [Ping timeout: 272 seconds]
smurray has joined #etnaviv
austriancoder has joined #etnaviv
berton has joined #etnaviv
lynxeye has joined #etnaviv
<marex>
lynxeye: hey
<marex>
lynxeye: I have a few fixes for that PGC, but I also noticed one thing about the hang
<marex>
lynxeye: it seems like the hangs are clock related, as if there is some weird clock glitch
<marex>
lynxeye: it almost seems like it might be particularly bad idea to let the PGC and etnaviv manage the clock at the same time
<lynxeye>
marex: I'm all ears
<marex>
austriancoder: also, MR6454, that was reproducible only after some weeks of runtime (ew)
<marex>
lynxeye: all ears regarding the clock or what ?
<lynxeye>
Yep, I would be very intersted to know what you found
<marex>
lynxeye: well look at what the TFA does , I suspect they enable all the clock in gpumix for that exact reason
<marex>
lynxeye: because if you keep enabling and disabling the clock, the GPU cluster becomes unstable
<lynxeye>
driving the clocks from both PGC and etnaviv is what we do on i.MX6 and i.MX8M
<marex>
lynxeye: except there you drive the clock for the entire cluster, right ?
<lynxeye>
marex: is it the _bus clock that does funky things for you?
<marex>
lynxeye: I didnt identify the clock just yet
<marex>
lynxeye: btw you should sync the HSK only after you turn the PGC PUP request on
<marex>
lynxeye: and tear the HSK down before PDN
<marex>
my understanding of that HSK is that it disconnects some bus bridge
<lynxeye>
marex: yeah, I thought about splitting power up/down in the gpc driver to better handle this sequencing
<lynxeye>
yep, the HSK drives the amba domain bridges
<marex>
lynxeye: thats likely what needs to be done
<marex>
lynxeye: the current way is awful
<lynxeye>
agreed
<marex>
lynxeye: but it seems like maybe we should rather do as the imx-scu , with .start / .stop
<marex>
because with the current setup, the platform_device_add turns the power domain on and off on boot for no good reason
<marex>
but the comment in soc-imx-scu.c explains that all NXP drivers are broken and needs to be fixed first :(
<marex>
lemme find that one
<marex>
ah, here drivers/firmware/imx/scu-pd.c
<marex>
lynxeye: I sent you the patches I have now, still needs work
<marex>
back to digging in the clock stuff
<lynxeye>
marex: Re patch 2: shouldn't regmap already guard the register access?
<marex>
lynxeye: I _think_ I've seen concurrent access there, so I just tossed the mutex in to be sure
<marex>
lynxeye: and no, I think regmap only serializes specific accesses, not across the entire function call
<marex>
lynxeye: i.e. regmap_update_bits is called with a lock help, but the entire Pxx function isn't called with a lock
<marex>
lynxeye: I suspect if you get both gpu-2d and gpu-3d access the HSK at the same time, that might mess things up
<marex>
for example
<lynxeye>
marex: And patch 3 is wrong from my understanding. This bit needs to be set in before the power down, otherwise the domain won't power down when you trigger the pdn_req
<marex>
lynxeye: shouldn't the bit then be set unconditionally ?
<marex>
lynxeye: the PCR needs to be set before power up at least, and likely also before power down then
<marex>
lynxeye: surely enable_power_control (which = !on) is wrong too
<lynxeye>
marex: the doc says the bit should be asserted before pdn_req and should not be changed until the domain is completely powered up again.
<lynxeye>
so yep, setting it all the time looks like the right thing to do
<marex>
lynxeye: lemme recheck what NXP does in their TFA too
<lynxeye>
marex: they switch it before powering up, which is inconsitent with the docs
<marex>
lynxeye: yep :)
<marex>
lynxeye: I would expect that is the sensible thing to do though, no?
<marex>
lynxeye: I mean, if that bit enables "power control"
<lynxeye>
marex: I think we should just both enable the CPU mapping and power control on probe of the domain drivers
<lynxeye>
I see no reason to ever switch those two things while the driver is active
<marex>
lynxeye: I suspect it is TFA defensive programming
<marex>
lynxeye: btw do you use NXP TFA fork or upstream one ?
<marex>
lynxeye: because they do different PGC init
<lynxeye>
marex: I'm still using downstream NXP TFA, because I can't get the DRAM to reclock with upstream TFA
<lynxeye>
and too much other fires right now
<marex>
yep, very much the reason I got back to you only nowish
<marex>
lynxeye: I'll keep digging and let you know if I find something
<marex>
lynxeye: lets ask daniel in linux-imx about the PGC
<daniels>
lynxeye: speaking of downstream imx, did you ever get a chance to test the last dcss patchset & push to drm-misc?
<marex>
daniels: dcss is the DSI stuff ?
<daniels>
marex: it's the display controller in imx8mq (and maybe also imx8m for some display types?)
<lynxeye>
daniels: Still on my list. I dropped the ball a bit, as there were still discussions ongoing while I was on vacation and I'm still trying to dig out from the usual after-holidays crazyness
<daniels>
fair enough, good luck :)
<lynxeye>
nope, dcss is only on the i.MX8MQ. All the others only have the simple eLCDIF controller
<daniels>
ah, I thought there was one which had elcdif + dcss for different display paths
<daniels>
but maybe that's the only bit about imx8 they made even a little bit simple ;)
<lynxeye>
the MQ has both DCSS and eLCDIF
<daniels>
ah, right
<marex>
lynxeye: wasnt lcdif paralel RGB only ?
<lynxeye>
apparently DCSS was a bit too big/power hungry for the scaled down variants
<marex>
lynxeye: ah duh, that pdn_req, its not asserted only by PGC I think, there might be others
<marex>
lynxeye: I need to check again
shoragan has quit [Ping timeout: 240 seconds]
shoragan has joined #etnaviv
_daniel_ has joined #etnaviv
JohnnyonFlame has quit [Read error: Connection reset by peer]
<marex>
lynxeye: um, btw, SRC_GPU_RCR is shared between gpu2d and gpu3d , and the reset is only released after both GPU PDs are up
<marex>
lynxeye: maybe thats why you cant turn them on/off separately
<lynxeye>
marex: Yea, I'm not exactly sure about this thing. We don't really use the SRC reset, exactly because it's shared (even on MX6 the GPUs have a shared reset from the SRC), but I'm not sure what the PGC does to those resets
<marex>
lynxeye: well I wouldn't be surprised if it did some de-glitch
<marex>
lynxeye: which could explain the odd behavior of gpu2d
<lynxeye>
marex: yea, maybe we actually need to smash those domains together, like we did with the PCIe domains on 8mq
<lynxeye>
which would be a shame, as now we have 3(!) power domains for the GPUs, but still can't gate them separately
<marex>
lynxeye: how does it not surprise me at all
<marex>
lynxeye: I think maybe the people who implemented the TFA code can tell us more about that reset
_daniel_ has quit [Quit: Leaving.]
<lynxeye>
marex: eLCDIF is just the display controller with parallel output. On i.MX8MM there is a MIPI DSI bridge attached to the controller, on MX8MP they even added the LVDS bridge again.
<marex>
ha
JohnnyonFlame has joined #etnaviv
lynxeye has quit [Quit: Leaving.]
<mntmn>
on i.MX8MQ, either DCSS or LCDIF can drive MIPI-DSI, but only DCSS can drive HDMI/DP.
<marex>
this PGC is total madness
<marex>
turn on VPUMIX, PU indicates GPUMIX PD failed to turn on
<marex>
wt-actual-f
T_UNIX has quit [Quit: Connection closed for inactivity]
berton_ has joined #etnaviv
berton has quit [Ping timeout: 264 seconds]
berton_ has quit [Client Quit]
_daniel_ has joined #etnaviv
<cphealy>
marek: For your MR6454, what i.MX platform was this on?
<mntmn>
but sadly glamor/xorg needs 2 fixes that will never be accepted i guess
<mntmn>
(remove 1 glClear() and add 1 glFinish())
<austriancoder>
cphealy: should not matter as it is a general multi-context problem
<cphealy>
I was asking as we were doing some work with qtwebengine too and I'm curious which cores it effects. GC2000? GC3000? GC7000L?
<austriancoder>
cphealy: all.. as it is a general driver problem
<austriancoder>
not connected to any specific core
<cphealy>
ack
_daniel_ has quit [Quit: Leaving.]
<marex>
austriancoder: thanks for the review, I'll update the patch and then ask for another month of testing (since thats how long the original one was under test)
<austriancoder>
marex: hmm.. I know such test from day job and it sucks to no have a faster reproducer
<marex>
austriancoder: it's OK, I have a bugfix locally, upstream will have to wait a month or two
<austriancoder>
marex: we can land the first patch really fast if thats okay for you
<marex>
austriancoder: the first patch (remove whatever function) is useless without the locking, FYI
<austriancoder>
marex: I know - thats why we can land it faster
<marex>
sounds totally pointless to me :)
<austriancoder>
okay
<austriancoder>
so you can drop this patch at all :)
<marex>
no, because the function has broken locking
<marex>
and I'm not gonna fix it, since its unused
<austriancoder>
marex: okay.. but if its not used we can drop it now - or not? I am happy with the change when you (or I) remove the stable marker
<marex>
its just waiting for someone to use it and break the state, just remove it
<marex>
I was really considering adding the same fixes: tag to it
<austriancoder>
aha .. so shall we drop it now in master or do you want to wait until you get test feedback?
<marex>
it doesn't matter without the locking fix ; with the locking fix, the function is broken
<marex>
without it, it was also broken
<marex>
I find applying useless half of patchset minus actual bugfixes kinda pointless
<austriancoder>
marex: puhh.. in maser etna_resource_get_status(..) has no callers -> dead code -> can be remove with out your other patch. correct?
<marex>
austriancoder: I pushed what I think is harmless update to the bugfix patch, I didn't test it, it should address most of your concerns
<marex>
I'm rather concerned about the performance of this enormous convoluted locking than about newlines
<austriancoder>
marex: okay.. but do you want me to wait ~1 month to get your test feedback before landing it? If yes I would love to land "etnaviv: Remove etna_resource_get_status()" the next 2-3 days.
<marex>
austriancoder: I believe with this update, you can land them both
<marex>
austriancoder: if you need any more involved changes, then you need to wait
<marex>
austriancoder: because putting untested crap upstream isn't good
<austriancoder>
marex: thats why I ask
* austriancoder
hates untestes crap
<marex>
look at the changes, they're newlines + that bool, that should be harmless
<marex>
austriancoder: note that the patch also should be easy to backport, so splitting functions isn't what I want to see in a bugfix
<marex>
austriancoder: although I do agree that the function you pointed out is horrific
<austriancoder>
marex: patch is okay.. but you have now to many new lines . I only wanted them after the if's (as I wrote in the review)
<austriancoder>
/* if resource has no pending ctx's reset its status */
<austriancoder>
if (_mesa_set_next_entry(rsc->pending_ctx, NULL) == NULL)
<austriancoder>
mtx_unlock(&rsc->lock);
<austriancoder>
marex: but lets land it
<marex>
austriancoder: and backport to stable , should still apply to 20.1.y at least
<marex>
I still think the locking should be somehow reworked and it should be possible to make it much simpler ... or ?
<austriancoder>
marex: yep
<marex>
in fact, how come mesa doesn't have something generic ?
<marex>
I mean, this must be a problem with other GPUs too
<austriancoder>
marex: but that needs to wait or we have a fix after an other round of testing
<marex>
austriancoder: well obviously
<marex>
austriancoder: I'm just concerned that this convoluted locking will have other even nastier warts
<marex>
austriancoder: and it is just too difficult to reason about the correctness about this
<austriancoder>
marex: me too .. but I dont want to block/delay your work that fixes a real world problem. performance wise you should have a feeling if it got much worser or not
<marex>
austriancoder: performance-wise its the same
<marex>
austriancoder: got any ideas about how to go about the locking rework ?
<austriancoder>
marex: I think batching would be the way to look into - like done in other drivers
<marex>
austriancoder: iirc robclark explained to me at some point there's a specific reason for batching in freedreno
<marex>
but I was too green back then to understand it fully (still am)
<marex>
austriancoder: and that it might not be necessary for vivante
<austriancoder>
aha
<marex>
austriancoder: it had to do with being able to interrupt the command stream and flush it at any point I think
<marex>
you should be able to do it on vivante, and not on freedreno
<austriancoder>
I am not sure about this statement .. but for the moment I am fine with it as I want to work on other etnaviv stuff :)
<marex>
right, because on vivante, we generate the entire BO with the command stream and then flush it into the kernel
<marex>
so uh ... the generation would have to be turned into something like ... ummm
<marex>
somehow the generation of the command stream would have to be changed so it's not just adding into a BO
<marex>
or something
<marex>
hmmmm
<marex>
austriancoder: I dont think I want to touch the locking for a bit, it makes my stomach turn
<marex>
austriancoder: but uh, isn't the ^ some form of batching already ?
<austriancoder>
marex: hmm.. na... it is more about describing of draws/clears etc. in a batch that descripes dependencies to resources etc.
<austriancoder>
but.. I have put it on my long list of todos
<marex>
austriancoder: well right now the draw calls almost directly pipe stuff into the final command stream, so somehow batching them should avoid the locking at the command stream end
<marex>
austriancoder: I think we are talking roughly about the same
<mntmn>
austriancoder: what happens after that? :D
<austriancoder>
mntmn: update the change with my r-b, remove the wip and I will do the rest
<marex>
mntmn: add the RB to your commit (git commit --amend ...) and repush
<marex>
mntmn: that's necessary to increase efficiency of the process ... or something :)
<austriancoder>
bed time now
<mntmn>
ok, will do, thanks for explaining!
<mntmn>
marex: correct like this?
<marex>
mntmn: like what ? :)
<marex>
mntmn: ah well, I guess
<mntmn>
ok
<marex>
well where is lynxeye when you need him ...
<marex>
the MX8MM GPCv2 can fail because if the gpu2d exits imx_gpc_pu_pgc_sw_pxx_req() in one thread and is just before pm_runtime_put() , and gpu3d enters imx_gpc_pu_pgc_sw_pxx_req() and is just before pm_runtime_get_sync(), then depending on the order, the gpumix might just be enabled and disabled right away, followed by gpu3d PU enabling, which obviously fails
<marex>
so PD nesting might need some locking work