ChanServ changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://oftc.irclog.whitequark.org/etnaviv
sravn has joined #etnaviv
pH5 has quit [Server closed connection]
pH5 has joined #etnaviv
mwalle has quit [Server closed connection]
mwalle has joined #etnaviv
cockroach has quit [Ping timeout: 480 seconds]
JohnnyonF has joined #etnaviv
JohnnyonFlame has quit [Ping timeout: 480 seconds]
frieder has joined #etnaviv
cengiz_io has quit [Quit: cengiz_io]
cengiz_io has joined #etnaviv
lynxeye has joined #etnaviv
frieder has quit [Quit: Leaving]
frieder has joined #etnaviv
<marex> mwalle: did you get anywhere with the MMUv2 issues ?
<marex> mwalle: I was wondering if we could use ftrace to find out exactly which function triggers the problem in the driver, and possibly dump further context of the problem into the trace buffer
pcercuei has joined #etnaviv
<mwalle> marex: not yet, had no time yet. but the error i am seeing is fairly simple to reproduce. (i still suspect some caching issue because i get a fault for an address which was used in the former context, ie. open())
<mwalle> marex: i just haven't had time to look closely (1) how it is supposed to work and (2) what is really happening
<marex> mwalle: ack, I also have the mp1 on my todo
<mwalle> marex: but it has something is happening in the suspend/resume path, because as I said, if i disable the pm ops, the problem goes away
<mwalle> and there is still "DMA-API: cacheline tracking EEXIST, overlapping mappings aren't supported" with CONFIG_DMA_API_DEBUG=y
<mwalle> lynxeye: ^
<lynxeye> mwalle: Any indication which mapping is causing this issue?
frieder has quit [Ping timeout: 480 seconds]
frieder has joined #etnaviv
JohnnyonF has quit [Ping timeout: 480 seconds]
frieder_ has joined #etnaviv
frieder has quit [Ping timeout: 480 seconds]
<mwalle> lynxeye: the cacheline tracking? I don't know. haven't digged deeper into that yet. Do you have DMA_API_DEBUG enabled? Maybe you see the same message
<lynxeye> mwalle: I can try with different systems later, but as a quick shot I tried enabling DMA_API_DEBUG on a i.MX6Q and haven't got any such messages.
<mwalle> lynxeye: ok, then I'll put that on my ever growing stack ;)
<mwalle> btw what happens if the refcount isn't zero after the _context_put() in https://elixir.bootlin.com/linux/v5.14-rc6/source/drivers/gpu/drm/etnaviv/etnaviv_gpu.c#L1590
<mwalle> this will leak some memory, no?
<mwalle> (a "WARN_ON(kref_read(&gpu->mmu_context->refcount));" just after the _context_put() call will trigger after a "recover hung GPU")
<lynxeye> mwalle: As long as the DRM client owning the context is still there, the context will not be freed at this time. This context put just marks that the GPU is no longer using the context.
<lynxeye> So the reference being non-zero at this point is valid.
<lynxeye> But you raise a good point. If a delay there causes a GPU hang recovery, then there's something wrong with the DRM scheduler. As the GPU is idle, the hang detection timer should have been stopped by the scheduler at this point.
<mwalle> lynxeye: but i should see a etnaviv_iommu_context_free(), which i dont
<lynxeye> mwalle: No, you shouldn't as long as the DRM client is alive. If the GPU is just going idle while the client, e.g. weston is still around, the context must not be freed at that point.
<lynxeye> The context will be freed when the GPU is no longer using it and the client has disappeared. Both of those things can happen in any order.
<mwalle> lynxeye: i'm just using glmark2-es2-drm, which is already terminated before I'm looking for the _free call
* mwalle is getting sidetracked.. actually I wanted to look into what is causing the SMMU fault *g
<lynxeye> mwalle: So you don't get a free call at all? Maybe we are messing up the refcounts in the GPU recovery path...
<mwalle> excuse my dirty printk's ;)
<lynxeye> mwalle: Yea, looks like messed up refcounts due to GPU recovery. Thanks for the report, I'll take a look.
<mwalle> lynxeye: btw, the smmu fault is caused by an access to a old base address of the MTLB. Ie. I start the glmark, mtlb is set up, everything works. I stop and start glmark again, then there is a high chance that the GPU is trying to access a page of the old mtlb. Oddly enough, if i disable the pm_ops, the GPU still get different addresses for the mtlb with each start of glmark2, but it i don't get
<mwalle> the fault
frieder_ has quit [Remote host closed the connection]
<lynxeye> mwalle: This sounds similar to the issue marex described. When the GPU is restarted after the runtime PM cycle, does the register indicate that the MMU is still set up?
<lynxeye> If you do full GPU reset in the runtime resume, does the issue disappear?
JohnnyonFlame has joined #etnaviv
lynxeye has quit [Quit: Leaving.]
chewitt has quit [Quit: Zzz..]
chewitt has joined #etnaviv
JohnnyonFlame has quit [Ping timeout: 480 seconds]
chewitt has quit [Remote host closed the connection]
JohnnyonFlame has joined #etnaviv
pcercuei has quit [Quit: brb]
pcercuei has joined #etnaviv
chewitt has joined #etnaviv
JohnnyonF has joined #etnaviv
JohnnyonFlame has quit [Ping timeout: 480 seconds]
pcercuei has quit [Quit: dodo]