#etnaviv on 2021-08-18 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:57 ChanServ changed the topic of #etnaviv to: #etnaviv - the home of the reverse-engineered Vivante GPU driver - Logs https://oftc.irclog.whitequark.org/etnaviv

00:17 sravn has joined #etnaviv

01:48 pH5 has quit [Server closed connection]

01:49 pH5 has joined #etnaviv

02:04 mwalle has quit [Server closed connection]

02:05 mwalle has joined #etnaviv

02:50 cockroach has quit [Ping timeout: 480 seconds]

05:55 JohnnyonF has joined #etnaviv

06:02 JohnnyonFlame has quit [Ping timeout: 480 seconds]

06:17 frieder has joined #etnaviv

06:31 cengiz_io has quit [Quit: cengiz_io]

06:32 cengiz_io has joined #etnaviv

07:32 lynxeye has joined #etnaviv

09:07 frieder has quit [Quit: Leaving]

09:07 frieder has joined #etnaviv

09:13 <marex> mwalle: did you get anywhere with the MMUv2 issues ?

09:14 <marex> mwalle: I was wondering if we could use ftrace to find out exactly which function triggers the problem in the driver, and possibly dump further context of the problem into the trace buffer

09:18 pcercuei has joined #etnaviv

09:23 <mwalle> marex: not yet, had no time yet. but the error i am seeing is fairly simple to reproduce. (i still suspect some caching issue because i get a fault for an address which was used in the former context, ie. open())

09:24 <mwalle> marex: i just haven't had time to look closely (1) how it is supposed to work and (2) what is really happening

09:25 <marex> mwalle: ack, I also have the mp1 on my todo

09:26 <mwalle> marex: but it has something is happening in the suspend/resume path, because as I said, if i disable the pm ops, the problem goes away

09:27 <mwalle> and there is still "DMA-API: cacheline tracking EEXIST, overlapping mappings aren't supported" with CONFIG_DMA_API_DEBUG=y

09:27 <mwalle> lynxeye: ^

09:36 <lynxeye> mwalle: Any indication which mapping is causing this issue?

10:43 frieder has quit [Ping timeout: 480 seconds]

10:52 frieder has joined #etnaviv

11:35 JohnnyonF has quit [Ping timeout: 480 seconds]

11:55 frieder_ has joined #etnaviv

11:56 frieder has quit [Ping timeout: 480 seconds]

13:13 <mwalle> lynxeye: the cacheline tracking? I don't know. haven't digged deeper into that yet. Do you have DMA_API_DEBUG enabled? Maybe you see the same message

13:15 <lynxeye> mwalle: I can try with different systems later, but as a quick shot I tried enabling DMA_API_DEBUG on a i.MX6Q and haven't got any such messages.

14:36 <mwalle> lynxeye: ok, then I'll put that on my ever growing stack ;)

14:37 <mwalle> btw what happens if the refcount isn't zero after the _context_put() in https://elixir.bootlin.com/linux/v5.14-rc6/source/drivers/gpu/drm/etnaviv/etnaviv_gpu.c#L1590

14:37 <mwalle> this will leak some memory, no?

14:43 <mwalle> (a "WARN_ON(kref_read(&gpu->mmu_context->refcount));" just after the _context_put() call will trigger after a "recover hung GPU")

14:49 <lynxeye> mwalle: As long as the DRM client owning the context is still there, the context will not be freed at this time. This context put just marks that the GPU is no longer using the context.

14:50 <lynxeye> So the reference being non-zero at this point is valid.

14:51 <lynxeye> But you raise a good point. If a delay there causes a GPU hang recovery, then there's something wrong with the DRM scheduler. As the GPU is idle, the hang detection timer should have been stopped by the scheduler at this point.

14:54 <mwalle> lynxeye: but i should see a etnaviv_iommu_context_free(), which i dont

14:56 <lynxeye> mwalle: No, you shouldn't as long as the DRM client is alive. If the GPU is just going idle while the client, e.g. weston is still around, the context must not be freed at that point.

14:56 <lynxeye> The context will be freed when the GPU is no longer using it and the client has disappeared. Both of those things can happen in any order.

14:57 <mwalle> lynxeye: i'm just using glmark2-es2-drm, which is already terminated before I'm looking for the _free call

14:58 * mwalle is getting sidetracked.. actually I wanted to look into what is causing the SMMU fault *g

14:59 <lynxeye> mwalle: So you don't get a free call at all? Maybe we are messing up the refcounts in the GPU recovery path...

14:59 <mwalle> lynxeye: see https://pastebin.com/raw/5PCvUSMV

14:59 <mwalle> excuse my dirty printk's ;)

15:01 <lynxeye> mwalle: Yea, looks like messed up refcounts due to GPU recovery. Thanks for the report, I'll take a look.

15:14 <mwalle> lynxeye: btw, the smmu fault is caused by an access to a old base address of the MTLB. Ie. I start the glmark, mtlb is set up, everything works. I stop and start glmark again, then there is a high chance that the GPU is trying to access a page of the old mtlb. Oddly enough, if i disable the pm_ops, the GPU still get different addresses for the mtlb with each start of glmark2, but it i don't get

15:14 <mwalle> the fault

15:18 frieder_ has quit [Remote host closed the connection]

16:22 <lynxeye> mwalle: This sounds similar to the issue marex described. When the GPU is restarted after the runtime PM cycle, does the register indicate that the MMU is still set up?

16:23 <lynxeye> If you do full GPU reset in the runtime resume, does the issue disappear?

16:51 JohnnyonFlame has joined #etnaviv

18:00 lynxeye has quit [Quit: Leaving.]

18:18 chewitt has quit [Quit: Zzz..]

18:22 chewitt has joined #etnaviv

18:52 JohnnyonFlame has quit [Ping timeout: 480 seconds]

19:24 chewitt has quit [Remote host closed the connection]

19:53 JohnnyonFlame has joined #etnaviv

20:10 pcercuei has quit [Quit: brb]

20:11 pcercuei has joined #etnaviv

20:40 chewitt has joined #etnaviv

20:59 JohnnyonF has joined #etnaviv

20:59 JohnnyonFlame has quit [Ping timeout: 480 seconds]

21:42 pcercuei has quit [Quit: dodo]