karolherbst has quit [Read error: Connection reset by peer]
karolherbst has joined #etnaviv
chewitt has quit [Quit: Adios!]
afaerber has quit [Ping timeout: 260 seconds]
afaerber has joined #etnaviv
Net147 has quit [Quit: Quit]
Net147 has joined #etnaviv
Net147 has quit [Quit: Quit]
Net147 has joined #etnaviv
lynxeye has joined #etnaviv
pcercuei has joined #etnaviv
<marex>
hm, is it possible that the vivante GPU is overwriting memory randomly ?
<marex>
and if so, is there some way to debug that ?
<marex>
austriancoder: lynxeye: ^
<austriancoder>
there is an iommu .. so.. chances are low
<marex>
austriancoder: that iommu is disabled on the MP1 because ... hold on
<marex>
1db012790446 ("drm/etnaviv: move linear window on MC1.0 parts if necessary")
<marex>
this
<marex>
but wait, that only talks about fast clear, I thought there was something about mmuv1 too
<marex>
austriancoder: the issue I am seeing happens once every 3-4 hours under heavy load ... so chances of triggering it are indeed low
<marex>
austriancoder: it is the same issue I am looking for for the past 2-3 weeks btw
<austriancoder>
do you see 'only' miss-renderings or even crashed apps etc. due to overwriting memory?
<austriancoder>
maybe there is a race condition regarding bo's
<marex>
austriancoder: I see the machine prints BUG about corrupted page
<marex>
austriancoder: and sometimes the machine even reboots
<marex>
austriancoder: I think something like that is happening
<marex>
austriancoder: but how do you debug that ?
<lynxeye>
marex: on MMUv1 you have a 2GB linear window, so you can bypass the MMU with reads/writes through that window
<lynxeye>
MMUv2 has real isolation and all accesses go through the MMU, so on MMUv2 chances for random memory corruption are much lower
<austriancoder>
stm32 should have mmuv2
<lynxeye>
austriancoder: You sure about this? My recollection is that the STM32 is MMUv1 (but MC2.0).
<lynxeye>
marex: You can force buffers to be mapped through the MMU with a BO flag, but that only helps if your issue isn't caused by random state corruption. Also MMUv1 isn't able to trigger exception IRQs, but just retargets accesses to the bad page. So you need to manually check that the bad page magic isn't overwritten in order to find out if some access is going astray.
<austriancoder>
lynxeye: I just booted up a stm32: minor_features1: 0xbe13b219 & chipMinorFeatures1_MMU_VERSION (0x10000000) --> ETNAVIV_IOMMU_V2
<lynxeye>
austriancoder: Okay, thanks, so my memory was wrong.
<austriancoder>
time for a debugfs patch to print the mmu version :)
<marex>
austriancoder: but then why disable TS on MP1 ?
<marex>
but then that might mean there is a random state corruption
<austriancoder>
marex: mmu and mc are two different things
<marex>
austriancoder: from what I understand from the above, MP1 should be MC2 and IOMMU2 ?
<austriancoder>
no: minor_features0: 0xe1299fff & chipMinorFeatures0_MC20 (0x00400000) --> MC2 .. but that bit is not set --> stm32 = mc1 + mmuv2
<lynxeye>
marex: Nope, seems I mixed things up. According to feature bits the GC400 on STM32 is MC1.0, but MMUv2.
<marex>
meaning my problem is likely state corruption ?
<lynxeye>
marex: Nope, meaning that it is very unlikely that the GPU is writing to unmapped regions (that could be a result of state corruption) as MMUv2 signals bad writes via an exception. So if it's really the GPU corrupting sysmem, the address must be mapped into the MMU address space.
<marex>
lynxeye: I am seeing various MMU errors in the kernel log here and there, could those be related then ?
chewitt has joined #etnaviv
<austriancoder>
marex: if there is an mmu exception for etnaviv you could try to collect a devcoredump
<austriancoder>
marex: that repo contains udev rule an a decoredump extractor. sudo make install should work. Then if an entaviv mmu exception happens you get all the information you to find the cause.
<austriancoder>
fyi: during deqp runs in CI I always get a devcoredump
<marex>
austriancoder: I didnt manage to crash the machine when running deqps in a loop for days