#linux-msm on 2021-12-15 — irc logs at oftc.irclog.whitequark.org

2021-07-26 22:56 ChanServ changed the topic of #linux-msm to:

01:10 cxl000 has joined #linux-msm

01:46 <abhinav___> robclark: agreed. slight delta in the sspp status registers (fetch related) but apart from that no significant difference. We can continue debugging on the bug.

01:47 <abhinav___> regarding DSPP, those are actually histogram status bits

01:47 <abhinav___> it should not make a difference for this issue but weird its 0 for one case and not for the other

01:48 <robclark> abhinav___: yeah, they aren't regs driver writes, afaict.. but maybe a hint about what is wrong?

01:49 <robclark> it's weird.. but I compared a bunch of different good vs good and bad vs bad (and good vs bad) to rule out things that are normally different, and that was the only thing that stood out

02:08 <abhinav___> robclark: got it, yes those are read only regs so driver cannot write to them. one immediate suggestion i can think of is to try the DSI test pattern to see if that comes up. if that works, then we can go further up

02:09 <robclark> ok.. not sure how if I remember how to get test pattern.. but if you reply on b/ I can try it in the morning

02:09 <abhinav___> https://gitlab.freedesktop.org/drm/msm/-/blob/msm-next/drivers/gpu/drm/msm/dsi/dsi_manager.c#L434

02:09 <robclark> the bridge doesn't seem to be raising any error status bits about dsi signal from SoC, fwiw

02:10 <abhinav___> sure i will update the bug

02:10 <abhinav___> will continue there

02:10 <robclark> sg

02:21 elroo_ has quit [Ping timeout: 480 seconds]

02:40 <sboyd> abhinav___: where should I call msm_dsi_manager_tpg_enable() from?

03:33 <abhinav___> sboyd: this worked for me

03:33 <abhinav___> https://www.irccloud.com/pastebin/mgaylJ4S/

04:52 marvin24_ has joined #linux-msm

04:55 marvin24 has quit [Ping timeout: 480 seconds]

05:20 marvin24 has joined #linux-msm

05:24 marvin24_ has quit [Ping timeout: 480 seconds]

05:27 marvin24_ has joined #linux-msm

05:30 marvin24 has quit [Ping timeout: 480 seconds]

06:18 pevik_ has joined #linux-msm

06:46 <sboyd> abhinav___: yes that works for me. I get the test pattern now.

07:39 pg12 has joined #linux-msm

07:41 pg12_ has quit [Ping timeout: 480 seconds]

12:24 svarbanov has joined #linux-msm

14:56 agross_ has joined #linux-msm

14:57 CosmicPenguin_ has joined #linux-msm

14:57 pundir_ has joined #linux-msm

14:58 thara_ has joined #linux-msm

14:59 thara has quit [Ping timeout: 480 seconds]

14:59 thara_ is now known as thara

15:00 CosmicPenguin has quit [Ping timeout: 480 seconds]

15:01 agross has quit [Ping timeout: 480 seconds]

15:01 agross_ is now known as agross

15:02 pundir has quit [Ping timeout: 480 seconds]

15:02 pundir_ is now known as pundir

15:31 rawouF is now known as rawoul

15:40 bamse_ has joined #linux-msm

15:40 bamse has quit [Read error: Connection reset by peer]

17:02 svarbanov has quit [Ping timeout: 480 seconds]

17:16 <robclark> abhinav___, sboyd: test pattern works for me too.. but without the test pattern, if I plug in a bogus address for scanout buffer I get the expected iova faults.. so I guess whatever is going wrong is between sspp and dsi?

17:36 <lumag_> robclark, could you maybe compare clocks values (debugcc)?

17:37 <robclark> clk_summary dumps more or less matched.. there were some diffs in some qup clocks, which I assume is related to having serial console enabled in the working case

17:37 <lumag_> robclark, no, I was talking about https://github.com/andersson/debugcc

17:38 <lumag_> It was helpful to me in several obscure cases already

17:38 <robclark> haven't tried that.. but maybe that is the testclock thing sboyd tried?

17:38 <lumag_> robclark, dunno

17:38 <robclark> hmm, does that thing even have sc7180 support?

17:39 <lumag_> robclark, no, but you can easily write it if you have downstream debug_cc driver

17:39 <lumag_> Just a huuuge table of all the muxes.

17:43 <sboyd> Looks like testclock

17:43 <lumag_> sboyd, what is testclock? :-)

17:44 <sboyd> it's the script that debugcc is based on

17:44 <robclark> I think this is it? https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/refs/heads/main/chipset-qc7180/dev-util/testclock/files/testclock.py

17:44 <sboyd> lumag_: what clk needs to be measured?

17:45 <sboyd> robclark: yeah that's it

17:46 <lumag_> sboyd, I was asking robclark if there is a clock measurements diff between good and bad tries

17:47 <robclark> I can stash good/bad $debugfs/clk/clk_summary somewhere in a min if that is useful.. and/or try this testclock.py thing

17:47 <lumag_> robclark, the testclock is probably more usefull

17:47 pevik_ has quit [Quit: Lost terminal]

17:54 <robclark> hmm, that script looks to have some other py dependencies that I'm missing

18:07 <sboyd> It depends on the mem script

18:12 <robclark> ok, any particular clk you want?

18:12 <lumag_> robclark, no, nothing particular

18:13 CosmicPenguin_ is now known as CosmicPenguin

18:14 <robclark> hmm, doesn't seem to be a convenient way to dump them all

18:16 <lumag_> I usually did `sdm845-debugcc -a | grep disp`

18:24 <robclark> there are some *slight* differences which I guess are rounding errors.. other than that pwrcl_clk is somewhat different (no idea what that is)

18:25 <sboyd> Can you print(clocks)

18:25 <robclark> good:

18:25 <robclark> https://www.irccloud.com/pastebin/TB6i2z2l/

18:25 <robclark> bad:

18:25 <robclark> https://www.irccloud.com/pastebin/MeIFq3lB/

18:25 * robclark just made a wrapper script to dump each clk individually

18:30 <robclark> just dumping pwrcl_clk a few times, it seems to vary wildly so I guess that is unrelated.. I assume the small differences in disp related clks is simply due to how they are sampled?

18:32 <sboyd> pwrcl is the CPU

18:33 <sboyd> I ran testclock -h | grep disp_cc | xargs testclock

18:34 <sboyd> pwrcl stands for power cluster

18:35 <robclark> ahh

18:35 <sboyd> https://www.irccloud.com/pastebin/VFmdbFRf/

18:36 <sboyd> also this works

18:36 <robclark> depending on what $think you are running on, those might not be comparable to mine.. my dumps were from lazor

18:36 <sboyd> https://www.irccloud.com/pastebin/vxnmu3Ww/

18:36 <sboyd> this is on coachz

18:36 <robclark> but doesn't look to me like any interesting difference between the good and bad dumps

18:44 <sboyd> I would suspect that the test pattern wouldn't work if the clks were off

18:46 <abhinav___> yes for test pattern to work the DSI clocks need to be right. we have to check dpu side

18:50 <robclark> IME the bridge status regs would also have some error bits set if it was unhappy with the dsi signal

19:06 <sboyd> so how do I look at DPU?

19:06 <robclark> sideways?

19:07 <robclark> what are you looking for?

19:11 <sboyd> robclark: heh abhinav___ said we have to check dpu side

19:12 <robclark> you can grab that series I posted that adds debugfs to dump all the register state.. I already attached good/bad dumps to the b/

19:13 <abhinav___> sboyd: we can continue on the bug, please attach the clock states as well to it.

19:19 <robclark> I can do that

19:33 <sboyd> ok if you want to make it slower we can debug in the bug :)

19:33 <sboyd> at least the paper trail is there

19:37 <abhinav___> sboyd: most likely kalyans team will look into this as its a chrome dpu issue. they dont have IRC access, hence its better to continue there

19:38 <sboyd> abhinav___: so the assignee is wrong?

19:38 <abhinav___> Yes, but no worries i can reassign

19:39 <lumag_> sboyd, robclark: regarding atomic_print_state. Would you like for me to send the v2 with const, or just fixing it at the apply time is fine?

19:42 <robclark> lumag_: squashing that fix in when applying is fine

19:43 <robclark> abhinav___: in the mean time, if there are other things you can think of trying, we can iterate on IRC and copy/paste updates into b/

19:43 <lumag_> robclark, ack

19:43 <abhinav___> robclark: sure , will do that too

19:53 <Marijn[m]> lumag_: Not sure if you intended to cc me on the clock-controller patches; send-email only seems to pick my address up as cc from the reviewed-by tags ;)

19:54 <lumag_> Marijn[m], yep, please excuse me. Just getting torn between several patchseries

19:55 <Marijn[m]> No worries, I can download it off of lore and stitch things back together (not subscribed to any list because it is way too noisy, and I haven't found a proper way to deduplicate that with cc'd/to'd mails)

19:59 <lumag_> Marijn[m], ugh. Then I'd double my excuses.

19:59 <Marijn[m]> lumag_: Not your problem, it's something to deal with in general. Suggestions welcome :)

20:26 <Marijn[m]> lumag_: Thanks for commenting on the DSI "ref" patches too, I hope bamse_ and sboydthink it's ready too :)

20:27 <Marijn[m]> lumag_: One little issue remaining for the clock cleanup, almost on the finish line :)

20:27 <Marijn[m]> (Unfortunately 8996 has a bunch other drivers remaining that could use the same conversion :( )

20:31 <lumag_> Marijn[m], yep.

20:35 <lumag_> Marijn[m], I'll probably finish the 8996 conversion at some point. Then I'll probably look onto 8916, 8960/8064 and 8939 (minor fixes).

20:35 <lumag_> If nobody picks them before I have time for them.

20:36 <Marijn[m]> lumag_: I don't want to pick up much of these without having a board/device under my nose to test it on, glad if you can tackle 8996 and all the other SoCs mentioned, none of those are in my posession.

20:37 <Marijn[m]> And of course time 😅

20:38 <lumag_> Marijn[m], time is the worst thing

20:40 <lumag_> always in a shortage

20:40 <Marijn[m]> lumag_: Yeah, especially when doing this as a hobby. Have an almost infinite backlog of patches that still need cleaning and sending :(

20:40 <Marijn[m]> So much progress on device bringup but none makes it to the lists

20:42 <lumag_> Marijn[m], yep. Some of the clocks patches from the set you've been reviewing date back to April

20:43 <Marijn[m]> lumag_: One-upping that one: the 8976 patches that are starting to reach the list date back to summer 2019 :)

20:46 <lumag_> oh, my.

20:47 <lumag_> 2019 was all about crypto, iMX and Zynq for me.

20:51 * vknecht[m] could test 8939 stuff, provided with instructions how to properly check the change is ok

20:53 <vknecht[m]> 8916 too, but minecrell is the expert for that one ;-)

21:06 Daanct12 has joined #linux-msm

21:12 Danct12 has quit [Ping timeout: 480 seconds]

21:14 Danct12 has joined #linux-msm

21:21 Daanct12 has quit [Ping timeout: 480 seconds]

21:25 svarbanov has joined #linux-msm

21:30 <steev> is patchwork broken? i'm not seeing any patches since the 13th - https://patchwork.kernel.org/project/linux-arm-msm/list/

21:36 <Marijn[m]> steev: Yup, I wondered the same, seems to be stuck. I got pointed to https://patches.linaro.org/project/linux-arm-msm/list/, but that doesn't appear to list every patch?

21:39 <steev> they may not mirror stuff that bamse marks as queued/not applicable?

21:40 <steev> but it's almost missing everything from today

21:40 <robclark> abhinav___: one thing that occurred to me.. I guess the CRC/MISR stuff is happening post-mixer.. I guess I could run some igt test which uses CRCs and see if they match in good/bad state?

21:42 <abhinav___> robclark: yes MISR is after mixer. we can try that. however i dont know how to interpret it. if it matches, then great but if it doesnt match it does not necessarily mean its bad. so for example, lets say the image being sent out itself is black ( hypothetical but possible ), then MISR will be different but thats a genuine mismatch and expected

21:43 <robclark> yeah, just need to pick a test that comes up with the same crc each time.. and maybe hack it up to print the crc

21:46 <abhinav___> that actually brings me to the other thing i was planning to do, the black screen you are seeing could be the interface border color ( used when pipes are not fetching anything or pipes are not connected ) OR what if the buffer itself is black

21:46 <abhinav___> we can change the interface border color as one try, I would like to try it first before sharing

21:47 <abhinav___> for the second one, when you do see the blank screen, is there a way to do a buffer dump to see if its actually black? its hypothetical but we have seen things like this before

21:47 <robclark> I hacked in a memset(ptr, 0xff, 4096) to test one of those theories.. I guess I can try hacking the solid-fill color for the other..

21:48 <robclark> also, w/ fbcon you should be able to `cat /dev/urandom > /dev/fb0` and see "snow" (which I don't)

21:49 <robclark> for fill color, I guess intf_timing_params::border_clr ?

21:50 <abhinav___> correct

21:50 <abhinav___> so technically

21:50 <abhinav___> https://www.irccloud.com/pastebin/sxp7zXrq/

21:50 <abhinav___> changing the border_clr here

21:50 <abhinav___> which is black by default

21:51 <abhinav___> to some other color

21:51 <abhinav___> should work

21:51 <abhinav___> I just havent tried it

21:52 svarbanov has quit [Ping timeout: 480 seconds]

21:52 <robclark> yup, that is what I'm trying..

21:52 <robclark> (also changed the underflow color to something different)

21:52 <abhinav___> underflow is blue

21:52 <abhinav___> so if it were underflow you should have seen blue

21:52 <robclark> ok, I wasn't sure if 0xff was alpha channel or not..

21:53 <abhinav___> no its RGB so 0xff is blue for sure

21:53 <robclark> k

21:53 svarbanov has joined #linux-msm

21:55 <robclark> nope, still black screen

22:21 <robclark> abhinav___: huh.. so it seems like setting a non-zero fill color => black screen!

22:22 <abhinav___> ok, so i was reading up on the interface border color which we just changed. That one will not affect the result. it has a different meaning. we can try changing two other places

22:22 <abhinav___> 84 struct dpu_mdss_color *color,

22:22 <abhinav___> 85 u8 border_en)

22:22 <abhinav___> 83 static void dpu_hw_lm_setup_border_color(struct dpu_hw_mixer *ctx,

22:22 <abhinav___> 86 {

22:22 <abhinav___> 87 struct dpu_hw_blk_reg_map *c = &ctx->hw;

22:22 <abhinav___> 88

22:22 <abhinav___> 89 if (border_en) {

22:22 <abhinav___> 90 DPU_REG_WRITE(c, LM_BORDER_COLOR_0,

22:22 <abhinav___> 91 (color->color_0 & 0xFFF) |

22:22 <abhinav___> 92 ((color->color_1 & 0xFFF) << 0x10));

22:22 <abhinav___> 93 DPU_REG_WRITE(c, LM_BORDER_COLOR_1,

22:22 <abhinav___> 94 (color->color_2 & 0xFFF) |

22:22 <abhinav___> 95 ((color->color_3 & 0xFFF) << 0x10));

22:22 <abhinav___> 96 }

22:22 <abhinav___> 97 }

22:22 <abhinav___> 98

22:22 <abhinav___> this one is the mixer color

22:23 <abhinav___> if we call this function and if this color shows up

22:23 <abhinav___> then pipe itself is not staged

22:23 <robclark> ok.. I was wondering what we were changing in intf.. because on mdp5 it was mixer where I needed to set it :-P

22:23 <abhinav___> which is unlikely

22:23 <abhinav___> because from the reg dump pipe is staged

22:23 <abhinav___> but still lets call this

22:24 <abhinav___> then there is a second color

22:24 <abhinav___> 1050 * These updates have to be done immediately before the plane flush

22:24 <abhinav___> 1049 /*

22:24 <abhinav___> 1051 * timing, and may not be moved to the atomic_update/mode_set functions.

22:24 <abhinav___> 1052 */

22:24 <abhinav___> 1053 if (pdpu->is_error)

22:24 <abhinav___> 1054 /* force white frame with 100% alpha pipe output on error */

22:24 <abhinav___> 1055 _dpu_plane_color_fill(pdpu, 0xFFFFFF, 0xFF);

22:24 <abhinav___> this is for the solid color of the plane

22:24 <abhinav___> so if there is an issue with fetch

22:24 <abhinav___> of the plane

22:24 <abhinav___> it should show this

22:24 <abhinav___> I would suggest changing this to two different colors

22:24 <abhinav___> and see which one comes

22:24 <robclark> hmm, is setup_border_color() not actually called anywhere? I guess maybe we haven't tested having non fullscreen planes only?

22:24 <abhinav___> and it will tell us where the issue is

22:25 <abhinav___> correct its not called today

22:26 <abhinav___> regarding interface color which i asked to change earlier, it only takes effect when there is less active pixels on the screen than what interface is programmed for

22:26 <abhinav___> so that has a different use-case

22:26 <abhinav___> i am hoping one of these tell us the clue

22:33 <robclark> abhinav___: even this => black screen

22:33 <robclark> https://www.irccloud.com/pastebin/9yULEDmx/

22:33 <robclark> on mdp5 it was a bit more involved.. we had to set an enable bit, and skip the base stage in the mixer

22:35 <abhinav___> alright, no surprise on this one as the pipe was staged

22:35 <abhinav___> so plane solid color has to be the next

22:36 <robclark> see mdp5_ctl_blend(), fwiw..

22:39 <robclark> abhinav___: I suppose the pdup->is_error path might not be tested?

22:39 <robclark> https://www.irccloud.com/pastebin/UUOKDLEu/

22:41 svarbanov has quit [Ping timeout: 480 seconds]

22:44 <lumag_> robclark, mea culpa. I don't think I tested it

22:45 <robclark> I guess it is *normally* an error path..

22:46 <robclark> ie. so we don't necessarily have a good way to test it

22:46 <robclark> ok, I'm going back to the igt/crc idea

22:49 <abhinav___> robclark: yes probably not tested. we can work on that in the bug and then try it once its fixed. its an important test

22:50 <robclark> I'd be open to addition of some way to test it.. maybe some debugfs switch and corresponding igt test?

22:51 <abhinav___> yes will figure it out. the crash seems to be from scaler path [ 4.078512] dpu_hw_setup_scaler3+0x524/0x7e4

22:51 <abhinav___> [ 4.078515] _dpu_hw_sspp_setup_scaler3+0x78/0xa8 so lumag_ another fix there?

22:53 <lumag_> I'll try taking a look in one of the next days

22:57 <robclark> for reference, I did add an igt test that uses debugfs to disable hw GPU hang detection so that we had a way to test the sw timer based fallback hang detection.. https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/msm_recovery.c#L133 .. we could use igt_debugfs_write() helpers in a similar way for tests to simulate/force error cases to get better test coverage

23:01 lumag_ has quit [Ping timeout: 480 seconds]

23:02 lumag_ has joined #linux-msm

23:04 svarbanov has joined #linux-msm

23:19 <robclark> abhinav___: ok.. confirmed that crc's change between good and bad state.. with blank screen the CRCs are all zero

23:29 <lumag_> robclark, just out of curiosity. Could you please trace CTL_START/CTL_FLUSH writes?

23:30 <lumag_> and compare timings and values being flushed/written

23:32 <robclark> I was kinda wishing for a debugfs way to trigger CTL_FLUSH write.. I suppose I could do it with devmem (ie. if the theory is we haven't flushed some updates)

23:50 <abhinav___> robclark: this certainly means frame wasnt pushed out. since issue is log sensitive, drm_trace with ctl_flush might be a better option

23:53 <robclark> ok, not really seeing any difference in CTL_FLUSH values.. seems like CTL_START is not written in either case?

23:58 <robclark> abhinav___: fwiw, regular printk is fine.. and even drm.debug enabled, you can still see the issue.. as long as loglevel is low enough, it doesn't slow things down enough to matter much, IME.. it is when you need to pump the debug msgs out the uart that the timing really changes