#aarch64-laptops on 2022-05-25 — irc logs at oftc.irclog.whitequark.org

2022-03-22 11:52 ChanServ changed the topic of #aarch64-laptops to: Linux support for AArch64 Laptops (Asus NovaGo TP370QL - HP Envy x2 - Lenovo Mixx 630 - Lenovo Yoga C630)

00:02 systwi_ has joined #aarch64-laptops

00:06 systwi has quit [Ping timeout: 480 seconds]

00:44 <steev> no112: out of curiosity, does your gb2 run W11?

00:49 <qzed> What's a `Internal error: synchronous external abort: 96000010 [#1] SMP`?

00:52 <steev> a bad thing (tm)

00:55 <qzed> heh, that much I gathered by it (somwhat) crashing the system xD

00:55 <qzed> bamse: So I've rebased onto next-20220520 and tried to get the GPU working, following the sc7280 dts

00:56 <qzed> and naturally I'm running into issues again (including the above)

00:56 <qzed> full dmesg log and DT: https://gist.github.com/qzed/2288c651a017fcc016b524d3bac5a43e

00:57 <qzed> so essentially some dpu_kms stuff is failing with EPROBE_DEFERRED and also the panel can't read EDID data for some reason

00:58 <qzed> also some times the system locks up during boot

00:58 <qzed> there are also some weird issues when I build in the GCC_8180 module vs. when I build it as module...

00:59 <qzed> when I build it as module (and include it in the initramfs) things (more or less) work (i.e. boot but problems above still exist)

00:59 <qzed> but when I build it into the kernel things lock up early on

01:02 <qzed> no112: there's some weirdness with grub on the Surface Pro X as well (it refuses to boot with some fedora grub patches applied). Seems a bit different though, but you could try https://github.com/linux-surface/grub-image-aarch64/releases/tag/grub-2.06

01:07 <bamse> qzed: i see the same problem on primus when i rebased today, i think there's something in the new edp probe path that is racy

01:07 <qzed> ah good to know I'm not the only one with issues

01:07 <bamse> qzed: and i think i had an error wrt issues powering up the gpu as well :/

01:08 <qzed> IIRC I read some comment about some panel probe stuff that needs to run synchronously... I'll try to find it, although I'm not sure if it's related

01:10 <qzed> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/gpu/drm/msm/dp/dp_display.c#n1563

01:12 <bamse> i see

01:12 <bamse> btw, in dp_display_unbind() you want to comment out the call to dp_catalog_hpd_config_intr()...

01:13 <bamse> because if the dp_parser probe defers on finding the panel, it will propagate that probe defer and in the cleanup path of the whole mdss it will attempt to unbind, but at that time the dp controller isn't yet clocked...so the register access in hpd_config_intr will fault and crash the system

01:13 <qzed> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/gpu/drm/msm/dp/dp_display.c#n1563

01:14 <qzed> bamse: ahh, thanks for that tip!

01:14 <bamse> don't know if there are any more patches trickling in related to this...so i guess we'll have to test this properly once -rc1 shows up, and fix it if anything remains

01:14 <qzed> yeah

01:19 <qzed> ah, somehow repeated the previous message instead of what I wanted to write...

01:20 <qzed> I'm not sure if that link is the probe/defer issue though as the panel-edp driver isn't marked with the async flag

01:20 <qzed> so if I'm not missing anything it should probe synchronously at that point

01:21 <qzed> and with the dp_catalog_hpd_config_intr() commented out it does get further, nice

01:22 <qzed> it can now even read edid successfully

01:23 <qzed> I guess I can ignore the `Trying to free already-free IRQ 191` for now but I#ll try to have a look at that (comes up when msm_dpu is deferred, I guess that's a simple free-but-already-freed/not-yet-set-up bug)

01:32 <bamse> not sure if it's a double free or a free before allocation

01:33 <qzed> yeah haven't looked at it yet, assuming since it's failing some time during probe it's probably a free before allocation

01:34 <qzed> so at the moment it seems to (at minimum) fail eDP link training

01:35 <qzed> but there's also a warning about `rcg didn't update its configuration`

01:35 <qzed> (`disp_cc_mdss_edp_pixel_clk_src`)

01:35 <bamse> so you get through the aux dance of reading out edid etc?

01:35 <qzed> yeah, that seems to work

01:35 <qzed> at least it's spitting out the correct model number and complains about it not having an entry

01:36 <bamse> yeah, i get that warning as well...i think the rate change propagates unexpectedly when we power up the phy

01:36 <bamse> not sure what changed there...

01:37 <bamse> might have something to do with the fact that we do this dance while the display is active

01:37 <qzed> I have no idea on that

01:38 <bamse> so do you get: [drm:_dpu_kms_initialize_displayport:635] [dpu error]modeset_init failed for DP, rc = -517 ?

01:38 <qzed> yeah

01:39 <bamse> qzed: well i wrote the phy driver, so i don't have anyone to blame ;)

01:39 <bamse> qzed: same here...on sc8280xp

01:40 <qzed> I assume it's trying again at some point (the error being EPROBE_DEFER and all)

01:40 <qzed> at least it complains about `[drm:dp_ctrl_link_train [msm]] *ERROR* max v_level reached` later

01:41 <bamse> okay, nice...i don't get that far...

01:41 <bamse> i'm stuck in the infinite loop of that error

01:41 <qzed> oh, hmm...

01:42 <qzed> I assume you're also trying the DP via aux-bus?

01:43 <bamse> yes

01:43 <bamse> hmm, but i might have failed to get is_edp set

01:44 <qzed> the log is a bit weird on my end... so it first fails with -517, at the same time it fails to read the EDID

01:44 <qzed> then later it's able to read the EDID

01:45 <bamse> and you only have a single dp instance enabled?

01:46 <qzed> yeah, so far I've only added the edp, I haven't tried anything with the USB DP things yet

01:48 <qzed> I've updated the log at https://gist.github.com/qzed/2288c651a017fcc016b524d3bac5a43e in case it helps

01:48 <bamse> sweet, flagging the dp controller as EDP made is_edp go true and now i only get a single probe defer, presumably because the module isn't loaded yet...and on the next attempt the panel showed up and the edid read passed

01:48 <qzed> nice!

01:48 <qzed> sounds like the same thing I can see

01:49 <bamse> then i ran into clock issues, but that might be expected ;)

01:49 <qzed> the `rcg didn't update its configuration`?

01:50 <qzed> also does Arch not have the qcom/a640_gmu.bin firmware?

01:52 <qzed> oh huh, that's not in linux-firmware either

02:09 <qzed> okay, so with that firmware from some oneplus repo I now get `msm_dpu ae01000.mdp: [drm:adreno_load_gpu [msm]] *ERROR* Couldn't power up the GPU: -13`

02:09 <bamse> that's what i saw on the primus as well, so something must have changed in the last couple of weeks

02:09 <qzed> ah okay

02:10 <qzed> btw. is there some good way to get the qcom firmware (other than googling for some github firmware dump of some semi-related phone)?

02:14 <bamse> no, so i need to talk to qualcomm about that

02:18 <bamse> i thought we had support for some other a640 device out there already

02:20 <qzed> IIRC it looked like it based on your commit for the 680 gpu

02:20 <bamse> yeah, but i think the a640 firmware is shared with sm8150 or something

02:20 <bamse> i mean, the a680 uses a640, and that's also used in sm8150

02:21 <qzed> ah

02:22 <qzed> I'm kinda wondering how different the supposed a690 is compared to the a680...

02:22 <qzed> at least if you look in the windows device manager some strings read a680 so there really can't be that much

02:22 <qzed> (and so far using the a680 definitions for that seems to work)

02:23 <qzed> so the "couldn't power up" error comes from https://elixir.bootlin.com/linux/v5.18/source/drivers/base/power/runtime.c#L773

02:24 <qzed> meaning `dev->power.disable_depth > 0` and it's not active

02:24 <bamse> you got a a690?

02:24 <qzed> yeah, the microsoft "SQ2" chip has one

02:25 <bamse> according to the downstream sources, the a690 uses the a660 firmware...

02:25 <bamse> so it seems to be different from a680

02:25 <qzed> ah

02:25 <qzed> hmm interesting

02:26 <qzed> so it should use `"a660_sqe.fw` and `a660_gmu.bin`?

02:27 <bamse> and i presume we're in the same boat then, because the sc8280xp has a690 as well

02:27 <qzed> oh neat

02:27 <bamse> yes

03:12 hexdump0815 has joined #aarch64-laptops

03:13 hexdump01 has quit [Ping timeout: 480 seconds]

05:38 jhovold has joined #aarch64-laptops

05:47 <steev> shouldn't the surface use the signed firmware from windows?

07:18 exit70 has quit [Quit: ZNC 1.8.2 - https://znc.in]

07:19 exit70 has joined #aarch64-laptops

07:28 samueldr_ has quit [Remote host closed the connection]

07:29 samueldr has joined #aarch64-laptops

07:36 <Dylanger> <qzed> "okay, so with that firmware from..." <- I was getting an error like this on the Duet 5 too, it didn't actually do anything tho

07:52 iivanov has joined #aarch64-laptops

08:34 matthias_bgg has joined #aarch64-laptops

08:54 <no112> steev: It is eligible but isn't currently upgradeable to W11 (Regional timing thing, as I am in Australia)

08:54 <no112> qzed: Huh interesting. Will have a look at that Grub. Thanks!

09:18 <qzed> steev: I can get most of it from windows (mostly the man files) but havent found any gpu firmware :/

09:18 <qzed> *mbn, not man

09:19 <qzed> or is that packaged up int some other file somehow?

10:26 <steev> qzed: on older machines it was called something along the lines of qcdxkmsucXXXX.mbn

10:26 <steev> e.g. on my c630 it's qcdxkmsuc850.mbn

10:45 <qzed> yeah, I have that under the "zap-shader" node

10:47 <qzed> but I couldn't find the others (aXXX_sqe.fw and aXXX_gmu.bin)

13:26 <bamse> steev: the zap is signed, the sqe and gmu are not signed

13:26 <steev> ah

13:27 <bamse> steev: but unfortunately the sqe and gmu are not available in windows...so i guess they are baked into the driver or something

13:33 <steev> that's quite possible

14:16 <no112> For what its worth regarding GPU FW, I've been using the copies from Linaro (Linked on the aa64 laptops wiki)

14:20 <robclark> qzed: my expectation is that a690 is pretty close to a660 and "7c3"... ie. a6xx subgen 4.. haven't looked too much on the kernel side, which is where most of the work should be, although probably shouldn't be so bad.. userspace would need something like: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/common/freedreno_devices.py#L344 .. I assume num_sp_cores/num_ccu will be 4, but a cmdstream trace

14:20 <robclark> would confirm that and the couple other params.. maybe if we are lucky we can find an android blob driver that support a690 and spoof the gpu-id, IIRC that is how we did it for a680

14:24 matthias_bgg has quit [Ping timeout: 480 seconds]

14:29 <bamse> robclark: for a680 we just sprinkled the fairy dust in the kernel and added a section in mesa and it happened to work

14:29 <no112> the classic story of ARM development :-P

14:33 matthias_bgg has joined #aarch64-laptops

14:57 <robclark> bamse: on the userspace side there are a couple magic params we needed to figure out.. pretty sure I managed to get an a680 cmdstream trace from somewhere for those. (But once you have a trace it isn't too hard to figure out and only a few line patch in mesa to add the device ;-))

15:36 <qzed> robclark: thanks! I'll see what I can do once we have the kernel errors fixed

15:37 <qzed> basic tty should work without mesa, right?

15:38 <robclark> yup

15:51 matthias_bgg has quit []

15:58 no112 has quit [Quit: Leaving]

18:27 jhovold has quit [Ping timeout: 480 seconds]

20:43 <qzed> bamse: I think I know what causes the `adreno 2c00000.gpu: [drm:adreno_load_gpu [msm]] *ERROR* Couldn't power up the GPU: -13`

20:44 <qzed> I think it's related to the EPROBE_DEFER

20:44 <qzed> at least it goes through a driver remove cycle

20:44 <qzed> at some point it reaches `adreno_unbind()`

20:45 <qzed> that then calls first `pm_runtime_force_suspend()` and then `gpu->funcs->destroy()`

20:45 <qzed> however both call `pm_runtime_disable()` on the GPU device

20:46 <qzed> the latter via `a6xx_destroy()` and `adreno_gpu_cleanup()`

20:47 <qzed> so we start out with `dev->power.disable_depth = 1` before the whole probe/remove cycle but end it with `dev->power.disable_depth = 2`

20:48 <qzed> so the next call that should enable the GPU just decreases `dev->power.disable_depth` by one, which means it's still disabled

20:54 <qzed> So revert this: https://github.com/torvalds/linux/commit/17e822f7591fb66162aca07685dc0b01468e5480 ?

20:55 <bamse> sounds like either destroy or cleanup should be the "invers" of "init"...but not both

20:56 <qzed> yeah

20:56 <bamse> and it seems that with the recent refactoring of mdss/dpu and edp changes we're probe deferring more

20:56 <qzed> or well, it's rather unbind and cleanup that are the two conflicts

20:57 <qzed> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/msm/adreno/adreno_device.c#L537 at one point and then the call directly below that

20:57 <qzed> ultimately calls https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/msm/adreno/adreno_gpu.c#L1002

21:05 <qzed> okay, now I get further, to `platform 2c6a000.gmu: [drm:a6xx_gmu_resume [msm]] *ERROR* GMU firmware initialization timed out`

21:05 <qzed> followed by

21:06 <qzed> `platform 2c6a000.gmu: [drm:a6xx_rpmh_stop [msm]] *ERROR* Unable to power off the GPU RSC`

21:06 <bamse> on your a690?

21:06 <qzed> `adreno 2c00000.gpu: [drm:adreno_load_gpu [msm]] *ERROR* Couldn't power up the GPU: -110`

21:06 <qzed> yeah

21:06 <bamse> okay, that's how far i got as well

21:06 <qzed> neat okay

21:06 <bamse> then i took a break to poke at dp instead

21:07 <bamse> suspect that i was missing something in my kernel patch

21:08 <qzed> I assume that's not an issue with the firmware? right now I'm loading qcom/a660_sqe.fw and qcom/a660_gmu.bin

21:08 <qzed> both seem to load fine based on the two lines before

21:09 <bamse> and the zap.mbn from windows?

21:10 <bamse> qcdxkmsuc8280.mbn or so

21:10 <qzed> you mean `qcdxkmsuc8180.mbn` under zap-shader? yeah

21:11 <qzed> although I do not get a log message for that one... so I guess it's not loaded yet?

21:11 <bamse> isn't it silent if you don't specify it?

21:13 <qzed> what do you mean? if it's not specifically in the list in adreno_device.c?

21:13 <bamse> on some devices the zap shader isn't used, so i think it won't tell you if you omitted it in dt

21:14 <bamse> been a while since i looked at that code though

21:14 <qzed> I'm looking through it right now, I think it doesn't use the `adreno_request_fw()` function which prints the fwname if it's loading the signed firmware

21:14 <qzed> I'll add a debug output and try

21:15 <qzed> here it seems to use `request_firmware_direct()`:

21:15 <qzed> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/msm/adreno/adreno_gpu.c#L81

21:16 <robclark> for windows (and android) devices, you defn need zap shader.. it will be insta-reboot if you don't have it, but possibly you aren't getting far enough for those fireworks

21:17 <robclark> I'm not sure if there is anything that isn't chromebook which doesn't need zap.. possibly android-auto things?

21:18 <robclark> hmm, but those are probably still using qcom's tz, so I think they would also need zap to take gpu out of secure mode at boot

21:18 <bamse> robclark: no, that's correct...everything !chrome uses that today

21:24 <CosmicPenguin> Not super following the whole thread, but the GMU implementation will be very tightly linked to the hardware

21:25 <bamse> CosmicPenguin: you don't happen to know where the gmu firmware lives in windows?

21:25 <qzed> doesn't seem like the zap shader is being loaded, so I guess it doesn't even get that far yet

21:25 <CosmicPenguin> oh, thats a great question. The other firmware images are directly built into the driver

21:26 <bamse> CosmicPenguin: okay, i figured...and they probably put the zap as a separate file to be able to rely on the pil loader with its certificate checks

21:27 <CosmicPenguin> yep - absolutely

21:27 <CosmicPenguin> The sqe actually ships internally as a header, and the linux/android build has to convert it to a binary

21:27 <CosmicPenguin> but the gmu ships internally as a binary, and I never asked what Windows did, but I have to assume they're converting it into a header - it would be super easy to do

21:30 <bamse> ahh, yeah both makes sense

21:30 <bamse> when you don't have gpl to deal with

21:31 <robclark> It is not impossible that a690 uses a660_gmu.bin .. it's possible it doesn't but worth a try

21:33 <CosmicPenguin> There is a handshake between the kernel and the GMU during init - usually its just agreeing on a version number (which is why the gpu list has a GMU major minor in it)

21:34 <CosmicPenguin> But as the GMU transitions further toward doing scheduling on its own the initialization process will become increasingly more complex

21:37 <robclark> but that's tomorrow's problem ;-)

21:38 <qzed> so re zap: I don't think it even gets there yet, I've tried to add some debug prints but they don't show up

21:38 <qzed> re embedded firmware: are there any headers or magic constants to look out for?

21:40 iivanov has quit [Remote host closed the connection]

21:40 iivanov has joined #aarch64-laptops

21:48 fleebs has joined #aarch64-laptops

21:48 iivanov has quit [Ping timeout: 480 seconds]

21:48 <qzed> bamse: my attempt at quick-fixing the power-up issue: https://github.com/linux-surface/kernel/commit/0d2387fe750e0a442f639163429443d784e99dc0

21:49 <robclark> try a660_gmu.bin .. nearly all the a6xx use same gmu fw as other a6xx within same subgen, which in case of a690 would be a660

21:50 <qzed> bamse: I'll need to revisit this at some point as it seems that if we drop that call there, we could run into situations where it's unbalanced the other way around if the init function fails...

21:50 <qzed> robclark: already tried that

21:51 <qzed> qcom/a660_gmu.bin and qcom/a660_sqe.fw

21:53 <qzed> here's a full dmesg log from the current setup: https://gist.github.com/qzed/2288c651a017fcc016b524d3bac5a43e#file-dmesg-log

21:56 <robclark> hmm, ok.. I assume you added a690 to adreno_is_a660_family()?

21:56 <qzed> yeah

21:57 <robclark> k

21:59 <qzed> this is what I added: https://github.com/linux-surface/kernel/commit/9ce982ca721b796d7c9133c815935c46305f6708

21:59 <qzed> current tree that I'm using: https://github.com/linux-surface/kernel/commits/spx/next-20220520

22:02 <robclark> possibly there is a downstream kernel kgsl somewhere w/ a690 support? That would certainly be helpful..

22:04 <qzed> kgsl?

22:05 <robclark> downstream android kernel driver

22:06 <qzed> I'll try to look around

22:07 <qzed> is there a good resource for which devices use the a690? so far I only know of the Surface Pro X

22:14 <HdkR> Looks like Snapdragon 8cx G2 SoC has it as well

22:14 <HdkR> Which the HP Elite Folio uses

22:16 <robclark> in the past, it seemed like they did early bringup of the windows SoC's on android.. I don't think think there is any actual android device w/ a690, but there also wasn't any a680 android device

22:22 <qzed> as far as I can tell the Surface Pro X SQ2 also has the 8cx Gen 2 chip (which would make sens then with the GPU)

22:36 fleebs has quit [Remote host closed the connection]

22:38 fleebs has joined #aarch64-laptops

22:41 <fleebs> Has anyone got the ASUS NovaGo builtin keyboard to work?

23:18 alpernebbi has quit [Ping timeout: 480 seconds]

23:31 alpernebbi has joined #aarch64-laptops

23:31 iivanov has joined #aarch64-laptops

23:39 iivanov has quit [Ping timeout: 480 seconds]