<digitalcircuit>
History so far: r17390-9baca41064-bisect-good, r17491-c98ddf0f01-bisect-good, r17457-25cb37bc00-bisect-good, r17508-a88b32bf6e-bisect-bad
<digitalcircuit>
slh: NBG6817 eMMC initialization error update - I'm at "Bisecting: 8 revisions left to test after this (roughly 3 steps)", but if I had to guess, it feels like it's closing in on the Linux 5.4 -> 5.10 kernel change. I'll do the full bisect though because it might be something else.
<shibboleth>
PaulFertser, mangix: aaaand it turns out the emmc isn't "bad" or write-protected at all, it's more like a logic-bomb of the embedded environment. the embedded squashfs only has drivers for ro, it depends on stuff in jffs, ext parts for rw
<shibboleth>
and if these parts happen to have fs errors? well, what could go wrong, eh?
<shibboleth>
can't mount due to fs errors, can't fsck because the squashfs can only do ro
<shibboleth>
good job there, dell
<slh>
digitalcircuit: I really hope for the best, at least that issue should be easier to identify and ultimately fix
<slh>
I just hope it's not gcc-10 --> gc-11
<slh>
gcc-11*
<digitalcircuit>
slh: Oof, that's a good point too. Though the 5.4 -> 5.10 migration is pretty huge, and it seems like nobody else has encountered this issue so I don't know what's wrong with my setup in particular. I've effectively factory reset everything handled by dualboot.
<digitalcircuit>
slh: No toolchain changes appear to be between 25cb37bc00 (good) and a88b32bf6e (bad). I'm not sure how to show the differences as a list of commits, but https://github.com/openwrt/openwrt/compare/25cb37bc00..a88b32bf6e appears to show all impacted files.
<digitalcircuit>
(Locally I'm using gitk which shows the current git bisect good/bad markers)
floof58 has quit [Ping timeout: 480 seconds]
floof58 has joined #openwrt-devel
goliath has quit [Quit: SIGSEGV]
victhor has quit [Ping timeout: 480 seconds]
<slh>
I've been running kernel v5.10 for roughly half a year on my nbg6817 and later the ASRock g10, so that 'shouldn't' be the issue - but at the same time, my binaries 'should' have worked on yours
shibboleth has quit [Quit: shibboleth]
<digitalcircuit>
slh: That makes sense. The only thing that comes to mind so far is if there's any persistent changes to the eMMC that I've missed (e.g. do I need to update the ZyXEL u-boot bootloader?) or such. Regardless of 5.10 (after git bisect, I could test that by enabling 5.10 on an older commit), your build should have worked given sysupgrade did not persist config.
<digitalcircuit>
(Inversely, I only got my NBG6817 in 2019, so maybe I have a slightly newer/changed firmware/hardware revision of something.)
* digitalcircuit
will continue the git bisect either way; this is all speculating until the exact commit is found.
<slh>
possibly, bit I hope not. I don't see how u-boot would be updated (aside from factory updates). yes, there are provisions for that, but I don't think any OEM update actually touched the 4 MB spi-nor flash
<digitalcircuit>
(I wanted to make sure I could recreate the issue using an up-to-date snapshot before I emailed Ansuel on the mailing list. Un/fortunately, I found a new regression in the snapshot.. kind of. Still hunting down what went wrong.)
<digitalcircuit>
(It's looking like the switch to Linux kernel 5.10 may have broken boot for my NBG6817. Still 2 revisions left to compile and test though.)
mig has joined #openwrt-devel
mig has quit []
zadr has joined #openwrt-devel
<zadr>
Where I can find table of rates for fixed_rate_idx knob?
<rsalvaterra>
stintel: Probably CPU revisions that didn't make into production, or haven't been used in the systems we support.
tohojo has quit [Ping timeout: 480 seconds]
<hauke>
stintel: I thought that this is clear when I list the CPU cores they are for
<hauke>
The ARM erratas are for Cortex A76 and N1
<hauke>
I do not think we have any target using them
<hauke>
armvirt could use them, so they are still active there
dangole has joined #openwrt-devel
hurricos has quit [Quit: WeeChat 2.8]
pmelange has left #openwrt-devel [#openwrt-devel]
danitool has joined #openwrt-devel
tohojo has joined #openwrt-devel
<slh>
russell--: I was specifically referring to the firmware images (and their internal structure) of the ZyXEL NBG6817, I know that many other vendors/ devices do occassionally update u-boot (e.g. ath79/ TP-Link). but from what I've seen with the nbg6817, I don't think it has been done (as part of a regular vendor update) there, yet
<mangix>
Nope. I setup a macOS VM yesterday. Works fine.
<mangix>
The CI seems to timeout. Strange.
<aparcar[m]>
neoraider: could you please backport https://git.openwrt.org/?p=project/opkg-lede.git;a=commit;h=5936c4f9660248284e8a9b040ea3153d3ea888de to 19.07?
<aparcar[m]>
mangix: not even a mac CI need 6 hours to do the magic
<aparcar[m]>
i'll do another run with V=s
<mangix>
On my installation, I have to compile with gmake instead of make. No idea if related.
Tapper1 has quit [Ping timeout: 480 seconds]
<digitalcircuit>
slh: Bad news (?), NBG6817 MMC initialization failures have been tracked down to... 0470159552641c2b11ccc1b0fcfcb4ea08f2c6ab is the first bad commit ("ipq806x: switch to kernel 5.10"). I'm not sure how to "git bisect" from Linux kernel 5.4 to Linux kernel 5.10, especially considering all the patches on top of stock Linux kernel...
<digitalcircuit>
slh: I'm guessing this is when I'd file a new bug report and/or mailing list post?
* digitalcircuit
still doesn't understand why his router has trouble, but others' NBG6817 (and IPQ8065) devices seem to handle 5.10 just fine.
<digitalcircuit>
slh: Noted! Since that PR is already merged, should I file a new bug report, or is commenting on that PR acceptable?
<slh>
digitalcircuit: I'd start by adding a comment to the closed PR
<slh>
that should notify Ansuel, but maybe address @Ansuel as well
<digitalcircuit>
slh: Makes sense! Might be simple enough of a fix for Ansuel to be able to skip some formalities. I'll offer to file a bug report in my PR comment though. @mention is also a good idea!
<slh>
very, very weird, as it works for me (and has been, for over half a year now)
jlsalvador has quit [Quit: jlsalvador]
<digitalcircuit>
Looking at the log, I just now noticed: "[ 2.902209] mmci-pl18x 12400000.sdcc: card claims to support voltages below defined range"
<digitalcircuit>
So I wonder if ZyXEL changed the MMC part..?
<slh>
but different eMMCs in a new production batch sounds pretty reasonable
<digitalcircuit>
slh: Yeah... Your log doesn't have the "support voltages below defined range", so that sounds like it. May I mention my comparison with your log in my comment?
<slh>
sure, but it will expire in 1h
<slh>
so better copy'n'paste
<digitalcircuit>
slh: Noted!
<slh>
what do cat /sys/block/mmcblk0/device/cid and cat /sys/block/mmcblk0/device/date say?
<digitalcircuit>
slh: So "cd /sys/block/mmcblk0/device/ && tail -v cid date name manfid fwrev hwrev oemid rev" should print all your details, right? (Just finding an easier output for the GitHub comment, I can reformat what you've provided me.)
<digitalcircuit>
slh: Sounds good! I'll post my comment now, unless there's anything else you'd like me to look for first.
<slh>
nope, just fishing in the dark and looking for potential differences
<slh>
iirc you have two nbg6817? if you can, please check the other one as well
<slh>
if you're 'lucky', one works, one doesn't ;)
<digitalcircuit>
slh: Good point! I'd check the other one.. but unfortunately, it's at someone else's house and OpenVPN appears to be broken at the moment (alongside other issues). It's still running 19.07.6 and next time I get over there I think I'll just flash it to 21.02.0 and reduce max CPU clock to 1.0 GHz to work around the crash bug.
<digitalcircuit>
(I'm working on a simple, robust init.d service to apply that which will persist across flashes so I can just have it stable while I'm tinkering at this place.)
<slh>
I'm very happy with wireguard on mine
<digitalcircuit>
I don't know (other person isn't technical), but I think the router itself is partially locked up - I ran into this issue at this place as well, the WiFi driver crashes/stops responding to LAN/etc. That was fixed for me with 21.02 (but that introduced CPU crashes instead).
<slh>
make sure to write the 'good' image to /dev/mmcblk0p8, as push-button tftp recovery always overwrites /dev/mmcblk0p5, so /dev/mmcblk0p8 remains safe and untouched
<digitalcircuit>
(I asked them to reboot the router, but they're in a lot of physical pain/etc so it's not feasible for now. I'm heading over this upcoming Friday.)
<slh>
not that it matters (I think) for this issue
pmelange has joined #openwrt-devel
pmelange has left #openwrt-devel [#openwrt-devel]
<Slimey>
hmm
<digitalcircuit>
slh: Noted as well. It's at least a sign that ZyXEL did change U-Boot over time, though even my U-Boot date is older than the MMC date.
<digitalcircuit>
mrkiko: I've created a workaround script for the CPU crash bug: https://github.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset#how-to-workaround-this-issue If you haven't bought an NBG6817 though, still wait - this reduces performance! I only created it as I'm managing an NBG6817 at a remote location too, and I wanted a way to help ensure it stays stable when I'm not testing things.
<stintel>
hauke: maybe it was my wishful thinking but I thought we had cortex a76
<digitalcircuit>
I'm not sure who to ping regarding the Let's Encrypt certificate troubles - OpenWRT 21.02.0 has trouble communicating with https://sysupgrade.openwrt.org/ (the attended-sysupgrade client "auc" says "Connection error: Invalid SSL certificate")
<digitalcircuit>
(The download repos had a workaround applied, which probably needs applied to sysupgrade as well)
cp- has quit [Quit: Disappeared in a puff of smoke]
cp- has joined #openwrt-devel
<digitalcircuit>
jow: ^ I think you had talked about the Let's Encrypt cert workaround for OpenWRT 21.02 by removing the cross-signed ISRG Root X1 from the chain? This probably should be done for sysupgrade.openwrt.org as well (for auc & luci-app-attended-sysupgrade).
<digitalcircuit>
Ack, I was mistaken - auc is affected, luci-app-attended-sysupgrade appears to not be affected (perhaps due to using browser's HTTPS stack?).