<fda>
hello, i installed RC3 to avm 1200. i've attached serial console. the flash from openwrt 19 to rc3 was ok. the device has eth0 + br-lan with the configured br-lan
<fda>
BUT it is not reachable by lan and it cant ping other devices
<fda>
so i deleted manualla br-lan and attached eth0 an ip. but still no ping to/from lan
<digitalcircuit>
...and nevermind that, stress-ng runs into errors with 1.4 GHz L2 disabled too. Might still work to trigger the reboot, will continue tinkering and testing.
<mangix>
digitalcircuit: i am amazed someone runs stress-ng
<mangix>
that package was added to openwrt by accident
<digitalcircuit>
mangix: Noted! It was suggested for me to try stress-ng back when I was trying to simplify my test case. The way I discovered the issue was by Deja Dup (duplicity) uploading 203 GB in 25 MB chunks over OpenSSH to a USB drive connected to my NBG6817.
<digitalcircuit>
(I hope I'm not annoying anyone here either; I've been trying to figure this issue out for months. I know the 1.4 GHz L2 cache frequency is what allows the hard reboot to happen, but I haven't figured out any fix.)
<mangix>
digitalcircuit: so before and after the patchset it fails?
lmore377_ has joined #openwrt-devel
lmore377 has quit [Ping timeout: 480 seconds]
<mangix>
backported latest stress-ng to 21.02
<digitalcircuit>
mangix: Err, referring to the SFTP test failing (causing a hard reboot), or stress-ng? If the latter, I haven't figured out the right stress-ng parameters yet to recreate the SFTP test (without needing SSH/etc).
<digitalcircuit>
And stress-ng on OpenWRT master has failed to build when I last tried in the past week or so (it was still building on 21.02).
<mangix>
:)
<mangix>
wonder why
lmore377_ has quit [Ping timeout: 480 seconds]
lmore377 has joined #openwrt-devel
<digitalcircuit>
I'm not sure - I tracked it down to something with the Makefile referencing the Linux headers for IO parameters resulting in an invalid definition due to double [#define]s ( https://github.com/ColinIanKing/stress-ng/blob/master/Makefile#L431-L438 ) but wasn't certain on how to fix that.
<mangix>
hmmm maybe a missing liburing dependency
<digitalcircuit>
Not sure... The resulting "io-uring.h" file got created, just with invalid syntax (I'd need to retry to check to give a firm answer).
<digitalcircuit>
Given my difficulty in recreating the issue with stress-ng, I had postponed that for a while, but even scripting an SFTP upload using GNOME's GIO stack in Python 3 hasn't yet recreated the issue either - I think I need to upload multiple random files, not a single one.
<digitalcircuit>
Alternatively, I may need to run stress-ng in 2-4 second bursts, to mimic the SFTP upload being in chunks, exercising the CPU governor. I know that the patchset to enable 1.4 GHz L2 cache mentions potential issues with clocking and modified the cpufreq driver to try to prevent those situations, and I'm wondering if I'm somehow bypassing that protection.
<digitalcircuit>
(I feel I'm way over my head, trying to learn as I go, pardon all the uncertainty!)
<mangix>
well, it's a sort of abandoned platform by Qualcomm
<mangix>
Ansuel's doing most of the work
<mangix>
that ethernet latency commit told me I wasn't crazy :)
<digitalcircuit>
Ah, that's unfortunate of Qualcomm. And I do appreciate Ansuel's work - I want to be clear that I'm not trying to put them or anyone else's efforts down! Ultimately if I can't figure out what's going wrong, if I just need to add runtime /sys/ parameter or otherwise disable 1.4 GHz L2 cache frequency just for me, I'm fine with that.
<digitalcircuit>
mangix: Noted, thanks! I've actually had that in mind because in theory I shouldn't be hitting 384 MHz anyways, yet the cpufreq stats/trans_table (?) shows enter/exit for 384 MHz. That plus the warnings in https://github.com/openwrt/openwrt/commit/3efbfe5465e0d3cbc52c37a2b80e8f4f2d4b35da makes me wonder if 384 MHz CPU + 1.4 GHz CPU isn't prevented in all cases.
<digitalcircuit>
(Err, 1.4 GHz L2 cache)
<digitalcircuit>
Nothing shows in the router's hardwired serial console about timings, so I must've not enabled the right kernel debugging settings or that logging isn't being hit.
* digitalcircuit
realizes the problem he's chasing (trying to use a USB drive + router instead of having a separate NAS) and the resulting nondeterministic crash that only he seems to be getting consistently (majority of SFTP backup runs) might sound crazy :)
<Monkeh>
Have you considered just hitting it with a hammer and moving on? :P
<digitalcircuit>
Monkeh: Heh, the thought certainly has crossed my mind :D The less-destructive variant of "just live on custom builds forevermore" might be what I end up doing. I just stubbornly WANT to fix this. It'd bother me to give up (especially since I'm trying to apply for some Quality Engineering positions).
<digitalcircuit>
(Having a NBG6817 in two different households adds to the motivation - I thought it'd make it easier for me to support both, which it does, but it means issues affect both too.)
Tapper has joined #openwrt-devel
<slh>
just to re-iterate, I've been seeing unexplained crashes on lantiq (and freezes/ hangs on ath79) as well, so it might not all be ipq806x specific
<Monkeh>
I have in the past seen some odd crashes on lantiq which I never did get to the bottom of
<slh>
in my case without modem involvement (using an external vigor 130; bthub5 just used as router terminating the PPPoE session)
<Monkeh>
These were all in modem use
<Monkeh>
Although I have an HH5A which is more than a little crash happy
<Monkeh>
But that one's definitely faulty
<slh>
the bthub5 was the next best thing (after tl-wdr4300 and tl-wdr3600 failed, by freezing up hard under high throughput conditions) - and it worked 'better' (sudden/ unexplained reboots, but no eternal hangs)
lmore377_ has joined #openwrt-devel
lmore377 has quit [Ping timeout: 480 seconds]
lmore377 has joined #openwrt-devel
lmore377_ has quit [Ping timeout: 480 seconds]
Rentong has joined #openwrt-devel
valku has quit [Quit: valku]
<digitalcircuit>
slh: Noted, thanks for chiming in! Hmm. Maybe there's larger changes that happened between 19.07 and 21.02 that are made worse (for me at least) by the 1.4 GHz L2 cache change that's specific to ipq8065.
<slh>
digitalcircuit: not sure, is /etc/init.d/cpufreq present in 21.02.0? (only if it was backported, no idea if it was)
<slh>
damned, that (avoid 384 MHz) would have been such an easy thing to backport ;)
<digitalcircuit>
slh: I know... I was so excited when I thought it worked, only for the frustratingly nondeterministic nature of this sudden/unexplained reboot to strike :)
<digitalcircuit>
I'm starting to wonder if hooking up GDB over serial console might not actually be that crazy of an idea.
<digitalcircuit>
(I don't recall specifics, just that there's a guide on OpenWRT kernel debugging via serial console.)
<digitalcircuit>
I've been splitting my focus between recreating this more reliably (without needing to run 0.5-20 hours of full system backups), and trying to determine exactly what's going wrong, too.
Borromini has joined #openwrt-devel
Rentong has joined #openwrt-devel
Rentong has quit [Ping timeout: 480 seconds]
rejoicetreat has joined #openwrt-devel
f5 has joined #openwrt-devel
Tapper has quit [Ping timeout: 480 seconds]
Tapper has joined #openwrt-devel
danitool has joined #openwrt-devel
Rentong has joined #openwrt-devel
<blocktrron_>
fda: there's the possibility the PHY delays are configured incorrectly now, as the behavior changed with kernel 5.4
<blocktrron_>
try to change phy-mode from rgmii-rxid to rgmii-id
Rentong has quit [Ping timeout: 480 seconds]
Borromini has quit [Quit: Lost terminal]
rejoicetreat has quit [Ping timeout: 481 seconds]
goliath has joined #openwrt-devel
danitool has quit [Quit: Cubum autem in duos cubos, aut quadratoquadratum in duos quadratoquadratos]
rmilecki has joined #openwrt-devel
dedeckeh has joined #openwrt-devel
_lore_ has quit [Ping timeout: 480 seconds]
_lore_ has joined #openwrt-devel
Rentong has joined #openwrt-devel
Tapper has quit [Ping timeout: 480 seconds]
Rentong has quit [Ping timeout: 480 seconds]
Tapper has joined #openwrt-devel
f5 has quit [Ping timeout: 480 seconds]
<fda>
blocktrron_: i tested more. ips are set correct, ping other devices and other ping it. 0 ping are okay, but in the ARP table are the devices!
Tapper has quit [Remote host closed the connection]
Tapper has joined #openwrt-devel
Tapper has quit [Remote host closed the connection]
Tapper has joined #openwrt-devel
danitool has joined #openwrt-devel
Rentong has quit [Remote host closed the connection]
Rentong has joined #openwrt-devel
goliath has joined #openwrt-devel
Rentong has quit [Ping timeout: 480 seconds]
Rentong has joined #openwrt-devel
shibboleth has joined #openwrt-devel
Rentong has quit [Ping timeout: 480 seconds]
Rentong has joined #openwrt-devel
aleasto has quit [Remote host closed the connection]
Rentong has quit [Ping timeout: 480 seconds]
philipp64 has quit [Quit: philipp64]
Rentong has joined #openwrt-devel
philipp64 has joined #openwrt-devel
philipp64 has quit []
philipp64|work has quit [Quit: philipp64|work]
Rentong has quit [Ping timeout: 480 seconds]
rsalvaterra_ has quit []
rsalvaterra has joined #openwrt-devel
<rsalvaterra>
Heh… elfutils hate being compiled with gcc 11… :)
<stintel>
I think they hate being compiled in general :P
<rsalvaterra>
stintel: Well, GCC is rightly complaining about mismatched pointer types in arrays (-Warray-parameter), and rightly so, from the ones I fixed…
<rsalvaterra>
But since I'm on a "oh, gcc 11, let's break routers!" mood… :P
<stintel>
yeah I never even finished switching to gcc10 as default
<rsalvaterra>
Yeah! Elfutils are building now. I think the next client is BusyBox…
<hauke>
stintel: there werer still 2 bugs with gcc 10
<hauke>
I think you also looked into them
<hauke>
did you fix one of them?
<stintel>
I fixed one, another is in my staging tree
<stintel>
for busybox I think\
<stintel>
feel free to check
jlsalvador2 has joined #openwrt-devel
<stintel>
I'm driving back to Bulgaria this weekend but hardware problems are delaying my departure
<stintel>
won
<stintel>
won't be able to look into gcc anytime sono
<hauke>
stintel: ok
<hauke>
stintel: ok you fixed the problem in mdnsd
<hauke>
then there is still umbim
<hauke>
I think this needs some bigegr chanegs or we ignore the warning
jlsalvador has quit [Ping timeout: 480 seconds]
jlsalvador2 is now known as jlsalvador
<rsalvaterra>
stintel: My Omnia is working fine with a gcc 11-compiled image (for my config, of course). And it's smaller too. I'm happy. :P
<shibboleth>
are any of the current COTS tplink/qca/intel 11ax devices supported?
<shibboleth>
i don't really care about ax, waiting for 6e, but the cpus/socs are nice
<hauke>
shibboleth: mt7915 is supported
<shibboleth>
mediatek
<shibboleth>
no offense to nbd, but no way
<hauke>
Intel client wifi with AX should also work
<shibboleth>
a while back there was some back and forth between devs testing ath11 devices?
<shibboleth>
hauke, yeah, but routers/devices.
<slh>
xiaomi ax3600 and ax9000 are being worked on, the former is further ahead (pretty much fully working, apart from the big caveat at the end), the later is significantly better hardware - but the elephant in the room is ath11k leaking memory like a sieve
<slh>
(ax9000 includes a third qcn9074 radio, which is hard to get working so far, PCIe not behaving, very, very fresh ath11k support for this revision, etc.)
<slh>
there are also plenty of other ipq807x devices on the market, but the xiaomi ones are by far the cheapest (and therefore the first victims)
Rentong has joined #openwrt-devel
dedeckeh has quit [Remote host closed the connection]