<hurricos> but
<hurricos> probably not before the next LTS
<hurricos> anyways
<Umeaboy> I wish that future openwrt version would have the smaller version of LuCi in it.
<hurricos> You can confirm Grommish that the memleak goes away when you unload / load the driver, as well?
<hurricos> s/memleak/unfreed memory allocation/
<Grommish> hurricos: When I kmod'
<Grommish> err when I kmod'd the octoen-ethernet.ko it seemed to stop when I unloaded it
<hurricos> stop, or ... wait, just stop? not return to your free memory?
<Grommish> Did not return that I could tell
<Grommish> But I'd have to recheck
<Grommish> Let me re-modularized octeon-ethernet and test
<hurricos> I doubt you'll get it back, reading your posts you were pretty certain about it
<hurricos> you also reference kfree_skb, I don't actually know how these drivers interact but I'd bet with how little the driver's been worked on recently, there's potential for not freeing memory then used by other drivers or non-modularizable bits, like the networking stack
<Grommish> thats what it was.. kfree_skb()
<hurricos> yeah.
<Grommish> I was close :)
<Grommish> hurricos: https://gist.github.com/Grommish/93b4b3452fb5254bae835738c2fe4cd0 if you feel like trying it locally
<Grommish> Just don't try to reload it after you unload it or it panics
<Grommish> But neggles said that was to be expected
<hurricos> No, no. I will break down and cry if I put a snic10e back in a box again
<hurricos> I appreciate it though :^)
<hurricos> Lemme checkout right after the 5.6 re-inclusion and build.
<Grommish> Ok.. and if you have an initramfs to test, toss me the link and it shall be tested
<hurricos> :thumbsup:
ptudor has quit [Read error: Connection reset by peer]
<Grommish> I've got all the services turned off and an hour of wiat time @ 5 min intervals to see what might be leaking.. I'm turning them on one at a time and repeating
ptudor has joined #openwrt-devel
Umeaboy has quit [Quit: Leaving]
ptudor has quit [Quit: Strict-Transport-Security: max-age=48211200; preload]
ptudor has joined #openwrt-devel
goliath has quit [Quit: SIGSEGV]
shoragan has quit [Ping timeout: 480 seconds]
minimal has quit [Quit: Leaving]
<neggles> hurricos: hello sir
<neggles> I am a masochist and therefore have access to a plethora of octeon
<neggles> one thing that's consistent for myself and grommish is that restarting dnsmasq will add 1-2mb to used memory
<neggles> however if I instruct dnsmasq to bind to a specific IP address rather than wildcard, this does *not* happen
<neggles> my current theory is "wildcard UDP socket receive buffers are going missing"
<neggles> but I do not really know much of anything about how the internals of the kernel networking stack do things
<neggles> i've got an snic10e set up, an srx300 i've not gotten around to making work yet, a USG-XG-8 which I can probably make work, a couple USG-3s, an ERLite-3... and hey I don't want Octeon to die I want octeonplus to die :P octeon ii and octeon iii can stay... for now...
<mangix> I like ax
valku has quit [Quit: valku]
<neggles> hurricos: dumped the memory of an snic10e which was just-booted (40MiB), restarted dnsmasq on repeatedly until it hit 300MiB, dumped the memory again... it's all this? https://i.imgur.com/ZcalJzR.png
<Slimey> cant sleep
<russell--> neggles: are you dropping cache? (e.g. echo 3 > /proc/sys/vm/drop_caches). Stuff isn't going to be evicted from memory until there is pressure from somewhere
<neggles> russell--: yes, it's not caches
<neggles> echo 3 > /proc/sys/vm/drop_caches does exactly nothing
<russell--> diff /proc/meminfo before and after?
<neggles> russell--: nothing changes other than reduction in memfree and memavailable
<neggles> same with /proc/vmallocinfo
<neggles> dma buffer allocations go up a lot but that's what we'd expect
<neggles> and that's just a count of the number of times one's been allocated, not whether they still are
danitool has quit [Quit: Cubum autem in duos cubos, aut quadratoquadratum in duos quadratoquadratos]
<neggles> russell--: [ 455.141050] DMA32 free:471544kB min:16384kB low:20480kB high:24576kB reserved_highatomic:0KB active_anon:28kB inactive_anon:284kB active_file:600kB inactive_file:572kB unevictable:0kB writepending:0kB present:1011448kB managed:965972kB mlocked:0kB bounce:0kB free_pcp:52kB local_pcp:0kB free_cma:0kB
<neggles> welp
<neggles> based on a bunch of behavioural tests I and Grommish have just re-verified, it does seem like it's something to do with wildcard-bound UDP sockets
<owrt-snap-builds> Build [#504](https://buildbot.openwrt.org/master/images/#builders/29/builds/504) of `pistachio/generic` completed successfully.
Borromini has joined #openwrt-devel
rua has quit [Quit: Leaving.]
cbeznea has joined #openwrt-devel
<aiyion> Is this a copy paste error?
<aiyion> It does not match the pattern.
<aiyion> Nice thanks.
<Borromini> git blame is neat :)
<aiyion> it is
alex_ has joined #openwrt-devel
alex_ has quit []
robimarko has joined #openwrt-devel
srslypascal is now known as Guest1011
srslypascal has joined #openwrt-devel
jlsalvador2 has joined #openwrt-devel
jlsalvador has quit [Read error: Connection reset by peer]
jlsalvador2 is now known as jlsalvador
Guest1011 has quit [Ping timeout: 480 seconds]
rua has joined #openwrt-devel
c0sm1cSlug has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
c0sm1cSlug has joined #openwrt-devel
dlg_ has joined #openwrt-devel
dlg has quit [Read error: Connection reset by peer]
srslypascal is now known as Guest1016
srslypascal has joined #openwrt-devel
mattytap_ is now known as mattytap
Guest1016 has quit [Ping timeout: 480 seconds]
Slimey_ has joined #openwrt-devel
Slimey has quit [Read error: Connection reset by peer]
Slimey_ is now known as Slimey
Slimey_ has joined #openwrt-devel
Slimey has quit [Read error: Connection reset by peer]
Slimey_ is now known as Slimey
rmilecki has quit [Ping timeout: 480 seconds]
rmilecki has joined #openwrt-devel
Borromini has quit [Quit: Lost terminal]
cbeznea has quit [Quit: Leaving.]
pepe2k has joined #openwrt-devel
mattytap has quit [Ping timeout: 480 seconds]
pepe2k has quit [Read error: Connection reset by peer]
Misanthropos has quit [Ping timeout: 480 seconds]
mattytap has joined #openwrt-devel
minimal has joined #openwrt-devel
goliath has joined #openwrt-devel
jwmullally has joined #openwrt-devel
Misanthropos has joined #openwrt-devel
csharper2005 has joined #openwrt-devel
bluew has joined #openwrt-devel
Borromini has joined #openwrt-devel
<hurricos> neggles: no. NO. THAT'S CURSED. CURSED
<hurricos> I'm bisecting now.
<hurricos> if it's wildcard-bound UDP sockets it's going to be easyish to find in whatever cursed, huge range
<hurricos> ... I end up with
ekathva has joined #openwrt-devel
ekathva has quit [Remote host closed the connection]
<stintel> why do we use libressl in tools/ instead of openssl ?
<jow> because we drank the cool-aid when it was new
<jow> might've been a tad simpler to build on osx
<jow> or faster/smaller
<jow> but given that libressl appears to have lost quite some momentum we maybe should go back to openssl
<stintel> we build kea/host against libressl and then kea against openssl
<stintel> isn't that asking for trouble anyway?
<jow> what is kea and why does it need a host build?
<stintel> kea is a dhcp server, dunno why it needs a host build, wasn't mentioned in the commit that introduced it
<jow> apparently for some "kea-msg-compiler" executable
<stintel> that seems optional to me
<stintel> will try and rip it out completely then
<Habbie> can I get a wiki account please?
<stintel> Habbie: /q me your email ?
<Habbie> done, thanks :)
<Habbie> this router does 125k serial during uboot and 115k2 after and i really want to save the next person these two hours I spent :D
<stintel> pffft
<Habbie> for extra fun, the CH340G can't do that
<stintel> seriously who comes up with with that shit
<Habbie> tp-link
<Habbie> :)
danitool has joined #openwrt-devel
<Slimey> lol
<hauke> jow: does openssl support static linking?I think that was one of the reasons for libressl
floof58 has quit [Ping timeout: 480 seconds]
<Borromini> Habbie: which TP-Link?
<hurricos> grommish: Unfortunately the patch between 5.4.96 and something mid-5.8 doesn't apply :(
<Grommish> hurricos: I think the last 5.4 I tested was .175
<Grommish> Which was fine.. that was the last 5.4 bump the kernel had before the switch to 5.10
<Grommish> I also managed to stop the leak by turning everything off but networking
<Grommish> So, the UDP issue neggles is suspecting is looking more and more likely
<stintel> have we been going at it all wrong? is it a mips64 issue rather than an octeon issue? :P
<Grommish> stintel: Nah. I suspect the networking changes that happened upstream is an issue, but not the driver itself, or at least, not exclusively
<Habbie> Borromini, tl-wr841nd v11
<Borromini> Habbie: ok
<Borromini> what version is that model at now? v14?
<Grommish> stintel: You can see my last test on the thread, but I can maintain networking as long as I kill everything else, including ipv6 and dns/dhcp
* Borromini has a v7
<Grommish> and it's stable as anything
floof58 has joined #openwrt-devel
csharper2005 has quit [Ping timeout: 480 seconds]
csharper2005 has joined #openwrt-devel
<mangix> stintel: libressl uses cmake. Good enough reason to keep.
mrkiko has quit [Remote host closed the connection]
csharper2005 has quit [Read error: Connection reset by peer]
floof58_ has joined #openwrt-devel
floof58_ has quit []
floof58_ has joined #openwrt-devel
<stintel> that alone is not a reason to keep it
<mangix> I disagree
floof58 has quit [Ping timeout: 480 seconds]
floof58_ has quit []
floof58 has joined #openwrt-devel
<Habbie> Borromini, v14 is what i'm aware of at least
<stintel> having libressl for host build and openssl for target build is a recipe for confusion and potentially hard to debug issues
<stintel> the fact that it uses cmake is irrelevant
<mangix> I disagree. Libressl is only used for a select few packages.
<mangix> OpenSSL is used by many packages.
csharper2005 has joined #openwrt-devel
cbeznea has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
cbeznea has quit [Quit: Leaving.]
<hurricos> Grommish: So ... what I could do is manually repack a newer linux into the xzcat'ed tarball
<hurricos> OpenWrt itself has a TON of patches most of which will likely not patch nicely with this method
<hurricos> but the method of storing the diff as a patch isn't going to cut it
<hurricos> which is a pain, really.
csharper2005 has joined #openwrt-devel
<Grommish> hurricos: It's all beyond my knowledge, unfortunately.. I can certainly test it, but there isn't much I'm going to be able to add to the "lets try this" stuff. I have full access to initramfs or even just flashing, the device is nearly unbrickable unless the emmc chip dies, so I'm not worried about things going wrong
<hurricos> I learned recently from stintel that you can have a package point to a source directory
<hurricos> I'm trying to recall where that's documented; I need to do that with the kernel
<hurricos> CONFIG_SRC_TREE_OVERRIDE
<hurricos> src-link
<hurricos> thank you weechat
<hurricos> but that won't keep our patches. Hmm. I could try fooling the Makefile instead.
<stintel> that doesn't work for the kernel
<hurricos> Oh no
<stintel> afaik
<hurricos> and either way
<hurricos> it's not a problem with the patch
<hurricos> I somehow included user_headers in the patch
<hurricos> that's the problem
<slh> ls
<hurricos> hurricos!
<hurricos> :^)
<stintel> you can do CONFIG_EXTERNAL_KERNEL_TREE
<stintel> if it boots, should allow you to do a normal git bisect in the directory that points to
<stintel> that's how I found 4ecf8346c074ff80101a17d39086010f8f4b23b8
<stintel> I wanted to do that for octeon too but I have yet to find a reproducer
<hurricos> <a reproducer> that is, a fast method to jiggle the key and reproduce?
<stintel> yes
<hurricos> Thank you. Makes sense. Sounds like anything that listens on udp wildcard sockets can reproduce it
<hurricos> so repeatedly restarting dnsmasq
slh64 has quit [Quit: gone]
<hurricos> Grommish has the setup
<stintel> bisect between 5.4 and 5.10 was ~15 steps or so
<hurricos> I'm on it now. The other thing to note is WHAT the space fills up with
<hurricos> see neggles' https://i.imgur.com/ZcalJzR.png
<hurricos> we'll probably see it once we're on the last bisect step
<hurricos> I say that but it's likely the last bisect step will be somewhere nasty
<hurricos> not sure exactly how main git kernel history looks but I know octeon-ethernet was dropped in 5.6
<hurricos> it might have only been in an rc that never got merged back up
<stintel> good luck bisecting, I've hit many commits that didn't compile
<hurricos> barf
<hurricos> I'll just git bisect skip
<hurricos> it probably won't work.
slh has quit [Remote host closed the connection]
<hurricos> but if one or two commits compile and I get a good and bad, then I can start taking a range and looking for UDP-related things
<hurricos> or specifically, socket related things
<Grommish> hurricos: I should be able to load a standard linux kernel, I wonder
<hurricos> It'll want our patches
<hurricos> I have a way to apply them just have to drop user_headerss from the tree
slh has joined #openwrt-devel
<hurricos> I mistakenly included it
<hurricos> everything else applies
<hurricos> except the binary diffs which I can ignore since they're just tools
<hurricos> yeah, user_headers was generated by the build. Silly, really
slh64 has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
<Grommish> hurricos: Well, i mean, my uboot has: linux_mmc=fatload mmc 1 $(loadaddr) vmlinux.64;bootoctlinux $(loadaddr) mem=0 numcores=2
<hurricos> I'd rather use OpenWrt's tree than find out whta happens if I don't :^)
<hurricos> Oh what
<hurricos> patch failed
<hurricos> but it gives me no .rej files or other entries
<Grommish> hurricos: --leave-rej on quilt
<Grommish> if your using that
<hurricos> Ah no
<hurricos> I don't use quilt
<hurricos> I just commit what I want from the tree and then git format-ptach
<hurricos> then add that to patches-5.4/thing.patch
<hurricos> having subsumed the existing patches in that directory
<stintel> I don't see an udp wildcard socket for dnsmasq :/
<hurricos> but the issue is binary patching actually
<Grommish> stintel: IPv6/odhcp6 listens ad can't be removed
<hurricos> so dhcp, not dnsmasq
<hurricos> sounds like
<Grommish> when I physically deleted it, the dhcpv6.sh went mental and bootlooped the system due to ram limits in about 4 minutes
<hurricos> hmm
<stintel> but this seems to increase memory usage fast: while true; do /etc/init.d/dnsmasq restart; sleep 1; done
<stintel> pffft
<Grommish> less than 500 seconds till it was borked
<Grommish> hurricos: Well. the sysntpd also listens on wildcards
<Habbie> fun, on pi serial i still need to pick 115200 for the uboot, but it is slightly corrupted
<Habbie> i wonder if my 125000 was caused by my pulseview/sigrok sampling rate being too low
<hurricos> stintel: I'd honestly want to continue the way neggles was going
<hurricos> reading system memory and seeing *what* is in there, then potentially using octeon-top / octeon-perf to watch and see who is writing to those regions of memory
<stintel> good luck tracking down where 0x82013000 comes from :P
<hurricos> yeah.
<hurricos> memory mapping isn't fun
<hurricos> err
<hurricos> also no, that'd not be offset for code, it'd be storage for sure
<hurricos> I say that but have no clue. I just don't think it's likely it'll be executing those pages, just using them as
<hurricos> heap
<hurricos> wait, I wonder
<hurricos> kernel memory management is also not my strong suit
robimarko has quit [Quit: Leaving]
<hurricos> page_owner=on
<hurricos> @grommish reboot with that in your command line with a buggy kernel if you can
<hurricos> see if you get (you should) /sys/kernel/debug/page_owner in your sysfs
<hurricos> the other thing is there may be an attached CONFIG_ symbol
<hurricos> Yes, I don't know if we default enable CONFIG_PAGE_OWNER
<Grommish> hurricos: Let me set the bootargs for it and check
<hurricos> Thank you
<Grommish> CONFIG_PAGE_OWNER in kernel config?
<stintel> PAGE_OWNER and page_owner=on cmdline
csharper2005 has joined #openwrt-devel
<Habbie> stintel, Borromini, i was wrong, uboot is 115200 too
<hurricos> Habbie: Was 125K ever working?
<hurricos> I was telling a friend in my workspace about that, he was concerned about how messed up a baud rate like 125K is lol
<Habbie> hurricos, no - i got 125k from sigrok/pulseview but i see now that at my sampling rate, no better number would have come out
<Habbie> hurricos, i don't understand why uboot was garbage on the ch340g
<hurricos> Ah! Easy mistake. BTW, what do you use for a logic analyzer?
<hurricos> probably voltages :^)
<Habbie> hurricos, when i use pi onboard serial, 115200 works for uboot and kernel but has a lot of corruption - perhaps i do need a pullup
<Habbie> hurricos, different voltages between uboot and kernel?!
<hurricos> No no!!!
<hurricos> I mean your ch340g might not have the right voltage for -- oh
<hurricos> Misread, totally misread
<Habbie> my analyzer is a (knockoff?) saleae
<Habbie> ack
<hurricos> I read ch340g as ch341a operating in SPI
<Habbie> ah, no :)
<hurricos> sounded like "oh I chip clipped it"
<hurricos> no
<Habbie> also, i can use a higher sampling rate, but i did not try that
<Habbie> it's a very educational evening ;)
<hurricos> RE: saleae logic clone: how are these so common :o
<Habbie> probably because they're cheap
<Grommish> stintel: PAGE_OWNER goes on the cmdline as well? Sorry, I just don't want to mis-understand :D
<Grommish> I already put the page_owner=on
<stintel> Grommish: PAGE_OWNER is CONFIG_PAGE_OWNER
<Grommish> stintel: Gotcha
<Grommish> Thanks
<stintel> they're written without CONFIG_ in Kconfig
<stintel> git grep in linux.git :)
<stintel> soooo ... I hit ~1GB mem usage and hit oom-killer
<stintel> while the SNIC has 2GB
csharper2005 has quit [Read error: Connection reset by peer]
csharper2005 has joined #openwrt-devel
<stintel> tbh I can't really parse the oom-killer output
c0sm1cSlug has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
<Grommish> stintel: I hit about 500k and it oom on a 1Gb device, so that tracks
<Grommish> stintel: but only in these situations.. on 5.4 when i run suricata, it uses 570mb ram without issue
<hurricos> sounds like the kernel wants twice as much free memory as it is currently using
<hurricos> err
<hurricos> as much*
<hurricos> not including dirty pages or anything like that
<hurricos> (for fs)
<Grommish> err 500mb not 500k :D
c0sm1cSlug has joined #openwrt-devel
<Slimey> hurricos this look right for the 203x? https://paste.centos.org/view/91633134
<hurricos> Oh Slimey I forgot to respond to your request for cleaning up partitions
<hurricos> As long as the bootloader doesn't care about a given partition you can just pave it
<hurricos> and you probably don't because reverting back to stock is something most easily done by ovewriting the storage in an initramfs boot
Borromini has quit [Quit: leaving]
<hurricos> which can usu. be done (on NOR) by sysupgrade -F'ing an initramfs :^)
<hurricos> if you don't have serial
<hurricos> so yeah
<hurricos> let's pave all the partitions tbh
<hurricos> leave art and senao 0, we don't need A/B booting do we?
<hurricos> is cert actually needed?
<Slimey> yeah i tried it on the 1925 and u-boot defaults it back to bootcmd=bootm 0x9f300000 on reboot
<hurricos> what is it used for? If it's empty will the board still boot openwrt?
<Slimey> nope
<hurricos> on reboot?
<Slimey> certs for oem stuff
<hurricos> so it reads from a hard-coded region?
<Slimey> yeah
<hurricos> on every reboot? Let me test my bsap2030 it's right behind me
<Slimey> dont need A/B kernels
<hurricos> remind me of the baud rate / how to get in?
<hurricos> stintel: you're right, entire months go by with it in an unbuildable state
<hurricos> I'm going to try to build a reversion patch for every commit which mentions the token "udp"
<hanetzer> http://netjsonconfig.openwisp.org/en/latest/backends/openwrt.html#programmable-switch-settings assuming this kind of switch config is going away with DSA?
<Slimey> 115200
<Slimey> when you see adtran bootloader ... b100:
<hurricos> hanetzer: Yes. Already is gone without dsa actually. You point to the switch as a whole device of type bridge and then you create a separate config bridge-vlan which points to the device and gives which ports of that device it should apply to, and which vlan tag
<Slimey> but really just need "Bootloader" "Bootloader environment" "Kernel A" "Rootfs A" "ART" "SENA0"
<hurricos> for exampe, from my realtek switch here: https://paste.c-net.org/LengthyMeself
<hurricos> I say "gone without DSA", I mean to say that the way of doing configs has changed as well from what I'm seeing om that openwisp page
<hurricos> or maybe not? I guess my mx60 still has swconfig
<hanetzer> hurricos: yeah. I finally got the kinks worked out in deploying the controller and registering a device
<hurricos> Nice! Do you do much more with openwisp?
<hurricos> I have a hankering to deploy it somewhere using something really recent but I have feared testing and getting over the new switch configs (
<hanetzer> hurricos: at one point I cobbled together a freeradius capable setup, yeah, but that's so out of date/long ago its no longer applicable.
<hurricos> More than I've done. :P
<hanetzer> hurricos: don't try to use the 'current' one. its busted. go down to https://github.com/openwisp/ansible-openwisp2#deploying-the-upcoming-release-of-openwisp and use that to play with :P
<hanetzer> note: shared secret has some max length in the controller ui, not sure what, so make sure you're not getting truncated.
ptudor has quit [Read error: Connection reset by peer]
<hurricos> Much appreciated and bookmarked. I have been procrastinating and now am glad I waited long enough to hear this :^)
<stintel> maybe PROC_PAGE_MONITOR also helps
<hurricos> stintel: that will be easier to debug. Where do you keep your most recent snic10e patches? :^)
<hurricos> I want to grab and compile in time for neggles to wake up.
<stintel> codeberg
<stintel> afair
<stintel> https://gist.github.com/2d202363d64889be400ec1b6b45b76a7 to enable PAGE_OWNER in the kernel
<stintel> https://gist.github.com/stintel/bf233c89e93a6589ac014a0dc94ef653 if you don't want to edit the cmdline
<hurricos> free patches!
<Grommish> hurricos: For whatever reason, it's rebuilding the toolchain, so gimme a few and I'll have it built out and loaded
<hurricos> oh you already got it :D
<Grommish> hurricos: Yah, I can build from source easily enough and appending the cmdline is easy, I just need to set the kernel symbol.. This is only 5.15 though.. do you want me to drop back to 5.10?
<hurricos> How are you compiling 5.15?
<hurricos> shouldn't matter. But you should be able to publicly post for neggles, that's my onlyconcern
<hurricos> oh wait, you're compiling for your itus aren't you?
<Grommish> Yes
<hurricos> yeah, the snic10e has a PCIe interface. The whole idea is to read the RAM off it, find the problematic bits with the repetitive ff ff ff ff 80 01 20 00 in it
<hurricos> and then find who owns those pages
<hurricos> my guess is that it's not the FPA hw. But only a guess
csharper2005 has quit [Ping timeout: 480 seconds]
csharper2005 has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
<hanetzer> question. I'm assuming even hardware switches will be using dsa in the future?
<hurricos> hanetzer: Hardware switches ARE DSA :^)
<hurricos> OpenWrt will essentially only support a switch if it has DSA support in-kernel
<hanetzer> well, I'm using dsa on what appears to lack an actual hardware switch (meraki mr24)
<hurricos> right, that's just "DSA config" really
<hurricos> that's just a format for the configuration.
<hurricos> MR24 has no hardware switch, in fact only one port :P
<stintel> ow confuse much
<hurricos> I think 2 GMACs on the apm821xx? Maybe?
<hurricos> As the MX60 has an eth and non-eth port through its AR8327
<hurricos> though I might be wrong about that, too
<hurricos> and MX60 doesn't have its DSA port :grimacing: I should finish that while I'm at the Lab
<hurricos> stintel: boards with switches not supporting DSA will get dropped soon, yes?
<stintel> mostly confused about "using DSA on something w/o hw switch"
<hurricos> Yeah, no, I think people are really thinking about the "DSA-focused configuration formatting change under /etc/config/network"
<hurricos> which itself is a pain, and the loss of swconfig also means the nice little luci config tab is gone :(
<hurricos> (the one for VLANS)
<hurricos> s/eth and non-eth/wan and not-wan/
<stintel> Error loading shared library libubus.so.20210809: No such file or directory (needed by /usr/sbin/dnsmasq)
<stintel> grrrrrr
<stintel> lol even if dnsmasq doesn't start due to that, there is still memory usage increase over restart attempts
<hurricos> you can strace it and see if it opens sockets before it can talk to libubus :^)
<hurricos> s/to/via/
jwmullally has left #openwrt-devel [#openwrt-devel]
<Grommish> stintel: Yes.. Even if you have dnsmasq stopped and disabled, and then calla restart, it bumps the used
<hurricos> sounds like the loader is not even letting it start tho?
<stintel> but /sys/kernel/debug/page_owner doesn't seem useful at all
<hurricos> stintel: are you on a pcie host?
<hurricos> if you are, can you read memory off and see if you reproduce the same pattern/
<stintel> how ?
<hurricos> oct-remote --- ah, oct-remote-oneofthose.
<hurricos> Let me find the right one
<hurricos> oct-remote-memory
<stintel> arguments?
<hurricos> sorry. Sorry. oct-remote-save.
<hurricos> physical memory addreses.
<hurricos> bleh. Well, trigger dnsmasq a few hundred times and do a diff of <(xxd old.bin) <(xxd new.bin)
<stintel> need faster storage :P
<hurricos> :D
<hurricos> another thing you could do is ...
<hurricos> write back the old pages in those regions ;^)
<stintel> also the pattern isn't always the same
<hurricos> Oh?
<stintel> +1 somewhere
<hanetzer> hurricos: ah, looks like someone is already working on it.
<hurricos> hanetzer: very nice!
<hurricos> stintel: the other thing is, if you write back the old zeroed-out pages to wherever SLUB (or whoever here is responsible) is allocating them, you will probably trigger a panic
<hurricos> *in the thread of the responsible process*
<hurricos> so what you can do is ...
<hurricos> you can disable SMP in your build
<hurricos> cause the panic
<hurricos> then oct-remote-boot
<hurricos> and read from the kernel's log area
<hurricos> too much effort likely
<hurricos> wait, you don't even need to oct-remote-boot I don't believe
<hurricos> it can just be hung, I don't know if the card has to be live for you to grab or write memory
<hurricos> maybe you do just to reset some crap. not sure
<hurricos> stintel: feed me a pubkey, I'd appreciate the dump of the old vs new RAM if I can get it from you
csharper2005 has joined #openwrt-devel
<stintel> hmmmz, I might not have enough ram for diffing 2 2GB files :P
csharper2005 has quit [Read error: Connection reset by peer]
<stintel> +255daff0: ffff ffff 836b 2000 ffff ffff 836b 2000 .....k ......k .
<stintel> +255db000: ffff ffff 836b 3000 ffff ffff 836b 3000 .....k0......k0.
<stintel> not identical but similar
<stintel> gut feeling says something timestamp based
<hurricos> do you happen to have serial?
<stintel> yes
<stintel> hmm no it alternates
<hurricos> do you have a copy of your System.map?
<stintel> yes
<stintel> aha
<stintel> ffffffff836b2000 B invalid_pte_table
<stintel> ffffffff836b3000 B invalid_pmd_table
<hurricos> ah n
<hurricos> I doubt that's it
<hurricos> oh no
<hurricos> it is, isn't it?
<hurricos> we have to ask neggles whether those symbols are at those offsets in their page table
<hurricos> like
<hurricos> err.
<stintel> it seems to be very much arch specific
<hurricos> ffs. In their system.map
<hurricos> at ffff ffff 8201 2000. That is a small enough amount to be very believable as a difference between two people compiling
<hurricos> try compiling with different options
<hurricos> such that system.map changes
<hurricos> and verify
<hurricos> I forgot all I learned about system.map when I was debugging the ap3825i
<hurricos> so forgive me, I have to brush up, but B is ... what, variable storage?
<stintel> I haven't the slightest clue
ptudor has joined #openwrt-devel
<schmars[m]> nice debugging adventure you got going there. we have a bunch of octeon devices that will be rescued by your efforts :-)
<hurricos> schmars: please, no
<hurricos> no :(
<schmars[m]> :)
<hurricos> OK. B and b are empty buffers preallocated by the kernel for stuff
<stintel> I'm gonna have an attempt at bisecting the leak
<stintel> well, at building with external kernel really
<stintel> really great that kmemleak and page_owner are 100% useless
<hurricos> ida_alloc_range ...
csharper2005 has joined #openwrt-devel
<hurricos> stintel: so few of the things build in between that it's likely not going to be easy :\
csharper2005 has quit [Read error: Connection reset by peer]
<hurricos> You were definitely right. It was left broken for a while
<hurricos> first two compile attempts failed, I just assumed git bisect bad the first time so they weren't even nearby
<stintel> computers were a mistake :)
<Habbie> definitely
<stintel> what bothers me most is that this massive leak has gone unfixed for so long
<stintel> can we conclude that nobody uses octeon, or even mips?
<Habbie> are you saying linux mips64 just leaks?
<Habbie> (i only paid partial attention)
<stintel> well it's only guesswork because none of us was able to track anything down yet
<Habbie> ack
<hanetzer> nack
<hurricos> we don't have any other serious mips64 targets other than longsoon, is my understanding
<stintel> at least I can run "/etc/init.d/oct-remote.0 restart" to reboot the first snic :P
<neggles> hurricos / stintel: hi
<hurricos> are you telling me the host you're booting this from is
<neggles> what am I checking
<hurricos> neggles: system.map
<neggles> ok 2 tics
<stintel> did we try restarting dnsmasq in a loop on !octeon ?
<hurricos> grep for 82 01 30 00
<stintel> neggles: morning ;)
<stintel> [ 10.334774] libphy: mdio_octeon: probed
<stintel> [ 10.346551] Unhandled kernel unaligned access[#1]:
<stintel> panic :P
<hurricos> rip
<stintel> snic no likey unpatched 5.10.0
<stintel> but I can just disable everything network related, can do the restart loop via serial
<hurricos> neggles: more precisely, sorry, look for what symbol is at ffffffff82013000
<hurricos> just because there's a symbol there doesn't mean it's the right one.
<hurricos> but I have a worry. I only asked about system.map because I wanted to see your buffer stintel
<hurricos> (specifically the __log_buf)
<hurricos> so you could trigger that panic. But in all likelihood it'd not be the calling thread oopsing, I don't think. Unless you kicked the invalid memory back in RIGHT as it started working
<hurricos> in which case yes, it'd execute 0x0 and implode and the kernel would drop the oops at the end of __log_buf or wherever its extended storage is
<hurricos> I'm gonna go get a coffee.
<stintel> did we actually try restarting dnsmasq in a loop on !octeon? :P
<hurricos> stintel: fiiine
<hurricos> wait, no! but this isn't userspace!
<hurricos> dnsmasq isn't the problem
<hurricos> if it were we'd see it everywhere
<stintel> I know lol
<hurricos> and, and, the issue isn't present on 5.4.
<hurricos> You're still scaring me fwiw
<stintel> but restarting dnsmasq seems to trigger at least one of the leaks
<neggles> hmm my system.map has ffffffff81bfbba0 d __irf_start, ffffffff8212ebfa d __irf_end
<hurricos> no 82013000?
<neggles> nope
<hurricos> RIP
<neggles> it appears those two mark a whole region
<neggles> what is irf
<hurricos> no, no, it's OK. It's just garbage
<hurricos> this is the same system.map that ran this, yes? png
<hurricos> png
<hurricos> ...
<neggles> Yes I haven’t done a rebuild since
<neggles> but maybe I grabbed from the wrong spot lemme just make sure
<hurricos> Inter Reference Frequency. It's memory management
<hurricos> ... but so is much of the kernel
<neggles> yep definitely the right map
<hurricos> OK. I'm letting the server chew through and get a frequency of these diffs, per line
<neggles> i can give you the ram dumps if you like they compress to like 15mb
<stintel> [ 446.558430] jffs2: compression type 0x08 not available
<stintel> grmbl
<neggles> also sysntpd causes a small leak too - but every time it polls a server it briefly opens a UDP socket bound to 0.0.0.0
<neggles> listening socket
csharper2005 has joined #openwrt-devel
hanetzer has quit [Ping timeout: 480 seconds]
<stintel> +#define JFFS2_COMPR_LZMA 0x08
<stintel> bisecting kernel in OpenWrt sucks
<hurricos> it does :\
<hurricos> OpenWrt is very nice but some of the patches need to be upstreamed badly
<neggles> dnsmasq of course opens several; also odhcp6c restarts trigger the leak, it opens some more 0.0.0.0/::: sockets
Grommish_ has joined #openwrt-devel
<neggles> when grommish uninstalled odhcp6c without removing the ipv6 config from network._, he OOMed in 570s
<hurricos> heh
<hurricos> Waiting on sqlite to finish up so I can get a summary of diffs
<hurricos> cmp could also do it, actually
Grommish has quit [Ping timeout: 480 seconds]
csharper2005 has quit [Read error: Connection reset by peer]
<neggles> hurricos: _irf_start is initramfs
<hurricos> Yes, of course, that's understood
<hurricos> I'm doing a summary of the diffs to find these big ones but I'm beginning to doubt that these are ... memory offsets
<hurricos> well it's hard to doubt. It's literally the right format, 64bit with a 2GB offset (0x80000000)
<hurricos> or sorry, ffffffff80000000
<hurricos> emmory
<hurricos> memory offses *of existing structures* which point to some sort of functionality, is what I was trying to say.
<neggles> i did not know that :P yeah I’m wondering if they’re an identifier for an fpa buffer start point or something
<hurricos> that yeah. I'm worried about that, I didn't want it to be the FPA
<hurricos> if the hardware just feels like eternally collecting memory it's going to be difficult to debug.
<hurricos> if it's something assisting the hardware with that, which ... well, it's not, like, running a firmware. It's not smart hardware. The driver just manages it. So almost certainly what it is. But I'm saying, I really want to end up finding the code that does this and not end up in some obscure spaghetti
<neggles> i don’t think it’s that, there were a bunch of fiddly little changes between 5.4 and 5.10 in how it handles passing skbuffs to/from the fpa
<hurricos> right
<hurricos> right!
<hurricos> anyways, let sqlite wrap up and then let me index some stuff and find a list of diffs by frequency
hanetzer has joined #openwrt-devel
csharper2005 has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
<stintel> [ 169.308293] jffs2: error: (735) jffs2_build_inode_fragtree: Add node to tree failed -22
<stintel> grmbl :P
hanetzer1 has joined #openwrt-devel