<Umeaboy>
I wish that future openwrt version would have the smaller version of LuCi in it.
<hurricos>
You can confirm Grommish that the memleak goes away when you unload / load the driver, as well?
<hurricos>
s/memleak/unfreed memory allocation/
<Grommish>
hurricos: When I kmod'
<Grommish>
err when I kmod'd the octoen-ethernet.ko it seemed to stop when I unloaded it
<hurricos>
stop, or ... wait, just stop? not return to your free memory?
<Grommish>
Did not return that I could tell
<Grommish>
But I'd have to recheck
<Grommish>
Let me re-modularized octeon-ethernet and test
<hurricos>
I doubt you'll get it back, reading your posts you were pretty certain about it
<hurricos>
you also reference kfree_skb, I don't actually know how these drivers interact but I'd bet with how little the driver's been worked on recently, there's potential for not freeing memory then used by other drivers or non-modularizable bits, like the networking stack
<Grommish>
Just don't try to reload it after you unload it or it panics
<Grommish>
But neggles said that was to be expected
<hurricos>
No, no. I will break down and cry if I put a snic10e back in a box again
<hurricos>
I appreciate it though :^)
<hurricos>
Lemme checkout right after the 5.6 re-inclusion and build.
<Grommish>
Ok.. and if you have an initramfs to test, toss me the link and it shall be tested
<hurricos>
:thumbsup:
ptudor has quit [Read error: Connection reset by peer]
<Grommish>
I've got all the services turned off and an hour of wiat time @ 5 min intervals to see what might be leaking.. I'm turning them on one at a time and repeating
ptudor has joined #openwrt-devel
Umeaboy has quit [Quit: Leaving]
ptudor has quit [Quit: Strict-Transport-Security: max-age=48211200; preload]
ptudor has joined #openwrt-devel
goliath has quit [Quit: SIGSEGV]
shoragan has quit [Ping timeout: 480 seconds]
minimal has quit [Quit: Leaving]
<neggles>
hurricos: hello sir
<neggles>
I am a masochist and therefore have access to a plethora of octeon
<neggles>
one thing that's consistent for myself and grommish is that restarting dnsmasq will add 1-2mb to used memory
<neggles>
however if I instruct dnsmasq to bind to a specific IP address rather than wildcard, this does *not* happen
<neggles>
my current theory is "wildcard UDP socket receive buffers are going missing"
<neggles>
but I do not really know much of anything about how the internals of the kernel networking stack do things
<neggles>
i've got an snic10e set up, an srx300 i've not gotten around to making work yet, a USG-XG-8 which I can probably make work, a couple USG-3s, an ERLite-3... and hey I don't want Octeon to die I want octeonplus to die :P octeon ii and octeon iii can stay... for now...
<mangix>
I like ax
valku has quit [Quit: valku]
<neggles>
hurricos: dumped the memory of an snic10e which was just-booted (40MiB), restarted dnsmasq on repeatedly until it hit 300MiB, dumped the memory again... it's all this? https://i.imgur.com/ZcalJzR.png
<Slimey>
cant sleep
<russell-->
neggles: are you dropping cache? (e.g. echo 3 > /proc/sys/vm/drop_caches). Stuff isn't going to be evicted from memory until there is pressure from somewhere
<neggles>
russell--: yes, it's not caches
<neggles>
echo 3 > /proc/sys/vm/drop_caches does exactly nothing
<neggles>
based on a bunch of behavioural tests I and Grommish have just re-verified, it does seem like it's something to do with wildcard-bound UDP sockets
<stintel>
we build kea/host against libressl and then kea against openssl
<stintel>
isn't that asking for trouble anyway?
<jow>
what is kea and why does it need a host build?
<stintel>
kea is a dhcp server, dunno why it needs a host build, wasn't mentioned in the commit that introduced it
<jow>
apparently for some "kea-msg-compiler" executable
<stintel>
that seems optional to me
<stintel>
will try and rip it out completely then
<Habbie>
can I get a wiki account please?
<stintel>
Habbie: /q me your email ?
<Habbie>
done, thanks :)
<Habbie>
this router does 125k serial during uboot and 115k2 after and i really want to save the next person these two hours I spent :D
<stintel>
pffft
<Habbie>
for extra fun, the CH340G can't do that
<stintel>
seriously who comes up with with that shit
<Habbie>
tp-link
<Habbie>
:)
danitool has joined #openwrt-devel
<Slimey>
lol
<hauke>
jow: does openssl support static linking?I think that was one of the reasons for libressl
floof58 has quit [Ping timeout: 480 seconds]
<Borromini>
Habbie: which TP-Link?
<hurricos>
grommish: Unfortunately the patch between 5.4.96 and something mid-5.8 doesn't apply :(
<Grommish>
hurricos: I think the last 5.4 I tested was .175
<Grommish>
Which was fine.. that was the last 5.4 bump the kernel had before the switch to 5.10
<Grommish>
I also managed to stop the leak by turning everything off but networking
<Grommish>
So, the UDP issue neggles is suspecting is looking more and more likely
<stintel>
have we been going at it all wrong? is it a mips64 issue rather than an octeon issue? :P
<Grommish>
stintel: Nah. I suspect the networking changes that happened upstream is an issue, but not the driver itself, or at least, not exclusively
<Habbie>
Borromini, tl-wr841nd v11
<Borromini>
Habbie: ok
<Borromini>
what version is that model at now? v14?
<Grommish>
stintel: You can see my last test on the thread, but I can maintain networking as long as I kill everything else, including ipv6 and dns/dhcp
* Borromini
has a v7
<Grommish>
and it's stable as anything
floof58 has joined #openwrt-devel
csharper2005 has quit [Ping timeout: 480 seconds]
csharper2005 has joined #openwrt-devel
<mangix>
stintel: libressl uses cmake. Good enough reason to keep.
mrkiko has quit [Remote host closed the connection]
csharper2005 has quit [Read error: Connection reset by peer]
floof58_ has joined #openwrt-devel
floof58_ has quit []
floof58_ has joined #openwrt-devel
<stintel>
that alone is not a reason to keep it
<mangix>
I disagree
floof58 has quit [Ping timeout: 480 seconds]
floof58_ has quit []
floof58 has joined #openwrt-devel
<Habbie>
Borromini, v14 is what i'm aware of at least
<stintel>
having libressl for host build and openssl for target build is a recipe for confusion and potentially hard to debug issues
<stintel>
the fact that it uses cmake is irrelevant
<mangix>
I disagree. Libressl is only used for a select few packages.
<mangix>
OpenSSL is used by many packages.
csharper2005 has joined #openwrt-devel
cbeznea has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
cbeznea has quit [Quit: Leaving.]
<hurricos>
Grommish: So ... what I could do is manually repack a newer linux into the xzcat'ed tarball
<hurricos>
OpenWrt itself has a TON of patches most of which will likely not patch nicely with this method
<hurricos>
but the method of storing the diff as a patch isn't going to cut it
<hurricos>
which is a pain, really.
csharper2005 has joined #openwrt-devel
<Grommish>
hurricos: It's all beyond my knowledge, unfortunately.. I can certainly test it, but there isn't much I'm going to be able to add to the "lets try this" stuff. I have full access to initramfs or even just flashing, the device is nearly unbrickable unless the emmc chip dies, so I'm not worried about things going wrong
<hurricos>
I learned recently from stintel that you can have a package point to a source directory
<hurricos>
I'm trying to recall where that's documented; I need to do that with the kernel
<hurricos>
CONFIG_SRC_TREE_OVERRIDE
<hurricos>
src-link
<hurricos>
thank you weechat
<hurricos>
but that won't keep our patches. Hmm. I could try fooling the Makefile instead.
<stintel>
that doesn't work for the kernel
<hurricos>
Oh no
<stintel>
afaik
<hurricos>
and either way
<hurricos>
it's not a problem with the patch
<hurricos>
I somehow included user_headers in the patch
<hurricos>
that's the problem
<slh>
ls
<hurricos>
hurricos!
<hurricos>
:^)
<stintel>
you can do CONFIG_EXTERNAL_KERNEL_TREE
<stintel>
if it boots, should allow you to do a normal git bisect in the directory that points to
<stintel>
that's how I found 4ecf8346c074ff80101a17d39086010f8f4b23b8
<stintel>
I wanted to do that for octeon too but I have yet to find a reproducer
<hurricos>
<a reproducer> that is, a fast method to jiggle the key and reproduce?
<stintel>
yes
<hurricos>
Thank you. Makes sense. Sounds like anything that listens on udp wildcard sockets can reproduce it
<hurricos>
so repeatedly restarting dnsmasq
slh64 has quit [Quit: gone]
<hurricos>
Grommish has the setup
<stintel>
bisect between 5.4 and 5.10 was ~15 steps or so
<hurricos>
I'm on it now. The other thing to note is WHAT the space fills up with
<Habbie>
i wonder if my 125000 was caused by my pulseview/sigrok sampling rate being too low
<hurricos>
stintel: I'd honestly want to continue the way neggles was going
<hurricos>
reading system memory and seeing *what* is in there, then potentially using octeon-top / octeon-perf to watch and see who is writing to those regions of memory
<stintel>
good luck tracking down where 0x82013000 comes from :P
<hurricos>
yeah.
<hurricos>
memory mapping isn't fun
<hurricos>
err
<hurricos>
also no, that'd not be offset for code, it'd be storage for sure
<hurricos>
I say that but have no clue. I just don't think it's likely it'll be executing those pages, just using them as
<hurricos>
heap
<hurricos>
wait, I wonder
<hurricos>
kernel memory management is also not my strong suit
robimarko has quit [Quit: Leaving]
<hurricos>
page_owner=on
<hurricos>
@grommish reboot with that in your command line with a buggy kernel if you can
<hurricos>
hanetzer: Yes. Already is gone without dsa actually. You point to the switch as a whole device of type bridge and then you create a separate config bridge-vlan which points to the device and gives which ports of that device it should apply to, and which vlan tag
<Slimey>
but really just need "Bootloader" "Bootloader environment" "Kernel A" "Rootfs A" "ART" "SENA0"
<Grommish>
hurricos: For whatever reason, it's rebuilding the toolchain, so gimme a few and I'll have it built out and loaded
<hurricos>
oh you already got it :D
<Grommish>
hurricos: Yah, I can build from source easily enough and appending the cmdline is easy, I just need to set the kernel symbol.. This is only 5.15 though.. do you want me to drop back to 5.10?
<hurricos>
shouldn't matter. But you should be able to publicly post for neggles, that's my onlyconcern
<hurricos>
oh wait, you're compiling for your itus aren't you?
<Grommish>
Yes
<hurricos>
yeah, the snic10e has a PCIe interface. The whole idea is to read the RAM off it, find the problematic bits with the repetitive ff ff ff ff 80 01 20 00 in it
<hurricos>
and then find who owns those pages
<hurricos>
my guess is that it's not the FPA hw. But only a guess
csharper2005 has quit [Ping timeout: 480 seconds]
csharper2005 has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
<hanetzer>
question. I'm assuming even hardware switches will be using dsa in the future?
<hurricos>
hanetzer: Hardware switches ARE DSA :^)
<hurricos>
stintel: the other thing is, if you write back the old zeroed-out pages to wherever SLUB (or whoever here is responsible) is allocating them, you will probably trigger a panic
<hurricos>
*in the thread of the responsible process*
<hurricos>
so what you can do is ...
<hurricos>
you can disable SMP in your build
<hurricos>
cause the panic
<hurricos>
then oct-remote-boot
<hurricos>
and read from the kernel's log area
<hurricos>
too much effort likely
<hurricos>
wait, you don't even need to oct-remote-boot I don't believe
<hurricos>
it can just be hung, I don't know if the card has to be live for you to grab or write memory
<hurricos>
maybe you do just to reset some crap. not sure
<hurricos>
stintel: feed me a pubkey, I'd appreciate the dump of the old vs new RAM if I can get it from you
csharper2005 has joined #openwrt-devel
<stintel>
hmmmz, I might not have enough ram for diffing 2 2GB files :P
csharper2005 has quit [Read error: Connection reset by peer]
<stintel>
but I can just disable everything network related, can do the restart loop via serial
<hurricos>
neggles: more precisely, sorry, look for what symbol is at ffffffff82013000
<hurricos>
just because there's a symbol there doesn't mean it's the right one.
<hurricos>
but I have a worry. I only asked about system.map because I wanted to see your buffer stintel
<hurricos>
(specifically the __log_buf)
<hurricos>
so you could trigger that panic. But in all likelihood it'd not be the calling thread oopsing, I don't think. Unless you kicked the invalid memory back in RIGHT as it started working
<hurricos>
in which case yes, it'd execute 0x0 and implode and the kernel would drop the oops at the end of __log_buf or wherever its extended storage is
<hurricos>
I'm gonna go get a coffee.
<stintel>
did we actually try restarting dnsmasq in a loop on !octeon? :P
<hurricos>
stintel: fiiine
<hurricos>
wait, no! but this isn't userspace!
<hurricos>
dnsmasq isn't the problem
<hurricos>
if it were we'd see it everywhere
<stintel>
I know lol
<hurricos>
and, and, the issue isn't present on 5.4.
<hurricos>
You're still scaring me fwiw
<stintel>
but restarting dnsmasq seems to trigger at least one of the leaks
<neggles>
hmm my system.map has ffffffff81bfbba0 d __irf_start, ffffffff8212ebfa d __irf_end
<hurricos>
no 82013000?
<neggles>
nope
<hurricos>
RIP
<neggles>
it appears those two mark a whole region
<neggles>
what is irf
<hurricos>
no, no, it's OK. It's just garbage
<hurricos>
this is the same system.map that ran this, yes? png
<neggles>
but maybe I grabbed from the wrong spot lemme just make sure
<hurricos>
Inter Reference Frequency. It's memory management
<hurricos>
... but so is much of the kernel
<neggles>
yep definitely the right map
<hurricos>
OK. I'm letting the server chew through and get a frequency of these diffs, per line
<neggles>
i can give you the ram dumps if you like they compress to like 15mb
<stintel>
[ 446.558430] jffs2: compression type 0x08 not available
<stintel>
grmbl
<neggles>
also sysntpd causes a small leak too - but every time it polls a server it briefly opens a UDP socket bound to 0.0.0.0
<neggles>
listening socket
csharper2005 has joined #openwrt-devel
hanetzer has quit [Ping timeout: 480 seconds]
<stintel>
+#define JFFS2_COMPR_LZMA 0x08
<stintel>
bisecting kernel in OpenWrt sucks
<hurricos>
it does :\
<hurricos>
OpenWrt is very nice but some of the patches need to be upstreamed badly
<neggles>
dnsmasq of course opens several; also odhcp6c restarts trigger the leak, it opens some more 0.0.0.0/::: sockets
Grommish_ has joined #openwrt-devel
<neggles>
when grommish uninstalled odhcp6c without removing the ipv6 config from network._, he OOMed in 570s
<hurricos>
heh
<hurricos>
Waiting on sqlite to finish up so I can get a summary of diffs
<hurricos>
cmp could also do it, actually
Grommish has quit [Ping timeout: 480 seconds]
csharper2005 has quit [Read error: Connection reset by peer]
<neggles>
hurricos: _irf_start is initramfs
<hurricos>
Yes, of course, that's understood
<hurricos>
I'm doing a summary of the diffs to find these big ones but I'm beginning to doubt that these are ... memory offsets
<hurricos>
well it's hard to doubt. It's literally the right format, 64bit with a 2GB offset (0x80000000)
<hurricos>
or sorry, ffffffff80000000
<hurricos>
emmory
<hurricos>
memory offses *of existing structures* which point to some sort of functionality, is what I was trying to say.
<neggles>
i did not know that :P yeah I’m wondering if they’re an identifier for an fpa buffer start point or something
<hurricos>
that yeah. I'm worried about that, I didn't want it to be the FPA
<hurricos>
if the hardware just feels like eternally collecting memory it's going to be difficult to debug.
<hurricos>
if it's something assisting the hardware with that, which ... well, it's not, like, running a firmware. It's not smart hardware. The driver just manages it. So almost certainly what it is. But I'm saying, I really want to end up finding the code that does this and not end up in some obscure spaghetti
<neggles>
i don’t think it’s that, there were a bunch of fiddly little changes between 5.4 and 5.10 in how it handles passing skbuffs to/from the fpa
<hurricos>
right
<hurricos>
right!
<hurricos>
anyways, let sqlite wrap up and then let me index some stuff and find a list of diffs by frequency
hanetzer has joined #openwrt-devel
csharper2005 has joined #openwrt-devel
csharper2005 has quit [Read error: Connection reset by peer]
<stintel>
[ 169.308293] jffs2: error: (735) jffs2_build_inode_fragtree: Add node to tree failed -22