nipos has left #haiku [Disconnected: Replaced by new connection]
nipos has joined #haiku
avanspector[m] has joined #haiku
<avanspector[m]>
waddlesplash: hi, have there been any considerations amongst devs on adding futexes to user API?
_-Caleb-_ has left #haiku [#haiku]
_-Caleb-_ has joined #haiku
OrangeBomb has quit [Remote host closed the connection]
OrangeBomb has joined #haiku
<waddlesplash>
avanspector[m]: no
<waddlesplash>
the pthread locks already use an API internally that has many of the advantages of a futex API
<waddlesplash>
I don't see much if any reason to expose anything else. the only thing that might be worthwhile is the API for having a mutex in just an int32 rather than the whole pthread data structure, but even this has very limited usecases
<waddlesplash>
why do you ask?
* OscarL
wonders if using "hlt" instead of "pause" while on KDL's kgetc() would work (and if that would allow VBox to stop eating 100% of a core while the debug console is active).
* OscarL
reads some driver code and wishes he had a better brain.
<OscarL>
that's from this side, erysdren. how about you? :-)
<erysdren>
just working on my code projects... custom 3D game engine, Rise of the Triad sourceport, Quake engine stuff....
<erysdren>
yknow, the usual :P
<erysdren>
i really wanna make the memory usage and management in the ROTT sourceport (Taradino) better, so it might have a better chance of running on other platforms
<erysdren>
it only barely runs on Haiku as-is.
<OscarL>
heh
<erysdren>
i think Linux/Windows/etc just have better hardening against applications with shoddy memory management, crazy array accesses, etc.
<erysdren>
the game is coded pretty poorly.
<Skipp_OSX>
ActionRetro has Evolve III Maestro 11.6" Laptop Computer - Micro Center... Intel Celeron N3450 1.1GHz Processor
<Skipp_OSX>
Celeron N4120, N3450... bout the same, need patch :)
<OscarL>
I guess I mixed up with some other machine. I thought the one he got for <100 USD had an N4120.
<OscarL>
The N3450 has a different device ID: "5A85"
<OscarL>
while the N4x20 sits right between KabyLake and CoffeeLake.
<OscarL>
(in regards to graphics)
<Skipp_OSX>
die shrunk I believe on mobile
<Skipp_OSX>
14nm part
<OscarL>
right. Intel code names gives me a headache.
<Skipp_OSX>
take a Tylenol, we need it!
<OscarL>
feel free to add the ids on https://review.haiku-os.org/c/haiku/+/8083. I can't test hardwre I don't have, and I rather avoid possible KDLs/black screens at boot :-)
<OscarL>
that little netbook sounds nice, till you read about the erratum on those Goldmont CPUs: "eMMC should have a maximum of 33% active time and should be set to D3 device low power state by the operating system when not in use."
<OscarL>
paraphrasing: "also... don't use USB ports, LPC bus, SD Card, or RTC too much if you want it to last" :-D
<Skipp_OSX>
come on you're the one that knows the IDs :)
<OscarL>
but we need more than just the "UHD Graphics 500" id... we need at least two more (PCH and LPC bus one)... and also to make sure to wich "group" it actually equates.
<OscarL>
Apollo Lake seems Gen9, so I would assume SkyLake group. But I assumed GeminiLake was KabyLake... turns out it is, just in parts, in others, it behaves like CoffeeLake :-D
<OscarL>
Skipp_OSX: on my GeminiLake... I looked in listdev output for the following: "Host bridge" (that id goes along with the device ID for the Graphics, in intel_gart.cpp).
talos has quit [Ping timeout: 480 seconds]
<Skipp_OSX>
I'll email him real quick
<Skipp_OSX>
(just kidding)
<OscarL>
and then for... "ISA bridge" (usually mentions LPC controller)... that ID might be needed in intel_extreme/driver.cpp's "detect_intel_pch()".
<OscarL>
and thats more or less the total extent of my understaning of things :-P
* OscarL
rebuilds 8083 on top of beta5 branch, to see if that fixes his boot issues.
<OscarL>
Sadly, can't do much with that :-). At least he can boot currently, would be bad to change that for the worse for "yolo"ing the IDs... (remember radeon :-D)
<OscarL>
"If you rename the system folder or its content"... that warning drives me nuts. I'm trying to rename a file under /var/log... stop pestering me Tracker!!!
<OscarL>
Also... that dialog is a nightmare to use with keyboard only.
<Skipp_OSX>
yeah I hate that warning too
* OscarL
is really happy with idualwifi7260 finally working *most* of the time (after 6 months where it only worked on 2 separate days :-D)
mmu_man has quit [Ping timeout: 480 seconds]
<OscarL>
yes! beta5 with working intel_extreme graphics (brightness control working).
smalltalkman__ has quit []
<Skipp_OSX>
yay
diver has joined #haiku
* phschafft
can clearly think about a few people he would like to have brightness controll on...
<OscarL>
would be cool if the screen turned off when closing the lid, but... oh well.
<OscarL>
phschafft: radiant people are the worse ;-P
<phschafft>
hm....
<OscarL>
(black body radiation doesn't count)
<zdykstra>
evenin' OscarL
<OscarL>
hello zdykstra!
<zdykstra>
how are you this fine evening?
<zdykstra>
and by fine I mean hot and humid
<OscarL>
midnight already down here... less cold than the last few days so... can't complain much :-)
<botifico>
[haikuports/haikuports] Begasus 3e37060 - extra_cmake_modules, bump version (#10922)
talos has joined #haiku
<oanderso[m]>
x512 waddlesplash I'd really appreciate eyes on this change. Not enough stuff is working to really test it yet, but I'd really like to get this foundational piece solid before layering more users on top of it: https://review.haiku-os.org/c/haiku/+/8139
Forza has joined #haiku
yann64 has joined #haiku
ChaiTRex has quit [Remote host closed the connection]
<botifico>
[haikuports/haikuports] Begasus 407ea03 - karchive6, bump version (#10931)
<Begasus>
;)
<Begasus>
keeps passing by here when "booting" is mentioned :D
arraybolt3 has quit [Quit: WeeChat 4.1.1]
<phschafft>
over here we sometimes refer to reboot as 'umschuhen' (to change shoes).
<Begasus>
funny too (ps, I do know a bit of German, Aachen only about 50Km from here) :)
arraybolt3 has joined #haiku
<Begasus>
my French is more rusty though
* phschafft
noods.
<phschafft>
I know, but this is an international channel after all :)
<Begasus>
+1 :)
<gordonjcp>
I may be the only Gaelic speaker in the channel though
<phschafft>
and I think it's always kind to keep those in mind who just read. Being myself in that position most of the time.
<Begasus>
Gaelic as in?
<phschafft>
gordonjcp: then you need to install a language pack on someone else!
<dovsienko>
Begasus: the native language of Scotland (after Pictish, that is)
<gordonjcp>
phschafft: my son has mostly settled on speaking English but when he was about 2 he would just use whichever English, German, or Gaelic word he thought of first
<gordonjcp>
phschafft: poor wee guy had a hard time on holiday because the other kids he was playing with couldn't understand him speaking English but he could understand them speaking English or German
<Begasus>
ah!
<Begasus>
I still talk our local dialect also to the kids and grandchildren, sadly they didn't pick it up
<botifico>
[haikuports/haikuports] Begasus 39e4dbf - kwidgetsaddons6, bump version (#10932)
<phschafft>
gordonjcp: my parents had a fight on which my second language should be, so they decided that I would not group up with three languages but one. made my life way more difficult in the end.
<phschafft>
(for any definition of language as by my parents)
<phschafft>
I think it it is very good to /offer/ additional languages to young childs as it helps the brains to get used to there being more than one.
<phschafft>
what they make of that is clearly their way.
<Begasus>
I'll second that
<phschafft>
also I very much understand the problem with interacting with other people. I find myself now, years later, in that situation sometimes.
<phschafft>
it can be hard, specifically for you young child. but it can also be a path to growth. for both sides. (but may also require a bit of support from the parents of both sides).
<botifico>
[haikuports/haikuports] Begasus c61b7de - kguiaddons6, bump version (#10934)
<Begasus>
need to focus a bit on those KF6 recipes, they need to be build in the right order, so not doing much side-tracks today :)
diver1 has joined #haiku
diver has quit [Read error: Connection reset by peer]
jhj has joined #haiku
<jhj>
Hi there. Somewhat recently I've seen a huge performance regression on Haiku, around 90% slowdown, with our application. (https://dps8m.gitlab.io/)
<jhj>
We support building a threaded and a non-threaded version.
_-Caleb-_ has left #haiku [#haiku]
<jhj>
The non-threaded version performs fine, but the threaded version doing work in a single thread is 80-90z slower, and this wasn't always the case.
<jhj>
It's just Haiku - the threaded version is 7-9% slower on all other tested platforms we support (mac, linux, solaris, AIX, windows)
<botifico>
[haikuports/haikuports] Begasus e3f4034 - kcolorscheme6, bump version (#10935)
<jhj>
Also, the "pigz" package is currently broken on Haiku, it exits with an error about library mismatch.
<avanspector[m]>
<waddlesplash> "why do you ask?" <- Every other OS has them, and in that sense they are really powerful because I can reproduce most of synchronization primitives just using thin abstraction on top of futexes
<avanspector[m]>
in my libraries that I use across OSs
<jhj>
Linux though, can't test on Haiku right now.
<dovsienko>
the web site looks broken, but let's say there is a difference that cannot be explained. I tested on Linux too.
jmairboeck has joined #haiku
<gordonjcp>
phschafft: yeah
<gordonjcp>
phschafft: I only started learning German relatively recently in my mid-40s
<gordonjcp>
phschafft: it's not that difficult but I find myself sometimes trying to explain the German word I can't remember in Gaelic, which is no bloody use to anyone
OrngBomb has joined #haiku
OrangeBomb is now known as Guest1615
OrngBomb is now known as OrangeBomb
Guest1615 has quit [Ping timeout: 480 seconds]
illwieckz has joined #haiku
<Coldfirex>
jhj: It might be worthwhile to open a bug report and see if you can find around which hrev the issue started
<jhj>
Coldfirex: I think I should. Is there any snazzy full system profiler for Haiku I should know about?
* phschafft
nods to gordonjcp.
<Coldfirex>
I think we have a profiler, but that is out of wheelhouse
<jhj>
Coldfirex: I never benchmarked it before scientifically, but I was just testing again recently and said "wow this is SLOW".
<jhj>
And the fact that the threaded version is 80% slower than the single-threaded version on Haiku, but only the expected 7-9% everywhere else, makes me think, it's a Haiku problem.
<jhj>
Also, we only distribute threaded binary builds, so that sucks for Haiku.
<Coldfirex>
Its possible. Maybe try on a fresh beta 4, then b4+updates, then a nightly?
<jhj>
not that we are a major project or anything, but sfill :)
<jhj>
Actually, it's 76% slower. But still, that's terrible.
<jhj>
That's with nightly from today.
<jhj>
Coldfirex: I recently broke my leg, so I'm stuck downstairs and can't easily use my office computer, I'm stuck with a crappy laptop downstairs. Maybe next time I get up there I setup VNC.
<jhj>
makes testing harder
<dovsienko>
jhj: maybe just send someone who is in shape to fetch your stuff to you
<jhj>
I don't trust anyone to move my desktop around. I can get upstairs but it's a pain, I have to scoot up sitting on each stair and haul the walker up.
<Coldfirex>
or virtualize it :)
<jhj>
My laptop has like 8GB of RAM and a small disk, so VNC will be better. I test on Haiku by SSH'ing into it now.
<jhj>
I would highly recommend not breaking your legs to everyone.
<Coldfirex>
good advise hah
* coolcoder613_
nods
coolcoder613_ is now known as coolcoder613
<coolcoder613>
My laptop has 8GB of RAM and a small disk, and it's the best machine I have
<coolcoder613>
(256GB)
<phschafft>
$in_the_good_old_times...
<coolcoder613>
It is an Apple M1 MacBook Air, so the performance *is* decent
<coolcoder613>
Or at least it is to me
* coolcoder613
is currently instaling BeOS on his Deskpro
<phschafft>
the more powerful machines became on avg the more I throttle them.
yann64 has quit [Quit: Vision[]: i've been blurred!]
<coolcoder613>
I got a PS/2 mouse finally
<coolcoder613>
phschafft: throttle them?
yann64 has joined #haiku
<phschafft>
and keep that in mind: 20 years ago it would be very uncommon to have a desktop running at 1% CPU while doing basically anything.
PetePete has joined #haiku
<phschafft>
coolcoder613: e.g. my laptop can go up to 4x 3.8GHz, but I set the limits for the CPU to 4x 1.6GHz. keeps it cooler when under stress, but hardly noticeable to me.
<phschafft>
I mean if you do something heavy for a second or two everything spins up, throttling them keeps it down to normal at the cost of adding like a second or two to your job.
<phschafft>
but waiting for an extra second is way less anoying then having it spin up and down the fan, it running hotter, and in the end breaking down earlier.
<jhj>
Oh, does anyone know if Haiku's setjmp/longjmp save the signal mask? On BSD, there is _setjmp/_longjmo that don't. Glibc before 2.19 saved by default, but now does not.
* phschafft
wonders why jhj needs those calls.
<jhj>
Haiku has all three of setjmp/_setjmp/sigsetjmp.
<jhj>
phschafft: We use it to implement faults and interrupts in our virtual machine.
<jhj>
Essentially our own exceptions.
<phschafft>
the sig*() variants are in POSIX.
<phschafft>
so they sound like good candidates.
<phschafft>
jhj: 'we'?
<jhj>
POSIX doesn't specify signal mask behaviors, which is why some OS's invented the variant.
<jhj>
"It is unspecified whether longjmp( ) restores the signal mask, leaves the signal mask unchanged, or restores it to its value at the time setjmp( ) was called."
<dovsienko>
jhj: that's a good point, I didn't know the signal mask needs attention
<jhj>
We don't care about saving and restoring it, so we'd prefer to use the fastest variant.
<jhj>
That doesn't address the 75% speed difference between the threaded and non-threaded builds on Haiku, just an unrelated possible performance optimization.
<jhj>
The small list of OS's where it's faster to use the underscore variants.
<jhj>
I guess I'll just need to benchmark it on Haiku or read the source.
<phschafft>
they're all on the same lineage. fun.
<jhj>
Glibc <2.19 also has the slower signal mask preserving longjmp by default, but that's old enough now I don't think I need to bother special-casing it.
<botifico>
[haikuports/haikuports] jmairboeck 7e43a9c - libtool: bump version (#10938)
walkingdisaster has joined #haiku
SLema has quit [Quit: Vision[]: i've been blurred!]
SLema has joined #haiku
<waddlesplash>
jhj: knowing when the regression started could be useful
<waddlesplash>
I am surprised your program isn't much faster multi threaded even on Linux though
freddietilley has quit [Quit: WeeChat 4.2.2]
<jhj>
waddlesplash: The 7-9% threading slowdown is synchronization stuff. The benchmark in both cases is testing just a single thread.
<waddlesplash>
ah
<jhj>
So in the multithreaded version, you can run, say, 12 emulated CPUs, and each could run that benchmark.
nipos has left #haiku [Error from remote client]
<waddlesplash>
but the multi threaded version with a single thread on Haiku is 75% slower, is that what you're saying?
<jhj>
If you only have one CPU, or you want to run on embedded hardware and don't care about emulating a multiprocessing system, then you could just build the single threaded version.
<jhj>
And yes, the multithreaded version with 1 CPU thread runs our benchmark 75% slower than the single threaded build.
nipos has joined #haiku
<waddlesplash>
odd
<waddlesplash>
pretty much all the multi threading primitives on Haiku should not even call the kernel if used with only one thread
<waddlesplash>
so I don't know what this could be coming from
<waddlesplash>
how long does a build take? could I spin it up here?
<jhj>
waddlesplash: The exact numbers on my host machine just now was 14.0234 MIPS on the single threaded version, and 13.0755 on the muktithreaded version.
<waddlesplash>
well it's using 100% CPU on one core here
<waddlesplash>
means we aren't getting lost in lock contention at least
PetePete has quit [Ping timeout: 480 seconds]
PetePete has joined #haiku
<waddlesplash>
jhj: how long should a run take?
<jhj>
I don't understand how it can be 75% slower on Haiku. But I'm positive it wasn't always that bad or I would have noticed it when I ported it originally.
<jhj>
It depends. It takes 20-some seconds here on Linux.
<waddlesplash>
it's been going over a minute here
<jhj>
That sounds "normal" for Haiku.
<jhj>
It'll take 75-80% longer.
<jhj>
So 2 or 3 minutes, depending on your hardware.
<jhj>
On a POWER10 with AIX 7.3, it takes 17 seconds / 18 seconds
<jhj>
Much quicker after profiling.
<waddlesplash>
I would suspect there are things you can optimize in the code itself if PGO makes that much difference
<jhj>
It's not much we can really do, there is a huge number of CPU instructions.
<waddlesplash>
anyway, appears there's a profiler bug here, if I kill the program while it's in progress I don't get results from any other thread
<waddlesplash>
using the whole-system profiler works
<waddlesplash>
guess I should look into that...
<jhj>
And optimizing QEMU with PHO gives the same 18-20% performance improvement with TCG.
<waddlesplash>
it's spending most of its time getting TLS addresses
<jhj>
Yes, we make extensive use of thread local storage.
<waddlesplash>
right, but it looks like you must be fetching it VERY often
<waddlesplash>
instead of fetching it just once at the top of a function
<waddlesplash>
that will probably speed things up everywhere if you fix that actually
<jhj>
Every cycle we have to fetch it more than once, because of the way the appending unit works.
<jhj>
That's by design.
<jhj>
Remember, only Haiku is slow.
<waddlesplash>
?
<waddlesplash>
why can't you fetch it once per cycle?
<jhj>
Itks not a regular von neuman machine design
<waddlesplash>
jhj: if you are fetching it multiple times per cycle on every OS then it's almost certainly a performance issue there too, those OSes are just faster at the fetch.
<waddlesplash>
what does that have to do with anything?
<jhj>
Some instructions can take hundreds of cycles, or thousands as well
<waddlesplash>
you should only need to fetch the TLS address *once* per function call here
<waddlesplash>
no, but the compiler generates many implicit calls to it
<waddlesplash>
if you access a _thread_local variable
yann64 has quit [Quit: Vision[]: i've been blurred!]
<waddlesplash>
so if you instead access the _thread_local variable just once and cache the result, massively reduces implicit calls
<jhj>
on macos for example, the TLS lookup on their profiler comes in at 1.7%
<waddlesplash>
keep in mind that there is another big difference on Haiku here which is that every executable is built as relocatable
<waddlesplash>
this means we cannot use the "static" TLS model
<waddlesplash>
and the "dynamic" TLS model needs a lot more function calls on every OS, that's just how it works
<waddlesplash>
so, if you are building a non-PIC executable on Linux and macOS, you may get the static TLS model and then it's faster because of that
<jhj>
so I dont think itks my code, really. And ecen if it was, Haiku is the huge outlier of the 17 OSs we support (AIX, OpenBSD, NetBSD, FreeBSD, DragonFly, Android, Windows, etc.)
<jhj>
also, we enforce global-dynami model on AIX
<jhj>
and we build -fPIC on Linux as well
<waddlesplash>
hm
<waddlesplash>
well, I don't know why it calls get_tls_address so often then
<waddlesplash>
maybe rebuilding with Clang might be interesting...
<jhj>
In fact, on AIX where we enforce global dynamic for various reasons, it's one of rhe fastest operating systems we run on (per clockmcycle)
mmu_man has quit [Ping timeout: 480 seconds]
<waddlesplash>
that's surprising and probably means you could optimize a lot on Linux more.
<jhj>
I don't think clang was differenr but I can check pretty quick
<jhj>
We originally wrote this on AIX and FreeBSD
<waddlesplash>
jhj: ok, so this isn't helped by the fact that I'm running a debug build of runtime_loader, which is where those TLS addresses come from
<waddlesplash>
but seeing as we have over a thousand hits of TLSBlock::IsInvalid(), this indicates that we are calling these methods a really ridiculous number of times
<waddlesplash>
because that method is literally *just a null pointer check* and nothing use
<waddlesplash>
so for us to get over 1000 hits of it when *only sampling every 1 ms*, indicates just how many times it's invoked (a really ridiculously high number)
<jhj>
I just built with Clang on Haiku, let me run the benchmark.
<phschafft>
waddlesplash: how is the static model different?
<jhj>
my $89 android tablet beats my desktop i7 running Haiku
<jhj>
multithreaded builds on both
<phschafft>
also, the function that returns the TLS address back to the compiler: might it be marked with some attributes that help the compiler? e.g. that the result is constant per thread?
<waddlesplash>
jhj: there may be some TLS caching scheme there, I don't know. but if you adjust the code just a bit to call TLS methods less, it will surely be faster on all OSes and not just Haiku.
<waddlesplash>
phschafft: I am not familiar with the precise differences tbh
<phschafft>
waddlesplash: ok. thank you.
<waddlesplash>
phschafft: I doubt it, because it's not constant per thread
<waddlesplash>
this is ELF TLS, so we get a different result based on what offset is specified
<jhj>
waddlesplash: even without optimization problems, this is definately a haiku issue
<phschafft>
the address of a given variable changes per thread at runtime?
<jhj>
on Android for example, the difference between threaded and non-threaded is less than 3%
<waddlesplash>
jhj: I don't know about that. It may be a difference in how GCC generates code for -fPIC -shared
<waddlesplash>
I don't really see how these methods could be optimized much more
<jhj>
waddlesplash: Well well.
<waddlesplash>
we could add a most-recently-accessed cache
<waddlesplash>
but that's about it
<jhj>
Clang build is 5 MIPS dor threaded.
<jhj>
GCC is 1
<waddlesplash>
and what's Linux on the same machine?
<jhj>
Clang builds this code significantly slower as well.
<waddlesplash>
so, Clang is much better about not calling the TLS routines probably then
<jhj>
waddlesplash: Linux Clang gets about 7 without profiling
<waddlesplash>
ok
<jhj>
Linux GCC is about 10
<phschafft>
I would maybe hint that it is not a problem in the software xor Haiku but most likely both could help. ;)
<waddlesplash>
jhj: the differences really should not be so big. this really indicates you can optimize your code more
<jhj>
But this wasn't always the case.
<jhj>
I'm quite sure it wasn't this slow before I would have noticed.
<waddlesplash>
possibly a change in GCC 13, no idea
<jhj>
And our code is very optimized now, at least what does memory addressing. We've been keeping track of it over time.
<waddlesplash>
the fact that Clang gives 8.5 and GCC gives 10.3 indicates that there are probably more things you could do
<waddlesplash>
or it's worth digging into and reporting to Clang upstream
<jhj>
Thismis very similar to building QEMU on the same machine.
<jhj>
Clang builds are about 15-20% slower with TCG on the same machine.
<waddlesplash>
jhj: there really isn't any way to optimize the get_tls routines here, they are pretty simple. Adding a thread-local cache of the last-accessed DTV is about all I can think of. But that wouldn't speed things up by much
<jhj>
Doing a PGO build improves clang much more than it improves GCC also.
<phschafft>
can you isolate the performance problem to a small section of code?
<waddlesplash>
Not when there are tens of thousands of calls to this method
<phschafft>
maybe you could diff the generated code from gcc and clang.
<waddlesplash>
it's almost certainly just down to how often it does TLS fetches
<waddlesplash>
let's see what the code is actually doing
<waddlesplash>
jhj: actually I don't think we can really cache this. We have to recheck the "Generation" every time
<jhj>
OK, so, with Clang, the difference between threaded and non-threaded versions is about 9%
<waddlesplash>
this is for Android but it appears to detail a lot of what's going on
<jhj>
Android only gives us a 3% difference or so between threaded and non-threaded, but we use Clang from the NDK there.
<jhj>
On essentially every other system except macOS, GCC is faster, but the difference between the threaded and non-threaded be chmarks is constant.
<waddlesplash>
it really shouldn't be
<waddlesplash>
probably fixing the thread local accesses will make the threaded-but-one-thread benchmark just as fast as the regular benchmark on all OSes
<jhj>
Even so, I've done this test on z/OS USS, AIX, Solaris, illumos, OpenBSD, NetBSD, DragonFly, FreeBSD, Windows, macOS, Android, and even oddballs like SerenityOS and GNU/Hurd, and Haiku is so far the only one that has had any weird performance issue.
_-Caleb-_ has left #haiku [#haiku]
<jhj>
And I'm also extremely confident it wasn't always so, because I would have immediately noticed it when I initially did the port.
_-Caleb-_ has joined #haiku
<waddlesplash>
yes, but I guarantee this is something that can be fixed on your end
<waddlesplash>
#define cpu (* cpup)
<waddlesplash>
this is the source of the problem, I bet
<jhj>
That's the CPU state, which we do want to be local to each running thread.
<waddlesplash>
yes
<waddlesplash>
but this means you are doing a TLS access *EVERY TIME* you access the CPU structure
<waddlesplash>
you should instead read it once at the beginning
<waddlesplash>
and pass it through to all functions
<waddlesplash>
probably Clang is just better at caching this variable, but GCC is more aware that it could change
<jhj>
waddlesplash: On Linux with VTune, TLS access is <2% of the difference, and mutexes and pthread cond wait makes up the rest.
<jhj>
with both gcc and clang
<waddlesplash>
if you are using non-PIC then that may be the case
<waddlesplash>
but I don't see how this could really be much faster on PIC
<waddlesplash>
getting the compiler to cache it across function calls is about it
<waddlesplash>
jhj: let's put it this way, even if we optimize things in Haiku more, you are still doing a function call every single time this variable is accessed, in the general case.
<waddlesplash>
avoiding that will surely be faster everywhere
<phschafft>
can you maybe replace *cpup with something like *get_cpup(); static inline get_cpup(void) __attribute__ ((pure)); static inline get_cpup(void) { return cpup; }
<phschafft>
just to tell the compiler that it is safe to cache the value. and just as a test if that makes any difference.
<waddlesplash>
how would that tell the compiler it can be cached?
<jhj>
No, I juat cgecked and built with an explicit $ env LDFLAGS=-fPIC CFLAGS=-fPIC gmake
<phschafft>
not fully sure if pure is the correct attribute.
<waddlesplash>
jhj: do you link with -shared?
<jhj>
on Linux, and the results are exactly the same
<waddlesplash>
if you don't link with -shared, then the linker may collapse all the TLS
<phschafft>
waddlesplash: it would tell the compiler that the result of get_cpup() is to be cached.
<phschafft>
e.g. in loops.
<milek7>
accessing var is surely also pure
<waddlesplash>
yes
<waddlesplash>
well.
<waddlesplash>
I'm not sure what TLS accesses count as actually
<phschafft>
milek7: I'm suspecting that gcc for some reason thinks it's not.
<jhj>
waddlesplash: linkin what wirh -shared?
<waddlesplash>
the whole binary
<waddlesplash>
jhj: yeah, you access this variable across multiple contexts in separate files. no wonder this is slow
<waddlesplash>
if you passed it through as an argument, definitely will be way faster
<waddlesplash>
jhj: also, your incremental builds are even slower than Haiku's incremental builds. if they're even incremental at all?
<jhj>
We arent building any shared libraries. You can't link a whole program with -shared
<waddlesplash>
which is impressive
<waddlesplash>
jhj: you can. Every program on Haiku is.
<waddlesplash>
that's probably a large part of the difference
<waddlesplash>
like I said, *every* application on Haiku is built with -shared
<waddlesplash>
it's implicit in the compiler specs
<jhj>
I can do it on AIX, Solaris, USS, etc.
<jhj>
All the BSDa
<jhj>
and prove it id you dont beleive me :)
<milek7>
I think it's just entry point problem
<jhj>
Even the docs say thats cor building objects or libraries and doesn't make an executable.
<jhj>
You mean -pie?
<waddlesplash>
it makes one on Haiku
<waddlesplash>
no, again, on Haiku applications are linked with -shared too
<milek7>
$ /lib/libc.so.6
<milek7>
GNU C Library (GNU libc) stable release version 2.38.
<phschafft>
milek7++
<phschafft>
that was the example I was thinking about as well.
<waddlesplash>
yeah
<jhj>
Well, anyway, on Linux, a pie build is also no different in speed. I built just now explicitly, without fpic or pie, and the average benchmark was less than 1% faster
<jhj>
So I think this is kind of a red herring.
<waddlesplash>
PIE is still not -shared
<waddlesplash>
Haiku doesn't even support PIE, iirc
<phschafft>
I think FreeBSD was the only ELF based system I tested that enforced ELFs to be a executable *and* having a exactly the correct ABI set.
<phschafft>
while all other systems just loaded it and started INIT.
<milek7>
you could remove thread_local from that var and see if that changes anything
<waddlesplash>
milek7: it does, it's way faster
<waddlesplash>
that's what the "Single threaded" builds do
<waddlesplash>
you don't need to change it everywhere
<jhj>
Especially since, like I mentioned, Haiku is the single outlier OS out of more than a dozen.
<waddlesplash>
keep the macro as-is and pass the argument through instead of fetching it
<waddlesplash>
okay, well, it would be interesting to test with -shared on Linux
<waddlesplash>
apparently there is some way to make that work as milek7 showed above
<jhj>
Still, I can't imagine we are the only affected application either.
<waddlesplash>
I can't think of any other application I have encountered that uses thread-local variables like this
<waddlesplash>
it's not what they're designed for
<waddlesplash>
fetching them will always be slow, it's an implicit function call, unless the compiler manages to cache it
<waddlesplash>
but it *CAN'T* cache it if you call across modules
nosycat has quit [Quit: Leaving]
<waddlesplash>
so, every function call to a function that the compiler can't see into, it MUST refetch
<waddlesplash>
after that
<waddlesplash>
so no matter what this will be inefficient on all OSes, maybe the others have inline asm to cache the results and thus make this faster
<waddlesplash>
but it's still not great
<jhj>
waddlesplash: We link with LTO, so I don't understand why the compiler couldn't see things.
<jhj>
We used to have some crazy concatentation system that we stole from Chrome or something.
<waddlesplash>
webkit also did something like that yes
<waddlesplash>
still
<jhj>
Also, Clang on Haiku isn't using LTO, and it's 500X faster than than GCC build.
<waddlesplash>
you are then depending on the compiler to do value propagation in a massive way
<jhj>
So something isn't working like it should here.
<waddlesplash>
that's not optimal
<jhj>
waddlesplash: A debug build on webkit is easily 400% slower than the release builds :)
<waddlesplash>
sure
<jhj>
Our debug builds are always massively slower, but we do all sorts of additional things in them, so it's not exactly equivilant.
frkazoid333 has quit [Ping timeout: 480 seconds]
walkingdisaster has quit [Quit: Vision[]: i've been blurred!]
<jhj>
We aren't calling any functions indirectly. We're just doing the equiv of 'modify_dsptw *cpup.TPR.TSR);' and such, where cpup is thread local, and never changes per thread.
<jhj>
I'm not sure how Haiku is the only OS where this doesn't work.
<waddlesplash>
a thread local variable access is an indirect function call
<waddlesplash>
that's how it works
<waddlesplash>
the compiler can sometimes cache the result of this indirect function call to avoid calling it multiple times
<waddlesplash>
but it still has to call it
<jhj>
I wonder if we can see why Clang is working fine and GCC isn't.
<waddlesplash>
probably Clang manages to cache the result more
<waddlesplash>
that's it
<jhj>
But that isn't the case, because we use GCC on every other platform as well.
<jhj>
In fact, we recommend using GCC, because it usually is somewhat faster than the alternatives.
<waddlesplash>
again, there are different TLS models here
<waddlesplash>
if you are using a static or PIE executable, the compiler can optimize more
<jhj>
waddlesplash: Even when I explictly do not, the results don't change.
<waddlesplash>
probably because e.g. it knows this won't be unloaded like a shared library could be, I think
<jhj>
And this is with the global-dynamic model.
<waddlesplash>
but you didn't test on Linux with -shared
<waddlesplash>
there are 4 different TLS models
<waddlesplash>
well, it's possible our compiler is somehow configured differently but I don't know how. clearly Clang manages to figure something out
<jhj>
I'm aware, we explicltly set -ftls-model=global-dynamic on Linux.
<jhj>
At least, I did when I just tested.
<jhj>
I even tested the same version of GCC (13.3.0) on Linux and Haiku.
<waddlesplash>
do you set that on Haiku?
<jhj>
I didn't, but I can. Let me see.
<phschafft>
also gccs support *for* Haiku might be less good than for Linux. So maybe it can do some more magic on Linux.
<jhj>
phschafft: Well, not just Linux, we recommend GCC everwhere except for macOS and AIX.
yann64 has joined #haiku
<jhj>
And on AIX, we recommend IBM's expensive commercial compiler if available, but GCC does just fine.
<jhj>
There is mainline LLVM support for AIX now because IBM ported Rust, I haven't benchmarked mainline Clang there yet tho.
<phschafft>
jhj: and how does that change my statement?
<jhj>
OK, I'm building now with global-dynamic forced.
<jhj>
phschafft: Oh, it doesn't change it exclusively, but just generalizing, that isn't not a Linux-specific improvement.
<jhj>
waddlesplash: About to test in a moment. I assumed that global-dynamic is the default, but I'm probably wrong because Haiku is weird. :)
<jhj>
It's not done yet, so it's going slowly. I'll see about forcing the other models, just to see what we get.
<jhj>
Once it finishes.
<phschafft>
jhj: My argument was more about that you compare compiler support for Haiku with support of other systems that have tens to tens of milions more users.
<phschafft>
so likely there is better support in gcc for those systems.
<milek7>
9,242567 MIPS
<milek7>
3,466251 MIPS
<milek7>
that's on linux gcc
<jhj>
My assumptions are that this is tied closer to ELF than anything more OS specific, but my assumptions are probably wrong.
<waddlesplash>
nope, it's all OS specific
<waddlesplash>
ELF specifies where the TLS needs to be stored and it is up to the OS to deal with that
<waddlesplash>
and up to the compiler to call the OS to get the storage location
<milek7>
second one is shared library (and main dlopened from another binary)
<waddlesplash>
jhj: well, look at that!
<waddlesplash>
milek7 proves it's not Haiku :D
<waddlesplash>
Linux is faster than we are, sure, but it's still a massive performance hit
<jhj>
Yeah, interesting. That's still about 60% slower.
PetePete has quit [Ping timeout: 480 seconds]
PetePete has joined #haiku
PetePete has quit [Read error: Connection reset by peer]
<jhj>
milek7: What happens if you build with -ftls-model=initial-exec in that case?
<jhj>
You'd need to link dps8 at link time vs dlopen though.
<jhj>
runtime_loader: Static TLS model is not supported.
<jhj>
waddlesplash: Does that apply even to a statically linked binary?
<waddlesplash>
we don't support statically linked binaries
<waddlesplash>
you must dynamically link at least to libroot.so
<jhj>
Is the local-exec model supported?
<waddlesplash>
I don't know, but I suspect not?
<waddlesplash>
depends on what exactly that implies
gouchi has joined #haiku
_-Caleb-_ has left #haiku [#haiku]
<jhj>
"/boot/system/develop/tools/bin/../lib/gcc/x86_64-unknown-haiku/13.3.0/../../../../x86_64-unknown-haiku/bin/ld: /tmp//cc0yw5HI.ltrans0.ltrans.o: relocation R_X86_64_TPOFF32 against symbol `cpup' can not be used when making a shared object"
<jhj>
Yeah, apparantly not.
_-Caleb-_ has joined #haiku
<milek7>
jhj: I think it will work with LD_PRELOAD
hightower2 has joined #haiku
<milek7>
5,709061 MIPS
<jhj>
Ah, faster, but still not as fast as the regular build.
<jhj>
Thanks.
frkazoid333 has joined #haiku
frkzoid has joined #haiku
frkazoid333 has quit [Ping timeout: 480 seconds]
<jhj>
waddlesplash: Thanks, I guess I'll play with some solutions, but maybe I can just not use thread-local storage for cpup at all.
<waddlesplash>
jhj: I'm working on seeing how feasible that is
<jhj>
Well, I could do it like we do for ROUND_ROBIN
<waddlesplash>
but yes, that's what should be done
<waddlesplash>
how's that?
<jhj>
That's a debugging feature we have, it runs all the CPUs in 1 thread.
<waddlesplash>
you just need to pass a state variable all the way through to all CPU functions
<jhj>
It's there to ensure reproducibility, it runs one instruction and then goes to the next CPU.
<gordonjcp>
jhj: Honeywell DPS8?
<jhj>
yessir
<gordonjcp>
jhj: I'm probably one of the youngest to use CP6 at Robert Gordon's University in Aberdeen, it was just getting pulled out the year I started (1991(
<gordonjcp>
jhj: I have somewhere got some printouts of manuals for it, and a mate of mine from school who's about three years older than me wrote a simple mail system for it after the admins shut the built-in one off :-)
<gordonjcp>
jhj: I actually had access to it before I was at uni, although I was very very much not supposed to
<jhj>
gordonjcp: Actually, if you have any CP-6 stuff, feel free to join our Slack (ugh
<jhj>
and let us know what you have.
<jhj>
We very much would run CP-6, if we had access to it.
<jhj>
CP-6 uses NSA (New System Architecture), which is the VU (virtual memory unit) instead of the AU (appending unit), which is the difference between the DPS-8/C and DPS-8/M spec CPUs.
<jhj>
GCOS-8 also uses VU/NSA.
<jhj>
We have the CP-6 source code, but not any binary tapes, and no access to a PL/6 compiler.
<jhj>
We don't currently support the VU, but we have more than enough documentation that it could be implemented, if we had something to run on it.
<jhj>
Until then, you are stuck using Multics (or GCOS-3) with us.
<jhj>
CP-6 ran on some additional hardware (the DPS-88 and 9000, IIRC) and by adding VU support we could support those CPU types.
<jhj>
Well, the 9000 (and the later systems as well, like the NovaScale, ACOS systems, M9600, etc.) have some additional CPU featuers like vector instructions.
<jhj>
But CP-6 doesn't use them.
<jhj>
gordonjcp: Anyway, I'm getting quite off the Haiku topic, but, we will have a new blog entry coming soon at https://dps8m.gitlab.io/blog/ where we'll give a status report.
<jhj>
It'll include begging for more CP-6 materials. We've already reached out to like 30 academic sites that ran CP-6 and haven[t had any luck so far. :(
<jhj>
waddlesplash: A very dirty removal of thread-local storage changes the 7% speed difference to a 3% speed difference for me.
<jhj>
on Linux.
* phschafft
wonders if one could make a per thread overlay memory map that would map some CoW memory area into the area for TLS.
B2IA has quit [Ping timeout: 480 seconds]
<jhj>
The remaining difference of 3% could be made much faster for the 1-thread case, if we don't do call any SCU/IOM locks until we've actually started >1 thread.
<jhj>
waddlesplash: anyway, thanks again.
<jhj>
Unfortunatley, I'm not super motivated to immediately fix it for what amounts to a 3.5% speed improvement on all other platforms (but a 75% speed improvement on Haiku).
<waddlesplash>
I bet it will be more than that if the CPU field is passed in a register rather than requiring a global read
Anarchos has joined #haiku
<jhj>
phschafft: something like that could possibly be made to allow supporting the local-exec model.
<jhj>
But I have no idea what the implications are of all programs also being shared objects on that, and it would need someone way smarter than me who actually knows Haiku internals.
<milek7>
waddlesplash: it might be balanced out by spilling more arguments onto stack
<waddlesplash>
maybe, but a lot of this will probably get ilined
<waddlesplash>
so it may not matter
<phschafft>
jhj: oh, my comment wasn't about Haiku at all. just a thought how this could be done.
<Anarchos>
is it possible to compile on haiku x86_64 for haiku x86_gcc2 ?
<jhj>
Optimizations are interesting.
<phschafft>
also such a per thread overlay could allow for ther kind of nice features. such as protection of memory between threads. I mean in an ideal world memory is only writeable if declared so. and only writeable by other threads if declared so. and only writeable by other processes if declared so, ... ;)
<milek7>
I think "per thread memory map" is also called a "process" :D
<jhj>
In the past, especially with Clang <14 and GCC <12, it made a pretty big difference to build dps8 with -march=native on AVX2 capable machines.
<jhj>
Now, it usually results in a build thats about 1-2% slower, while the overall, the baseline binaries (even on old checkouts before we made other optimizations) are faster.
<phschafft>
milek7: not really. there are a lot of other things you can share.
<phschafft>
such as the file descriptor table.
<jhj>
It's improved to the point where I'm not going to bother offering any AVX2 or greater optimized builds any longer.
<phschafft>
or basically any other structure the kernel has about you.
<gordonjcp>
jhj: I definitely don't have any CP6 media
<gordonjcp>
jhj: I wonder how hard a PL/6 compiler would be?
<jhj>
gordonjcp: Find someone that has CP-6 tapes hidden away and point them to us! We can recover 9-track tape data.
<jhj>
gordonjcp: A new PL/6 compiler would be non-trivial to say the least, even with our existing 6000-series PL/I backend.
_-Caleb-_ has left #haiku [#haiku]
_-Caleb-_ has joined #haiku
<jhj>
But even if we could recreate the compiler, and build the entire operating system, there isn't any guarantee that the compiler we create would have the same of similar code generated as what exists on real media, with all the corresponding quirks.
<jhj>
*or similar;
<jhj>
Like, maybe the real distributed tapes have generated code for instructions we aren't emulating (or aren't emulating correctly).
Anarchos has quit [Quit: Vision[]: i've been blurred!]
B2IA has joined #haiku
Begasus_32 has joined #haiku
B2IA has quit [Quit: Vision[]: i've been blurred!]
B2IA has joined #haiku
Anarchos has joined #haiku
<jhj>
gordonjcp: Oh, for Multics, for example, only changed parts of the system were actually rebuilt, or, if there was a compiler bug that was identified, the identified miscompilations were rebuilt.
<jhj>
gordonjcp: We actually have a specific optimization for some inefficient code generated by old versions of the Multics PL/I compiler in dps8m.
<Skipp_OSX>
Multics is the primary example of why central planning does not work.
<Skipp_OSX>
Multics was a centrally planned OS, Unix the rogue decentralized OS. The centrally planned OS died an ignominious death like centrally planned nation-states, while the decentralized Unix and capitalist states thrive.
<Anarchos>
how to configure for a x86_gcc2 build ?
<Anarchos>
" ../configure --build-cross-tools x86_gcc2 --build-cross-tools x86_64 --cross-tools-source ../../buildtools/ --use-gcc-pipe -j5" did not work
<jhj>
Skipp_OSX: That's not really true. All of the features that were removed from Multics were just added back to Unix later, like volume management and such.
<jhj>
Skipp_OSX: Multics had a very long commercial history, and the 6000-series mainframes are still currently sold.
<jhj>
And it isn't like Multics is completely gone, we are releasing the next version of it probably before the year is done, and we are working on new hardware as well that implements the classic 6000-series CPU.
<Skipp_OSX>
The very name "Unix" is a play on words for "eunuchs", meaning a "cut-down" version of Multics.
<Skipp_OSX>
Sure, central planning does not go away, it languishes on, but it is not the source of real economic innovation.
<jhj>
I wouldn't say that it died an ignominious death at all either, as it was successful for its market segment. You have to remember that the PDP-7 that Unix was initially developed for was a machine that cost $72K in 1965. The 6000-series machines that Multics targeted were much larger systems, which were multimillion dollar installations.
<waddlesplash>
jhj: this compiles and runs, and with it I get 5.425542 MIPS for the benchmark on Haiku
<waddlesplash>
doubtless I missed stuff in #ifdefs
<jhj>
For example, one customer was the USGS which ran both Multics and Unix, and the Unix systems were mostly under $1M and the Mutlics systems cost about $40M.
<jhj>
This wasn't just cost for the sake of cost, these were much larger systems.
<Skipp_OSX>
I guess you're right that is too far, it languished into obscurity, but did not die an ignominious death.
<jhj>
waddlesplash: Awesome, I'll look in a bit!
<gordonjcp>
Skipp_OSX: also I'd disagree with the idea that capitalist states thrive
<waddlesplash>
jhj: it will be interesting to see what the performance difference is on Linux, if this improves things at all
<jhj>
Skipp_OSX: The main reason why it languished was management though, because your "central planners" picked another winner.
<Skipp_OSX>
only by accident, the inefficiency of the capitalist system is what allows it to thrive, engineers allowed to work on their own independent of the central planners, like Unix vs. Multics.
<jhj>
Skipp_OSX: If Honeywell would have released it under similar terms as Unix, it likely would have won in the long run.
<Skipp_OSX>
perhaps, but the deal with AT&T had already fell
<waddlesplash>
jhj: besides whatever I missed in disabled #ifdefs, you will have to check the logic in the main loop after setCPU goto I think... not sure if I got that right
<waddlesplash>
it works for one CPU anyway
<waddlesplash>
but with multiple CPUs, no idea
<jhj>
Skipp_OSX: The other impediment was that the hardware required to run Multics was protected by various patents or trade secrets. Those patents have expired and we were able to obtain internal documentation from Bull, but only after a very long time.
<jhj>
waddlesplash: I'm going to just benchmark it on Linux as is first, before I start fixing things further.
<gordonjcp>
jhj: Data General were even worse, almost nothing exists of any media or hardware from them
<Skipp_OSX>
Which is not a problem for a couple of rogue engineers with spare hardware and little supervision, and that's my point.
<jhj>
GE -> Honeywell -> Honeywell/Bull -> Bull -> Atos, so thankfully there is clear ownership
<jhj>
Atos still makes 6000-series systems, they don't support only NSA/VU and not AU, so they can't run Multics.
<jhj>
Skipp_OSX: Multics and the 6000 AU were designed together, as a system, not an existing system and then added software.
<Skipp_OSX>
Central planning created the B-2 Bomber and the Space Shuttle, and Multics expensive technology with little mass-market appeal.
<dovsienko>
on this note, OpenVMS is a current product that even runs on x86 (not for free), and the easiest way to try RISC OS is by running it on a Raspberry Pi
<dovsienko>
the latter being a very nice example of what Haiku could do
<jhj>
Skipp_OSX: The intention was never to make it a mass market product, it was intended to be a computing utility service, that is, something similar to telephone service or water or sewage.
<jhj>
It was cloud computing befofe it was popular.
<Skipp_OSX>
Sure B-2 Bomber and the Space Shuttle and Multics have their place, but I'm happy that we live in a society that allows the freedom for engineers to work outside those areas, even if it is an accident of history.
<Skipp_OSX>
Unix was never intended to be a mass-market product either, but here we are.
<dovsienko>
jhj: what do CP-6 tapes look like physically?
<jhj>
I mean, there is efficency in having a power grid and centralized power stations, vs. everyone doing their own generation. Even with solar and the like.
<Skipp_OSX>
My point is the more change happened on accident than on purpose and that is a feature of capitalism that is lost in a centrally planned system.
<jhj>
dovsienko: For CP-6, it is actually possible that the boot tape would have been an image on a CD-ROM at the end of its life.
<Skipp_OSX>
It's more an accident of capitalism than a feature, a bug that the capitalists would fix if they could, but can't.
<jhj>
Source code browser was distributed on CD for example.
<dovsienko>
Skipp_OSX: I have worked for genuine capitalist businesses that wiped their bottoms with valuable accidents on a regular basis, and it seems to be the rule rather than an exemption
<Skipp_OSX>
well there you go
HaikuUser has joined #haiku
HaikuUser has quit []
<dovsienko>
jhj: I will try to remember next time I am next to old hardware
<jhj>
waddlesplash: So, I didn't test multiple CPUs let, I'm going to go through the patches, but, I'm going to benchmark it now on my machine, vs. existing master (Linux)
<waddlesplash>
very good
<waddlesplash>
I expect we will at least get that 3.5% improvement, and possibly a lot more depending
<jhj>
I'll do the full (profiled) release build of each and average 3 runs.
jmairboeck has quit [Quit: Konversation terminated!]
hightower2 has quit [Remote host closed the connection]
<Skipp_OSX>
I am so close to getting these menu fields to display right: https://0x0.st/XtWU.png it is now truncating at the right spot, but why is there no drop down arrow?
<jhj>
waddlesplash: almost done, last run.
Begasus_32 has quit [Quit: Vision[]: Gone to the dogs!]
<Begasus>
k, done for today
<Begasus>
cu peeps!
<waddlesplash>
Skipp_OSX: why doesn't the layout system handle this properly?
Begasus has quit [Quit: Vision[]: i've been blurred!]
<Skipp_OSX>
because we're not using it I suppose, there are regular menu fields
<waddlesplash>
?
<waddlesplash>
the Find panel itself uses layouts
<waddlesplash>
so how or why does the layout manager size things wrongly?
<Skipp_OSX>
yeah it does that's true, I'm using SetExplicitSize() idk
<waddlesplash>
you should not need to
<waddlesplash>
something else has gone wrong if we need SetExplicitSize
<waddlesplash>
try SetExplicitMinimumSize() with something small
<Skipp_OSX>
What is the proper way to limit a menu field width?
<waddlesplash>
it's possible the view just has a too large computed minimum size and we should force a smaller one
<waddlesplash>
... oh, ResizeMenuField() is calling SetExplicitSize already
<waddlesplash>
why does it do that?
<waddlesplash>
can we get rid of that entire function?
<Skipp_OSX>
to limit the width of the menu field
<waddlesplash>
tbh I would either kill this function entirely, or call SetExplicitMaxSize rather than SetExplicitSize
<Skipp_OSX>
We could, and it would display correctly, but the menu field would be much wider than it currently is, it would be wide enough to fit the longest string
<waddlesplash>
then use SetExplicitMaxSize rather than SetExplicitSize
<waddlesplash>
the layout manager will then be able to do the right thing and pick a smaller size
<Skipp_OSX>
but then the menu field is too narrow :/
<waddlesplash>
?
<Skipp_OSX>
and then SetExplicitPreferredSize but that doesn't work either
<waddlesplash>
why not?
<Skipp_OSX>
idk
<Skipp_OSX>
I tried that, same result if I set the preferred size...
<Skipp_OSX>
let me try just setting the max content width and not SetExplicitSize() maybe that will work...
<waddlesplash>
Skipp_OSX: looks like ShowOrHideMimeTypeMenu() may be interfering
deneel has joined #haiku
<Skipp_OSX>
I don't think it is... but maybe let's see
<waddlesplash>
yes, I think it is
<waddlesplash>
because ExplicitMinSize is interfering too
deneel has quit []
<jhj>
waddlesplash: On Linux/GCC/glibc without your patch I get 12.0895 MIPS, and with it I get 12.3870 MIPS. That's a 2.46081% increase.
<waddlesplash>
any difference with Linux/Clang/glibc?
<jhj>
waddlesplash: Let me check.
<waddlesplash>
Skipp_OSX: there is definitely some bug with SetExplicitMaxSize. I see what you are talking about, it's got way too much space around it in that case
<waddlesplash>
Something looks wrong there. Not sure what
<waddlesplash>
whether it's a layout bug or elsewhere, but definitely there's a bug
<Skipp_OSX>
yeah, it's annoying
<waddlesplash>
well, it may be worth debugging that
<Skipp_OSX>
well _BMCMenuBar_ (the class inside menu field that handles the internal menu bar) is handling the insets correctly, but menu item is not once you SetMaxContentWidth
<waddlesplash>
I think the menu field isn't handling explicitly set sizes propelry?
<waddlesplash>
properly
<waddlesplash>
either way there is clearly some bug
<Skipp_OSX>
yeah... I've deep dived into code trying to fix but it is elusive
<waddlesplash>
step a debugger through the layout code?
<waddlesplash>
clearly the menufield thinks it is larger than it actually is
<waddlesplash>
and renders something larger than its actual size
<waddlesplash>
so, something is not checking a value somewhere
<waddlesplash>
jhj: if this was a 2.5% performance win by itself, I wouldn't be surprised if more wins are possible with a few more minor changes. Note that there are a few places where I didn't bother passing the cpup, those could be further refactored, but they looked like "cold" areas (fault handler etc.)
<jhj>
waddlesplash: Linux/Clang/glibc without patch: 10.8738, with patch: 12.1210, so +11.4698% increase. But it's still slower than the GCC build -2.14741%.
HaikuUser has joined #haiku
HaikuUser has quit []
Anarchos has quit [Quit: Vision[]: i've been blurred!]
mmu_man has quit [Ping timeout: 480 seconds]
<jhj>
waddlesplash: I bet this will hurt some things that are register starved, but thankfully there aren't many of those platforms.
<jhj>
i586 and such, but performance on 32-bit machines sucks already.
<jhj>
It's no fun emulating a 36-bit/72-bit machine on a 32-bit machine :)
<jhj>
Also for the 32-bit machines, we have to do our own 128-bit math, which is probably slower than what the compiler provides.
<waddlesplash>
doesn't clang at least support int128 on 32bit?
<jhj>
Oh, I'd have to check. We have a define (NEED_128) which uses our code instead of the compilers, or if it isn't provided.
<waddlesplash>
jhj: if Clang has that much of an improvement on Linux it probably at least has something of an improvement elsewhere
<jhj>
I didn't before, so if it does now it must be recent.
<waddlesplash>
but yes, this patch overall should reduce the compiler-dependent variations in code performance
<waddlesplash>
jhj: if you are using LTO then it may not matter much for register starved things, we will just spill onto the stack, which will then be a memory access same as it was before
<waddlesplash>
jhj: interesting that the Clang build is now faster than the GCC one was pre-patch
mmu_man has joined #haiku
<jhj>
waddlesplash: We use LTO everywhere we can, yes.
<jhj>
waddlesplash: We did some benchmarks of the same crappy Rpi ARM SBC in 32-bit vs. 64-bit mode, and it was about half the speed in 32-bit.
<jhj>
Which is sort of expected. :)
<dovsienko>
jhj: the only time I saw an IBM 360 was in a corner of a museum
<waddlesplash>
makes sense
<jhj>
dovsienko: There's a cool book about all the IBM 360's joining together to take over.
<jhj>
> told that it has taken over almost every computer in the US (somewhat dated with 20,000 mainframes with a total of 5,800 MB), and is now fully sentient and able to converse fluently in English.
<jhj>
:)
<phschafft>
jhj: was wondering about having some tape device. just for decoration. but wow are they expensive.
<jhj>
waddlesplash: Obviously we wrote the code first for single threaded and then changed it to be multithreaded long after.
<jhj>
The memory access code is pretty finely tuned, and was hard to get right, done with proper atomics and such.
<jhj>
waddlesplash: I don't know why Clang is generally lower performance everywhere else vs. GCC.
<Skipp_OSX>
"The content area is where the item label is drawn; it excludes the margin on the left where a check mark might be placed and the margin on the right where a shortcut character or a submenu symbol might appear."
<Skipp_OSX>
ok, so that answers that question, the menu item content width should not include padding even if you call SetMaxContentWidth
<jhj>
waddlesplash: thanks for looking at it, I had no idea that Haiku was loading *everything* as a shared object, which is what kind of threw me off and I didn't quite understand what you were talking about at first :)
<waddlesplash>
yep
<jhj>
I didn't even know such a thing was in the realm of possibility :)
<waddlesplash>
we do this for a number of reasons, but one of them is that any application can have replicants
<jhj>
waddlesplash: Yes, that was with LTO+PGO also.
<waddlesplash>
which need to be loaded as shared objects into Deskbar and Tracker
<waddlesplash>
so applications may need to be loaded as shared objects
<waddlesplash>
hence, everything's built as shared
<waddlesplash>
there are other reasons I think, too
<jhj>
It is an interesting choice, but I do have to wonder if there are other applications besides ours that have a performance hit.
<jhj>
and wonder if there is anything that can be done to work around it in a more general sense.
<jhj>
waddlesplash: for our released binaries that we build for Linux on our website, we actually build musl libc with LTO too.
<jhj>
For Linux, we can actually target 2.6.x (next release we are going to target 3.2.x though) by itself and ship fully static binaries, and we get quite a bit of performance improvement by being able to inline libc functions directly into our code.
<jhj>
Especially for an emulator in general.
<jhj>
It's about a 15% improvement.
<milek7>
what are libc functions doing so much in emulator?
<jhj>
We've had a couiple people ask about why our binaries are faster than what they build, and we explain we build various toolchains with crosstool-ng with LTO.
<waddlesplash>
jhj: it may be less of one now
<waddlesplash>
libc functions should get out of the hot path now after not using _thread_local everywhere
<jhj>
milek7: Well, libm stuff more so.
<jhj>
Running numeric benchmarks in the simulator for example.
<waddlesplash>
function calls are pretty fast on x86_64
<waddlesplash>
however, without full LTO, the compiler can't know that those functions don't mess with global state
<waddlesplash>
so the performance improvement may have again been _thread_local caching
Manboy has quit [Ping timeout: 480 seconds]
<jhj>
Possibly, but for numeric code we did get a pretty decent improvement brining in a static libm alone.
deneel has joined #haiku
<waddlesplash>
hm, ok
<jhj>
We used to have builds that used Julia's Openlibm code and static linked it.
<jhj>
But even a 15% or so improvement in the math code wasn't worth the hassle/complexity to the build system.
<jhj>
(for most users)
<jhj>
also there were some things where glibc or musl libm was faster than julia openlibm
<jhj>
waddlesplash: something I need to do is figure out how to run our code through cachegrind or what not on ARM and see what we can do to improve it that way.
<waddlesplash>
yes
<waddlesplash>
well, "perf record" is probably better
<jhj>
We have a special build that's not documented well (PERF_STRIP), that builds ONLY the CPU code alone.
<jhj>
No threading, no I/O, no SCU/IOM, no devices, no IO, just the CPU.
<jhj>
Which we have profiled the heck out of.
<jhj>
But it's too slow to do anything non-trivial in valgrind.
<jhj>
waddlesplash: In fact, Multics won't even boot on a system that is considerably slower than a GE-645, because of various timeouts that exist in the code and aren't wrong.
<waddlesplash>
again, "perf record"
<waddlesplash>
doesn't slow down things nearly as much because it's not an emulator
<waddlesplash>
alternatively, "Very Sleepy" for Windows builds is a very nice profiler
<jhj>
waddlesplash: problem is that for an optimized build, a lot of stuff is going to be inlined and constant propgated and cloned tho, so I'm not sure how much it is going to reflect what the code really looks like.
<jhj>
I guess I could just try it :)
<jhj>
I should create an somewhat less optimized build for this task, between our release and testing builds.
<waddlesplash>
perf record should work just fine on ReleaseWithDebugInfo builds
<jhj>
TESTING=1 does stuff like use gcc or clangs trivial-auto-var-init to set a pattern for all memory, etc.
<jhj>
and it's quite slow because it's built with -fno-inline too
<jhj>
waddlesplash: I already have an issue open about it but, if you have any pull with this project, or with PulkoMandy etc, maybe we can get getconf into the system for R1b5?
gouchi has quit [Remote host closed the connection]
<jhj>
It's needed for POSIX conformance anyway!
<PulkoMandy>
It's available in the depot if you need it, I think?
<jhj>
PulkoMandy: It is, yes!
<PulkoMandy>
But yes, maybe it should be part of the development feature included in the nightlies
<jhj>
But Haiku nicely includes essentially everything to build most anything I've made in the default installation with the optional components installed, sans that.
<jhj>
It becomes a problem because I *extensively* use things like: "env PATH="$(command -p getconf PATH)" grep" or "env PATH="$(command -p getconf PATH) sed" (and awk)
<jhj>
Because there are still horrible systems out there like Solaris.
<jhj>
Returns: "/usr/xpg4/bin/awk" and "/usr/bin/awk", for example.
<jhj>
And the default awk isn't fully POSIX conforming.
<jhj>
On Solaris (but thankfully mostly not on illumos systems like OpenIndiana), the default awk, grep, find, sed, and sort, in particular, aren't POSIX conforming.
<jhj>
So every makefile I write does something like "AWK?=env PATH=$(command -p getconf PATH) awk".
<jhj>
Or more often, something worse like 'AWK?=env PATH=$(shell command -v gawk 2> /dev/null || env PATH="$(command -p getconf PATH)" awk'
<jhj>
missed a parens there, but you get the idea.
<jhj>
PulkoMandy: I need this for at least Solaris, AIX, z/OS USS, and something else I forget offhand.
<jhj>
thus all stuff fails to build out of the box on Haiku without installing getconf first :)
<waddlesplash>
why not switch to some other build system? :P
<jhj>
waddlesplash: GNU make (with POSIX shell) is essentially universal and available on every system we need to target.
<jhj>
where cmake or whatever isn't, and other ones require python or what not.
<milek7>
mutexes in advanceG7Faults take around 9%
<milek7>
isn't that somewhat high?
<jhj>
waddlesplash: I've given in a bit, on a couple projects I maintain, I do have cmake support along side GNU make.
<jhj>
milek7: Possibly, but the G7 faults code is complicated and called often and requires locking the SCU
<jhj>
the group 7 faults processing is also tied into the appender unit
<jhj>
the TRO is a 512khz timer, which is presented in a dedicated 27-bit register, which indicates a TRO G7 fault every rollover
<jhj>
milek7: I'll look at that in a moment. I'm still going through a bazillion functions to pass the cpup address through :)
juanjo has left #haiku [Error from remote client]
<jhj>
milek7: We have a full factory test and diagnostic tape that passes (and it took years to get it to pass) which I'll have to verify too once I make sure all the configurations build.
<jhj>
after making changes.
<jhj>
The factory diagnostics, which we don't have the source code to, say things like "CHECK CPU BOARD 7 BACKSIDE" or "CALL FOR FACTORY MODU 7 ENH ERATTA" which aren't always helpful to figure out what they want.
<jhj>
In some cases it was because the documentation we had was from a 1985 revison of the CPU and the diagnostics we have are from 1987, and some fixes had been applied.
<jhj>
The fun of old machines.
<Skipp_OSX>
ok progress I got ExplicitMaxSize instead of ExplicitSize to work but still no pop-up indicator
marzzbar has joined #haiku
<dovsienko>
jhj: what exactly is the problem that using getconf solves? because it will use the default PATH with non-POSIX binaries, as far as I understand
<milek7>
certainly looks very CISCy
<milek7>
what was real-world speed of these machines?
<dovsienko>
also, POSIX comes in several editions, so old Solarises are POSIX-compliant, to a degree, but to the earlier revisions, so you don't get $() and must use ``
<dovsienko>
my usual solution to that is to set optional environment variables such as MAKE_BIN=/path/to/gmake and LEX=/path/to/flex, then the script would default via ": ${LEX:=lex}" and so on
<dovsienko>
also to export PATH as required and/or to have symlinks in the PATH, but I barely remember ever using getconf
<dovsienko>
(I maintain a few scripts that have to work on Linux, BSD, Solaris, AIX and Haiku)
<dovsienko>
waddlesplash: I ran the debug build of Haiku for two or three weeks and the SSH-related KDL never occurred again. today I upgraded to the current nightly to be able to test libpcap before the release (and to use TMPFS)
<waddlesplash>
ok
<waddlesplash>
I guess we'll see what happens
<jhj>
dovsienko: "command -p" runs the POSIX path, which is different than the default path.
<jhj>
dovsienko: And the POSIX path has a different getconf binary. On Solaris, you actually have a couple different getconfs that you can use - XPG4 and XPG7 IIRC.
<waddlesplash>
jhj: btw, why have #ifdef LOCKLESS... etc. everywhere? why not just make the lock_...() functions inline no-ops when thats' defined?
<jhj>
I don't bother with "``" vs "$()" though, because dps8m (and everything else I do) targets POSIX.1-2008 and uses some of those new functions, and anything that has that conformance supports $()
<jhj>
waddlesplash: The goal is to eventually get rid of lockless entirely and completely.
<jhj>
Err, no_lockless that is.
<waddlesplash>
sure, but in the meantime you could clean up the code a lot by just having the functions be defined to nothing and then removing the #ifdefs from everywhere
<jhj>
There are some fundamental differences that would still need an ifdef where the non-multithreading paths take shortcuts knowing there will never be more than 1 CPU.
<jhj>
I can't remember exactly where. Probably not too many places anymore, but I'm sure a few exist still.
<waddlesplash>
yeah, but they are probably small
<waddlesplash>
compared to how many ifdefs there are
<waddlesplash>
milek7: doesn't that need to be &&?
<waddlesplash>
but yes, looks like another significant performance win
<milek7>
yes
<waddlesplash>
jhj: CMake build systems are very nice for the reason that incremental builds and IDE integration "just work"
<waddlesplash>
maybe you need Makefiles for some more obscure system types where CMake isn't supported but I think that's rate
<waddlesplash>
rare
<waddlesplash>
I mean, AIX has CMake even
<dovsienko>
waddlesplash: if you try building software on bare NetBSD, your assumption will likely change
<waddlesplash>
oh?
<dovsienko>
it is possible to have CMake, git, Perl, Python, Ruby, Rust etc., but one has to compile these first
<dovsienko>
on cfarm.net AIX hosts CMake is a "freeware" package, for example, and tcpdump/libpcap have CMake problems there
<dovsienko>
similar story on OpenCSW Solaris
<dovsienko>
(of course, one can install the required build dependencies as binary packages if someone else compiled these first)
<dovsienko>
anyway, it is getting late. TTYL!
<jhj>
gnite
<jhj>
AIX provides cmake in the IBM provided toolkit now, but it's not part of bos.
<jhj>
Anyway, I'm not too concerned with a new build system. If you look at https://dps8m.gitlab.io/dps8m/Releases/, we cross-compile builds for 14 operating systems and 21 different platforms architectures with what we got.
<Skipp_OSX>
no, nvm, still broken I still don't get it
<jhj>
Making a build go 300% faster might be great, but we're talking about savings 20 seconds total from 30s to 10s.
<waddlesplash>
it would probably be more than that
<jhj>
waddlesplash: Also I didn't mention it, but if you go into the src/dps8m and build from there, you do get incremental builds without most of the checks that the main build performs
<waddlesplash>
I did discover that
<jhj>
That doesn't recompile anything in the other src directories, which change much less often.
<waddlesplash>
an incremental build with nothing changed still takes 25 seconds on Haiku
<waddlesplash>
most of that before the LD step
<waddlesplash>
it appears to spend most of its time spawning hundreds of shell processes
<jhj>
waddlesplash: Yeah, it's all the shell process stuff it does.
<jhj>
Much of it because we embed the full command line in a defintion into every compiled file, although not every file uses it.
<jhj>
Also, we scan all the system headers
<waddlesplash>
sounds expensive
<jhj>
And then we scan all of the source files and keep track of which defintions are set and used, for every file.
<waddlesplash>
surely there's some way to speed that up
<jhj>
We could do that just for some of the main files.
<jhj>
waddlesplash: If you run "dps8 -t
<waddlesplash>
can't you separate into a configure + make stage like most things?
<jhj>
waddlesplash: If you run "dps8 -t" (the -t just avoids creating a rather large state file you don't want or need) and run "SHOW BUILDINFO" you can see some of this information that we save.
<waddlesplash>
sure, but there's better ways to cache this, I'm sure
<waddlesplash>
anyway after adding the check milek7 suggested above, I get 6.433 MIPS
<waddlesplash>
earlier I got 5.4255 MIPS
<jhj>
waddlesplash: There are better ways, though there aren't better ways that are equally as portable and don't need anything beyond what POSIX bare minimums necessiatate.
<waddlesplash>
idk, I feel like there's ways to strip this down more. but not my project :)
_-Caleb-_ has left #haiku [#haiku]
<jhj>
waddlesplash: We also have nearly 20 years of legacy in here, and we're very conservative when it comes to changing things like build systems that might break something.
qwebirc42001 has joined #haiku
<waddlesplash>
it makes sense to a degree
<jhj>
Like I forget exactly what it was, but we changed something and some random person complained they were having problems building on some obscure discontinued handheld game system :)
<waddlesplash>
maybe use CMake for a standard build and then have makefiles for a "reduced" build with some of these things not available
<waddlesplash>
and thus simpler makefiles and less to maintain
<jhj>
waddlesplash: Also, because we want to support easy profiling builds, we want to be able to natively build on all these systems, not just cross-compilation.
<waddlesplash>
sure
<jhj>
And some of those, like those crappy game consoles, don't come with C++ toolchains, just C.
qwebirc42001 has quit []
<x512[m]>
Is CMake really significantly better than Makefile?
<waddlesplash>
x512[m]: it is better than these makefiles :P
<jhj>
x512[m]: It can be significanlty faster.
<x512[m]>
Meson is better.
<jhj>
If you have to make a TON of checks, it is especially faster because cmake has modules where all those things are done in compiled code instead of individual checks that are like autoconf shell.
<jhj>
Essentially our build system does a lot of what autoconf does each time youu run make, behind the scenes.
<jhj>
So yeah, cmake can be significantly faster.
<jhj>
https://github.com/johnsonjh/duma is a project I maintain now, where we added cmake support. I'm sure the cmake support is a bit buggy.
<arraybolt3>
and Autoconf spits out shell scripts that look like utterances from a Balrog
<jhj>
The makefiles are gross and date back to the 90s.
<x512[m]>
Pure Makefiles are better than Craptools.
<jhj>
Eventually for this project, I plan to do something similar and offer cmake as a new option, which won't replace our GNU makefile based system, but work along side it in case it doesn't work.
<waddlesplash>
that sounds nice
<jhj>
But I don't like the idea of making more work for myself maintaining two build systems.
<waddlesplash>
would be neat to step through some of this in an IDE
<jhj>
And the biggest problem is that I don't really know cmake too well.
<waddlesplash>
I know cmake well enough for something like this. but it doesn't sound like a fun project :P
<jhj>
waddlesplash: If you look at https://github.com/aremmell/libsir/ which is a project I am a co-author on, we have a GNU make based system, and our makefiles are absolutely insane.
<jhj>
waddlesplash: But they have zero legacy garbage and they're very fast.
<jhj>
So, it can be done. :)
<waddlesplash>
oh, I know it can be done
<waddlesplash>
but it's not easy
<jhj>
DPS8M uses libsir (right now, not for much, but the next version will extensively), so libsir supported systems and compilers is a superset of what dps8 works on.
<jhj>
waddlesplash: libsir also supports msbuild. :/
<jhj>
Also, with libsir, we've extensively performance optimzied and cached it as we write it, and have scripts that track performance regressions and such.
<jhj>
A large portion of it's formally verified and proven with ESBMC and CMBC etc.
<jhj>
There's a lot you can do when you are writing from scratch
<waddlesplash>
jhj: here's the change milek7 was talking about in patch form
<waddlesplash>
around +15% or so
<waddlesplash>
at least for me on Haiku
<jhj>
waddlesplash: I need to stop chatting here, I'm still going through every use of cpup :)
<waddlesplash>
OK, very good
<jhj>
We have a full test suite that takes about 45 minutes to run.
<waddlesplash>
jhj: the above patch will probably be more of a performance improvement on systems that don't have futex-like APIs internally for their pthreads to use
<waddlesplash>
as on those systems a lock is always a syscall
<jhj>
It builds a 5-CPU system from scratch, runs diagnostics, builds and runs a program in every single language we have on multics like APL and COBOL and PL/I and C and BCPL and FORTRAN and even more obscure ones and runs them, simulates activity with multiple users, etc.
<jhj>
So the real trick is to make sure that works the same after any modifications :)
<waddlesplash>
well, might as well add the above patch before the test, it may speed it up more
<jhj>
Yeah, let me fire up a run of our CI with just that patch applied, I'll see if it breaks anything first :)
<jhj>
I can do that while I do other stuff.
<jhj>
waddlesplash: LTO is a big win for us, we used to have *really* gross scripts that would amalgamate all of the source code into a single file and build it, which required massive memory and CPU and often failed obscurely :)
<waddlesplash>
I can imagine
<jhj>
Like I said we stole that from something webkit did back in the day.
<waddlesplash>
WebKit still does that actually
<jhj>
sqlite still does that, but they do it more for making sqlite easier to embed than for performance, but it still has the same effect.
<waddlesplash>
probably differently than it used to though
<jhj>
waddlesplash: The Portland Group compiler (which is now free as NVIDIA HPC SDK C/C++ Compiler) doesn't support LTO, but a build using it is faster than a Clang non-LTO build.
<jhj>
We support that because of the NVIDIA Nsight tooling and it'll ease eventually doing GPU offload for some functionality if we ever go that way.
<jhj>
But some other guys on the project (not me!) have since taken a different approach and instead of doing GPU or FPGA offload are doing a whole CPU in FPGA instead.
<waddlesplash>
that does make sense
<jhj>
The DPS means distributed processing system. For example, the IO controllers are actually separate 18-bit minicomputers.
<jhj>
We don't run their firmware directly, instead, we reimplemented what they do in C for speed.
<waddlesplash>
yeah, high-level emulation
<jhj>
But we do have an emulator that can run the original firmware/OS for them.
<jhj>
Those systems are essentially half a DPS-8 main CPU, 18-bit/36-bit where the main system is 36-bit/72-bit.
<jhj>
The architecture is otherwise nearly the same/
<jhj>
waddlesplash: On the real hardware, everything is wired together on a bus called DIA, I forget how fast it was... but the original systems didn't gain much of anything in speed going beyond 3 CPUs, and even the 4th CPU wasn't that big of a improvement to a loaded system.
<jhj>
We can run 6 CPUs just fine because our bus is infinitely fast. :)
<waddlesplash>
OK, so in a Linux VM here, I get: 9.18 MIPS with master branch, 10.15 MIPS with the cpu_state_t patch, and 10.95 MIPS with the lock avoiding patch