refi64 has quit [Remote host closed the connection]
refi64 has joined #asahi-dev
phire_ has joined #asahi-dev
phire is now known as Guest12
phire_ is now known as phire
Guest12 has quit [Ping timeout: 480 seconds]
axboe has joined #asahi-dev
<axboe>
kettenis: seriously? like 90s style?
<kettenis>
totally
<axboe>
wow
<axboe>
well I guess any kind of fs perf numbers on openbsd needs to come with that caveat ;)
<kettenis>
I don't think i've ever lost a filesystem though
<axboe>
I mean, all's well unless it crashes or you lose power
<axboe>
I used to use openbsd for a ppoatm router, when that was my inet connection
<milek7_>
are modern ssd really providing any guarantees on power loss? with opaque FTLs and hundreds of megabytes of cache
<axboe>
that's what the flush cache command is for
<axboe>
if previously acked writes aren't power loss stable after that, then the hw is defective
yuyichao has quit [Ping timeout: 480 seconds]
<alyssa>
defective hardware? impossible.
<axboe>
heh I know
<axboe>
but it's one of the 4 core primitives, really should get that wrong for any half-way decent device and above
<axboe>
I'm sure there's tons of shitty ones that just nop it :/
<axboe>
s/wrong/right, obviously
yuyichao has joined #asahi-dev
rkjnsn has quit [Quit: Reconnecting]
rkjnsn has joined #asahi-dev
PhilippvK has joined #asahi-dev
phiologe has quit [Ping timeout: 480 seconds]
axboe has quit [Quit: leaving]
kov has quit [Quit: Coyote finally caught me]
kov has joined #asahi-dev
<marcan>
< sven> we can always file a radar with apple and have it disappear and get no feedback for years! ;)
<marcan>
the question is what does macOS do
<marcan>
if it violates barrier guarantees in a way that can cause proper corruption when you yank the plug on the mac mini, we repro it then file a bug then I blog about it so it hits hacker news ;)
<marcan>
then apple listens :D
<marcan>
milek7_: yes, some SSDs (even some consumer lines) have enough capacitors to flush the cache on plug pulls
<marcan>
it's not that hard with how fast SSDs are
<marcan>
though these days they're mostly given up onthat, but Micron/Crucial ones used to be like that
<marcan>
and enterprise lines certainly like to advertise that feature
<marcan>
I think at the lower end they only flush enough to make sure the FTL is consistent, but not necessarily all the data
<marcan>
the consumer line has enough caps to avoid corruption on power yank; the enterprise line has enough to save the entire cache
XeR has quit [Ping timeout: 480 seconds]
<VinDuv>
I may have missed some part of the discussion, but on macOS fsync does not flushes the disk cache; it only writes the data to disk. You have to use fcntl(fd, F_FULLFSYNC) to flush the disk cache.
<VinDuv>
I though fsync on Linux worked similarly but maybe not?
<marcan>
if that's how it works, maybe fio should be changed on macos so it stops lying... :p
<rkjnsn>
Sounds like it matches how kettenis was describing OpenBSD as working, then, which I suppose is not terribly surprising.
<marcan>
ok, can confirm macOS absolutely loses data if you do a plain fsync
<marcan>
simple python script writing to a file (using raw os. calls) and doing fsync() on it, then sleeping
<marcan>
I wait 5 seconds after the write/sync, then reboot via USB-PD command
<marcan>
write is gone
<marcan>
in fact I even managed to trigger an inconsistency in an Apple app by accident: I had GarageBand open, and closed it before the test, not saving data. on reboot, it tried to restore the now-nonexistent file, and threw up an error about corruption/invalid project
<marcan>
so I guess it deleted the "unsaved" project but didn't commit the "last open" state to restore on boot
<marcan>
F_FULLFSYNC indeed works
<marcan>
and indeed with a dumb repeated write test I get ~46 IOPS with that, vs. 40000 with plain fsync
<sven>
hah
<marcan>
ooookay then, time to make some noise on twitter
<sven>
at least that explains all the weird behavior
<marcan>
yup
<ar>
so, fsync without f_fullfsync is basically a no-op on macos?
<marcan>
it's like fsync on Linux with the write cache set to write-through
<marcan>
(fake set)
<marcan>
so it's not a no-op but it's not good enough
<marcan>
given the default dirty writeback on Linux is 5 seconds and macOS is losing more than 5 seconds worth of data with this, it *effectively* is like doing nothing on Linux, modulo writeback pressure
<marcan>
so on a Samsung SSD 860 EVO mSATA on my laptop, I get 10K IOPS with the same write cache hack, and ~330 without (and fsync)
<marcan>
so even this mSATA SSD does better than Apple's NVMe
<marcan>
sigh
<marcan>
on my iMac with a WD SSD, I get more like 2000 IOPS with proper flushes
<marcan>
and 20000 without
the_lanetly_052__ has joined #asahi-dev
<rkjnsn>
Do we know how much of an issue it would is outside of apt and synthetic benchmarks? I'm curious if it would make sense just to make apt less aggressive with its syncs, given that in can still be a significant slowdown (albeit less of one) on other drives. Presumably most software isn't trying to sync more than 50 times a second…
<marcan>
it's a problem for e.g. databases
<rkjnsn>
Ah.
<rkjnsn>
Out of curiosity (since I don't know too much about this topic), if folks do want to ignore/postpone flushes on Linux, does the SSD support any other kind of ordering operation that could ensure data written after a flush doesn't hit the disk before data written prior to it?
<rkjnsn>
I know it could still eat data that was supposed to be safely stored, but it'd be nice to avoid broken internal consistency.
<VinDuv>
I’m pretty sure there isn’t a way to properly order the writes since basically the whole chain is allowed to reorder them
<marcan>
I'm not sure if NVMe itself provides barriers; axboe might know
<marcan>
and this might be a good reason to try to rescue that kind of feature...
<Glanzmann>
marcan: axboe said yesterday that there is no such thing as 'barrieres' for 15 years. And that it is impossible due to multiqueue support IIUC.
<rkjnsn>
I think they were talking about barriers in Linux, as opposed to barriers in the drive, though?
<rkjnsn>
Even without support for write ordering in the kernel, it seems like (IIUC) translating fsyncs to write barriers (assuming the drive supports them) rather than flushes could at least help avoid corruption from the drive reordering things, even if it doesn't protect against writes that were supposedly committed being lost.
<sven>
it's not impossible to do with multiqueues. it's just very challenging.
<sven>
I don't remember seeing anything about write barriers in the nvme spec but I only read that one for the first time a few weeks ago
<marcan>
and yeah, translating fsync()s to write barriers would likely avoid corruption, if the drive can do that
<sven>
there's force unit access for write commands but that would probably only order writes where it's set (if it enforces the ordering at all)
<_jannau_>
marcan: I don't think dcp in its current state is ready to be integrated in asahi. preserved regions need a coordinated m1n1/kernel changes with a not fully agreed on dt-binding. My current code to remap in the locked dcp dart apparently only works when iboot doesn't initialize dcp (it works only on the mac mini)
<marcan>
ack, I'll defer then. I was on the fence about that
<marcan>
would it be fair to say that if I make I release within this month I should keep simpledrmfb for now?
<_jannau_>
it might be enough to transition dcp to hibernate in m1n1
<_jannau_>
yes, I think simpledrm should be preferred unless someone has time to work on dcp before that
<_jannau_>
I don't think there is huge amount of work to be done to make dcp useful but I should concentrate on submitting spi-hid
MajorBiscuit has joined #asahi-dev
<marcan>
repro'd on the Mac Mini pulling the plug; there is definitely no (working) last-gasp mechanism
c10l4 has joined #asahi-dev
c10l has quit [Ping timeout: 480 seconds]
<ar>
interestingly, https://sqlite.org/atomiccommit.html#sect_9_2 > Setting fullfsync on a Mac will guarantee that data really does get pushed out to the disk platter on a flush. But the implementation of fullfsync involves resetting the disk controller. And so not only is it profoundly slow, it also slows down other unrelated disk I/O. So its use is not recommended.
<marcan>
well, this is #10 on hacker news now, so... who knows, maybe Apple will fix it :-)
<ar>
so it's not a new problem
<marcan>
ar: indeed, the database folks knew about this for a while
<marcan>
it's just amazingly nonobvious for the rest of us
<ar>
postgres also seems to have some references to `fcntl(fd, F_FULLFSYNC, 0)`
<dottedmag>
Is there a way to quickly detect power loss on mini, to cobble last-gasp signal purely in sw?
<marcan>
very good question. no idea.
<marcan>
though I imagine if there were Apple would be using it?
<marcan>
let me look at the schematics for a bit...
<marcan>
ah wait, I don't have the Mini ones, derp
<marcan>
though there might still be some info
<marcan>
dottedmag: not seeing anything in SMC, quick look at the MBA schematic (they often have hints as to other variants) doesn't show any place that signal would go
<marcan>
I suspect there is no such mechanism
<dottedmag>
Alas. I was thinking more about some kind of side effect that betrays the power loss.
<marcan>
dottedmag: polling the main primary voltage rail voltage, it updates once a second, but I caught it updating immediately before shutdown after pulling the plug and it wasn't drooping
<marcan>
so I suspect the PSU just keeps it up then immediately kills power...
<marcan>
wonder if there's some PGOOD thing that could still work...
<dottedmag>
OK, so no easy way out.
<marcan>
hard to say without the real schematic
tanty has joined #asahi-dev
<marcan>
well this is interesting
<marcan>
while running the python loop, python says 46 ops/s, powermetrics reports 915 disk ops per second / 4.7MB/s (seems a bit much? APFS write amplification? and 20MB/s of ANS2 memory bandwidth in both directions)
<marcan>
doing the same thing, but with an artificial delay to make it run at the same speed sans the flush, I get 44 disk ops/s / 180 KB/s, and ~200 KB/s of ANS2 memory bandwidth
<marcan>
so that sounds like two problems here... FULLFSYNC does some horrible APFS amplification nonsense, *and* ANS2 blows it up even more
<marcan>
let me try another filesystem...
<marcan>
on FAT32 it's actually slower (34 IOPS), but no serious amplification: 78.37 ops/s 321.02 KBytes/s
<marcan>
and:
<marcan>
ANS2 RD : 3.042 MB/s
<marcan>
ANS2 WR : 8.836 MB/s
<marcan>
so yeah, ANS2 is definitely doing something somewhat dodgy if it's doing 9MB/s of memory write traffic and 3MB/s of memory read traffic to serve 320KB/s of data traffic
<marcan>
same test without the full fsync: write: 93.52 ops/s 397.77 KBytes/s
<marcan>
ANS2 DCS RD : 0.244 MB/s
<marcan>
ANS2 DCS WR : 0.411 MB/s
<marcan>
that's more like it
<marcan>
so yeah, that NVMe sync command is making ANS2 do a lot of work...
<marcan>
I wonder if it, like, linearly scans some huge cache hashtable?
<Dcow[m]>
is amount of work depends on the storage size?
XeR has joined #asahi-dev
refi64 has quit [Read error: Connection reset by peer]
refi64 has joined #asahi-dev
<sven>
so that just leaves that weird issue where we sometimes miss an interrupt now
<rkjnsn>
I saw a reference to a F_BARRIERFSYNC. Not sure if that's a macOS thing or just an iOS thing, but if it's available on macOS, it might be worth seeing if it works as advertised and results in less performance loss.
<maz>
sven: contrary to what I said, the AIC doesn't seem to have a configuration for edge/level, and "knows" which line in which. and the fasteoi flow doesn't distinguish them either.
<maz>
is*
<sven>
yeah, that's what I thought after spending some time with the interrupt code yesterday as well
<kettenis>
does the NVMe core code implement different code paths for MSI and non-MSI?
<maz>
kettenis: if it does, the latter probably isn't very well tested...
<kettenis>
sven: so on OpenBSD I still set the "number of openings" to 1 to avoid the NVMe from locking up
<kettenis>
which effectively means command sumission is serialized
<kettenis>
the issue I ran into sounds somewhat similar to what you're seeing
<kettenis>
at some point the completion interrupt for a command never happens
<kettenis>
although in my case the command actually didn't complete as far as I could tell
<kettenis>
not optimal, but still plenty fast and it has been rock solid since I added that hack
<sven>
that sounds very similar actually. some commands never generate the completed interrupt but they do appear in the completion queue and can be polled but others don't ever seem to be triggered
<sven>
what exactly does "number of openings" do exactly?
<sven>
macos uses INTMS and INTMC in its interrupt handler apparently, which linux doesn't because it relies on the interrupt controller itself
<kettenis>
I think "number of openings" is classic SCSI terminology for the number of commands that can be in flight simultaniously
MajorBiscuit has quit [Ping timeout: 480 seconds]
gladiac has quit [Quit: k thx bye]
gladiac has joined #asahi-dev
MajorBiscuit has joined #asahi-dev
Gaspare has joined #asahi-dev
<marcan>
for shits and giggles (and because HN commenters are tiring): full fsync on a shitty USB3 flash drive on macOS/M1: 223 IOPS. internal NVMe, 58 IOPS. Both FAT32.
<alyssa>
marcan: *blinks*
<marcan>
(of course, the flash drive has no cache at all, so it's equally slow with a vanilla fsync :p)
<marcan>
alyssa: yes, you get better database transaction latencies on a shitty flash drive than on internal NVMe on these machines :-)
<alyssa>
marcan: and if you wear out the flash drive you're not fsck()'d :V
<marcan>
:p
<phire>
wtf?
<phire>
did they just not try for decent fsync preformance at all?
<marcan>
I suspect nobody noticed
<marcan>
since it doesn't matter on iOS
<sven>
aaand now I can't reproduce the missing interrupts at all anymore with the original code *sigh*
<povik>
filing bugs with Apple by getting on HN is an interesting method
<sven>
every team needs someone to do the mediawhoring ;)
<marcan>
the only reason I haven't deleted twitter is those 50k followers are *sometimes* useful :p
<as400[m]>
sven: are you suggesting that marcan is a new apple spokesperson ?
<alyssa>
marcan: unfollowing everyone and setting all privacy settings to "following only" has made my twitter quiet! :-p
<alyssa>
`-$ echo "very important data" > file.txt` <-- it's right here
<sven>
:D
<j`ey>
marcan: lul nice
<marcan>
that said, I'm not even entirely sure if rsync tries to do a normal fsync() here
<marcan>
so it might just be unsafe on any system :p
<marcan>
but fsync() certainly wouldn't save macOS
yuyichao has joined #asahi-dev
hays has quit []
<milek7_>
rsync doesn't do fsync at all
hays has joined #asahi-dev
hays has quit []
hays has joined #asahi-dev
axboe has joined #asahi-dev
<axboe>
I already had 3 separate people send me marcans flush rant this morning ;)
<axboe>
good to see some noise on this topic
Gaspare has quit [Ping timeout: 480 seconds]
<Jamie[m]1>
is radar priority decided directly based on HN rank, or is twitter likes the bigger factor? :P
<sven>
axboe: remind me again, is taking the anv->lock around the writel itself enough to prevent those timeouts or did the memcpy also have to be inside the lock?
<sven>
i can't seem to reproduce the issue at all even without the lock right now :/
<axboe>
sven: just around writel is enough, that's what I've been running and it's been rock solid
<axboe>
sven: reproduces trivially for me, was a bit harder with issues serialized, but could still hit it. consistently just doing a make -j8 kernel compile on it
<axboe>
would always hang
<axboe>
I'm running the three nvme patches now and it's all dandy
<axboe>
tcb clear cleanup, lock around writel, and the flush deferral
<sven>
okay, so from looking at the hypervisor logs macos seems to never interleave "ring cq doorbell + nvmmu invalidation" with "write new tag to sq"
<axboe>
ok
<sven>
so maybe if something like "start nvmmu invalidation, start new command, write cq doorbell" happens things break. no idea why though.
<axboe>
so that won't happen with that patch either then
<axboe>
as nvmmu invalidation is already under the anv lock
<sven>
yeah. i'm just trying to understand why we need that lock around the writel
<axboe>
I generally didn't feel comfortable without that lock to begin with, only part I'm a bit puzzled on is why the cpu freq changes makes this so much more likely to trigger
<axboe>
it'd be better if we didn't need to share this lock (and particularly since it's irq disabling on the submit path), but at the rates of these drives and the fact that it only has a single queue anyway, it's not going to be an iops monster anyway and hence it doesn't really matter
<axboe>
so while that could be improved, it doesn't matter in practice imho
<sven>
makes sense
<sven>
if it's actually a race between nvmmu/cq_db and that writel maybe the higher cpufreq just makes it more likely to lose (or win, depending how we look at it :D) that one
<axboe>
definitely some correlation there, just not sure what!
<axboe>
I ran quite a bit with the cpus at max freq before that
<axboe>
and didn't see it
<axboe>
maybe differences in clock between simultaneous issuers? though I don't see how...
<sven>
quite a few people have been using it on the max/pro as well before cpufreq as well and never saw it either.
<sven>
yeah, i just hate when a mystery lock solves something that smells like some kind of race
<axboe>
yeah I know
<axboe>
your osx tracing sounds useful though
<axboe>
if it is indeed inval vs doorbell write
<axboe>
if only there was documentation ;)
<sven>
yup :D
<axboe>
I think we just need a good comment around why we _think_ it's needed
<axboe>
in the commit message too, but more importantly in the code
<sven>
yeah, absolutely
<axboe>
irq fix is queued up btw, but I guess you saw that already
<sven>
yup, that was pretty quick :)
<axboe>
one less to worry about :)
<axboe>
oh forgot, also running that "set aq depth to 2" patch
<axboe>
sven: though for the batching, we could optimize the nvmmu_inval by writing each tag, then one readl_relaxed at the end...
<sven>
macOS does a readl_relaxed after each inval fwiw. I’m not sure if that’s required ofc.
<sven>
could be worth a try
<axboe>
probably not worthwhile to pursue, but something to keep in mind
<axboe>
I've got this funky series for io_uring that allows registered buffers to retain dma and iommu mappings, instead of doing them for each io submit and complete
<alyssa>
"if only there was documentation ;)"
<alyssa>
mood
<axboe>
it'd bump the peak 17xK rand iops to substantially more
<sven>
nice!
<axboe>
was planning on adding support for apple nvme just to test out what it can actually do, if I do that, then I can try the nvmmu inval optimization too as it'd likely make a difference at that point
<axboe>
was part of the "lets see how many iops we can do on a core" experiments, 1 of 2 series that hasn't been posted anywhere yet as it's a bit of a hack
<sven>
I saw part of those optimization on twitter. Very impressive how much it improved
<axboe>
it was a fun project
<axboe>
there's no way to test apple-nvme in qemu yet, is there?
<sven>
nope
<axboe>
ah screw it, we'll do it live
<axboe>
let's find out
<sven>
:D
<sven>
what could go wrong ;)
<axboe>
right?
<axboe>
done, let's see if it works...
axboe has quit [Quit: reboot]
the_lanetly_052___ has joined #asahi-dev
skipwich has quit [Quit: DISCONNECT]
skipwich has joined #asahi-dev
<sven>
uh oh
the_lanetly_052__ has quit [Ping timeout: 480 seconds]
axboe has joined #asahi-dev
<axboe>
it worked
<sven>
nice!
<axboe>
not sure why I seem to be hitting the segments != 1 path in apple_nvme_map_data() though
<axboe>
let's debug...
axboe has quit [Quit: Lost terminal]
axboe has joined #asahi-dev
<axboe>
oh, it's the 16k page size
<axboe>
I should align my buffers better and not assume 4k page sizes :)
<j`ey>
:-)
<axboe>
and then apple-nvme needs to support ->queue_rqs too
bpye3 has joined #asahi-dev
al3xtjames2 has joined #asahi-dev
skipwich has quit [Ping timeout: 480 seconds]
<alyssa>
sven: so uh what does this mean for apple-nvme upstreaming?
al3xtjames has quit [Quit: Ping timeout (120 seconds)]
<alyssa>
i guess we're still super blocked on rtk?
al3xtjames2 is now known as al3xtjames
skipwich has joined #asahi-dev
bpye has quit [Ping timeout: 480 seconds]
bpye3 is now known as bpye
<axboe>
sven: ok queue_rqs is hard to support, since the sq door bell is writing the specific tag...
<axboe>
oh well, guess we can't easily do that
<axboe>
but with the changes, we get around the same perf, but at about half the CPU usage
<axboe>
so guessing we're actually controller limited on iops anyway around that point
<rkjnsn>
marcan, would it be easy to test F_BARRIERFSYNC using your python script to see what it does and how it performs on these machines?
Major_Biscuit has quit [Ping timeout: 480 seconds]
axboe has quit [Quit: Lost terminal]
the_lanetly_052___ has quit [Ping timeout: 480 seconds]
<sven>
axboe: half the cpu usage already sounds great :)
<sven>
alyssa: yeah, I need to take a look at what marcan did to rtkit and we need to discuss how we will upstream that together with smc
<alyssa>
sven: right
axboe has joined #asahi-dev
<axboe>
sven: got queue_rqs setup, and it's just cpu reduction at this point, iops are capped / controller limited