<alyssa>
apple's compiler implements * 9 with iadd lsl
<alyssa>
* 17 too
<alyssa>
imad for * 33
<alyssa>
imad for * 7
<alyssa>
imad for * -7
<alyssa>
oh, this is interesting, it uses iadd+lsl instead of bfi for small shifts
<alyssa>
it does use isub lsl
<alyssa>
but not in lieu of imad
<alyssa>
(and will optimize isub lsl to imad, weirdly)
<alyssa>
up to lsl 4 for a single iadd
<alyssa>
then, weirdly, <<5 is implemented by chaining iadd(lsl 1) with iadd(lsl 4)
<alyssa>
i fail to see why that would be a win
<alyssa>
although, if there are a lot more SCIB than IC units than it makes sense
<alyssa>
(those are functional units, in the perf counters)
<alyssa>
(different ALUs)
<alyssa>
I haven't done any benchmarking myself, but comparing Philip Turner's notes with dougall's, I guess there are 4 SCIB units per 1 IC unit on Apple M1
<alyssa>
with int32 addition and basic bitwise ops on the SCIB unit, but the variable barrel shifters on the IC unit
<alyssa>
meaning from a raw throughput instruction, using 3 adds (with lsl) to save a real shift would still be a win
<alyssa>
that suggests x << 12 would be implemented with chained iadds, but x << 13 is a toss up, but x << 17 would be real bitwise
<alyssa>
In reality, Apple's compiler chooses to use bfi for 9 and above, but x << 8 is chained iadds
<alyssa>
That is, it will use 2 iadds to save a bfi, but will not use 3 iadds
<alyssa>
Possibly there are battling latency and i-cache concerns here, this might be a heuristic it has
abd has joined #asahi-gpu
<alyssa>
Performance of shifts by 10 is not of any importance whatsoever to me. But it's a nice test case for reasoning about the uarch.
<alyssa>
I would like to have cycle count estimates in my shader-db stats, but meh, I don't think we know enough about the uarch to do that accurately yet.
<alyssa>
---
<alyssa>
With no render targets, but depth/stencil
<alyssa>
lina: If you revert bf3027c3916 ("mesa/st: Normalize wrap modes for seamless cubes")
<alyssa>
then a bunch of dEQP-GLES3 cases will fail (listed in the commit message)
<alyssa>
if you can find a control register bit that you can toggle to make them pass it would make me very interesting
<alyssa>
interested
<alyssa>
would not change the level of interesting i am
LinuxM1 has quit [Quit: Leaving]
<alyssa>
(which is to say, about 37% on a good day)
c10l has quit [Quit: Bye o/]
c10l has joined #asahi-gpu
stickytoffee has quit [Read error: Connection reset by peer]
stickytoffee has joined #asahi-gpu
Guest12222 has quit [Ping timeout: 480 seconds]
rhysmdnz has quit [Ping timeout: 480 seconds]
<lina>
alyssa: I don't know if this is a good thing or a bad thing, but almost all the unknowns I exposed seem to do nothing to dEQP2/3... ^^;;
<alyssa>
Sounds like a neutral thing ^^
<alyssa>
Also I already had lunch, get some sleep %_%
<lina>
The only interesting things I found are that if large_tib and the TVB isn't large enough I get a message and the job hangs (means I need to update the min_tvb_size for that case) and one of the registers has two bits which need to be set or a ton of things fault. I have a feeling it might be related to cache control/flushing?
<lina>
I just ended the stream! I'll have a bit of dinner and sleep ^^
<alyssa>
^^
<lina>
I opened a PR with the UAPI changes anyway ^^
<alyssa>
1. I don't know exactly how eMRT interacts with partial renders. I'm trying not to think about eMRT until I've finished up everything else in gles3.1
<alyssa>
2. Shrug
<lina>
Speaking of, I thought we were passing all of GLES3 but there's still that one test that needs eMRT?
<lina>
(the one with a too large TIB stride)
<alyssa>
eh, yeah, right
<alyssa>
that's in GLES3.0 in your CTS, it's in GLES3.1 in mine because I haven't updated in years :p