<jani>
basically HF-EEODB says, set the extension count in base edid block to 1, but the *real* extension count is in the HF-EEODB data block
<jani>
if you only have a struct edid pointer in kernel, the memory allocated for it depends on whether the allocator was HF-EEODB aware
<jani>
even if we add helpers to determine the EDID size and extension count, I can't see a way around reviewing *all* EDID usage, transfer, allocation, everything, across *all* drivers
<jani>
unless we modify the base block extension count in a HF-EEODB aware drm_get_edid()... but I fear that might have userspace implications too
<jani>
vsyrjala: ajax: airlied: ^
frieder has quit [Ping timeout: 480 seconds]
<emersion>
please don't mutate the EDID blob exposed to user-space
<daniels>
^
gawin has joined #dri-devel
YuGiOhJCJ has quit [Quit: YuGiOhJCJ]
frieder has joined #dri-devel
<jani>
emersion: daniels: ack. (though we kinda do already by throwing out the edid blocks that are invalid... but maybe that's a thing on displays of the distant past mostly)
slattann has quit []
aravind has joined #dri-devel
<jani>
I'm kind of tempted to investigate adding a struct drm_edid that wraps around the raw struct edid to contain this meta info
itoral_ has joined #dri-devel
itoral has quit [Ping timeout: 480 seconds]
thellstrom has joined #dri-devel
rgallaispou has quit [Read error: Connection reset by peer]
Lynne has quit [Quit: Lynne]
Lynne has joined #dri-devel
rasterman has joined #dri-devel
itoral_ has quit [Remote host closed the connection]
itoral has joined #dri-devel
jkrzyszt_ has joined #dri-devel
jkrzyszt has quit [Remote host closed the connection]
<alyssa>
all of these are gating GLSL IR lowerings
<alyssa>
which we already have NIR lowerings for
<kusma>
Yeah, but we need all drivers to go via NIR first, and I don't think that's the case yet?
<alyssa>
grumble
* alyssa
wonders who's at the intersection of "pure TGSI path" and ES3.1+
<alyssa>
Nouveau, I guess
* alyssa
adds to issue graveyard
<kusma>
Anything using gallivm or tgsi exec seems to prefer TGSI also
<kusma>
LLVMpipe has an explicit switch for this, probably time to remove that one soonish ;)
* alyssa
opened #6196 and added it to the graveyard
<alyssa>
:q
<kusma>
Yeah, !8044 is the MR to unblock this.
tobiasjakobi has quit []
shashanks has joined #dri-devel
alarumbe has joined #dri-devel
* alyssa
sighs
<alyssa>
Someday :)
Thymo_ has joined #dri-devel
Thymo has quit [Ping timeout: 480 seconds]
sdutt has joined #dri-devel
kts has joined #dri-devel
shashank_sharma has joined #dri-devel
Thymo has joined #dri-devel
psi has joined #dri-devel
shashanks has quit [Ping timeout: 480 seconds]
psii has quit [Remote host closed the connection]
Thymo_ has quit [Ping timeout: 480 seconds]
<kusma>
alyssa: And what a glorious day it will be!
<kusma>
At that point, we can even remove all TGSI caps, and make them options for nir_to_tgsi instead!
Thymo has quit [Ping timeout: 480 seconds]
psi has left #dri-devel [#dri-devel]
psii has joined #dri-devel
<alyssa>
:-D
Thymo has joined #dri-devel
shashank_sharma is now known as shashanks
mattrope has joined #dri-devel
iive has joined #dri-devel
<karolherbst>
./build/test_conformance/images/clReadWriteImage/test_cl_read_write_images 1D: PASSED 42 of 42 sub-tests. :3
masush5[m] has joined #dri-devel
<alyssa>
karolherbst: congrats!
<karolherbst>
alyssa: sadly nothing besides 1D images work and I don't know exactly why that is :D
<alyssa>
strides
<alyssa>
tiling
<alyssa>
compression
<karolherbst>
something something
<karolherbst>
I am sure it's strides though
<karolherbst>
or that would be my first guess
<karolherbst>
ahh... I know
<karolherbst>
usize to u16 conversion fails, because clients are not required to init all fields of a struct...
<karolherbst>
ehh wait
<karolherbst>
that's not it
<karolherbst>
and would be u32 anyway
<zmike>
kusma: tgsi will stay in llvmpipe until softpipe is removed
<zmike>
as the draw module also still uses it
<alyssa>
who are we keeping softpipe around for again
<zmike>
alternatively rewrite softpipe to use nir also, I suppose
<alyssa>
oh right obscure archs that don't have llvmpiep nvm
<alyssa>
*llvm
<karolherbst>
alyssa: those exist?
<zf>
is there no interest in having a conformant software rasterizer, then?
<zf>
I thought that was the purpose of softpipe
<alyssa>
llvmpipe is conformant IIRC
<karolherbst>
softpipe is more conformant than llvmpipe?
<karolherbst>
since when
<alyssa>
softpipe is llvmpipe but super slow and without LLVM if you like that kinda thing
<zf>
my impression was that llvmpipe intentionally violated the spec in some ways for the sake of performance
<alyssa>
that was swrast, which has been deleted
<karolherbst>
it passes the CTS
<karolherbst>
alyssa: you mena swr
<karolherbst>
*mean
<alyssa>
karolherbst: maybe that one too? :-p
<karolherbst>
or.... ehh.. swrast was classic, no?
<alyssa>
ye
<karolherbst>
anyway, llvmpipe is conformant
<zmike>
llvmpipe is GL 4.5 conformant
<zmike>
not ES
<karolherbst>
in the end it doesn't matter
<zf>
had to look this up, but at least in the past it needed GALLIVM_DEBUG=no_rho_approx,no_brilinear,no_quad_lod
<zf>
to be fully conformant
<zf>
also GALLIVM_PERF=no_filter_hacks
<karolherbst>
zf: hardware is also not perfect
<zf>
well, sure, but isn't perfection kind of the point of a reference rasterizer? :D
<karolherbst>
but I think the perf opts are disabled by default
<karolherbst>
zf: what's the point of software rast if it's unusably slow
alyssa has left #dri-devel [#dri-devel]
<karolherbst>
reference impl is kind of a myth though in GL and Vk
<karolherbst>
nobody needs it and nobody cares
<zf>
it can be quite useful to run tests against it
<karolherbst>
and what do you want to test with it?
<karolherbst>
all games have optimized paths for vendors anyway
<karolherbst>
so testing that against swrast is pointless
<karolherbst>
as the vendor specific paths is something you'll need to test anyway
<zf>
well, I work on wine, and we try to avoid vendor-specific paths
<karolherbst>
and then you can just skip swrast because that would mean 4 rounds of testing instead of just 3
<karolherbst>
or 2 if you ignore intel
<zf>
there's been plenty of instances where testing against swrast reveals something interesting
<karolherbst>
zf: yeah well.. which isn't true
<karolherbst>
ohh sure, it might reveal something interesting
<daniels>
Mesa CI does tons of conformance testing of and on top of llvmpipe
<zf>
swrast also has the nice feature that it's unlikely to cause a GPU reset and break one's whole desktop
<daniels>
it does need those options, as you say, but it is fine if you have them
<karolherbst>
zf: sure, but then some conformant sw impl is fully enough
<karolherbst>
I specifically said "reference impl"
<karolherbst>
that is what nobody needs
<karolherbst>
ehh meant
Haaninjo has joined #dri-devel
<karolherbst>
anyway, "reference implementations" are usually something the spec author(s) write in order to show how to implement things
<zf>
sure
<zf>
fair enough, I'm not arguing for a reference implementation, but having a fully conformant one is still quite useful IME
<karolherbst>
sure, and llvmpipe is fully conformant
<zf>
not reference implementation qua reference, I should say
<karolherbst>
well.. at least so far the CTS checks
rgallaispou has quit [Read error: Connection reset by peer]
mbrost has joined #dri-devel
<karolherbst>
but speaking about llvmpipe.. I would be curious how llvmpipe makes use of vectorized instructions if at all... but I can also see that SSE/AVX/etc.. are lacking instructions to do that properly
gio_ has quit []
fxkamd has joined #dri-devel
camus1 has quit [Ping timeout: 480 seconds]
lkw has quit [Quit: Lost terminal]
mq47 has joined #dri-devel
<ajax>
karolherbst: llvmpipe goes out of its way to rearrange pixels into SoA form so it can vectorize
Lucretia has quit []
<ajax>
except for the "linear" path but that uses sse2 too
mq47 has quit []
<ajax>
it's about as clever of an swrast as you could want, the biggest performance challenge with llvmpipe is your cpu's utter lack of bandwidth
<clever>
SoA?
<ajax>
"structure of arrays". four input RGBA pixels get rearranged into RRRR GGGG BBBB AAAA
<clever>
ahhh right
<ajax>
as opposed to the AoS of the input form
<clever>
at least on the rpi gpu, it can accept data in either form i believe, the vector part of the 3d core is surprisingly flexible
<clever>
but for other vector parts of the chip, SoA may be better
<clever>
so which is better can vary, depending on the consumer
<DanaG_>
Aside from my amdgpu + ast issues on my x86 server board, on my arm64 machine I've also had some odd lockups with traces that mention CEC (hdmi-cec)... but maybe that's more a kernel bug than a DRI bug? https://dpaste.com//BM8XND5QF
<ajax>
yeah. iirc the SoA form is better for real 3d tasks since you spend more of your time in the meat of the shader, and the linear path stays in AoS because 2.5D compositor kinds of tasks don't do a whole lot of math in the fs so the swizzle in and out of SoA becomes meaningful overhead
rgallaispou has joined #dri-devel
<DanaG_>
I tried amdgpu.dpm=0, but that disables the fan control, annoyingly. But amdgpu.bapm=0 seems to maybe help.
<clever>
ajax: for the 3d core of the rpi, it always loads sets of 16 vertex into the vector core, but it has a wide range of modes, that can accept either
<ajax>
(here speaking just of llvmpipe, i don't pretend to know the details of most modern gpus at this level)
<clever>
in text form, refering to a given row+col in the matrix, in 8bit packed mode, it will turn an uint32_t[4] into an uint8_t[16], breaking the bytes up
<clever>
but you can also do 8bit laned, where you give a row and a byte, then it will extract the specified byte (0-3) from all 16 values in an uint32_t[16]
Lucretia has joined #dri-devel
<clever>
so you could have an array of struct { uint8_t a,b,c,d; }, then you load it into the VPM, and use 8bit laned mode to read 16 a's
<ajax>
clever: do you happen to have one handy and would you mind running x11perf on it a couple of times for me? i have a libX11 patch that seems like an obvious win on this machine but it's a 6-core i7 so i kinda want to check the overhead on a less potent machine
<clever>
and 16bit laned mode lets you do the same with struct { uint16_t a; uint8_t b,c; }
<clever>
ajax: i do have the entire model range, pi0, pi1, pi2, pi3, pi4, pi400, and pi02!
<ajax>
and then 'Xvfb -ac -noreset -scrn 0 1024x768x24 :77 &'
<clever>
in framebuffer mode? that driver is generally deprecated
<ajax>
and then 'DISPLAY=:77 x11perf -noop -query -dot -rect{1,10} -{putimage,shmput}10', twice, once with that patch and once without
<ajax>
enh, fb mode is always something we have to care about because there's always like the efifb case where i915 doesn't support your gpu yet
<ajax>
if you want to the same test on hardware server that's fine too, just let me know which kind you're doing
<ajax>
back story is XInitThreads is optional for a bunch of historical reasons but among them that the locking overhead would be unacceptably high
<clever>
and i'm guessing a lot of the libX11 functions will just not do locking, if you dont XInitThreads() ?
<ajax>
and afaict it's just not measurable anymore, so the tests there (except for -query) are all quite small in terms of work per request so they should show any overhead
<ajax>
right
<clever>
but thats still an overhead of having to check if locking is needed or not
<ajax>
hence the 'weaksauce' comment in the commit message ;)
<ajax>
point is even if i don't try super hard the overhead seems to not be there, and it makes whole classes of bugs die forever, so...
<clever>
this sounds like its mainly a problem within the X11 client
<clever>
and the exact backend used by the server doesnt matter?
<ajax>
nod. as long as the server doesn't just no-op all rendering the numbers should be pretty meaningful
<ajax>
clever: yep. do it on both runs just so you're sure you're comparing same cflags instead of whatever the system libX11 happened to be built with
<clever>
ah yeah, thats also critical
<clever>
i also suspect that the arm core will impact things too
<clever>
the bcm2835 arm1176 core doesnt really have the same mutex opcodes i think, because its single-core
<ajax>
mattst88: i'm leaning no? the logic in that patch still seems to make sense, and !113 seems like a more robust fix
<mattst88>
okay. do you want to press the button?
<ajax>
clever: i will take as much data as you're willing to generate here ;)
<mattst88>
I'll make a new release once that's fixed
<ajax>
ugh, libx11 still makes you insert your own link to the MR in the commit message?
<ajax>
mattst88: !125 too please?
<clever>
ajax: and just to cover all of the bases, how would i check if a compositing window manager is active?
<clever>
build is done, on master 918063298cb893bee98040c9dca45ccdb2864773
<clever>
for some benchmarks, the client is getting up to 20% cpu
<clever>
for others, its much lower
<clever>
`putimage 10x10 square` also looks wrong
<clever>
the lines are just pure chaos
<clever>
is it supposed to be?
<clever>
ajax: gist updated, with libX11 master
<clever>
ajax: and again, with your branch
<clever>
ajax: no-op is far worse looking
aravind has quit [Ping timeout: 480 seconds]
<karolherbst>
ajax: yeah.. I was more thinking about running shader purely with SSE/AVX
<karolherbst>
so not just launching threads executing "GPU threads" but also vectorize internally, because... that's what we do on CPUs and I think that was the goal of swr?
<clever>
karolherbst: when using SSE/AVX, how many vectors of uint8_t can you load into registers at once? how wide is each vector?
<clever>
from a brief look at vector extensions in aarch64, its surprisingly weak
<karolherbst>
I would hope that llvm knows how to optimize, but auto vectorization being so useless overall :/
<karolherbst>
clever: was more thinking about x86, but yeah, aarch64 and risc-v could benefit as well
<clever>
i'm curious as to how powerful x86 is as well
<karolherbst>
auto vectorization is just doomed to fail, so developers have to be smart about writing code
<karolherbst>
clever: _if_ you can vectorize it's fast
<clever>
ive been using a cpu core that has enough register space to hold 256 vectors of uint8_t[16]
<karolherbst>
and how does that help if code doesn't make use of it?
<clever>
so arm having only room for something like 8 or 16 doubles, seems rather weak
<ajax>
karolherbst: i feel like a lot of llvmpipe's smarts is already about moving things into vec4s before llvm gets to them
<karolherbst>
ajax: probably
<karolherbst>
but
darkapex has joined #dri-devel
<karolherbst>
the issue is, we still execute threads 1:1 afaik
<ajax>
1:1 with what
<clever>
ajax: i think the "everything else" was bottlenecked with Xorg actually drawing to the drm buffers, as instructed?
<karolherbst>
gpu:cpu threads
<karolherbst>
so if you launch 1024 CL kernels you get 1024 "work items" which get executed by llvmpipe
<karolherbst>
sucks for scalar code
<karolherbst>
I might be wrong, but I think that's what is happening
pochu has quit [Quit: leaving]
<ajax>
scalar code sucks, yes.
<karolherbst>
well
<karolherbst>
it doesn't
<karolherbst>
that's the point
<ajax>
that work queue does get distributed over every thread though, and i'm pretty sure lp will do one per core
<karolherbst>
on real GPUs it won't matter as you can just run 10k threads in parallel
<karolherbst>
so GPUs moved to scalar ISAs and everything
<karolherbst>
but on CPUs....
<ajax>
again, your problem isn't the instructions you're retiring, it's the long thin tube connecting your EUs to memory
<ajax>
at least for GL uses of llvmpipe. maybe CL is different.
<karolherbst>
mhhh, yeah maybe it won't matter as long as memory is connected as slowly as it is on x86
<clever>
i did some benchmarking of that as well on the rpi
<clever>
let me find my numbers
<karolherbst>
but we do have AVXed and SSEed memcpys which do speed up things
<clever>
400mhz DDR2 ram, has an estimated 25.6 gigbit/sec of bandwidth, assuming you transfer data on every single clock
<clever>
vectorized loads from uncached ram, to the VPU, got 23.5-23.7gigbit/sec of thruput
<clever>
when using the 4096 byte vector-load opcode
<karolherbst>
yeah well.. my intel CPU has like 76.8 GB/s max
<clever>
91% of the theoritical ram bandwidth, doesnt seem like a thin tube
<ajax>
25.6 GB/s, for reference, is a radeon 9800 (yes, the thing from 2003)
<karolherbst>
well
<karolherbst>
yeah
<karolherbst>
x86 has slow memory
<karolherbst>
but it's not _that_ slow
<karolherbst>
I mean if there isn't enough memory bandwidth for SSE and AVX that would make them pointless, no?
<clever>
karolherbst: from what ive seen of arm vector extensions, there is relatively few vector registers, and any code using it is going to be extremely load/store dense
<karolherbst>
sure it would have been better to multiply cores by 8 and not add SSE/AVX but here we are
<clever>
but the VPU vector extensions, are HUGE, and i could load an entire dataset in one shot, and then do an entire FIR filter in ~3 opcodes
<clever>
and there is enough to spare, that i can keep the entire coefficients table in the registers, and load another dataset to FIR against
<karolherbst>
clever: because big vector sizes are a waste of transistors
<clever>
so i can omit an entire load on every loop
<ajax>
karolherbst: i guess i'm trying to say most of the v11n that can usefully be done probably has
<karolherbst>
probably
<clever>
karolherbst: i think its this big, because it was originally a DSP core
<karolherbst>
vectors for highly specific use cases are fine, just not for general purpose stuff
<ajax>
i can't remember if lp stores textures pre-tiled-and-soa'd
<clever>
the vectors are also always 16 wide, that is hard-wired
<ajax>
seems like it should if it can
<clever>
so its not that big of a vector, its more the number of vectors it can hold
rgallaispou has quit [Read error: Connection reset by peer]
<karolherbst>
clever: I prefer 16 times more cores than vectors tbh
<clever>
thats kinda what the 3d core did
<ajax>
remember the other problem with keeping the cpu fed is your input data having any cache locality
<karolherbst>
but that's super hard to do on CPUs actually
<karolherbst>
it's a mess
<karolherbst>
clever: yes and no
<clever>
the 3d subsystem is 12 cores, of 16 wide vector-only compute
<karolherbst>
GPUs can ignore a bunch of problems CPUs have to deal with
<ajax>
and sampling linear memory is about the least cache friendly thing
<clever>
yeah
<clever>
the rpi 3d core is technically turing complete, but with added restrictions
<clever>
conditional branching, is based on if none/some/all of the lanes meet a condition
Duke`` has joined #dri-devel
<karolherbst>
*ugghh* that reminds me of somebody who actually compared CPUs and GPUs like that
<clever>
conditional execution (an opcode maybe being a no-op) has finer control
<karolherbst>
yeah...
<HdkR>
SVE and AVX-512 predication matches pretty well with the GPU model
<clever>
the asm and most docs also treat the 3d system as a scalar core
<karolherbst>
the ISA might look like scalar for most GPUs, but internally it's highly vectorized
<clever>
exactly
<clever>
it may look scalar, but the hw will schedule 16 threads with 16 different inputs, and 1 shader
<clever>
and the illusion only breaks at conditional branching
<karolherbst>
it would be interesting to see if there are shaders lp _could_ emit purely vectorized and be variabely sized GPU
<karolherbst>
so sometimes it can launch 4 GPUs threads in one go, sometimes only 1
<clever>
there is also user controlled threading as well
<karolherbst>
mhhh
<karolherbst>
openmp, but not threaded, just simed
<karolherbst>
:D
<clever>
if a shader starts a long async task (texture lookup), it can yield the core to another thread
<clever>
the hw will then swap the upper and lower registers, and run a different set of threads
<clever>
until they also yield
<karolherbst>
ahh sure
<ajax>
karolherbst: lp tasks are just a work queue, you can make them as big as you want
<clever>
context switching is avoided, by just banning the use of the upper half of the registers, and swapping uppwer/lower
<karolherbst>
ajax: yeah.. but I meant merging lp tasks into vectorized variants
<ajax>
what do you think is the fundamental unit of work for an lp task
<ajax>
it's not: one pixel
<clever>
karolherbst: another tricky thing, is that its not even a 16x wide vector lane, its only 4x, but the pipeline helps to cheat
<ajax>
they're built to be a vector workload, already
<karolherbst>
ahh okay
<clever>
my rough understanding of how the QPU functions, is that the pipeline is 4 stages long, so a given opcode takes 4 clock cycles to run
<karolherbst>
maybe that doesn't pan out for compute then
<karolherbst>
not sure
<clever>
but to hide latencies where a register can only be used 4 clocks after you write
<clever>
a given "thread" only runs on every 4th clock
<clever>
so the pipeline, is always interleaved with 4 "threads"
<karolherbst>
but I think 1024 CL threads were split into 1024 lp tasks
<ajax>
well i'm describing render tasks, you could probably make compute tasks pick a different microtile size
<clever>
and each of those "threads" is a 4x vector task
<karolherbst>
ajax: yeah.. that's what I am thinking about how feasible that would be.. not that it really matters, just thinking out loud here
<ajax>
probably what i'd do is if the incoming task has a 1x1 work unit is unroll it to 2x2 and feed that as the task?
<karolherbst>
CL has explicit grid sizes
<karolherbst>
ehh block?
<ajax>
llvm might not vec it up very much but you're at least likely to improve cache hit rate
<karolherbst>
the smaller one
<karolherbst>
yeah...
<karolherbst>
I wouldn't trust compilers here to do the correct thing in terms of vectorization anyway
<karolherbst>
but maybe things could be improved a little somehow
<ajax>
i'm honestly wanting a way out of llvm
<karolherbst>
mhhh
<karolherbst>
as long as we can keep clang
<karolherbst>
nir to x86 would be fun
<ajax>
it can be a process on my system, i'm not super into it being a DSO in every process. clang _or_ llvm.
<karolherbst>
yeah...
<karolherbst>
well I only want clang for CL
<ajax>
and like: gallium is pretty well positioned to make informed layout choices already. i don't need my jit to try super hard for that, i just need it to be able to express every vector operation i might encounter
<karolherbst>
I really don't want to write a C parser which can deal with all the code out there
<ajax>
someone just teach mir about avx2 already
<ajax>
anyway i'm rambling
<zmike>
ADAM FOCUS
<clever>
ajax: want any other combinations tested? Xvfb? fkms? no kms? other pi models?
<ajax>
clever: wimpiest pi model you've got
<clever>
original pi1 it is!
gio has joined #dri-devel
gio has quit []
gio has joined #dri-devel
<clever>
if it will boot...
Thymo has quit [Ping timeout: 480 seconds]
<ajax>
ah, provocative maintenance
<clever>
maybe a zero...
<clever>
same soc, just a different default clock
<clever>
yep, booting
<clever>
You are in emergency mode. After logging in, type "journalctl -xb" to view
<clever>
[ 24.666726] mmc0: read transfer error - HSTS 20
<clever>
ah yeah, /boot wont mount, for unknown reasons
ybogdano has joined #dri-devel
ajax is now known as Guest158
Daaanct12 has joined #dri-devel
lkw has joined #dri-devel
alyssa has joined #dri-devel
<alyssa>
eric_engestrom: dcbaker: Have we selected a date for the branch point?
<alyssa>
I guess April 13 is the 'expected' date but not sure if it'll be sooner/later in practice
<dcbaker>
yes.. April 13th IIRC. I'll send out hte calendar update
<alyssa>
ack
<alyssa>
I'd really like to land Valhall support in Panfrost and trying to budget my time (and modulate my perfectionism) accordingly
<alyssa>
given I have gles2 conformant (I.e. enough for accelerated desktop), it would stink if there's no support in 22.1
<alyssa>
(Aiming for conformant gles31 in 22.2 regardless, but given Linux-capable Valhall devices are already in the wild 3 months is a big difference)
<dcbaker>
okay. We can nudge it, or I can slip things in during th RC for you as long as it's just in panfrostland
<alyssa>
Heh, thanks
<dcbaker>
(or other people are happy to land common stuff)
<alyssa>
We'll see how much I can get in before APril 13
<alyssa>
the code is written just needs to be processed thru Marge ;P
<dcbaker>
my policy is "if driver teams want to bork their own drivers, that's their call" :D
rkanwal has joined #dri-devel
Daanct12 has quit [Ping timeout: 480 seconds]
slattann has joined #dri-devel
MajorBiscuit has quit [Ping timeout: 480 seconds]
<alyssa>
Lol. Fair
<airlied>
karolherbst: when you say CL threads what do you mean?
<karolherbst>
airlied: one item in the local_work_group
<alyssa>
For context of scale, my Valhall bring up branch is about 5kloc diff from main right now
<airlied>
launching 1024 1x1x1? or something saner?
<alyssa>
Trying to land the first 3kloc today/this week
<karolherbst>
airlied: nope, launching 1024x1x1
<airlied>
like it vectorizes that
<karolherbst>
really? mhh
<karolherbst>
maybe I missed that
<alyssa>
3 weeks to clean up and Marge the other 2kloc, sounds doable
<airlied>
yes it would suck otherwise
<karolherbst>
maybe I need to take a closer look
<linkmauve>
karolherbst, on CPU, you can use perf to see how a given function got JIT’d to.
<airlied>
it will launch 1024/8
<alyssa>
(Especially if I hide the PIPE_CAPs for anything newer than gl2)
<alyssa>
(For the new hw only obviously)
<airlied>
karolherbst: it also uses coroutines to do barriers
<clever>
Guest158: ok, wut, i can read the entire boot partition, but i cant mount it!?
<clever>
karolherbst: an HSTS of 0x20, means the SD controller encountered a crc16 error, on the SD bus
<clever>
that would imply problems between the card and soc, not the data itself
<clever>
and yet, i can read the entire partition, twice, and not have any errors
<clever>
fsck is also happy
<airlied>
karolherbst: so it could do better at dispatch maybe if barriers if no barriers
<airlied>
since if you have an local block size that is nuts it will vectorize, but not thread
<airlied>
it only threads blocks
gawin has quit [Ping timeout: 480 seconds]
slattann has quit []
lynxeye has quit []
garrison has joined #dri-devel
fxkamd has joined #dri-devel
<karolherbst>
clever: or the driver is buggy
<clever>
karolherbst: it only fails like this on certain models
i-garrison has quit [Ping timeout: 480 seconds]
i-garrison has joined #dri-devel
shankaru has quit [Read error: Connection reset by peer]
shankaru has joined #dri-devel
jkrzyszt_ has quit [Ping timeout: 480 seconds]
garrison has quit [Read error: Connection reset by peer]
heat has joined #dri-devel
DanaG_ has quit [Remote host closed the connection]
gawin has joined #dri-devel
ella-0 has joined #dri-devel
DanaG has joined #dri-devel
ella-0_ has quit [Read error: Connection reset by peer]
frieder has quit [Remote host closed the connection]
janesma has joined #dri-devel
nchery has quit [Ping timeout: 480 seconds]
shankaru has quit [Quit: Leaving.]
<karolherbst>
2D images: PASSED 42 of 42 sub-tests. :)
<karolherbst>
3D as well, yay
<karolherbst>
alyssa: I forgot to use the strides from the pipe_transfer object... :D
<karolherbst>
1Darray and 2Darray are passing as well, nice
idr has joined #dri-devel
Guest158 has left #dri-devel [#dri-devel]
ajax has joined #dri-devel
<karolherbst>
airlied: what's our plan for opaque pointers?
<airlied>
karolherbst: I only found out about them yesterday :-)
<airlied>
yeah at some point we'll have to to move to the new APIs
<karolherbst>
:) I kept ignoring the issue and hoped somebody knowing LLVM would review
<karolherbst>
airlied: or.... ditch clover?
<airlied>
karolherbst: I think llvmpipe will needs fixes as well
<daniels>
karolherbst: saving that message to pull out later when you start reviewing LLVM-adjacent work from others
<airlied>
karolherbst: I don't think clover impacts it that much, and if it does, clc will have same issues
<karolherbst>
airlied: annoying :(
<karolherbst>
daniels: heh :D
Thymo has joined #dri-devel
<Venemo>
kusma: about our talk earlier on your radv docs MR, what do you think would be the best way to also include the ACO docs on the Mesa website in the future? If I simply make a MR to move the ACO readme file to docs/ will that do?
<kusma>
Venemo: You need to also convert it to rst. I *think* pandoc can do that for you...
<Venemo>
kusma: does rst support tables now? Last I checked it didn't
glennk has quit [Remote host closed the connection]
gouchi has joined #dri-devel
glennk has joined #dri-devel
lemonzest has quit [Quit: WeeChat 3.4]
Danct12 has joined #dri-devel
lkw has quit [Quit: leaving]
Haaninjo has quit [Quit: Ex-Chat]
<graphitemaster>
NV discussing using Intel to fab GPUs is the most hellish thing I've read in computer graphics
<graphitemaster>
We are in the bad place
<alyssa>
graphitemaster: but everyone has been telling me this is the good place!
<Lyude>
lmao I am surprised, I didn't expect Intel to be open to anyone using their fab
<alyssa>
Lyude: all companies are open to anything for enough $$$
<graphitemaster>
alyssa, That's exactly what someone who doesn't want you to know we're in the bad place would say!
<Lyude>
alyssa: very true
rcf has quit [Quit: WeeChat 3.2.1]
rcf has joined #dri-devel
Duke`` has quit [Ping timeout: 480 seconds]
<alyssa>
graphitemaster: *blink*
LexSfX has quit []
<graphitemaster>
Intel's fab process is so far behind if NV made a GPU on it it would consume 700w of power, the chip would be the size of a CD jewel case, the GPU would take up four PCIe slots and probably require a transfer of the framebuffer to the CPU because Intel will want to slide their terrible hybrid dGPU + iGPU nonsense into it as part of the deal.
<graphitemaster>
See, bad place.
mvlad has quit [Remote host closed the connection]
rcf has quit [Quit: WeeChat 3.2.1]
rcf has joined #dri-devel
LexSfX has joined #dri-devel
xroumegue has joined #dri-devel
ybogdano has quit [Read error: Connection reset by peer]
ybogdano has joined #dri-devel
<alyssa>
graphitemaster: can't tell if you're forking with me
<icecream95>
alyssa: getpid()?
<alyssa>
that's one way to find out!
<graphitemaster>
alyssa, It's called hyperbole because that is the brand image Intel has right now for their continued failure to deliver improvements on their manufacturing process and strong-holding OEMs to use integrated graphics.
<graphitemaster>
I was embellishing it a little :P
<graphitemaster>
Do you think I'm being too harsh on Intel
<alyssa>
hyperbole, that's the sinh/cosh one right?
<DrNick>
mesa's post-processing isn't suited to changing render target resolution is it
<alyssa>
people use that?
<DrNick>
mesa's postprocessing? no.
<graphitemaster>
alyssa, I think those are called the exaggerated sines and cosine functions, they give similar results to sine and cosine but they're slightly exaggerated.
<DrNick>
I was just contemplating AMD's RSR
<Lyude>
graphitemaster: tbh ADL is finally actually competing so they're starting to do a bit better
<Lyude>
(also I'm looking forward to the DG2 dropping, honestly very much want to use one in my next desktop)
gouchi has quit [Remote host closed the connection]
AndroidC512L has joined #dri-devel
<graphitemaster>
I was looking at the swift shader papers on improved exp/log and sin/cos
<graphitemaster>
Wrote the sin/cos one in regular GLSL with floatBitsToUint and friends to simulate some of the machine-level stuff they do for what is ostensibly a software renderer
<graphitemaster>
Since most of those kernels are not native on GPUs anyways
<graphitemaster>
Curious, does mesa have implementations of sin/cos/exp/log as part of NIR or what ever that are just used in-situ if the GPU lacks native instructions?
<graphitemaster>
Could be interesting replacing them with the swift shader implementations if the ones in mesa are slower
<icecream95>
graphitemaster: Yup, and also IIRC in TGSI.. I think it's in nir_opt_algebraic.py
JohnnyonFlame has quit [Ping timeout: 480 seconds]
<jekstrand>
Most hardware has sin/cos/exp/log
<jekstrand>
Other transcendentials like acos, atan2, etc. aren't
<icecream95>
See also lp_build_log2_approx and i915_sincos_lower...
<jekstrand>
I said "most"
<jekstrand>
It might be useful for llvmpipe but no one cares much about i915 perf
<icecream95>
jekstrand: I was pointing out that these (at least the i915 one) seem to duplicate the lowering in opt_algebraic
<alyssa>
graphitemaster: Just read the sin/cos paper, these are all standard tricks AFAICT
<alyssa>
(Less sophisticated than what can be done in hw these days as well)
<jekstrand>
We have lowering in opt_algebraic?
* jekstrand
wasn't paying attention, I guess.
<alyssa>
for lima
<icecream95>
lowered_sincos
columbarius has joined #dri-devel
danvet has quit [Ping timeout: 480 seconds]
<jekstrand>
I see. Yeah, we could probably drop the i915 one then
<icecream95>
"It's suitable for GLES2, but it throws warnings in dEQP GLES3 precision tests."
<jekstrand>
:-/
<jekstrand>
Good enough for i915, probably
gawin has quit [Ping timeout: 480 seconds]
* alyssa
assigns to yeet
<alyssa>
wait that sounds terrible
<alyssa>
let me try again
* alyssa
yeets to Marge
* zmike
watches marge yeet it back
<alyssa>
We'll see
ahajda has quit [Quit: Going offline, see ya! (www.adiirc.com)]
* icecream95
hurredly looks through the MR for something to NAK
rkanwal has quit [Ping timeout: 480 seconds]
<alyssa>
Apparently I have to assign unreviewed code to Marge in order to get reviews C:
<graphitemaster>
Don't yeet PRs
<icecream95>
alyssa: Do you really have to touch the Bifrost ISA file like this? "<opt>acmpxchg</opt> <!-- For Valhall -->"
<alyssa>
icecream95: Diff is missing context, that's part of a pseudo instruction
<alyssa>
But... yes, I'm not regretting tying the IR to Bifrost's encodings
<daniels>
anholt: given that GCN is now 10 years old, there's probably a case for yeeting r600 to amber tbh
<icecream95>
alyssa: Generally pseudo-instructions don't need "<reserved/>" modifier values?
<alyssa>
And would like to clean that up but I'd rather not block hw enablement on that
<alyssa>
It's a copy paste from an actual valhall instr
<alyssa>
*bifrost
<alyssa>
(the real atomics)
<anholt>
daniels: feels like a weird line -- we have nv30 on this side of amber, which is way worse than r600
<alyssa>
The whole pseudo instruction idea is a mess and I regret it and wish I fixed it a year ago
<anholt>
and crocus is handling a bunch of older hw, too.
<alyssa>
anholt: daniels: IMO the relevant axis is maintainence level, which is only weakly correlated with hw age
<anholt>
daniels: tbh I'd be happy if I could just merge r600 changes, like I'm doing for i915. but gerddie kinda owns it.
<alyssa>
I don't know where nv30/r600 fall there. crocus seems active, though.
<daniels>
anholt: ah, I'd missed that nv30 was still around
<daniels>
worth prodding gerddie in any case
<airlied>
r300 and r500 are still around :)
<anholt>
yeah, we have a new active developer for r300, and it's not me.
<anholt>
(and on that note, do we have a "add a new committer" process written down anywhere?)
<daniels>
anholt: ask on irc/gitlab/list, get approval from one or more people, eventually someone adds them
<anholt>
cool. well, @ondracka has been doing an incredible job fixing regressions and adding new optimization for r300.
<graphitemaster>
Speaking of math in GLSL. Who here wants to fix integer divisions? Literally unusable in any shader because the rounding direction is undefined. I've seen the same hardware round differently just based on the values. It seems NV turns integer division into floating point so the precision is like 24 bits (might be less). At the very least some warning about using it would be nice in the shader compiler so I can detect bugs. I've
<graphitemaster>
now found and corrected probably 50 such bugs.
<alyssa>
icecream95: ugh. typo.
<alyssa>
good NAK work
<alyssa>
thanks
<alyssa>
graphitemaster: Just don't do integer divisions on the GPU.
<alyssa>
Just, really, don't.
<graphitemaster>
I know but how do I enforce that in a team alyssa
<graphitemaster>
There's no way to enforce it
<alyssa>
jury rigged CI pipeline
<graphitemaster>
How would you detect an integer division
<graphitemaster>
Keep in mind a compiled shader using integer division will get turned into float division
<graphitemaster>
Just running the app won't do that
<graphitemaster>
if integer divisions are so bad they should be removed from the shading language
<graphitemaster>
at the very least there should be an #extension I can require which turns them into errors
<mareko>
AMD has precise integer division, use that
<graphitemaster>
AMD is the one platform (on Windows) that gets integer divisons in GLSL wrong the most :P
<jekstrand>
Intel does too
<jekstrand>
Intel even has a "real" integer divide instruction
<alyssa>
butwhy.gif
<graphitemaster>
how does one get precise integer division
<mareko>
via Mesa
<alyssa>
soul sale
<jekstrand>
Clearly, you need Mesa on Windows. Problem solved. :-P
<alyssa>
that but unironically
<icecream95>
graphitemaster: while (a > 0) { a -= b; ++c; }
<icecream95>
:P
<alyssa>
O(a) perf woot!
gawin has joined #dri-devel
<alyssa>
actually not O(a)
<graphitemaster>
I just need the people creating shading languages and graphics drivers to agree on something as basic as what way integer division rounds because it's totally unusable otherwise, imagine it rounds wrong and that value accesses out of bounds ...
<alyssa>
unbounded and incorrect
<alyssa>
icecream95: Your excellent algorithm has a bug, it hangs if b = 0
<icecream95>
Or you could try subtracting shifted versions of b, like armv7 software division
<jekstrand>
graphitemaster: Are you sure it isn't specified?
<graphitemaster>
it isn't!
<jekstrand>
graphitemaster: You may just be hitting AMD driver bugs
<graphitemaster>
all the gl specs leave the rounding direction unspecified
<jekstrand>
graphitemaster: Feel free to file a SPIR-V spec bug if you want
<graphitemaster>
jekstrand, Yeah it's literally unusable, also all the translation layers which map Vulkan to DX12, or DX12 to Vulkan or Vulkan to Metal have silent bugs because DX actually _defines_ the rounding mode as does Metal
<jekstrand>
graphitemaster: That's probably more useful than GLSL and will likely end up with people looking at the GLSL spec too
<jekstrand>
graphitemaster: I suspect the intention in GLSL and SPIR-V was round-towards-zero
<jekstrand>
i.e. C integer division
<jekstrand>
But they just didn't bother to think that people would interpret it wrong
<graphitemaster>
jekstrand, The issue is multi-faceted because the rounding direction differs depending on the sign as well and nothing is consistent there. Also C defines % in such a way that it's consistent with division, i.e (a/b)*b+a%b=a
<graphitemaster>
And GLSL definitely does not honor C there at all either
<jekstrand>
graphitemaster: SPIR-V does have OpSRem and OpSMod which do care about sign, sort-of
<jekstrand>
But division going all sorts of weird ways seems wrong
<jekstrand>
Worth filing an issue. Lets people know that there are devs out there suffering
<graphitemaster>
I worked out after experimenting that AMD's division works different for positive numbers than it does rounding, paste incoming
<jekstrand>
It's ok. I don't mind you complaining here. #dri-devel is mostly for shit-posting, after all.
<jekstrand>
And I could file the bug for you but I think it'll get more of the right attention if it's filed by someone external who's struggling with driver inconsistency in the real world than if I file something about how the spec is unclear.
<graphitemaster>
My assumption is something as basic (and fundamental) as a bug in integer division wouldn't be something that requires that. I mean it shouldn't even be broken, how has this not been caught by tests.
<graphitemaster>
Should need an advocacy group or presentation titled "Making division work in 2022"
<graphitemaster>
s/Should/Shouldn't
<alyssa>
graphitemaster: there are 2^64 test cases for idiv32
<graphitemaster>
You only need two test the edge cases :P
<graphitemaster>
s/two/too
<alyssa>
If an implementation is right on 99% of inputs, it can easily be missed by random sampling (which the CTS does lots of)
<graphitemaster>
But it's not right on 99% of inputs, it's right on 50% of inputs since 50% of pair of numbers will round differently
thellstrom has quit [Ping timeout: 480 seconds]
nchery is now known as Guest182
Guest182 has quit [Read error: Connection reset by peer]