alanc has quit [Remote host closed the connection]
benjamin1 has joined #dri-devel
alanc has joined #dri-devel
bgs has quit [Remote host closed the connection]
shoragan has joined #dri-devel
fab has joined #dri-devel
sghuge has quit [Remote host closed the connection]
sghuge has joined #dri-devel
<dolphin>
airlied, agd5f: I was grepping for some register values from i915 and xe using "drivers/gpu/drm" as a path and came across "drivers/gpu/drm/amd/pm/powerplay/inc/polaris10_pwrvirus.h". Is there some backstory for the file, it seems pretty much a binary blob in the source to me? Maybe it should be in firmware repo instead?
djbw_ has quit [Remote host closed the connection]
<airlied>
dolphin: not sure what it is used for, but its just a register programming sequence in a table, not a binary
<dolphin>
pwr_virus_section3 too?
<dolphin>
seems like a blob to me for sure
jkrzyszt has joined #dri-devel
benjamin1 has quit [Ping timeout: 480 seconds]
tursulin has joined #dri-devel
lemonzest has quit [Quit: WeeChat 3.6]
bmodem1 has joined #dri-devel
benjamin1 has joined #dri-devel
bmodem has quit [Ping timeout: 480 seconds]
<airlied>
good question on what it is programming into the hw
<airlied>
agd5f: ^
lemonzest has joined #dri-devel
<airlied>
it used to be just a long sequence of reg writes, but maybe it's writing some microcode
karolherbst_ is now known as karolherbst
<MrCooper>
the name seems clear, it's a power virus ;)
sgruszka has joined #dri-devel
benjamin1 has quit [Ping timeout: 480 seconds]
lynxeye has joined #dri-devel
jkrzyszt has quit [Remote host closed the connection]
YuGiOhJCJ has quit [Ping timeout: 480 seconds]
YuGiOhJCJ has joined #dri-devel
Arsen_ has quit []
Arsen has joined #dri-devel
benjamin1 has joined #dri-devel
itoral has quit [Ping timeout: 480 seconds]
itoral has joined #dri-devel
heat has quit [Remote host closed the connection]
heat has joined #dri-devel
<tzimmermann>
section3 unleashed the power virus!
sarahwalker has joined #dri-devel
pochu has joined #dri-devel
Surkow|laptop has quit [Quit: 418 I'm a teapot - NOP NOP NOP]
Surkow|laptop has joined #dri-devel
benjamin1 has quit [Ping timeout: 480 seconds]
benjamin1 has joined #dri-devel
elongbug has joined #dri-devel
FireBurn has quit [Quit: Konversation terminated!]
<enunes>
mupuf: still not great it seems, maybe it is better to disable it for today, and I also take the downtime to do some pending updates on it
<mupuf>
enunes: yeah, sounds like a good idea
<enunes>
I can send a MR for it if you don't have one ready yet
benjamin1 has quit [Ping timeout: 480 seconds]
f11f12 has quit [Quit: Leaving]
<mupuf>
enunes: please do :)
<alyssa>
stop doing amdgpu
<alyssa>
power viruses were never meant to be programmed
<alyssa>
wanted to amdgpu anyway? we had a tool for that. it was called r200.
lodmoas^ has quit [Remote host closed the connection]
<mupuf>
alyssa: rofl
benjamin1 has joined #dri-devel
heat has quit [Read error: Connection reset by peer]
heat has joined #dri-devel
benjamin1 has quit [Ping timeout: 480 seconds]
Leopold_ has quit [Remote host closed the connection]
Leopold has joined #dri-devel
benjamin1 has joined #dri-devel
MrCooper has quit [Remote host closed the connection]
Dr_Who has joined #dri-devel
fab has quit [Quit: fab]
fab has joined #dri-devel
MrCooper has joined #dri-devel
<alyssa>
haswell is... crocus? or iris? or both?
<alyssa>
seemingly crocus?
<alyssa>
like mostly sure crocus, cool
<alyssa>
unfortunately crocus doesn't build on arm64. errrrg
<alyssa>
that's fine, I didn't want to read Intel assembly anyway
illwieckz has quit [Quit: I'll be back!]
swalker_ has joined #dri-devel
swalker_ is now known as Guest2983
sarahwalker has quit [Remote host closed the connection]
benjamin1 has quit [Ping timeout: 480 seconds]
<kisak>
Haswell is Intel gen 7.5, yes, that's crocus.
illwieckz has joined #dri-devel
benjamin1 has joined #dri-devel
heat has quit [Read error: Connection reset by peer]
heat has joined #dri-devel
Guest2983 has quit [Ping timeout: 480 seconds]
benjamin1 has quit [Ping timeout: 480 seconds]
<q66>
<alyssa> stop doing amdgpu
<q66>
i wish
<q66>
maybe when intel upstreams xe kmd so i can actually use it on non-x86 hardware
rasterman has quit [Quit: Gettin' stinky!]
yuq825 has quit []
<alyssa>
gfxstrand: this is odd.. the shader you sent me is actually *helped* for instruction count on midgard
<alyssa>
and we go deeper into the rabbit hole..
djbw_ has joined #dri-devel
<alyssa>
i do not understand haswell vec4 asm
benjamin1 has joined #dri-devel
<jfalempe>
tzimmermann, did you have a chance to look at my mgag200 DMA v2 patches ?
<tzimmermann>
jfalempe, sorry not yet
<tzimmermann>
it's busy recently :(
<alyssa>
well, I can reproduce the shaderdb change now. moo.
<jfalempe>
tzimmermann, ok, no problem, let me know if it can still be improved ;)
heat has quit [Read error: Connection reset by peer]
heat has joined #dri-devel
<alyssa>
how is virgl still failing
<agd5f>
dolphin, airlied, it's used to tune the voltage frequency curve on individual boards. IIRC, it's not firmware. It's some sort of pattern data sent to a power validation hardware which runs test with the patterns and then the results of those tests are used to tune the curve on each board so it's stable across varying silicon. I don't remember all of the details off hand.
kzd has joined #dri-devel
fdu_ is now known as fdu
fab has quit [Quit: fab]
pochu has quit [Quit: leaving]
mlankhor1t is now known as mlankhorst
<alyssa>
ValueError: could not convert string to float: 'top-down'
<alyssa>
How do I do Intel shader-db reports?
bmodem1 has quit [Ping timeout: 480 seconds]
jewins has joined #dri-devel
<alyssa>
did a grep abomination
<alyssa>
gfxstrand: reworked lower_vec_to_regs, haswell vertex shaders on my shader-db seem happy https://rosenzweig.io/lol.txt
<alyssa>
not actually runtime tested though so could be totally broken, but you know
<alyssa>
nothing else in the marge queue triggers lima jobs so yes that should do the trick
<alyssa>
definitely silly that farm disabling runs full premerge CI on every other farm though..
<alyssa>
mupuf: could the FARM online/offline boolean live in src/vendor/ci/.yml instead, to avoid that silliness?
<alyssa>
It loses the niceness of the bools for every farm being together but, meh?
<alyssa>
DavidHeidelberg[m]: ^^
fab has joined #dri-devel
<jenatali>
Some of them can, but some (e.g. the "Microsoft" farm aka the Windows builder/runner) can't
<alyssa>
jenatali: Hm
<alyssa>
Why not?
<alyssa>
Why can't that be in src/microsoft/?
<jenatali>
I mean, I guess they could, but it wouldn't make sense
<jenatali>
If it was just the test runners then sure, but it's also the build runner
<alyssa>
src/microsoft/ci/gitlab-ci.yml contains the gitlab-ci yaml source for microsoft's CI
<alyssa>
makes perfect sense to me?
<alyssa>
sure it's morally a backronym, but
<jenatali>
I think we're the only one in that situation though where our "farm" is more than just test runners
* alyssa
shrug
<alyssa>
I'm thinking even just from a psychological perspective
<DavidHeidelberg[m]>
alyssa: yes, yes, I was thinking about that last week to move it into ci-farms.yml in root
<alyssa>
Days where a farm needs to be disabled are days where people are already stressed (from jobs failing from the farm that needs to be taken down)
<DavidHeidelberg[m]>
to avoid triggering whole farm CI when disabling. For enabling then some .gitlab-ci.yml should be touched to check all the jobs (because they haven't been tested when farm is off)
<DavidHeidelberg[m]>
I was thinking how to make it automagicall. Off without pipeline, On with pipeline.
<jenatali>
That would be excellent
<alyssa>
..adding a 30 min stall in there where nothing gets merged when people are already stressed is, suboptimal
<alyssa>
DavidHeidelberg[m]: eyes
<DavidHeidelberg[m]>
alyssa: haha, yeah....
<alyssa>
("people" includes both the users of CI and the maintainers of it, I imagine)
<DavidHeidelberg[m]>
2x yeah...
<alyssa>
DavidHeidelberg[m]: If you're working in that area, the other question is how the farm disable commit should actually get merged
<alyssa>
In particular, if there are a pile of MRs already assigned to marge and some of them would trigger jobs on the broken farm
<alyssa>
the "unassign everything, assign disable MR, reassign everything" manual dance is silly
<alyssa>
the "just assign to the end" means a day's marge queue is wasted
<DavidHeidelberg[m]>
I was thinking also about doing - include: farms.yml@different-repo
<DavidHeidelberg[m]>
then we could alter it without unassigning marge
<DavidHeidelberg[m]>
but it's inconvinience to go to another repo for that
<alyssa>
and the "push directly, one MR will get shot down but it was going to fail and the next MR will be fine" technique is socially undesired
<enunes>
I wonder if we considered having something at the runner side that the CI scripts would check to see if the job needs to run at all
<alyssa>
How would that cross-repo include work to make sure reenabling has a pipeline?
<enunes>
so an admin with access to the runner could flip a switch there and the job would just skip
<jenatali>
David Heidelberg: What if we had both? Then you could push disables to a separate repo, while enqueueing a secondary disable to mesa (sequenced behind all other MRs). Re-enabling then touches the mesa repo too
<enunes>
so we wouldn't need to merge "set the lab to offline" commits at all
<alyssa>
enunes: That has the usual problem that, when the farm is back and flipped back on, random unrelated pipelines will start failing if anything regressed while offline
<DavidHeidelberg[m]>
yes yes, full pipeline (or at least all pipelines on related farm) needs to be run
<DavidHeidelberg[m]>
but before enable-phase could do something like "all-farms off, except the one who gets enabled"
<alyssa>
DavidHeidelberg[m]: I always kinda wondered if we could have a monotonically increasing integer "death_count" on each farm
tzimmermann has quit [Quit: Leaving]
* DavidHeidelberg[m]
wonder if he should put CO2 meter badge on our Mesa3D CI farm :D
<alyssa>
The mesa/mesa rules would hardcode a check "if death_count <= 27: run pipeline, else skip"
<alyssa>
When a farm dies, the admin increases the integer on the farm side to 28. So now everything is skipped.
<alyssa>
When the farm is back, the admin needs to MR against mesa/mesa changing the check to "death_count <= 28", going through the regular pipeline
<alyssa>
and then presumably the actual check logic is nicely abstracted in the yaml/bash halls of hell, so the actual mesa/mesa side is just the usual 1 line commit "LIMA_FARM: 27" -> "LIMA_FARM: 28" or whatever
<alyssa>
this probably has some weird side effects for running CI on the stable branches
<alyssa>
but that's wholly uncommon so maybe stable branch would just be exempted as part of the branching off a release process
<alyssa>
eric_engestrom: ^^ you'd be affected by that if you want to tell me why I'm being silly and this is a terrible idea actually =D
kts has quit [Remote host closed the connection]
<DavidHeidelberg[m]>
when you release into stable, you SHOULD always wait for all farms to be ready to test
tonyk5 is now known as tonyk
<alyssa>
right, yeah
<DavidHeidelberg[m]>
you don't want to release something which had half of the testing off :P
<alyssa>
=D
<alyssa>
DavidHeidelberg[m]: by the way, it's not clear to me if the CI itself has gotten better lately (vs my habits on how to use CI have changed, vs me social engineering myself with the appreciation report is working) ... but my perceived CI signal:noise ratio is a LOT higher than it used to be
<alyssa>
so thank you CI team ^^
<DavidHeidelberg[m]>
sergi: gallo koike ^ :)
<DavidHeidelberg[m]>
it got a bit better I think
<jenatali>
+1
<mupuf>
alyssa: it's definitely better
<jenatali>
I merged a nir change yesterday on the first try, that made me happy
<mupuf>
I guess fewer big uprevs too?
<gfxstrand>
\o/
<DavidHeidelberg[m]>
:D we trained developers to be happy even when the stuff merges :D
* DavidHeidelberg[m]
laughing his ass off
* mupuf
ran a stress test of 1000+ jobs and got a failure rate of 0.5%
<DavidHeidelberg[m]>
koike wrote really nice reporting, so at some point when we added most offending flakes showing time to time, reliability increased. It was milion of flakes, which takes a hit once a time, but... if summed up, it was almost every job
<mupuf>
Ran for 4 days continuously, on three steam decks
<gfxstrand>
mupuf: Is that across the entire CI or one runner?
<mupuf>
Just the steam deck runners
<mupuf>
At my home
<alyssa>
mupuf: that's kinda exactly the problem though
<alyssa>
with 100 jobs with a failure rate of 0.5%, we'd expect 40% of pipelines to fail if retry isn't enabled
<mupuf>
Oh yeah, of course! I wasn't satisfied with it
<mupuf>
There is more work needed
<mupuf>
But distributed test farms over the internet are harder to make super reliable
<alyssa>
(That effect is presumably what the daily reports show every day... 99% jobs passing but half of pipelines failing)
<mupuf>
Yep... But retries aren't the solution either: we need to retry on infra failures
<alyssa>
(I appreciate the difficulty of the problem. I am not going to attempt to find solutions because I am chastised every time I try. But I do know arithmetic.)
<mupuf>
That's doable, I guess, but it requires some work on Marge to detect that we got to the point where actual code was tested
* mupuf
will replace gitlab runner soon-ish
<mupuf>
Should speed up the startup sequence, reduce the number of moving parts, and allow me to add more network resiliency
<daniels>
I don't like the retries either, but realistically given that people just hit marge with a hammer until a merge occurs, it's better having those followed by an automated script which just goes around finding what the flakes are and auto-merging those into expectations
<mupuf>
True
<mupuf>
The most important asset is developer's trust
<mupuf>
Without it, the system is fully useless
gcarlos57 has quit []
gcarlos has joined #dri-devel
alyssa has left #dri-devel [#dri-devel]
Duke`` has quit [Ping timeout: 480 seconds]
Duke`` has joined #dri-devel
kzd has quit [Quit: kzd]
tursulin has quit [Ping timeout: 480 seconds]
kzd has joined #dri-devel
<DavidHeidelberg[m]>
I'm thinking about ON/OFF farms logic: 1. definition: .ci-farms/; .ci-farms/$farm_name; 2. execution: if changes .ci-farm/$farm_name: always run farm jobs; 3. if changes .ci-farm/ never run; if exist .ci-farm/$farm_name always run
<DavidHeidelberg[m]>
so, if we enable farm, it gets run ($farm_name file exist now, so change)
<DavidHeidelberg[m]>
other farms won't run, because 3., ci-farm/ changed
<DavidHeidelberg[m]>
if there is normal state (without change), last option becomes valid, .ci-farm/$farm_name exist, it'll run
<DavidHeidelberg[m]>
I'll need to test it a bit, but "could work"
<DavidHeidelberg[m]>
for some reason austriancoder farm jobs pops up even when it's not enabled, but I'll look into it
<austriancoder>
funny
DPA has quit [Ping timeout: 480 seconds]
DPA has joined #dri-devel
heat_ has joined #dri-devel
heat has quit [Read error: Connection reset by peer]
<mupuf>
DavidHeidelberg[m]: I would suggest renaming to $farm.disabled
smiles_ has quit [Ping timeout: 480 seconds]
<mupuf>
This way, we don't typo the name when adding it back :D
* mupuf
needs to split valve farm into two: KWS and mupuf
<DavidHeidelberg[m]>
Sure, maybe moving to .ci-farms-disabled?
<mupuf>
Oh, better!
<DavidHeidelberg[m]>
(I think the CI syntax would get more complicated when filteting the .disabled files
<DavidHeidelberg[m]>
*filtering (damn wish we could use Matrix or something. I love retroactively fix my typos)
sgruszka has quit [Remote host closed the connection]
heat__ has joined #dri-devel
heat_ has quit [Read error: Connection reset by peer]
rasterman has joined #dri-devel
alyssa has joined #dri-devel
<alyssa>
gfxstrand: I am questioning whether aggressive vec_to_moves/regs is a good idea
<alyssa>
it eliminates moves, sure
<alyssa>
but it also spikes register demand (-->spilling) since it means random scalars that eventually get collected into a short-lived vector become channels of a long-lived vector
<alyssa>
and none of the vec4 backends can split live ranges in their RA
<alyssa>
at least this is the case with midgard
<alyssa>
maybe intel/vec4 skirts around that somehow
<gfxstrand>
alyssa: I dropped you more shader-db stats for commit messages. I think I'm happy with the impact. It looks like most of the noise from the series as a whole comes from reworking things like PTN and TTN to not depend on writemasks. IMO, that's totally fine.
vliaskov has quit [Remote host closed the connection]
<gfxstrand>
alyssa: vec_to_movs pushed writes up. You weren't before and that led to a bunch of minor regressions. How that the new pass is also pushing stuff up, the regressions are gone.
<gfxstrand>
Oh, one other thing RE register pressure. The Intel vec4 back-end doesn't RA per-component. Every NIR def/reg gets a whole vec4 whether it needs it or not (or 2 if it's a dvec3/4).
<gfxstrand>
Intel vec4 is dumb...
<alyssa>
doh
<alyssa>
yeah, that'd do it then
<alyssa>
Midgard allocates registers at byte-granularity, with full per byte liveness tracking
<gfxstrand>
Yeah, Intel vec4 is dumb
<alyssa>
oh the midgard compiler is dumb in a lot of ways
<gfxstrand>
But, hey, it supports tessellation shaders with FP64 so...
<alyssa>
but it makes excellent use of the register file
<alyssa>
(by bruteforce, mostly)
<gfxstrand>
hehe
<alyssa>
I fondly remember you marking up that paper when I was in first year and not being able to finish it :~)
<alyssa>
also a comment to the effect of "too much detail, this isn't homework"
<gfxstrand>
lol
<gfxstrand>
That may have been a thing past me said
<jenatali>
Huh, Intel's Windows Vulkan driver apparently doesn't display anything on monitors that aren't directly connected to it. That's fun
<gfxstrand>
Color me unsurprised.
<zmike>
what color is that
benjaminl has quit [Ping timeout: 480 seconds]
eyearesee has left #dri-devel [#dri-devel]
benjaminl has joined #dri-devel
elongbug has quit [Remote host closed the connection]
elongbug has joined #dri-devel
oneforall2 has quit [Remote host closed the connection]
benjaminl has quit [Ping timeout: 480 seconds]
benjaminl has joined #dri-devel
oneforall2 has joined #dri-devel
alyssa has left #dri-devel [#dri-devel]
benjamin1 has joined #dri-devel
benjaminl has quit [Ping timeout: 480 seconds]
elongbug has quit [Remote host closed the connection]
elongbug has joined #dri-devel
enunes has joined #dri-devel
kzd has quit [Quit: kzd]
benjamin1 has quit [Quit: WeeChat 3.8]
benjaminl has joined #dri-devel
elongbug has quit [Remote host closed the connection]
elongbug has joined #dri-devel
jewins has quit [Remote host closed the connection]
jewins has joined #dri-devel
fab has quit [Quit: fab]
Duke`` has quit [Ping timeout: 480 seconds]
sima has quit [Ping timeout: 480 seconds]
everfree has joined #dri-devel
rasterman has quit [Quit: Gettin' stinky!]
ngcortes has joined #dri-devel
<karolherbst>
jenatali: I'm sure you are the first and only person hitting this issue
kzd has joined #dri-devel
<DavidHeidelberg[m]>
jenatali: have you triggered the Windows build in mine MR? :D just not sure if I broke the rules or you just wanted test it works
<jenatali>
David Heidelberg: I didn't trigger anything
<DavidHeidelberg[m]>
damn, ok I'll check the pipeline
<jenatali>
David Heidelberg: I think the problem is that the container builds can't use the same rules as the build/test jobs
<jenatali>
The containers are supposed to be auto for Marge / post-merge, manual everywhere else
<DavidHeidelberg[m]>
jenatali: I have the fix
<jenatali>
The build and test jobs are supposed to be auto all the time, and it's just their dependency on the containers that keep them from running
<jenatali>
Ok cool
<DavidHeidelberg[m]>
jenatali: the trigger container you have depending on Win farm devices. And you define it, so when I say if changed, go `always`, it... always always :D
<DavidHeidelberg[m]>
with changing the reference for the trigger job to farm-manual-..
<jenatali>
That'll break marge
<jenatali>
It can't be always manual
<jenatali>
Well, probably anyway. I don't know enough about all of this :)
<DavidHeidelberg[m]>
thanks, you're right.
<jenatali>
Hence my original comments about our stuff being special because our "farm" being offline isn't just some tests to skip
<DavidHeidelberg[m]>
jenatali: copy paste the container, adjusted + added rest of the MS rules
<DavidHeidelberg[m]>
the offending rules never gets executed if we go trough the container. I just thinking if I should move .container into .gitlab-ci/test-source-dep.yml to have it in same file
JohnnyonFlame has quit [Ping timeout: 480 seconds]
<DavidHeidelberg[m]>
just one unintentional change, which I kinda of LIKE. ... When you re-enable farm, it runs ALL the jobs (even the manual ones). I kinda like it, because in these scenarios we had to wait until nightly runs to see new flakes/fails/succ, but now it gets fixed at re-enabling phase
JohnnyonFlame has joined #dri-devel
Haaninjo has quit [Quit: Ex-Chat]
adavy has quit [Ping timeout: 480 seconds]
iive has quit [Quit: They came for me...]
kzd has quit [Quit: kzd]
JohnnyonFlame has quit [Ping timeout: 480 seconds]