gouchi has quit [Remote host closed the connection]
kts has joined #dri-devel
kts has quit [Quit: Leaving]
tobiasjakobi has joined #dri-devel
warpme_____ has quit []
paulk has quit [Ping timeout: 480 seconds]
paulk has joined #dri-devel
freemint has joined #dri-devel
<freemint>
I got an PCIe device whose region 0 is supposed to be 64G big but it is only 128M big on the machine according to lspci -vvv. On a machine where the PCIe device works lspci does not mention Resizeable Bar. Is: https://pastebin.com/5agyjmtv Should be: https://pastebin.com/us1gsHdn The vendor does not officially support my older processor and some error analysis tool by the vendor suspects a PCIe related error. Do you have any
<freemint>
suggestions how i can begin to narrow down this issue?
kts has joined #dri-devel
bmodem has quit [Ping timeout: 480 seconds]
warpme_____ has joined #dri-devel
<Mis012[m]>
I'd assume you don't need to resize the BAR if the space dedicated for that device was large enough already, but that seems unlikely
<Mis012[m]>
and presumably the BIOS (or ACPI code) would be the one dealing with that, because that's totally not something that doesn't belong there
<Mis012[m]>
just write a Linux driver for the root bridge instead of letting bios/acpi manage it :P
<freemint>
that double negation confuses. If i am in the wrong channel i welcome pointers to more appropriate channels.
<freemint>
Mis012[m] your advice is not actionable for me as i do not know where to start with that. If this is humor it went over my head.
<Mis012[m]>
it's not necessarily on-topic in this channel, but I'm not sure there's a better place so hopefully someone here can help you
<Mis012[m]>
my point was that sadly, this stuff seems to be handled by bios/acpi, rather than by Linux (which in my personal opinion would be saner)
<Mis012[m]>
worst case you can ask on LKML
kts has quit [Quit: Leaving]
<Mis012[m]>
freemint: it seems confusing to say that a generation of SoCs has a bug if that bug is actually in the ACPI tables the motherboard supplies
<freemint>
The PCIe device in question has an open source kernel driver https://github.com/veos-sxarr-NEC/ve_drv-kmod however the bring up involves an proprietary userland binary. This bring up fails. I hope that Linux kernel development tooling could atleast brings some visibility into this problem.
<freemint>
Mis012[m], Sure that was phrased badly. Let me rephrase.
<Mis012[m]>
so I gather the kernel driver is not exactly upstream?
<Mis012[m]>
would probably not fly with hard dependency on magic userspace binary
<freemint>
The kernel driver it out of tree but open source.
<Mis012[m]>
well, you could also call certain downstream drivers open source, but at some point they are not drivers but rather condoms for talking to the userspace code
<freemint>
Mis012[m], I would not either but i just got the card powered up this week and ... i am still far from replacing that magic binary
<Mis012[m]>
so it's certainly important to evaluate if it's a driver or a "driver"
<Mis012[m]>
freemint: do you have some large VRAM GPU that supports having it's full VRAM mapped?
<Mis012[m]>
if you could reproduce the issue with that, it would sure be easier to debug
<freemint>
I only have a 960 handy which i do not think qualifies.
<Mis012[m]>
well, how much VRAM it has is presumably documented :P
<Mis012[m]>
whether you can make it map it directly might not be
<freemint>
Resizeable Bar and above4g support are things PCIe3.0 motherboards should support. Yet many or all motherboards with AMI BIOSes of this and also some later generations come without ReBar support and broken above4g support.
<freemint>
The GTX 960 has 4 GB of RAM which might qualify as large. By default it maps 256M in one of its regions.
epoll has quit [Ping timeout: 480 seconds]
<freemint>
It looks like NVIDIA has to provide an explicit firmware update for Resizeable Bar support which they did not do for this old an generation. AFAIK
rcf has quit [Ping timeout: 480 seconds]
<Mis012[m]>
>Copied from drivers/pci/pci.c
<Mis012[m]>
wow, that's some quality code right there
<Mis012[m]>
could maybe explain why they thought that was a good idea
* freemint
is confused whether copying and adapting from working things is a bad thing.
<Mis012[m]>
it's a layering violation
<Mis012[m]>
so they should at least explain why they thought it was appropriate
<Mis012[m]>
drivers/pci/pci.c is not a driver for a PCI(E) device
<Mis012[m]>
it's the framework this driver is using to do pci(e) stuff
epoll has joined #dri-devel
* freemint
nods
<freemint>
Is there a way to listen to what dev_err gets called with or if it gets called?
maxzor has joined #dri-devel
<Mis012[m]>
dev_err should print stuff to kernel log
<Mis012[m]>
so it ought to be quite obvious if it was called xD
<Mis012[m]>
it's literally a logging function
<Mis012[m]>
Region 0: Memory at e8000000 (64-bit, prefetchable) [size=128M]
<Mis012[m]>
this is certainly below 4G
<Mis012[m]>
but why would it not be if it fits in the pre-allocated BAR space
<freemint>
at the risk of being so dumb that i get kicked form the room. How do i do find that in the logs?
<Mis012[m]>
I don't think this room kicks people out for being dumb
<Mis012[m]>
either `dmesg | grep <stuff>` or `dmesg -H` and scroll through manually
<Mis012[m]>
dev_err should prefix the message with the name of the driver
rcf has joined #dri-devel
<Mis012[m]>
actually this is the dev room, there should be a more user-friendly companion room that's supposed to be where people ask for help
<Mis012[m]>
(with non-dev issues)
<Mis012[m]>
presumably development of an upstream-quality driver that doesn't need a proprietary userspace blob would be desirable, but not sure that's dri-adjacent :P
<freemint>
I am a bit worried about the binary too and if i think it is within my grasp i would like to atleast document what it does. However i first got to get the card working. In half a month a might have access to a supported machine. ... But that is in half a month.
tobiasjakobi has quit []
thaytan has quit [Ping timeout: 480 seconds]
<freemint>
no dev_err in dmesg.
<Mis012[m]>
🤨
<Mis012[m]>
it doesn't print the name of the logging function
<Mis012[m]>
you can filter by severity if you only want errors
<Mis012[m]>
not like I remember which level is "error" :P
<Mis012[m]>
dmesg -l err
<Mis012[m]>
nice, proper dmesg doesn't need a number
<freemint>
nope nothing from the ve driver in there.
<Mis012[m]>
I assume it enumarates as 128MB initially, then you run the userspace blob, and if that works, it magically becomes 64GB?
<Mis012[m]>
I'm not sure I saw anywhere in the "driver" that was doing that, but at the same time surely that can't be done fully from userspace
<freemint>
That is an hypothesis compatible with what i know.
<freemint>
There is ARM chip on the PCIe device which is controlled by the user space binary and is suppossedly involved with initialisation process.
<freemint>
Do you want to stay ignorant why that exists?
caef^ has quit [Remote host closed the connection]
<Mis012[m]>
well, it sounds like they're mapping something to userspace that they probably should not
<Mis012[m]>
but I can't parse that code...
<freemint>
Okay on these card there a 8 fancy CPU cores. This CPU ISA has non system ring. So what they do instead is have the PCIe host interupt CPU cores over PCIes, reading and writing the processors state. From an userland perspective running on the card, you are a userland binary and all system calls are handled or emulated by the host CPU. An user process has access to the normal file system and so on. This includes memory
<freemint>
protection between multiple users and i guess that part does takes care that the memory protection unit on the pcie card is in sync with memory protextions on the x86_64 host running Linux.
<freemint>
or that is my best idea given my limited knowledge of linux
<Mis012[m]>
well, that's a fun architecure that I would love to exist, but realistically having the cores only connected via PCIE is way too slow for that to actually be viable :P
<Mis012[m]>
it's supposed to be an accelerator I assume?
<Mis012[m]>
won't be very fast if it context switches over PCIE
<freemint>
Actually they manage that by having the scheduling interval being 1 second.
<freemint>
It is an accelerator and that exists.
<freemint>
They do the context switches over PCIe.
<Mis012[m]>
really?
<Mis012[m]>
like really really?
<freemint>
I have it in writing from someone who works at NEC on those cards.
<Mis012[m]>
wow, I'd sure love to implement that in a generic way in mainline :P
<Mis012[m]>
I'd assume you would need to have the two cores be on the same die for this to be anywhere near viable
<freemint>
Awesome. I give you as much access to the card under my desk remotely as as you want.
<Mis012[m]>
well, I do prefer to work with docs :P
<mareko>
who knows how shared-glapi works?
<freemint>
I got one more card i could lend out in theory but given how hard they are to get i am a bit cautious with sending it out.
<Mis012[m]>
freemint: schematics of the card would be nice, and fingers crossed there are docs for the SoC they plopped on there
<Mis012[m]>
ideally they wouldn't use secure boot and one could have foss fw on that as well
<freemint>
They have 8 cores with 64 16384 bit register one can crunch on with vector operations..
<freemint>
Wait that is for the 2.0 gen let me see if i find something 1.0 although 2.0 was mostly a refresh and tweak from 1.0
<mareko>
olv__: is there any reason to have shared-glapi now that we have glvnd and only one *_dri.so file? Can shared-glapi be moved into libgallium_dri.so?
<Mis012[m]>
the "driver" did talk about some firmware
<freemint>
The small arm chip runnning firmware is probably on the xiu, peu or dgu . I do not have the legend to that handy to see what those shorthands mean Mis012[m]
<Mis012[m]>
neither sounded like it should have fw
<Mis012[m]>
freemint: do you have the fw binary
<freemint>
I have some binary files called firmware.
<freemint>
Yeah those cards are not meant for that. I am already what the card should be able to do with the that ASM code. Implementing a vector state machine using vector mask registers to parse things is not something you are gonna get the compiler to do for you.
<freemint>
Mis012[m], my looks like files can only be downloaded once. give me a sec
<DemiMarie>
freemint: my point is that CPUs are not good at this kind of code either
* freemint
nods
<freemint>
are the branches performance critical?
<Mis012[m]>
freemint: you know there are file sharing sites that are not such blatant adware that they don't even work in a browser with adblocker type stuff?
<DemiMarie>
freemint: branching is most of what the code I write does, I believe.
<DemiMarie>
I suspect the same is true for e.g. compilers.
<freemint>
Mis012[m] DGU it is then.
Haaninjo has joined #dri-devel
<Mis012[m]>
doesn't sound like it
iive has joined #dri-devel
<freemint>
Mis012[m], I was told by someone from NEC that there is an arm chip on there which is involved with device bring up and monitoring.
<freemint>
I know no more than that.
<Mis012[m]>
sad
<freemint>
Actually maybe also good.
<freemint>
It one can run code on it, the card becomes a bit more powerful. One could run the a standalone OS on it and have it no longer need to do context switch over PCIe or so is my pet theory.
<Mis012[m]>
freemint: can one run code on it thouh
<Mis012[m]>
*tough
<Mis012[m]>
*though
digetx has joined #dri-devel
<freemint>
it is firmware updateable. how much effort it takes to run code there i do not know yet.
<Mis012[m]>
well, it's not about effor
<Mis012[m]>
t
<Mis012[m]>
it's about "do they use cryptographically secure assymetric cipher for signature verification"
<Mis012[m]>
also known as "wait, that's legal?"
Akari has joined #dri-devel
<freemint>
I mean one could always swap out the chip with one controlls. Whether there are non hardware modding ways i do not know. I would be shocked if unsigned firmware updates are done in hardware developed 2017 but i honestly do not know. Whether the firmware can be exploited from the Vector cores, from the x86 host behind PCIe or if a hard mod is needed i do not know.
<freemint>
Hey do you sign your firmware images with asymmetric keys is not something you get an answer to if you ask nicely i think.
<freemint>
I know symmetric encryption is used somewhere though from binwalking some files.
krushia has joined #dri-devel
<clever>
freemint: on what device?
<Mis012[m]>
freemint: the chip is the full thing
<Mis012[m]>
you can't only swap out the part that you don't like :P
<freemint>
Where-ever "fw_ve_sbus_00.bin" gets installed to.
<Mis012[m]>
you can shoot it with a particle accelerator beam but then it will have a very sad EOL
<freemint>
Mis012[m], I have not taken the card apart but my impression was that the ARM controller was external. I could be mistaken.
<Mis012[m]>
🤨
<Mis012[m]>
how would that work
<Mis012[m]>
"it's basically planned obsolescence, they're forcing us to shoot particles at it which massively reduces it's lifespan"
<Mis012[m]>
freemint: that block diagram doesn't show any GPIO interface either
<freemint>
Good point.
<Mis012[m]>
so how does it communicate with the supposed chip to update the fw
<freemint>
maybe in an undocumented way?
<Mis012[m]>
or it has an undocumented arm core
<Mis012[m]>
which is more likely :P
<Mis012[m]>
or the arm core is inside one of those things from the block diagram
<freemint>
or parts of the block diagram are outside the chip
<Mis012[m]>
that's not how block diagrams work
<Mis012[m]>
the fw mentions i2c and spi xD
<Mis012[m]>
they could technically use the smbus interface on PCIE slots
<Mis012[m]>
but I don't think that's mandatory?
<freemint>
Well blockdiagrams can be lies for childreen.
<Mis012[m]>
:doubt:
<Mis012[m]>
there might be stuff missing or it might be slightly incorrect, but for sure it's all on die
<freemint>
Is NEC even arm license holder?
<freemint>
If it is not i can not print arm cores on silicon at tsmc i think
<Mis012[m]>
freemint: you could try to poke them to put all the registers in here, not just ones they use :P
<Mis012[m]>
I certainly would do that if I had the docs with the register map
<freemint>
That poking should work even if there is no driver.
<Mis012[m]>
well, you can poke...
<freemint>
Would you say given how unpopular the card is that this a better or worse open source and public doc situation compared to other accelarators by Nvidia or AMD?
<freemint>
want login to the machine the card is in so you can poke too?
<Mis012[m]>
AMD doesn't exactly share register maps, but sometimes they leak
<Mis012[m]>
fwiw they do seem to auto-generate the headers from them
<Mis012[m]>
which are public
<Mis012[m]>
so unless they really want to hide something, that should be quite exhaustive
<Mis012[m]>
freemint: open source situation regarding AMD cards is pretty much as good as it's going to get
<Mis012[m]>
do you want to correlate a dump of the unknown register values with LED blinking?
<Mis012[m]>
or whatever it has hooked up to GPIOs
<freemint>
What LEDs are you talking about? There are three externally visible once.
<Mis012[m]>
amgpu is somewhat solid (though I'd still love to convince mainline to let me refactor the i2c/gpio/irq stuff into separate drivers), Mesa drivers are nice...
<Mis012[m]>
freemint: the fw mentioned LEDs
<freemint>
I would not want to do potentially destructive stuff to the card. I do not have good screw drivers.
<Mis012[m]>
>implying that a die can be fixed with screwdrivers
<Mis012[m]>
freemint: you wanted to poke, the only thing that I can think of that might be visible just by looking at the mmio dump is if there are bits which change depending on if the LEDs are on
<freemint>
Mis012[m], Implying that GPIO is behind a screw driver
<Mis012[m]>
well, LEDs are presumably connected to GPIOs
<Mis012[m]>
I somewhat doubt a large BGA will have pins labeled on silkscreen