PCI Configuration Process

bloodline · Post by **bloodline** » Sat Sep 11, 2021 2:23 am

Octocontrabass wrote:
bloodline wrote:PCI Config offset 14 (BAR1) can be configured via registers CF8, CF4, and CF3... Which is oddly cryptic as such "registers" don't appear to be located anywhere...
I took another look at the datasheet and it's in there, section 3.2, but it's not something you can manipulate in software: it's controlled by whether resistors are wired up to specific lines of the memory data bus.

But, QEMU is supposed to emulate a card with BAR1 enabled. Are you sure the BARs you're looking at belong to the Cirrus card?

Doh!

Yup, when I did that run, I did have -vga std option set

But still, you pointing out my errors have helped me far more than the documentation did

bloodline wrote:Anyway, I might give-up with the old Cirrus chip and follow @thewrongchristian 's advice and try to find some documentation for QEMU's virtio display adaptor...
The VIRTIO specification should help.

I have been looking at this one, it's perhaps more impenetrable than the Cirrus Document

But Just setting -vga virtio as an option in qemu has afforded a slight speed improvement... So that's the route to go down.

rdos · Post by **rdos** » Tue Sep 14, 2021 2:27 am

thewrongchristian wrote:
rdos wrote:
Octocontrabass wrote:Probably for firmware. Actual GPU drivers use DMA to move things around, they usually don't access the framebuffer directly. (Does "memory schedules" refer to a type of DMA?)
Yes. The kernel driver will construct a memory schedule of work to be done, and then the PCIe device will read & write the schedule with DMA (bus mastering).

Anyway, this seems to explain why performance with LFB is slower than using the GPU interface. Something that appears to be a bit illogical at first. However, BARs never have the same performance as bus mastering. It also explains why the LFB should only be written and not read. When doing a read of a BAR, the CPU will need to wait for the PCIe device to read the contents from local RAM and then send it back as a PCIe transaction. With writing using the correct caching settings & a decently implemented PCIe device, the CPU shouldn't need to wait for the PCIe device to handle the request.
I think you're over thinking this.

This is the CL GD5446 we're talking about, a value PCI GFX chipset from the 1990's. Its "GPU" was a simple blitter, and all the VRAM was on the device side of the PCI bus, and relied on write combining for burst performance when writing to the FB.

Certainly, but I'm wondering why modern Intel graphics chips have such a poor performance, and why AMD chips tend to perform a lot better. A poor implementation of the LFB via BARs certainly can explain it. Since Intel assumes everybody will use the GPU interface, they didn't bother with providing speed in the BAR interface.

thewrongchristian · Post by **thewrongchristian** » Tue Sep 14, 2021 12:07 pm

rdos wrote:
thewrongchristian wrote:
rdos wrote: Anyway, this seems to explain why performance with LFB is slower than using the GPU interface. Something that appears to be a bit illogical at first. However, BARs never have the same performance as bus mastering. It also explains why the LFB should only be written and not read. When doing a read of a BAR, the CPU will need to wait for the PCIe device to read the contents from local RAM and then send it back as a PCIe transaction. With writing using the correct caching settings & a decently implemented PCIe device, the CPU shouldn't need to wait for the PCIe device to handle the request.
I think you're over thinking this.

This is the CL GD5446 we're talking about, a value PCI GFX chipset from the 1990's. Its "GPU" was a simple blitter, and all the VRAM was on the device side of the PCI bus, and relied on write combining for burst performance when writing to the FB.
Certainly, but I'm wondering why modern Intel graphics chips have such a poor performance, and why AMD chips tend to perform a lot better. A poor implementation of the LFB via BARs certainly can explain it. Since Intel assumes everybody will use the GPU interface, they didn't bother with providing speed in the BAR interface.

Doesn't the Intel GPU operate exclusively via the regular shared system RAM?

So, writing to the framebuffer is just a case of writing to the physical RAM you've indicated to the GPU to pull the framebuffer contents from. As I understand it, BAR2 indicates the physical memory address of this shared framebuffer RAM, but I've not poked it. I'm still mostly QEMU based at the moment.

The performance (or lack thereof) in the Intel GPU will be a function of the GPU itself. AMD are probably just better at GPUs than Intel.

rdos · Post by **rdos** » Wed Sep 15, 2021 7:22 am

thewrongchristian wrote: So, writing to the framebuffer is just a case of writing to the physical RAM you've indicated to the GPU to pull the framebuffer contents from. As I understand it, BAR2 indicates the physical memory address of this shared framebuffer RAM, but I've not poked it. I'm still mostly QEMU based at the moment.

No, when you read or write to a BAR area you will create PCIe requests to the GPU which it needs to serve in real time. It will typically map some of it's local RAM to the BAR, and then it is the respnosibility of the GPU to route between those. If you do a quick, pipe-lined solution, this could indeed end up as highly inefficient. I know because I have implemented BARs myself, and I decided I needed to do bus mastering requests to main memory to achieve the throughput I wanted. If I let the CPU read the BAR, it's too slow, but when the PCIe card uses bus-mastering, I can read it from main memory very fast.

Octocontrabass · Post by **Octocontrabass** » Wed Sep 15, 2021 11:41 am

thewrongchristian wrote:Doesn't the Intel GPU operate exclusively via the regular shared system RAM?

Yes. Recent ones participate in the cache coherency protocol, too.

thewrongchristian wrote:As I understand it, BAR2 indicates the physical memory address of this shared framebuffer RAM, but I've not poked it.

BAR2 provides a window into the GPU's view of RAM according to the GPU's page tables. On a recent GPU with coherent shared memory, I would expect it to be slower than directly accessing the memory from the CPU's view.

OSDev.org

PCI Configuration Process

Re: PCI Configuration Process

Re: PCI Configuration Process

Re: PCI Configuration Process

Re: PCI Configuration Process

Re: PCI Configuration Process