Intel GPUs are plenty fast when you have a proper driver for them, but we're talking about VBE. My guess is Intel's VBE firmware puts the GPU into a compatibility mode. Since drivers don't use the compatibility mode, there's no reason for Intel to make it perform well.Ethin wrote:Confused about this as well. Considering that the Linux Intel GPU drivers are plenty fast, there's a high probability that your doing something wrong if they're slow.davmac314 wrote:Why is that (avoid Intel platforms)? And also why would "a bit dated" systems be faster?rdos wrote:You should avoid Intel platforms if you want high-performance VBE. You generally should also avoid new systems and go with systems that are a bit dated. Although, modern AMD systems work pretty well.
What graphics card is good at VBE?
-
- Member
- Posts: 5563
- Joined: Mon Mar 25, 2013 7:01 pm
Re: What graphics card is good at VBE?
Re: What graphics card is good at VBE?
The question was about VBE, not writing GPU drivers. It's very possible that Intels GPU drivers perform well, but their VBE implementation on top of BARs perform horribly slow.Ethin wrote:Confused about this as well. Considering that the Linux Intel GPU drivers are plenty fast, there's a high probability that your doing something wrong if they're slow. The dated hardware argument only makes sense in the context of "this might be easier to program". But I generally find the Intel GPU manuals incredibly confusing and disorganized, especially since they went "Hey lets break tradition and organize our GPU registers alphabetically and not by category/address like everybody else". (What's ironic about that is they also wrote the AHCI and HDA specifications and didn't do that.)davmac314 wrote:Why is that (avoid Intel platforms)? And also why would "a bit dated" systems be faster?rdos wrote:You should avoid Intel platforms if you want high-performance VBE. You generally should also avoid new systems and go with systems that are a bit dated. Although, modern AMD systems work pretty well.
The dated argument is because before everything significant used native GPU drivers, manufacturers needed to make VBE fast. Nowadays, nobody cares. Intel doesn't even care to add all available resolutions, and so on Intel hardware with a wide-screen monitor, you cannot set the correct resolution and must use a standard 4:3 resolution that distorts the display. This is true regardless if you boot with BIOS or use EFI and GOP. The latter is only a problem with Intel too.
Re: What graphics card is good at VBE?
There is no reason to pessimize VBE though. VBE uses the same BAR-mapped-VRAM that is also used for data transfers to the card (e.g., when moving textures into VRAM). So the issue is probably that the VBE driver runs the entire GPU at a lower clock speed (or similar) and not that writing to the BAR is inherently slow.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
Re: What graphics card is good at VBE?
Not true. Writing to PCIe BARs is always slower than writing normal memory and VRAM. This is because the PCIe card needs to decode the PCIe operations and transfer the data to it's internal RAM. If you don't do this in a highly pipelined mode, it will be very slow. I'm pretty sure that GPUs will use bus-mastering, which can be done in large chunks and is much faster.Korona wrote:There is no reason to pessimize VBE though. VBE uses the same BAR-mapped-VRAM that is also used for data transfers to the card (e.g., when moving textures into VRAM). So the issue is probably that the VBE driver runs the entire GPU at a lower clock speed (or similar) and not that writing to the BAR is inherently slow.
I think the biggest problem is that the LFB in modern GPUs is implemented with BARs rather than being directly mapped in main memory.
Re: What graphics card is good at VBE?
But VBE only writes to the BARs themselves to set up the memory mapping, presumably.rdos wrote:Not true. Writing to PCIe BARs is always slower than writing normal memory and VRAM
I think what you mean is that writing to memory mapped via the BARs (BAR-mapped-VRAM) is slower than writing normal memory and VRAM (for the latter assuming there was some way to write VRAM directly). What you say about using bus mastering to have the adapter transfer chunks from system RAM to VRAM at high speed at least sounds plausible, but implies that direct framebuffer access should always be slow and that to get performance an Intel GPU driver always needs to construct bitmaps in system RAM before blitting them via commands to the GPU (involving bus mastering). That seems surprising.
But graphics adapters always had their own on-board memory. They never used to use main memory to back the framebuffer; as far as I'm aware, the Intel integrated GPUs were the first (or among the first) to do that (for x86 PCs). And I don't see any particular reason why mapping the VRAM into the system memory address space is drastically less efficient using BARs than it would be any other way. I'd love to know more.I think the biggest problem is that the LFB in modern GPUs is implemented with BARs rather than being directly mapped in main memory.
Re: What graphics card is good at VBE?
While I've implemented a basic modesetting driver for the Intel chipset in my ThinkPad T410, I have historically relied on VESA mode setting to obtain a framebuffer for the device... and I can't say I've ever had any performance issues. I can happily play Quake under my software compositor with excellent framerates and no visible latency to screen refreshes. The trouble I always had with Intel cards and VESA/VBE is their tendency to inexplicably not provide the native panel resolution on a laptop. Further, my modesetting driver never does anything with the framebuffer, which is still the one provided by VESA and still matches the address provided in BAR2 on the card - it only tweaks the source scaling options to match the panel resolution (as VESA on these chipsets already sets up the output for the native resolution but then always seems to set a source resolution that differs - perhaps the hardware scaling that entails is responsible for some performance issues, but even at 1280x800 on a 1440x900 output things are peachy here).
Re: What graphics card is good at VBE?
Well, I've implemented BARs myself in an FPGA project, and the problem is that when the CPU accesses a BAR you need to add Verilog code to handle the (short) PCIe transaction. If it is a write you need to write directly or queue it for sending to local RAM and ack the packet. If you write it directly then the PCIe bus will be busy until writing is done. If it is a read, you need to do a local RAM read and then create another PCIe transaction with the return data. This is why reading the LFB is a big no-no. So, based on how the card handle BAR accesses, the performance may differ. Also note that a 32-bit processor can at most create a dword PCIe access. When you do bus-mastering from an PCIe card, you can create read & write requests of 128 or even 256 words, and these will run much faster than CPU accesses. So, by creating a schedule in main memory you can achieve much better graphics performance than with an LFB.davmac314 wrote:But VBE only writes to the BARs themselves to set up the memory mapping, presumably.rdos wrote:Not true. Writing to PCIe BARs is always slower than writing normal memory and VRAM
I think what you mean is that writing to memory mapped via the BARs (BAR-mapped-VRAM) is slower than writing normal memory and VRAM (for the latter assuming there was some way to write VRAM directly). What you say about using bus mastering to have the adapter transfer chunks from system RAM to VRAM at high speed at least sounds plausible, but implies that direct framebuffer access should always be slow and that to get performance an Intel GPU driver always needs to construct bitmaps in system RAM before blitting them via commands to the GPU (involving bus mastering). That seems surprising.
Actually, I could post ADC data with my ADC FPGA close to the maximum throughput of the PCIe hardware by creating 128 word PCIe writes. If I instead had let the CPU read the data from the BAR, performance would be many times slower.
I think older designs actually shared memory with the system. I don't know how it worked, but still.davmac314 wrote:But graphics adapters always had their own on-board memory. They never used to use main memory to back the framebuffer; as far as I'm aware, the Intel integrated GPUs were the first (or among the first) to do that (for x86 PCs). And I don't see any particular reason why mapping the VRAM into the system memory address space is drastically less efficient using BARs than it would be any other way. I'd love to know more.
Re: What graphics card is good at VBE?
The two cases that stand out there would be AGP, and the PCjr.
AGP acted as a second PCI bus just for the graphics card, with stricter rules regarding transaction size, etc. That plus usual PCI-like bus mastering allowed the card to read system memory at a reasonable rate. I don't know offhand how that affected performance on the CPU side, but with caches on either side, and it usually being for textures rather than the framebuffer, it was probably at an acceptable level.
In the ancient days, while card VRAM was mapped into the address space, accessing it could be rather slow. It wasn't always dual-ported, and if the card itself was busy reading it for scanout, it got priority over the CPU access, which would get waitstates stuffed in as needed. Building your image in memory, and then blast copying it to the card during vblank was not just for avoiding tearing.
The PCjr was it's own special little thing. In the interests of cost savings, the video was onboard, and had no dedicated ram of its own. That meant fighting with waitstates everywhere. To be even more cheap about it, they ditched the 8237 DMA controller, and used the CRT controller to do DRAM refresh. Turn off the screen for too long (say, during a mode switch) and bad things happen.
AGP acted as a second PCI bus just for the graphics card, with stricter rules regarding transaction size, etc. That plus usual PCI-like bus mastering allowed the card to read system memory at a reasonable rate. I don't know offhand how that affected performance on the CPU side, but with caches on either side, and it usually being for textures rather than the framebuffer, it was probably at an acceptable level.
In the ancient days, while card VRAM was mapped into the address space, accessing it could be rather slow. It wasn't always dual-ported, and if the card itself was busy reading it for scanout, it got priority over the CPU access, which would get waitstates stuffed in as needed. Building your image in memory, and then blast copying it to the card during vblank was not just for avoiding tearing.
The PCjr was it's own special little thing. In the interests of cost savings, the video was onboard, and had no dedicated ram of its own. That meant fighting with waitstates everywhere. To be even more cheap about it, they ditched the 8237 DMA controller, and used the CRT controller to do DRAM refresh. Turn off the screen for too long (say, during a mode switch) and bad things happen.
-
- Member
- Posts: 5563
- Joined: Mon Mar 25, 2013 7:01 pm
Re: What graphics card is good at VBE?
A 32-bit processor with write-combining can create PCI accesses as large as a whole cache line, which is usually 64 bytes. This applies to both reads and writes, as speculative reads are allowed from WC memory. Intel has guides on how to accomplish this.rdos wrote:Also note that a 32-bit processor can at most create a dword PCIe access.
Without WC, most x86 processors can still perform at least qword PCIe access (e.g. using FILD/FISTP).
Before 2000 or so, most PC display adapters had their own memory. Before 1995 or so, all PC display adapters had their own memory.rdos wrote:I think older designs actually shared memory with the system. I don't know how it worked, but still.
That depends on the hardware. An old SiS chipset datasheet specifically mentions that accessing the display memory through PCI MMIO incurs overhead from PCI transactions, but configuring the chipset to map the display memory directly into the CPU's address space does not. Intel's datasheets have similar warnings for how accessing the display memory in particular ways will incur overhead (though I didn't see anything about PCIe MMIO).davmac314 wrote:And I don't see any particular reason why mapping the VRAM into the system memory address space is drastically less efficient using BARs than it would be any other way.
Re: What graphics card is good at VBE?
I set the pages that cover the LFB as WC. I never read the LFB, and it is accessed with a rep movsd instruction. Still, many modern Intel systems have worse peformance than a 10-20 years old AMD Athlon motherboard. The computer that runs my random graphics test fastest is a 10 year old AMD computer. There need to be some good explaination for this.Octocontrabass wrote:A 32-bit processor with write-combining can create PCI accesses as large as a whole cache line, which is usually 64 bytes. This applies to both reads and writes, as speculative reads are allowed from WC memory. Intel has guides on how to accomplish this.rdos wrote:Also note that a 32-bit processor can at most create a dword PCIe access.
Without WC, most x86 processors can still perform at least qword PCIe access (e.g. using FILD/FISTP).
Re: What graphics card is good at VBE?
Why exactly are you using MOVSD? That's for moving data to and from strings, definitely not designed to be used to transfer data to and from an LFB (I think). I mean, okay, it can be used for that, but I don't think your (supposed) to do that. Have you thought about either using LOOP/LOOPcc or implementing a loop yourself? Will take more instructions, but it might be faster if you use a normal MOV.rdos wrote:I set the pages that cover the LFB as WC. I never read the LFB, and it is accessed with a rep movsd instruction. Still, many modern Intel systems have worse peformance than a 10-20 years old AMD Athlon motherboard. The computer that runs my random graphics test fastest is a 10 year old AMD computer. There need to be some good explaination for this.Octocontrabass wrote:A 32-bit processor with write-combining can create PCI accesses as large as a whole cache line, which is usually 64 bytes. This applies to both reads and writes, as speculative reads are allowed from WC memory. Intel has guides on how to accomplish this.rdos wrote:Also note that a 32-bit processor can at most create a dword PCIe access.
Without WC, most x86 processors can still perform at least qword PCIe access (e.g. using FILD/FISTP).
Edit: the intel manuals for MOVS/MOVSB/MOVSW/MOVSD/MOVSQ notes that newer processors perform certain optimizations for fast string ops, and refers the reader to SEC. 7.3.9.3 of vol. 1, which notes that system software should always set IA32_MISC_ENABLE[0] to 1. So if you don't do that and if the firmware doesn't do it, you might also want to do that.
Re: What graphics card is good at VBE?
That's more Intel strangeness. For a typical assembler programmer, using a loopless instruction that the CPU manufacturer can optimize to actually create huge PCIe transactions should be optimal. The CPU doesn't need to fetch or decode any instructions, and so could perform the operation at maximum speed. Of course, if it wasn't for the fact that C compilers will not output it. So they implement it in microcode instead. Just like they don't care about the LFB either, because nobody uses it.Ethin wrote:Why exactly are you using MOVSD? That's for moving data to and from strings, definitely not designed to be used to transfer data to and from an LFB (I think). I mean, okay, it can be used for that, but I don't think your (supposed) to do that. Have you thought about either using LOOP/LOOPcc or implementing a loop yourself? Will take more instructions, but it might be faster if you use a normal MOV.
Edit: the intel manuals for MOVS/MOVSB/MOVSW/MOVSD/MOVSQ notes that newer processors perform certain optimizations for fast string ops, and refers the reader to SEC. 7.3.9.3 of vol. 1, which notes that system software should always set IA32_MISC_ENABLE[0] to 1. So if you don't do that and if the firmware doesn't do it, you might also want to do that.
I guess that AMD probably handles it better, just like they handle syscalls (with call gates) and segmentation much better than Intel.
The idea that a CPU manufacturer would look at how C compilers output code and which memory models they use, and then optimize their CPU according to this really sucks. It means that better compilers or OSes will not be made, because there is no need since the CPU is adapted to their poor code generation strategies. It's a bit like how the CPU tries to create parallelism because programmers cannot. Particularly not when they use sequential languages like C.
There is a reason why I will not write another OS for long mode or anything else that needs to rely on poor modern toolkits.
Re: What graphics card is good at VBE?
Sorry but I have to disagree with you very strongly about this. Its actually a very good thing that Intel/AMD optimize their CPUs to be fastest with compiler-generated code. Because 99.99999999 percent of programmers don't write assembly. I'm quite positive that less than 30 percent of them even use intrinsic functions. Assembly is only used in very very specific circumstances where doing what you want to do is impossible in a higher-level language, and the reasons to use it are becoming fewer and fewer every year. I wouldn't be surprised if in 5-10 years you'll be able to completely write an OS in a high-level language without knowing any assembly at all, or only have to write a few tiny functions in it (or if your doing some ABI weirdness).rdos wrote:That's more Intel strangeness. For a typical assembler programmer, using a loopless instruction that the CPU manufacturer can optimize to actually create huge PCIe transactions should be optimal. The CPU doesn't need to fetch or decode any instructions, and so could perform the operation at maximum speed. Of course, if it wasn't for the fact that C compilers will not output it. So they implement it in microcode instead. Just like they don't care about the LFB either, because nobody uses it.Ethin wrote:Why exactly are you using MOVSD? That's for moving data to and from strings, definitely not designed to be used to transfer data to and from an LFB (I think). I mean, okay, it can be used for that, but I don't think your (supposed) to do that. Have you thought about either using LOOP/LOOPcc or implementing a loop yourself? Will take more instructions, but it might be faster if you use a normal MOV.
Edit: the intel manuals for MOVS/MOVSB/MOVSW/MOVSD/MOVSQ notes that newer processors perform certain optimizations for fast string ops, and refers the reader to SEC. 7.3.9.3 of vol. 1, which notes that system software should always set IA32_MISC_ENABLE[0] to 1. So if you don't do that and if the firmware doesn't do it, you might also want to do that.
I guess that AMD probably handles it better, just like they handle syscalls (with call gates) and segmentation much better than Intel.
The idea that a CPU manufacturer would look at how C compilers output code and which memory models they use, and then optimize their CPU according to this really sucks. It means that better compilers or OSes will not be made, because there is no need since the CPU is adapted to their poor code generation strategies. It's a bit like how the CPU tries to create parallelism because programmers cannot. Particularly not when they use sequential languages like C.
There is a reason why I will not write another OS for long mode or anything else that needs to rely on poor modern toolkits.
-
- Member
- Posts: 5563
- Joined: Mon Mar 25, 2013 7:01 pm
Re: What graphics card is good at VBE?
But no matter how much you optimize it, you'll never beat DMA. Since DMA will always be the fastest choice, every PCIe device will support DMA, and the CPU manufacturer has no reason to optimize anything else. (This might explain why older hardware seems to work better here: PCI display adapters were sometimes based on ISA or VLB designs, which didn't use DMA, so it made sense for the CPU and chipset to optimize non-DMA PCI performance.)rdos wrote:For a typical assembler programmer, using a loopless instruction that the CPU manufacturer can optimize to actually create huge PCIe transactions should be optimal.
GCC does. Clang does too.rdos wrote:Of course, if it wasn't for the fact that C compilers will not output it.
I don't see how this is a bad thing. With microcode, the manufacturer can optimize cache access patterns. The instruction decoder can't dispatch enough uops to do that without microcode.rdos wrote:So they implement it in microcode instead.
That sounds like a topic for a different thread.rdos wrote:It's a bit like how the CPU tries to create parallelism because programmers cannot.
Re: What graphics card is good at VBE?
Sure, but I see no reason to write a new OS that cannot exploit an interesting design idea. With ready-made memory models, compilers and the Unix "standard", all you can do is to write yet another Unix-clone, which probably will be poor and uninteresting. I don't much fancy writing that kind of poor stuff. I can use my time in better ways.Ethin wrote:Sorry but I have to disagree with you very strongly about this. Its actually a very good thing that Intel/AMD optimize their CPUs to be fastest with compiler-generated code. Because 99.99999999 percent of programmers don't write assembly. I'm quite positive that less than 30 percent of them even use intrinsic functions. Assembly is only used in very very specific circumstances where doing what you want to do is impossible in a higher-level language, and the reasons to use it are becoming fewer and fewer every year. I wouldn't be surprised if in 5-10 years you'll be able to completely write an OS in a high-level language without knowing any assembly at all, or only have to write a few tiny functions in it (or if your doing some ABI weirdness).
Originally, Intel did a great work with the 386 standard, and provided an enviroment that can be used to isolate software. Then the big OSes decided they wanted to use a flat memory model, linking the kernel with code & data closely packed and a nightmare in poor software isolation. The biggest reason they did so was because C compilers couldn't handle segmentation in a efficient way. So, the Unix standard and C compilers made a great invention regress back to a flat memory model, and then AMD invented the 64-bit flat memory model with no segmentation support.
CPU manufacturers decided they needed to optimize multilevel paging and could do segment loads in microcode.
In fact, the Unix and C standard will crush any new invention in the CPU area aimed to solve the issue of software isolation. Simply because this is incompatible with those standards.
I once had an idea that the upper 16-bit of the long mode address space could also be used for software isolation, however GCC and the GNU linker couldn't handle it.