Unable to mark a memory region as WC using MTRRs

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Unable to mark a memory region as WC using MTRRs [solved

Post by Brendan »

Hi,
vvaltchev wrote:
Brendan wrote:The part that still doesn't make sense to me is that SSE or AVX using non-temporaral stores should be using the write-combining buffers and should give exactly the same performance regardless of whether the area is UC or WC in MTRRS or PAT.
That's the same thing I initially thought, but it turns to be not true on all of my machines.
Is it true on some but not others?

Note: As far as I can tell the UDOO's CPU is a "braswell" (an atom derivative). These CPUs are specifically designed for low power embedded systems and cut lots of corners that Intel wouldn't consider for their normal (laptop/desktop/server) chips.
vvaltchev wrote:My current understanding is that stores with a non-temporal hint affect only write-back memory and they force the CPU to bypass the cache and just write directly to the main memory. The gain evident because it avoids the CPU to throw away precious cache data for something that we'll not going to read any time soon. But in my case, for UC memory that hint has no effect since the CPU already bypasses the cache. Each store goes directly to the RAM and the CPU busy-waits until the store is completed. I got the idea of using non-temporal stores from: https://software.intel.com/en-us/articl ... me-buffers but in that article is explained the opposite: how to READ in an efficient way from a framebuffer and not how to write to it. I wanted to try anyway the non-temporal stores because I thought the framebuffer was using a write-back memory. After I added supported for MTRRs, I realized that it was not.
My understanding is:
  • non-temporal loads from WB go directly to L1 cache (bypassing and not polluting L2 and L3 caches)
  • non-temporal loads from UC probably do nothing (no different to normal loads)
  • normal stores to WT or UC go to a "store queue" where CPU can keep going but has to wait for stores to be finished before doing subsequent loads (which is why "single register doing moves" is bad - you want to load 16 registers full of data then store 16 registers full of data, to minimise the "loads have to wait until stores are done" problem).
  • non-temporal stores go to write-combining buffers (regardless of WB, WT, WC or UC) and there is no "loads have to wait until stores are done" problem; and then (when data is evicted or drained from write-combining buffers) the CPU stores up to 32 bytes or 64 bytes at once (because it combined multiple smaller stores), which helps because memory and PCI is more like a network (e.g. each store probably has a ~10 byte header, the data, then a footer, so increasing the size of the write means "more bytes with same overhead" or "same bytes with less overhead", which works out to better bandwidth)
Of course Intel's documentation rarely says the exact behaviour so that Intel can change the behaviour later; and different CPUs may do things differently.
vvaltchev wrote:
Brendan wrote:Note that 32 bits per pixel means that 25% of the bytes written to frame buffer do nothing more than waste bandwidth. With 24 bits per pixel you should be able to get it 1.333 times faster (about 0.85 cycles per pixel for the WC case).

Of course for rendering pixels it's faster to have each pixel nicely aligned; which means that the fastest method is to do all the rendering with 32 bits per pixel and then convert from 32 bits per pixel to 24 bits per pixel as part of copying data from your buffer in RAM to the frame buffer.
Yeah I get that, but I don't have anymore my 2nd buffer, because it's faster without it, I'd avoid re-introducing it only because of the 24-bit case.
OK, maybe with a lot of tricks I might gain something even without the 2nd buffer, by packing locally the pixels but that's also additional code and I'm not sure what are the odds to achieve a meaningful improvement.
I'm much more used to graphics that wasn't obsolete decades ago - e.g. things like GUIs with device independence and anti-aliasing and proportional Unicode fonts. I assumed your console was something that would only be used during boot/kernel initialisation, up until you started a GUI or something (and that later on you'd probably hide it under a nice splash screen so that normal people don't claw their own eyes out while the OS boots). I certainly didn't expect that "horrifically retro" was the only thing you would ever want your OS to be used for.
vvaltchev wrote:Also, UEFI uses only 32-bit modes (EFI_GRAPHICS_OUTPUT_BLT_PIXEL is a struct with size = 4) therefore I decided to stick with that.
No; UEFI supports "everything" using a system of pixel masks. For example, for 16 bits per pixel it might say "RedMask = 0x0000001F, GreenMask = 0x000007E0, BlueMask = 0x000F800, ReservedMask = 0x00000000", and for 24 bits per pixel it might say "RedMask = 0x000000FF, GreenMask = 0x0000FF00, BlueMask = 0x00FF0000, ReservedMask = 0x00000000". However, the UEFI blitting function uses an abstract pixel format (which is a 32 bit per pixel format) and converts this abstract format into whatever the video mode actually wants, so that it's easy for boot code and UEFI applications to support any video mode (with any pixel format) because UEFI does the conversion for you. Of course after you exit boot services you can't use any of that (and I'd recommend doing the same as UEFI does - do all the rendering in an abstract pixel format and convert that to whatever the video mode actually wants while blitting).
vvaltchev wrote:
Brendan wrote:What performance do you get if you scroll up by one pixel (and not by a whole character)?
My console does not support that: it works with rows. The fastest scroll I have now it just a plain redraw of the scroll screen, using as source a buffer of characters (not pixels). I tried an "image scroll" using the framebuffer and it's terribly slow (because reading from the framebuffer is terribly slow).
Reading from the frame buffer is terribly slow; but reading from a buffer in RAM is not. That's how most beginners do things - e.g. scrolling becomes a "memmove()" from one place in the buffer in RAM to another place in the same buffer in RAM, followed by drawing a single row of characters the top or bottom of the buffer, and then finishing by blitting the buffer in RAM to the frame buffer.
vvaltchev wrote:
Brendan wrote:Note that for this case (assuming that the buffer in RAM is larger than "bare minimum" - e.g. maybe 3200x6000 if the video mode is 3200x1800; and assuming that there's no scroll bar or menus or status bar or...) the scrolling itself should cost a total of literally nothing.
How would you achieve that? A scroll requires both loads a stores. Loads from the framebuffer are insanely slow. With the 2nd buffer in RAM,
it's much better, but still overall it's slower than a simple redraw. The Linux kernel uses a full redraw strategy as well for the console, as far as I know.
This is an old trick. Imagine if you had a 123456 * 123456 buffer in RAM. When copying from the buffer in RAM to the frame buffer you could choose any 3200x1800 area within that larger buffer to copy to the frame buffer; so without redrawing anything (just by doing the "copy from buffer to frame buffer" and nothing else) you could scroll horizontally and vertically wherever you like.

Of course most video cards have built-in support for this trick - if you configure things right you can have (e.g.) an 8192 * 8192 frame buffer and then change a single "display start" register to change which part of the frame buffer is sent to the monitor. Unfortunately you need native drivers to make use of it.

Note: "Linux does it like this" is almost always proof that something is bad. There's probably only one small piece of the entire kernel (their high resolution timers) that doesn't suck.
vvaltchev wrote:
Brendan wrote:When scrolling the screen by a whole character the gaps between lines of characters remains in the same place so you end up with about 10% of pixels that don't change colour because of that (more if a lot of characters are lower case), and with a fixed width font the same happens for gaps between rows of characters, and sometimes you'll get lucky and some characters won't change (e.g. "The" on one line and "There" on the line below means three whole characters remain the same when you scroll), and typically there's lots of white space (at the ends of lines, at the start of lines if there's an indented list, etc). In other words the amount of data you need to change in the frame buffer is probably less than 50% of all pixels; so it should've costed you zero cycles for the scrolling and then ~3.3 million cycles to update the screen; so your code is probably more than twice as slow as it could be.
OK, I got that but it seems very tricky.. also you'd have to handle the case of special characters like the full blank block where the whole character area (8x16 or 16x32) is full with a single color. Are you sure that the overhead of that tricky code won't throw away (most of) the gain from the reduced stores to the framebuffer? I agree that theoretically something like that could be done, but I'm not sure how big will be the benefit in practice. It depends a lot of how many stores you can skip and how much costs doing that. Have you written a code like that? I'd be very curious to see a console using such strategies in practice.
In practice it depends on too many things (how it's implemented, what size stores you're using, if the video is integrated or discrete, how much else you do while blitting, ...).
vvaltchev wrote:
Brendan wrote:Oh my... You're planning to use a modern 4K display capable of millions of colours to emulate an 80*25 monochrome terminal from the 1970s?
Aahahhaha :-)
Not exactly. On 3200x1800 using a 16x32 font + my banner, I have 200x54 characters (200x56 without the banner) on screen.
16 colors, VGA style :-)
So; on a relatively average ~400 mm wide screen each character is going to be about 2 mm wide and most people won't be able to read any of the text because you failed to bother with basic resolution independence?


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: Unable to mark a memory region as WC using MTRRs [solved

Post by vvaltchev »

Brendan wrote: Is it true on some but not others?
No, I meant that on all of my machines the behavior is consistent.
Brendan wrote: I'm much more used to graphics that wasn't obsolete decades ago - e.g. things like GUIs with device independence and anti-aliasing and proportional Unicode fonts. I assumed your console was something that would only be used during boot/kernel initialisation, up until you started a GUI or something (and that later on you'd probably hide it under a nice splash screen so that normal people don't claw their own eyes out while the OS boots). I certainly didn't expect that "horrifically retro" was the only thing you would ever want your OS to be used for.
Well, my goal is not to build (or try to build) a desktop OS that everybody could use.
Such a goal would be by far too ambitious and unreasonable today. Actually not even Linux succeeded to do that, even if "the world is running on it". What I'm trying to do is to create a simple (and as much as deterministic as possible) kernel able to run natively Linux applications and, in the future, offer some unique extensions. It would be something much closer to the embedded world than to the desktop/server one. That's also why I don't think it will ever support SMP nor native video drivers. It is designed for engineers or OS development researchers who might want to try something new (that's why it was called experimentOs before) using a smaller code-base than the Linux kernel, but at the same time being able to run on it mainstream Linux applications. Most of the "hobbyist operating systems" are compatible only with themselves and that, in my opinion, seriously prevent them to be spread and be used by more people.
Brendan wrote:
vvaltchev wrote:Also, UEFI uses only 32-bit modes (EFI_GRAPHICS_OUTPUT_BLT_PIXEL is a struct with size = 4) therefore I decided to stick with that.
No; UEFI supports "everything" using a system of pixel masks.
You're completely right here. My bad (memory). My UEFI bootloader just discards the video modes where PixelFormat != PixelBlueGreenRedReserved8BitPerColor. I just don't wanna support any generic PixelFormat. Not a UEFI limitation.
Brendan wrote: This is an old trick. Imagine if you had a 123456 * 123456 buffer in RAM. When copying from the buffer in RAM to the frame buffer you could choose any 3200x1800 area within that larger buffer to copy to the frame buffer; so without redrawing anything (just by doing the "copy from buffer to frame buffer" and nothing else) you could scroll horizontally and vertically wherever you like.
Actually with my 2nd buffer I was doing exactly this and it was still much slower (~3 times) than redrawing the whole screen from the character buffer. Why? In my opinion, the 2nd buffer is still huge compared to the cache (consider 3200x1800x4 = ~22 MB). Reading it causes a lot of cache misses and poisons completely the cache. Even if I could avoid the cache poisoning by using non-temporal loads, I'll still have to pay the price for reading from that buffer.
Even if the best case where reading from that buffer is fast as writing to a WC memory, still it will be 2x slower than the simple redraw. The reason is that the "whole overhead" of generating the characters with my implementation is something like 7%-8% of the cycles necessary to redraw the screen. Therefore is by far more convenient to pay 1x for the redraw + 0.1 [let's say 10% overhead] for the character-generation, instead of paying 2x for a trivial and "fast" copy. It is typical in computer science today to find cases where it is better to re-generate something each time than to cache it in the memory, because the memory is so slow compared to the CPU.

Brendan wrote: Of course most video cards have built-in support for this trick - if you configure things right you can have (e.g.) an 8192 * 8192 frame buffer and then change a single "display start" register to change which part of the frame buffer is sent to the monitor. Unfortunately you need native drivers to make use of it.
Exactly, I'm avoiding native video drivers. But in theory yes, a real double buffer in the video card would be by far better.
Brendan wrote: Note: "Linux does it like this" is almost always proof that something is bad. There's probably only one small piece of the entire kernel (their high resolution timers) that doesn't suck.
OK, I got it :-) We have a completely different approach towards writing code and OS development in general. I don't believe Linux is perfect, but I think it is an outstanding and rock-solid world-class operating system. Most of the servers in the whole world run it, as long as billions of embedded devices.
Brendan wrote: So; on a relatively average ~400 mm wide screen each character is going to be about 2 mm wide and most people won't be able to read any of the text because you failed to bother with basic resolution independence?
Actually I was thinking to add in the future support for 32x64 fonts. Today Tilck supports only 8x16 and 16x32 fonts and it automatically chooses the font depending on the resolution. Simple algorithm: start with 8x16. If the screen will going to contain 160 or more characters horizontally, use the 16x32 font. Therefore I could just add another case for even bigger screens, but that's not really the focus of my work. Having a graphical console is something I introduced only because on pure UEFI systems there is no text-mode. Later I realized that since I had to support a graphical console, just for fun, it would be cool also to allow user applications to use the framebuffer and being able to show images (run the Linux "fbi" program, for example). That feature is not ready yet, but I plan to implement it, mostly because it is cool.

Vlad
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Post Reply