Is it true on some but not others?vvaltchev wrote:That's the same thing I initially thought, but it turns to be not true on all of my machines.Brendan wrote:The part that still doesn't make sense to me is that SSE or AVX using non-temporaral stores should be using the write-combining buffers and should give exactly the same performance regardless of whether the area is UC or WC in MTRRS or PAT.
Note: As far as I can tell the UDOO's CPU is a "braswell" (an atom derivative). These CPUs are specifically designed for low power embedded systems and cut lots of corners that Intel wouldn't consider for their normal (laptop/desktop/server) chips.
My understanding is:vvaltchev wrote:My current understanding is that stores with a non-temporal hint affect only write-back memory and they force the CPU to bypass the cache and just write directly to the main memory. The gain evident because it avoids the CPU to throw away precious cache data for something that we'll not going to read any time soon. But in my case, for UC memory that hint has no effect since the CPU already bypasses the cache. Each store goes directly to the RAM and the CPU busy-waits until the store is completed. I got the idea of using non-temporal stores from: https://software.intel.com/en-us/articl ... me-buffers but in that article is explained the opposite: how to READ in an efficient way from a framebuffer and not how to write to it. I wanted to try anyway the non-temporal stores because I thought the framebuffer was using a write-back memory. After I added supported for MTRRs, I realized that it was not.
- non-temporal loads from WB go directly to L1 cache (bypassing and not polluting L2 and L3 caches)
- non-temporal loads from UC probably do nothing (no different to normal loads)
- normal stores to WT or UC go to a "store queue" where CPU can keep going but has to wait for stores to be finished before doing subsequent loads (which is why "single register doing moves" is bad - you want to load 16 registers full of data then store 16 registers full of data, to minimise the "loads have to wait until stores are done" problem).
- non-temporal stores go to write-combining buffers (regardless of WB, WT, WC or UC) and there is no "loads have to wait until stores are done" problem; and then (when data is evicted or drained from write-combining buffers) the CPU stores up to 32 bytes or 64 bytes at once (because it combined multiple smaller stores), which helps because memory and PCI is more like a network (e.g. each store probably has a ~10 byte header, the data, then a footer, so increasing the size of the write means "more bytes with same overhead" or "same bytes with less overhead", which works out to better bandwidth)
I'm much more used to graphics that wasn't obsolete decades ago - e.g. things like GUIs with device independence and anti-aliasing and proportional Unicode fonts. I assumed your console was something that would only be used during boot/kernel initialisation, up until you started a GUI or something (and that later on you'd probably hide it under a nice splash screen so that normal people don't claw their own eyes out while the OS boots). I certainly didn't expect that "horrifically retro" was the only thing you would ever want your OS to be used for.vvaltchev wrote:Yeah I get that, but I don't have anymore my 2nd buffer, because it's faster without it, I'd avoid re-introducing it only because of the 24-bit case.Brendan wrote:Note that 32 bits per pixel means that 25% of the bytes written to frame buffer do nothing more than waste bandwidth. With 24 bits per pixel you should be able to get it 1.333 times faster (about 0.85 cycles per pixel for the WC case).
Of course for rendering pixels it's faster to have each pixel nicely aligned; which means that the fastest method is to do all the rendering with 32 bits per pixel and then convert from 32 bits per pixel to 24 bits per pixel as part of copying data from your buffer in RAM to the frame buffer.
OK, maybe with a lot of tricks I might gain something even without the 2nd buffer, by packing locally the pixels but that's also additional code and I'm not sure what are the odds to achieve a meaningful improvement.
No; UEFI supports "everything" using a system of pixel masks. For example, for 16 bits per pixel it might say "RedMask = 0x0000001F, GreenMask = 0x000007E0, BlueMask = 0x000F800, ReservedMask = 0x00000000", and for 24 bits per pixel it might say "RedMask = 0x000000FF, GreenMask = 0x0000FF00, BlueMask = 0x00FF0000, ReservedMask = 0x00000000". However, the UEFI blitting function uses an abstract pixel format (which is a 32 bit per pixel format) and converts this abstract format into whatever the video mode actually wants, so that it's easy for boot code and UEFI applications to support any video mode (with any pixel format) because UEFI does the conversion for you. Of course after you exit boot services you can't use any of that (and I'd recommend doing the same as UEFI does - do all the rendering in an abstract pixel format and convert that to whatever the video mode actually wants while blitting).vvaltchev wrote:Also, UEFI uses only 32-bit modes (EFI_GRAPHICS_OUTPUT_BLT_PIXEL is a struct with size = 4) therefore I decided to stick with that.
Reading from the frame buffer is terribly slow; but reading from a buffer in RAM is not. That's how most beginners do things - e.g. scrolling becomes a "memmove()" from one place in the buffer in RAM to another place in the same buffer in RAM, followed by drawing a single row of characters the top or bottom of the buffer, and then finishing by blitting the buffer in RAM to the frame buffer.vvaltchev wrote:My console does not support that: it works with rows. The fastest scroll I have now it just a plain redraw of the scroll screen, using as source a buffer of characters (not pixels). I tried an "image scroll" using the framebuffer and it's terribly slow (because reading from the framebuffer is terribly slow).Brendan wrote:What performance do you get if you scroll up by one pixel (and not by a whole character)?
This is an old trick. Imagine if you had a 123456 * 123456 buffer in RAM. When copying from the buffer in RAM to the frame buffer you could choose any 3200x1800 area within that larger buffer to copy to the frame buffer; so without redrawing anything (just by doing the "copy from buffer to frame buffer" and nothing else) you could scroll horizontally and vertically wherever you like.vvaltchev wrote:How would you achieve that? A scroll requires both loads a stores. Loads from the framebuffer are insanely slow. With the 2nd buffer in RAM,Brendan wrote:Note that for this case (assuming that the buffer in RAM is larger than "bare minimum" - e.g. maybe 3200x6000 if the video mode is 3200x1800; and assuming that there's no scroll bar or menus or status bar or...) the scrolling itself should cost a total of literally nothing.
it's much better, but still overall it's slower than a simple redraw. The Linux kernel uses a full redraw strategy as well for the console, as far as I know.
Of course most video cards have built-in support for this trick - if you configure things right you can have (e.g.) an 8192 * 8192 frame buffer and then change a single "display start" register to change which part of the frame buffer is sent to the monitor. Unfortunately you need native drivers to make use of it.
Note: "Linux does it like this" is almost always proof that something is bad. There's probably only one small piece of the entire kernel (their high resolution timers) that doesn't suck.
In practice it depends on too many things (how it's implemented, what size stores you're using, if the video is integrated or discrete, how much else you do while blitting, ...).vvaltchev wrote:OK, I got that but it seems very tricky.. also you'd have to handle the case of special characters like the full blank block where the whole character area (8x16 or 16x32) is full with a single color. Are you sure that the overhead of that tricky code won't throw away (most of) the gain from the reduced stores to the framebuffer? I agree that theoretically something like that could be done, but I'm not sure how big will be the benefit in practice. It depends a lot of how many stores you can skip and how much costs doing that. Have you written a code like that? I'd be very curious to see a console using such strategies in practice.Brendan wrote:When scrolling the screen by a whole character the gaps between lines of characters remains in the same place so you end up with about 10% of pixels that don't change colour because of that (more if a lot of characters are lower case), and with a fixed width font the same happens for gaps between rows of characters, and sometimes you'll get lucky and some characters won't change (e.g. "The" on one line and "There" on the line below means three whole characters remain the same when you scroll), and typically there's lots of white space (at the ends of lines, at the start of lines if there's an indented list, etc). In other words the amount of data you need to change in the frame buffer is probably less than 50% of all pixels; so it should've costed you zero cycles for the scrolling and then ~3.3 million cycles to update the screen; so your code is probably more than twice as slow as it could be.
So; on a relatively average ~400 mm wide screen each character is going to be about 2 mm wide and most people won't be able to read any of the text because you failed to bother with basic resolution independence?vvaltchev wrote:AahahhahaBrendan wrote:Oh my... You're planning to use a modern 4K display capable of millions of colours to emulate an 80*25 monochrome terminal from the 1970s?
Not exactly. On 3200x1800 using a 16x32 font + my banner, I have 200x54 characters (200x56 without the banner) on screen.
16 colors, VGA style
Cheers,
Brendan