Doublebuffering

Jeko · Post by **Jeko** » Sun Jul 01, 2007 3:55 am

I'm writing a driver for VESA. In your opinion, must I insert the doublebuffering in my driver? Or doublebuffering is used only by applications?

Pyrofan1 · Post by **Pyrofan1** » Sun Jul 01, 2007 3:57 am

i believe double buffering is only used by applications

hendric · Post by **hendric** » Sun Jul 01, 2007 10:29 pm

As My thoughr You 'd better Add Double Buffer Capacity into your Driver , if using it depends on applications

pcmattman · Post by **pcmattman** » Mon Jul 02, 2007 8:49 pm

ATM, my VESA driver automatically does all the double buffering it needs to do, for the entire screen. How else can you get the performance you need?

Note: dirty rectangles + a double buffer would probably be better.

Brendan · Post by **Brendan** » Mon Jul 02, 2007 11:46 pm

Hi,

pcmattman wrote:ATM, my VESA driver automatically does all the double buffering it needs to do, for the entire screen. How else can you get the performance you need?

Note: dirty rectangles + a double buffer would probably be better.

Dirty rectangles would help.

Page flipping and bus mastering bitblits can't be done using VESA/VBE, so that won't help.

The only other things I can think of is cache management (on the CPU's side) and "transfer management" (on the device side).

For "transfer management" the idea is to minimise the number of transfers across the bus by doing large transfers instead of small ones (e.g. transfer 64-bytes at a time, rather than doing 64 single byte transfers) and improving the speed of each transfer (e.g. increasing/setting the AGP transfer rate).

For information about setting the AGP transfer rate, see this wiki page.

For increasing the size of the transfer across the bus, you could use "write combining" (if supported), and MMX/SSE (if supported) or 32-bit transfers.

For cache management there's 2 things to consider - getting data into the CPU, and then trying to prevent that data from pushing everything else out of the cache.

Getting data into the CPU means arranging the data so that a single cache line holds as much data as possible (you wouldn't want to read the first byte of each cache line and throw the rest away, as this means the CPU needs to load 64-bytes from RAM for each byte you use) and making sure the data is in the CPU before the CPU needs it (otherwise the CPU needs to wait for the data to arrive from RAM). It's likely that the data in your double buffer is already in the best format it can be, which means you only need to worry about getting data into the CPU before it's needed (prefetching).

For prefetching, the CPU has a hardware prefetcher that automatically prefetches data from RAM. However, the hardware prefetcher sucks at bit. For example, if your reading data from RAM sequentially, the hardware prefetcher usually won't start prefetching until you get 2 cache misses, and it'll stop prefetching when it gets to the end of a 4 KB page. This alone means that (if your sequentially reading a large amount of data from RAM) you get 2 cache misses per 4 KB of data. The hardware prefetcher may also have bad timing - it can prefetch too late so that the CPU still needs to wait. If you sequentially read half a page, then the hardware prefetcher will also prefetch 2 cache lines (past the end of the data) that are never used.

The "prefetch" instruction can be used to ask the CPU to prefetch some data into it's cache, however prefetch does nothing if there's a TLB miss. To avoid that there's a technique called "preloading". Preloading means using an instruction that accesses RAM and then not using the results of the instruction (to minimize dependancies). For example, you could do something like "test [theAddress],1" which forces the CPU to load the TLB entry for "theAddress" and load the cache line for "theAddress", before you actually need them.

Basically, you'd use preloading for the first address in a page, then use prefetching (or preloading if prefetching isn't supported) for the remaining addresses in a page.

Doing the prefetching yourself isn't easy though, as it can depend on CPU and RAM timing, cache sizes, and the access pattern. Excessive prefetching can also reduce performance - for e.g. you can push data you need out of the cache by prefetching too much, and the data you prefetched can be pushed out of the cache before you need it.

Lastly, you can try to prevent the data you load into the cache from pushing everything else out of the cache. The CPU uses a "most recently used" algorithm, which doesn't work so well if you access data once and won't need that data again. The idea is to load the data into the cache, use the data, then invalidate the cache line when you're finished with the data to free up space in the cache. This prevents things you will need again from being pushed out of the cache and reduces the cache of problems caused by "excessive prefetch". The only way to do this is the "CLFLUSH" instruction (if present).

In the end you can end up with something like this:

Code: Select all

    /* Prefetch data for first 4 cache lines in first page */
    preload(address);   // Preload for TLB miss
    precache(address + cacheLineSize);
    precache(address + cacheLineSize * 2);
    precache(address + cacheLineSize * 3);

    /* calculate control variables, etc here while the CPU is getting stuff from RAM */

    /* Do all pages except the last one */
    for(page = 0; page <lastPage - 1; page++) {
        for(cacheLine = 0; cacheLine < lastCacheLine - 4; cacheLine++) {
            /* Do all cache lines except the last 4 */
            sendDataToVideo(address);
            cacheFlush(address);
            prefetch(address + cacheLineSize * 4);
            address += cacheLineSize;
        }
        /* Do the remaining 4 cache lines in this page */
        sendDataToVideo(address);
        cacheFlush(address);
        preload(address + cacheLineSize * 4);   // Preload for TLB miss
        address += cacheLineSize;
        sendDataToVideo(address);
        cacheFlush(address);
        prefetch(address + cacheLineSize * 4);
        address += cacheLineSize;
        sendDataToVideo(address);
        cacheFlush(address);
        prefetch(address + cacheLineSize * 4);
        address += cacheLineSize;
        sendDataToVideo(address);
        cacheFlush(address);
        prefetch(address + cacheLineSize * 4);
        address += cacheLineSize;

    /* Do the last page */
    for(cacheLine = 0; cacheLine < lastCacheLine - 4; cacheLine++) {
        /* Do all cache lines except the last 4 */
        sendDataToVideo(address);
        cacheFlush(address);
        prefetch(address + cacheLineSize * 4);
        address += cacheLineSize;
    }
    /* Do the remaining 4 cache lines without prefetching more data */
    sendDataToVideo(address);
    cacheFlush(address);
    address += cacheLineSize;
    sendDataToVideo(address);
    cacheFlush(address);
    address += cacheLineSize;
    sendDataToVideo(address);
    cacheFlush(address);
    address += cacheLineSize;
    sendDataToVideo(address);
    cacheFlush(address);
}

Of course you'd probably need several different routines to copy data from the double buffer to display memory - one that only uses preloading and 32-bit transfers for old CPUs, one that uses preloading and MMX, one that uses preloading and prefetching, one that uses preloading/prefetching and cache flushing, etc.

Cheers,

Brendan

Jeko · Post by **Jeko** » Wed Jul 04, 2007 4:02 am

Brendan wrote:For "transfer management" the idea is to minimise the number of transfers across the bus by doing large transfers instead of small ones (e.g. transfer 64-bytes at a time, rather than doing 64 single byte transfers) and improving the speed of each transfer (e.g. increasing/setting the AGP transfer rate).

For information about setting the AGP transfer rate, see this wiki page.
For increasing the size of the transfer across the bus, you could use "write combining" (if supported), and MMX/SSE (if supported) or 32-bit transfers.

Thank you! I already implemented the setting of AGP transfer rate, the setting of FastWrite and SideBand Addressing, the using of MMX/SSE/SSE2 (it's very strange that in bochs using MMX is faster than using SSE), I must only implement prefetching and write-combining. How can I do this?

Brendan · Post by **Brendan** » Wed Jul 04, 2007 4:27 am

Hi,

MarkOS wrote:Thank you! I already implemented the setting of AGP transfer rate, the setting of FastWrite and SideBand Addressing, the using of MMX/SSE/SSE2 (it's very strange that in bochs using MMX is faster than using SSE), I must only implement prefetching and write-combining. How can I do this?

For write-combining, first check if the CPU supports it. Pentium Pro and later do support it (I'm not sure when AMD started supporting it though). To enable it you need to find out where display memory is and then set that area to "write-combining" using the MTRRs (typically for LFB it means setting an unused variable range MTRR to cover the area).

There's also an alternative - for Pentium III and later you can use the PAT (Page Attribute Table) to select write-combining at the page level (e.g. in the page table entry), without messing about with MTRRs (which is good if you've got 4 video cards and only 2 free MTRRs). However, there's a bug in Pentium III CPUs where you end up with "uncached" even though the PAT says "write-combining".

For complete details on write-combining, see the "Memory Cache Control" chapter in your favourite Intel manual...

For prefetching, there's not much I can say that wasn't in my previous post. For more information on this see the "IA32 Intel Architecture Optimization Reference Manual" (or AMD's equivelent) for the CPU/s you're optimizing for.

Cheers,

Brendan

Jeko · Post by **Jeko** » Wed Jul 04, 2007 5:01 am

Brendan wrote:For write-combining, first check if the CPU supports it. Pentium Pro and later do support it (I'm not sure when AMD started supporting it though). To enable it you need to find out where display memory is and then set that area to "write-combining" using the MTRRs (typically for LFB it means setting an unused variable range MTRR to cover the area).

Have you some code for setting an area to write-combining using MTRRs?

Brendan · Post by **Brendan** » Wed Jul 04, 2007 6:08 am

Hi,

MarkOS wrote:
Brendan wrote:For write-combining, first check if the CPU supports it. Pentium Pro and later do support it (I'm not sure when AMD started supporting it though). To enable it you need to find out where display memory is and then set that area to "write-combining" using the MTRRs (typically for LFB it means setting an unused variable range MTRR to cover the area).
Have you some code for setting an area to write-combining using MTRRs?

I normally don't give code, as people that do "cut & paste OS development" end up with a patchwork they don't understand and I'd rather help people write their own code (that they do understand).

BTW there's a little information (and a little code) in this post about enabling MTRRs. I'd still recommend reading and understanding the relevant part of Intel's manual though...

Cheers,

Brendan

Jeko · Post by **Jeko** » Thu Jul 05, 2007 4:33 am

Brendan wrote:I normally don't give code, as people that do "cut & paste OS development" end up with a patchwork they don't understand and I'd rather help people write their own code (that they do understand).

BTW there's a little information (and a little code) in this post about enabling MTRRs. I'd still recommend reading and understanding the relevant part of Intel's manual though...

Thank you, Brendan!
I don't do "cut and paste OS development", but I think that reading the source code is very helpful to understand.