Page 1 of 1

Page Coloring

Posted: Fri Jan 26, 2018 7:45 pm
by josecm
I've heard about this algorithm page coloring but can't find any good resources that thoroughly explain it or describe an implementation. Can somebody point me to some?

Thanks in advance

Re: Page Coloring

Posted: Fri Jan 26, 2018 11:35 pm
by Brendan
Hi,
josecm wrote:I've heard about this algorithm page coloring but can't find any good resources that thoroughly explain it or describe an implementation. Can somebody point me to some?
The way caches work is that the address is split into 3 pieces - an offset within the cache line (lowest bits), an index (middle bits), and an ID (highest bits).

For a direct mapped cache the index tells you the only place that the cache line could be in the cache and the ID is used to check if it's the cache line you want or not. For example, you can think of it like this:

Code: Select all

    offset = address & 63;       // 64-byte cache lines
    index = (address >> 6) & 1023;    // 1024 entries = 64 KiB total cache
    ID = address >> 16;

    if(cache[index].tag == ID) {
        // cache hit
    } else {
        // cache miss
    }
For an N-way associative cache it's like N direct mapped caches in parallel. For example, you can think of it like this:

Code: Select all

    offset = address & 63;       // 64-byte cache lines
    index = (address >> 6) & 1023;   // 1024 entries = 64 KiB per set
    ID = address >> 14;

    for(N = 0; N < 4; N++) {               // 4-way associative = 256 KiB total cache
        if(cache[N][index].tag == ID) {
            // cache hit
        }
    }
   // cache miss
If the "index bits" happen to be the same for all of the addresses you use, then you can't use all of the cache - for a direct mapped cache you'd only be able to use a single cache line, and for an N-way associative cache you'd only be able to use N cache lines. That would cause lots of cache misses and be very bad for performance.

To make sure you can use all of the cache (to improve cache efficiency) you'd want to make sure that (for all of the addresses you use) the "index bits" are as different as possible. Without paging, this isn't much of a problem - software tends to use nice linear areas, and that causes the "index bits" to be different.

With paging, if caches use physical addresses (like they do on most CPUs) then some of the index bits depend on how the OS allocated physical pages. For the examples above, the index bits are bits 6 to 15, and (assuming 4 KiB pages) bits 12 to the highest bit depend on how the OS allocated physical pages; so bits 12 to 15 are index bits that depend on how the OS allocated physical pages. If the OS allocates pages "randomly" then a process can be unlucky, and bits 12 to 15 (of all physical addresses) might be the same for all addresses software uses, and the efficiency of the cache can be much worse because of this. Worse, because the OS allocates pages "randomly", there's nothing that software can do to avoid this - sometimes an executable will be lucky and get good performance, and sometimes the exact same executable will be unlucky and get bad performance.

"Cache colouring" is a way to fix that problem and ensure that software can always get the maximum cache efficiency. It works by making sure that the bits that matter in physical addresses (bits 12 to 15 for the examples I've been using) depend on the same bits in virtual addresses. More specifically; for the examples I've been using there are 4 bits that matter (bits 12 to 15), so there would be 16 "cache colours"; and when allocating a physical page the OS could do "colour = (virtual_address >> 12) % 16;" and allocate a physical page with that colour.

Of course when there's many tiny processes, it's likely that all of them will use the same area of their virtual address space (e.g. all executables might begin at virtual address 0x04000000), and that would mean that (with "simple cache colouring") all of these tiny processes would fight for the area of the cache (e.g. the first page of code for all processes would have the same virtual address and therefore would have the same index bits). To fix that you can use the process ID to influence the colour - for example (same example I've been using) you could do "colour = ((virtual_address >> 12) ^ process_ID) % 16;".

Note: To calculate the number of cache colours for a specific cache, rather than messing around with "index bits" you can divide the total size of the cache by the size of a page, then divide by the associativity. For example, for a "256 KiB, 4-way associative" cache (and 4 KiB page size) you get "256*1024 / 4*1024 / 4 = 16 colours". Also; it's a good idea to round the number of colours up to the nearest power of 2, so that you can use masking instead of modulo when calculating the colour to use for a page (e.g. "colour = ((virtual_address >> 12) ^ process_ID) & (cache_colours - 1);" instead of "colour = ((virtual_address >> 12) ^ process_ID) % cache_colours;").

There is one type of cache that I haven't mentioned yet - "fully associative" (where any cache line can be anywhere in the cache). A fully associative cache has no index bits, and therefore there's no need for cache colouring (all pages would be the same colour).

For benefits; for direct mapped caches (which are like "1-way associative" caches), for "random allocation" there's a relatively high chance of poor cache efficiency and doing cache colouring is therefore relatively effective (e.g. maybe 5% to 10% better performance on average); and as associativity increases the benefits diminish (e.g. maybe 2.5% to 5% better performance for 2-way associative, 1% to 2.5% better performance for 4-way associative, etc); until you get to fully associative (which is like "infinite-way associative") where there's no benefit at all.

However, it's not quite that simple - modern CPUs have multiple caches. In this case (because "too many colours" isn't much of a problem), you'd figure out the number of colours to use for each cache and then use whatever the highest number of colours was for everything. For example (for a real "Intel Core i7, ivy bridge" CPU):
  • L3 cache is "12 MiB, 16-way associative", so you'd want 256 colours (192 rounded up) for that
  • L2 cache is "256 KiB, 8-way associative", so you'd want 8 colours for that
  • L1 cache is "32 KiB, 8-way associative", so you'd only want 1 colour for that
  • 256 colours is the highest, so (for this CPU) you'd use 256 colours for everything
Of course for that specific CPU (because of the rounding, and because L3 has relatively high associativity anyway), it might be tempting to use 128 colours instead.

Alternatively (because code to obtain the characteristics of all the caches is painful - CPUID has become a horrible mess) you might be tempted to just use 256 colours for all CPUs.


Cheers,

Brendan