OSDev.org

Posted: **Sun Jan 02, 2011 11:56 pm**

Up to now I hadn't given this a lot of thought, just assuming that you have to use WBINVD (writeback invalidate) before saving to disk via the DMA so that all the data sitting in cache memory is written to main memory first, and INVD (invalidate) after loading from disk via the DMA so that when you try to read the new data you don't end up working with old garbage in the cache instead. I also realised that you would need to do WBINVD before loading data from disk too, because otherwise using INVD after the DMA has loaded new data into main memory would destroy any data at any other address that hasn't been written to main memory yet and is only sitting in cache memory.

Now I've been thinking about it a bit more and I've noticed a problem. It's not too bad if you're sending data to disk, because you only need to do WBINVD for that, but whenever you load data in via the DMA (or similar device), any writes to memory by the processor in between the WBINVD and INVD are in danger of being lost by not making it from cache to main memory when the INVD instruction is run, so it looks as if the processor more or less has to stop work throughout the entire process. It looks as if Other processors on a multi-core machine will also have to stop too for the same reason, so the whole machine grinds to a halt whenever you're saving data to any kind of storage device.

I've now found that there's a CLFLUSH instruction available on some machines (probably most and I would imagine all modern ones - strangely this instruction is missing from the copy of the instruction set which I have always used for reference, although I have seen it before in other documents). It looks as if CLFLUSH can get round the problem by writing back and invalidating specific areas of memory (and it only writes back those bytes that have been modified since being brought into the cache). I imagine that I should be using it before the DMA loads data in and again afterwards - the first time to clean up the cache so that none of it gets loaded in the second time on top of the data just loaded by the DMA. I'd like to know if that is actually the correct way to do things.

So, my questions are:-

(1) What should an OS do on a machine which lacks the CLFLUSH instruction? Perhaps simply not supporting that machine would be the best option.

(2) Should I be using CLFLUSH both before and after the DMA loads data?

(3) Should I be using CLFLUSH before saving data (rather than WBINVD)?

(4) How do I turn "0F AE /7" (CLFLUSH) into actual machine code numbers - the OF AE part is easy (15 74), but does anyone know what the "/7" part is meant to be? I assume it's something to do with how the instruction knows where to find the address in the address line that it's to flush, the location of that address being held in a register such as EAX, but I can't find complete information on this.

(5) 8 bytes in memory are used to store the address to be flushed, but how does an actual address occupy those 8 bytes? It's only going to need four of them, or two in real mode. I can normally find out how to translate assembly stuff into machine code, but this one's not spelt out sufficiently well and I don't have an assembler to try it out with.

(I think I can see how to get the cache line size from CPUID, so it ought to be easy enough to work out how many times to repeat CLFLUSH to cover the memory range required, so I shouldn't need any help with that.)

Posted: **Mon Jan 03, 2011 12:24 pm**

Hi,

80x86 is "cache coherent". Part of what this means is that hardware automatically makes sure that all devices sharing a bus (e.g. the FBS) or a link (e.g. hyper-transport or quickpath) see a consistent view of RAM; where "devices" includes CPUs (and CPU caches). Basically, there's no reason to INVD caches after the underlying RAM was modified by some other device or WBINVD before the underlying RAM is read by some other device (regardless of whether that "other device" is another CPU, or a legacy device using the legacy DMA controller, or a PCI device doing bus mastering, or anything else).

The WBINVD and INVD instructions mostly exist for when the CPU's caches can be wrong, which only happens in very rare cases (e.g. OS or firmware is reconfiguring MTRRs and/or doing strange things to the chipset's memory controller). The only other reasons to use WBINVD is if you're trying to benchmark RAM bandwidth, or if you need to resort to manual probing to detect the amount of RAM present (which is "never" for any 80x86 computer made since about 1992).

The CLFLUSH instruction exists for an entirely different purpose (performance tuning). The CPU's caches keep the most recently used cache lines (and discard the least recently used cache lines to make space). This causes performance problems in some cases. For example, if you read (or copy) a large amount of data and don't intend to access any of that data soon, then that will fill the cache with more recently used data that isn't going to be needed and push less recently used data that is more likely to be needed soon out of the cache. This is commonly known as "cache pollution". To avoid this problem (and improve performance by reducing cache misses caused by cache pollution) you want to tell the CPU to discard the more recently data. That is what CLFLUSH is for.

So, your answers are:

DavidCooper wrote:(1) What should an OS do on a machine which lacks the CLFLUSH instruction? Perhaps simply not supporting that machine would be the best option.

Nothing. If CLFLUSH doesn't exist, then you have to accept any performance problems caused by cache pollution (and doing something like WBINVD instead will just make things worse).

DavidCooper wrote:(2) Should I be using CLFLUSH both before and after the DMA loads data?

No.

DavidCooper wrote:(3) Should I be using CLFLUSH before saving data (rather than WBINVD)?

No - you shouldn't use CLFLUSH or WBINVD.

DavidCooper wrote:(4) How do I turn "0F AE /7" (CLFLUSH) into actual machine code numbers - the OF AE part is easy (15 74), but does anyone know what the "/7" part is meant to be? I assume it's something to do with how the instruction knows where to find the address in the address line that it's to flush, the location of that address being held in a register such as EAX, but I can't find complete information on this.

If your tools require machine code, find better tools that understand the CLFLUSH instruction. All modern tools (for 80x86) support the CLFLUSH instruction.

DavidCooper wrote:(5) 8 bytes in memory are used to store the address to be flushed, but how does an actual address occupy those 8 bytes?

The instruction's operand is the address to be flushed, not the address of an 8-byte pointer. For example (NASM), "CLFLUSH [eax]" flushes the cache line that corresponds to the virtual address in EAX (and does not cause the CPU to read the address of the cache line to flush from "[eax]").

Cheers,

Brendan

Posted: **Mon Jan 03, 2011 9:39 pm**

Brendan wrote:80x86 is "cache coherent". Part of what this means is that hardware automatically makes sure that all devices sharing a bus (e.g. the FBS) or a link (e.g. hyper-transport or quickpath) see a consistent view of RAM; where "devices" includes CPUs (and CPU caches).

That explains why no one else seems to be discussing the problem anywhere - it doesn't exist, so thanks for flushing all that polluted rubbish out of my cache.

I shouldn't need to worry about CLFLUSH then for now, but it's good to know what it's for.

If your tools require machine code, find better tools that understand the CLFLUSH instruction. All modern tools (for 80x86) support the CLFLUSH instruction.

My tools are the machine code numbers themselves, so I need to know what they are - I've tried programming in assembler long ago (for Z80 machines) and don't consider it to be a better tool - I found it hellish to use and that's why I just stuck to using machine code numbers directly instead where what you see is what you get and there's no extra junk to deal with. Anyway, since I don't need to know this instruction for now, not knowing the numbers for it won't matter.

The instruction's operand is the address to be flushed, not the address of an 8-byte pointer. For example (NASM), "CLFLUSH [eax]" flushes the cache line that corresponds to the virtual address in EAX (and does not cause the CPU to read the address of the cache line to flush from "[eax]").

I found a program using it which made it look as if it worked some other way, but I couldn't fully understand the mnemonics used.

OSDev.org

DMA, WBINVD, INVD and CLFLUSH question.

DMA, WBINVD, INVD and CLFLUSH question.

Re: DMA, WBINVD, INVD and CLFLUSH question.

Re: DMA, WBINVD, INVD and CLFLUSH question.