DMA, WBINVD, INVD and CLFLUSH question.
Posted: Sun Jan 02, 2011 11:56 pm
Up to now I hadn't given this a lot of thought, just assuming that you have to use WBINVD (writeback invalidate) before saving to disk via the DMA so that all the data sitting in cache memory is written to main memory first, and INVD (invalidate) after loading from disk via the DMA so that when you try to read the new data you don't end up working with old garbage in the cache instead. I also realised that you would need to do WBINVD before loading data from disk too, because otherwise using INVD after the DMA has loaded new data into main memory would destroy any data at any other address that hasn't been written to main memory yet and is only sitting in cache memory.
Now I've been thinking about it a bit more and I've noticed a problem. It's not too bad if you're sending data to disk, because you only need to do WBINVD for that, but whenever you load data in via the DMA (or similar device), any writes to memory by the processor in between the WBINVD and INVD are in danger of being lost by not making it from cache to main memory when the INVD instruction is run, so it looks as if the processor more or less has to stop work throughout the entire process. It looks as if Other processors on a multi-core machine will also have to stop too for the same reason, so the whole machine grinds to a halt whenever you're saving data to any kind of storage device.
I've now found that there's a CLFLUSH instruction available on some machines (probably most and I would imagine all modern ones - strangely this instruction is missing from the copy of the instruction set which I have always used for reference, although I have seen it before in other documents). It looks as if CLFLUSH can get round the problem by writing back and invalidating specific areas of memory (and it only writes back those bytes that have been modified since being brought into the cache). I imagine that I should be using it before the DMA loads data in and again afterwards - the first time to clean up the cache so that none of it gets loaded in the second time on top of the data just loaded by the DMA. I'd like to know if that is actually the correct way to do things.
So, my questions are:-
(1) What should an OS do on a machine which lacks the CLFLUSH instruction? Perhaps simply not supporting that machine would be the best option.
(2) Should I be using CLFLUSH both before and after the DMA loads data?
(3) Should I be using CLFLUSH before saving data (rather than WBINVD)?
(4) How do I turn "0F AE /7" (CLFLUSH) into actual machine code numbers - the OF AE part is easy (15 74), but does anyone know what the "/7" part is meant to be? I assume it's something to do with how the instruction knows where to find the address in the address line that it's to flush, the location of that address being held in a register such as EAX, but I can't find complete information on this.
(5) 8 bytes in memory are used to store the address to be flushed, but how does an actual address occupy those 8 bytes? It's only going to need four of them, or two in real mode. I can normally find out how to translate assembly stuff into machine code, but this one's not spelt out sufficiently well and I don't have an assembler to try it out with.
(I think I can see how to get the cache line size from CPUID, so it ought to be easy enough to work out how many times to repeat CLFLUSH to cover the memory range required, so I shouldn't need any help with that.)
Now I've been thinking about it a bit more and I've noticed a problem. It's not too bad if you're sending data to disk, because you only need to do WBINVD for that, but whenever you load data in via the DMA (or similar device), any writes to memory by the processor in between the WBINVD and INVD are in danger of being lost by not making it from cache to main memory when the INVD instruction is run, so it looks as if the processor more or less has to stop work throughout the entire process. It looks as if Other processors on a multi-core machine will also have to stop too for the same reason, so the whole machine grinds to a halt whenever you're saving data to any kind of storage device.
I've now found that there's a CLFLUSH instruction available on some machines (probably most and I would imagine all modern ones - strangely this instruction is missing from the copy of the instruction set which I have always used for reference, although I have seen it before in other documents). It looks as if CLFLUSH can get round the problem by writing back and invalidating specific areas of memory (and it only writes back those bytes that have been modified since being brought into the cache). I imagine that I should be using it before the DMA loads data in and again afterwards - the first time to clean up the cache so that none of it gets loaded in the second time on top of the data just loaded by the DMA. I'd like to know if that is actually the correct way to do things.
So, my questions are:-
(1) What should an OS do on a machine which lacks the CLFLUSH instruction? Perhaps simply not supporting that machine would be the best option.
(2) Should I be using CLFLUSH both before and after the DMA loads data?
(3) Should I be using CLFLUSH before saving data (rather than WBINVD)?
(4) How do I turn "0F AE /7" (CLFLUSH) into actual machine code numbers - the OF AE part is easy (15 74), but does anyone know what the "/7" part is meant to be? I assume it's something to do with how the instruction knows where to find the address in the address line that it's to flush, the location of that address being held in a register such as EAX, but I can't find complete information on this.
(5) 8 bytes in memory are used to store the address to be flushed, but how does an actual address occupy those 8 bytes? It's only going to need four of them, or two in real mode. I can normally find out how to translate assembly stuff into machine code, but this one's not spelt out sufficiently well and I don't have an assembler to try it out with.
(I think I can see how to get the cache line size from CPUID, so it ought to be easy enough to work out how many times to repeat CLFLUSH to cover the memory range required, so I shouldn't need any help with that.)