Aliased pages & caching

rdos · Post by **rdos** » Tue Mar 28, 2023 3:24 pm

I have a physical page that is mapped at one linear address in kernel space and at another linear address in user space. I designed it like this because I want the pages to have different access rights, with user space only having read access. The kernel linear address also is available in all processes.

However, I'm having transient issues where the kernel space page has been updated (which I can see through logs), but the user space page is not. The update might have happened on a different CPU core than the later read. The next time the page is read a bit later, the contents are updated.

The cache control bits for both pages are set to "normal state".

What could cause this problem? Isn't caching supposed to operate on physical addresses?

Octocontrabass · Post by **Octocontrabass** » Tue Mar 28, 2023 4:00 pm

rdos wrote:The update might have happened on a different CPU core than the later read.

What kind of synchronization are you using to ensure that the read happens after the update?

rdos wrote:The cache control bits for both pages are set to "normal state".

You mean WB?

thewrongchristian · Post by **thewrongchristian** » Tue Mar 28, 2023 5:31 pm

rdos wrote:I have a physical page that is mapped at one linear address in kernel space and at another linear address in user space. I designed it like this because I want the pages to have different access rights, with user space only having read access. The kernel linear address also is available in all processes.

However, I'm having transient issues where the kernel space page has been updated (which I can see through logs), but the user space page is not. The update might have happened on a different CPU core than the later read. The next time the page is read a bit later, the contents are updated.

The cache control bits for both pages are set to "normal state".

What could cause this problem? Isn't caching supposed to operate on physical addresses?

Cache protocols operate only at the cache level.

But CPU cores have store buffers, which might result in writes to memory being written to cache (and thus invalidated in remote caches) in a different order to the order of the writes in the program flow.

The order might be different because an earlier write is waiting for exclusive write access to a cache line, but a later write might be to a cache line the core already has write access to, so it can physically be written to the cache before the earlier write.

This will manifest itself in those writes appearing to happen in the reverse order in the remote cache, and so this is probably what you're seeing in your code.

You need a memory fence (lfence, sfence and mfence instructions on x86) to ensure writes are written to memory in a defined order. So, when updating some memory protected by a spin lock, you'll need a store fence between the last write to the protected data and the write to clear the spin lock. That barrier will ensure the protected data update is visible in other CPUs before the spin lock is unlocked by the latter write.

Octocontrabass · Post by **Octocontrabass** » Tue Mar 28, 2023 6:12 pm

thewrongchristian wrote:But CPU cores have store buffers, which might result in writes to memory being written to cache (and thus invalidated in remote caches) in a different order to the order of the writes in the program flow.

That's only true for string instructions (e.g. REP MOVSB) and non-temporal instructions (e.g. MOVNTI). For ordinary instructions, writes will exit the store buffer and be observed by other hardware threads in program order.

You may still need a memory barrier for those other threads to observe writes in a timely manner.

thewrongchristian · Post by **thewrongchristian** » Tue Mar 28, 2023 6:53 pm

Octocontrabass wrote:
thewrongchristian wrote:But CPU cores have store buffers, which might result in writes to memory being written to cache (and thus invalidated in remote caches) in a different order to the order of the writes in the program flow.
That's only true for string instructions (e.g. REP MOVSB) and non-temporal instructions (e.g. MOVNTI). For ordinary instructions, writes will exit the store buffer and be observed by other hardware threads in program order.

You may still need a memory barrier for those other threads to observe writes in a timely manner.

Right, and rdos didn't mention locking anyway, so it could be that a single write has not yet hit the cache and is perhaps still in a store buffer. So it's a race condition, and probably best resolved with explicit synchronization.

x86 is also a bit of an outlier here as well, most other modern CPUs allow stores to be re-ordered.

rdos · Post by **rdos** » Wed Mar 29, 2023 1:37 am

It's not spinlocks or locking that goes wrong.

I have a sorted array at the beginning of the 4k page. The producer thread will insert a new entry into it, queue a message to the consumer (in another part of the address space), map the new entry into linear memory and queue another message to the consumer. At the consumer side, I can see both the queued messages, but when the code validates the sorted array, it sometimes have an old copy of it, meaning it doesn't consider the new entry and sometimes will queue an overlapping request that causes problems in the cache. It doesn't happen very often, perhaps once in a million times or so.

I might overlook some other obvious problem with the code. The whole thing is a bit complicated involving several processes & threads.

I've tried to reproduce the problem by issuing the same requests again, but without luck. I now can catch the problem by looking for inconsistency in the sorted array, which always is proceeded by the consumer not seeing the correct array entries.

I might simply try to solve it by not sharing the memory area, rather letting the consumer / server use the notification messages to know what is cached. Still, this is a bit annoying, particularly since I've failed to pinpoint the issue in spite of having tried to analyse it for more than a week now.

It doesn't seem to be dependent on CPU. It works the same both on a bit dated 4-core AMD Athlon and a much newer HP Pavilion laptop with a 8-core AMD CPU.

Octocontrabass · Post by **Octocontrabass** » Wed Mar 29, 2023 11:11 am

rdos wrote:It's not spinlocks or locking that goes wrong.

It sounds an awful lot like it is.

rdos wrote:I have a sorted array at the beginning of the 4k page. The producer thread will insert a new entry into it,

How do you ensure the consumer is finished accessing the array before the producer updates it?

rdos wrote:but when the code validates the sorted array, it sometimes have an old copy of it,

Are you sure it's an old copy of the array and not something else? Like, for example, the producer thread inserting a new entry while you're validating it?

rdos · Post by **rdos** » Wed Mar 29, 2023 12:43 pm

There is a critical section that protects the sorted list & entries from being corrupted. These synchronization primitives have been thoroughly tested so I'm 100% certain they work on multicore.

I've now rewritten the code so the server keeps active requests itself, and updates them from producer messages. With this modification, everything seems to work as expected. I still have the cache consistency checking code running, and it doesn't detect any problems. This solution no longer depend on aliasing pages.

I think the most likely problem with the original code is that I have overlooked something in the interaction between producers & consumer.

thewrongchristian · Post by **thewrongchristian** » Wed Mar 29, 2023 3:36 pm

rdos wrote:It's not spinlocks or locking that goes wrong.

I have a sorted array at the beginning of the 4k page. The producer thread will insert a new entry into it, queue a message to the consumer (in another part of the address space), map the new entry into linear memory and queue another message to the consumer. At the consumer side, I can see both the queued messages, but when the code validates the sorted array, it sometimes have an old copy of it, meaning it doesn't consider the new entry and sometimes will queue an overlapping request that causes problems in the cache. It doesn't happen very often, perhaps once in a million times or so.

How do you copy in the entry into the array? If memcpy or memmove is involved, might you be seeing out of order stores, as Octocontrabass indicated that this can happen with the x86 string instructions.

And why two messages? Can you not put the entry into the array, map to the consumer address space, then send a single message to the consumer?

rdos · Post by **rdos** » Thu Mar 30, 2023 2:59 pm

thewrongchristian wrote:
rdos wrote:It's not spinlocks or locking that goes wrong.

I have a sorted array at the beginning of the 4k page. The producer thread will insert a new entry into it, queue a message to the consumer (in another part of the address space), map the new entry into linear memory and queue another message to the consumer. At the consumer side, I can see both the queued messages, but when the code validates the sorted array, it sometimes have an old copy of it, meaning it doesn't consider the new entry and sometimes will queue an overlapping request that causes problems in the cache. It doesn't happen very often, perhaps once in a million times or so.
How do you copy in the entry into the array? If memcpy or memmove is involved, might you be seeing out of order stores, as Octocontrabass indicated that this can happen with the x86 string instructions.

And why two messages? Can you not put the entry into the array, map to the consumer address space, then send a single message to the consumer?

The main problem is to keep track of which parts of a file are mapped (and therefore locked in the disc cache). It's the application that requests a new part to be mapped, and then the file system server will send a request to the disc thread to read it from the disc. Then either the disc server will send a notification that data is available (or the filesystem server might send the notification directly if everything is already cached). This notification will add a new entry to the sorted array. Entries are freed from the application side when they are no longer in use, something that will remove entries from the sorted array. The filesystem server must keep an updated view of what is mapped, which was the original intention of sharing the mapping with user space in the filesystem server. However, now I have events when requests are completed, when they are mapped to user space and when they are freed. These can be used to create a local view of mappings in user space of the filesystem server without using a shared mapping. And it works too.

The mapping is actually even more complicated since each process can have different parts mapped. This is solved by having per-process mappings and a master mapping. The file system server works with the master mapping only which has all entries that are part of the per-process mappings.

OSDev.org

Aliased pages & caching

Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching

Re: Aliased pages & caching