copy-on-write/mapped files without race conditions?

mariuszp · Post by **mariuszp** » Mon Mar 28, 2016 3:30 pm

I have stumbled upon this theoretical problem with mapped files and copy-on-write. So far, my OS was uniprocessor, and I started work on SMP support. I have some practical issues such as 2 CPUs ending up on the same stack, but I noticed a much more difficult theoretical program.

Let's say a certain area of virtual memory has a file mapped to it. Thread A accessed the memory. This causes a page fault, the kernel sees it is a memory-mapped file area and:

1) Marks the page as present and allocates a frame for it.
2) Reads the file data into that page.

But if thread B, which shares memory with thread A, tries to access the page inbetween step 1 and 2, it will end up reading whatever garbage was on that page before the file was read! Disabling interrupts for the duration of the mapping works for uniprocessor systems, but will not work with multiple CPUs, when thread B runs on another CPU at the same time this mapping happens.

How can I prevent this race condition?

gerryg400 · Post by **gerryg400** » Mon Mar 28, 2016 3:46 pm

mariuszp wrote:I have stumbled upon this theoretical problem with mapped files and copy-on-write. So far, my OS was uniprocessor, and I started work on SMP support. I have some practical issues such as 2 CPUs ending up on the same stack, but I noticed a much more difficult theoretical program.

Let's say a certain area of virtual memory has a file mapped to it. Thread A accessed the memory. This causes a page fault, the kernel sees it is a memory-mapped file area and:

1) Marks the page as present and allocates a frame for it.
2) Reads the file data into that page.

But if thread B, which shares memory with thread A, tries to access the page inbetween step 1 and 2, it will end up reading whatever garbage was on that page before the file was read! Disabling interrupts for the duration of the mapping works for uniprocessor systems, but will not work with multiple CPUs, when thread B runs on another CPU at the same time this mapping happens.

How can I prevent this race condition?

The first thing is to swap the 2 steps. So the page will still be marked as not-present if a fault occurs on the second core.

Further from that you will need to give your pages a state variable. In this do the following:

1) Mark page state as WAITING_TO_BE_LOADED - but leave the not-present bit clear
2) Set the thread state as WAITING_FOR_PAGE
3) Allocate a physical page and load the data into it
4) Set the present bit
5) Mark page state as READY
6) Set present bit (and send an IPI if necessary)
7) Wake up waiting pages.

Now if the 2nd thread faults during steps 1 to 3 it will see that the page is WAITING_TO_BE_LOADED and can be put in the WAITING _FOR_PAGE state and be woken up when the page is ready. If it faults during 4 to 7 the handler notice the PRESENT bit is now set and simply return for a retry.

I hope I got this mainly right. It's roughly what I do in my OS.

mariuszp · Post by **mariuszp** » Mon Mar 28, 2016 3:53 pm

So to read the file data, I'd first have to map the physical frame into a temporary page (which i protect with a spinlock), read there, then map it into the real page and mark it present?

As for 2 page faults going off together:

I protect the paging structures (which includes Glidix-specific metadata) with a spinlock. So, could the algorithm be:

1) acquire the paging spinlock
2) if the page is present, release the spinlock and return for a retry
3) otherwise, perform the above algorithm for mapping the page
4) release the spinlock and return for a retry

Does that sound plausible?

gerryg400 · Post by **gerryg400** » Mon Mar 28, 2016 3:58 pm

mariuszp wrote:So to read the file data, I'd first have to map the physical frame into a temporary page (which i protect with a spinlock), read there, then map it into the real page and mark it present?

As for 2 page faults going off together:

I protect the paging structures (which includes Glidix-specific metadata) with a spinlock. So, could the algorithm be:

1) acquire the paging spinlock
2) if the page is present, release the spinlock and return for a retry
3) otherwise, perform the above algorithm for mapping the page
4) release the spinlock and return for a retry

Does that sound plausible?

That sounds correct. The trick is to describe your procedure in a list of atomic steps and then to interrupt it at every step with the other thread to see what happens.

Do you have IPIs yet ?

mariuszp · Post by **mariuszp** » Mon Mar 28, 2016 4:00 pm

Currently I use 2 IPIs: the HALT and SCHED_HINT IPIs.

I send HALT to all CPUs (one-by-one) except the sender when a kernel panic or other such event occurs, and it just causes them to disable interrupts and halt.

The SCHED_HINT is sent from one CPU to another when the sender believes that the receiver must reevaluate scheduling decisions. For example, it is sent if a thread from another CPU is woken up by a thread from another CPU, or when the PIT fires (since a thread may now be woken from timed sleep).

rdos · Post by **rdos** » Fri Apr 01, 2016 12:25 pm

This is a similar problem as demand-loading pages from an executable that is potentially running on multiple cores. I solve it by allowing pages to be "hooked" and served by a driver (in the case of the executable, the PE loader driver). When the pagefault occurs, the driver gets control from the pagefault handler, and will use an ordinary critical section to ensure that only one thread at a time tries to read and relocate the page. Inside the critical section, there is a check if it is already loaded, in which case the pagefaulting code is just reexecuted. This way to solve it really don't need any IPIs, as sending IPIs is a natural part of implementing critical sections / mutexes.

rdos · Post by **rdos** » Fri Apr 01, 2016 2:37 pm

mariuszp wrote:Currently I use 2 IPIs: the HALT and SCHED_HINT IPIs.

I send HALT to all CPUs (one-by-one) except the sender when a kernel panic or other such event occurs, and it just causes them to disable interrupts and halt.

I use NMI for that. It has the advantage that it will be able to HALT cores that have disabled interrupts.

mariuszp wrote: The SCHED_HINT is sent from one CPU to another when the sender believes that the receiver must reevaluate scheduling decisions. For example, it is sent if a thread from another CPU is woken up by a thread from another CPU, or when the PIT fires (since a thread may now be woken from timed sleep).

Looks similar to my solution. I also use this IPI for TLB-invalidation.

I have separate IRQs for timers, and they depend on the available hardware. Some PCs have HPETs, some have old PITs while yet others use the APIC timer.

mariuszp · Post by **mariuszp** » Fri Apr 01, 2016 4:19 pm

rdos wrote:
mariuszp wrote:Currently I use 2 IPIs: the HALT and SCHED_HINT IPIs.

I send HALT to all CPUs (one-by-one) except the sender when a kernel panic or other such event occurs, and it just causes them to disable interrupts and halt.
I use NMI for that. It has the advantage that it will be able to HALT cores that have disabled interrupts.

mariuszp wrote: The SCHED_HINT is sent from one CPU to another when the sender believes that the receiver must reevaluate scheduling decisions. For example, it is sent if a thread from another CPU is woken up by a thread from another CPU, or when the PIT fires (since a thread may now be woken from timed sleep).
Looks similar to my solution. I also use this IPI for TLB-invalidation.

I have separate IRQs for timers, and they depend on the available hardware. Some PCs have HPETs, some have old PITs while yet others use the APIC timer.

I use the APIC timer in single-shot mode for scheduling - that is, after switching task, the scheduler sets the current CPU's APIC timer to a specific value, such that it can be interrupted when it needs to switch task next. If a thread needs less time than its normal slice, it calls kyield() which turns off the APIC timer (by enabling interrupts, setting the timer initial count to 0, 2 NOPs, followed by setting it to 0 again, in case a task switch occurs in the middle), and then calls switchTask() itself.

rdos · Post by **rdos** » Sat Apr 02, 2016 2:02 am

mariuszp wrote: I use the APIC timer in single-shot mode for scheduling - that is, after switching task, the scheduler sets the current CPU's APIC timer to a specific value, such that it can be interrupted when it needs to switch task next. If a thread needs less time than its normal slice, it calls kyield() which turns off the APIC timer (by enabling interrupts, setting the timer initial count to 0, 2 NOPs, followed by setting it to 0 again, in case a task switch occurs in the middle), and then calls switchTask() itself.

I no longer support cooperative multitasking, and threads cannot call "yield". I would definitely not trust a usermode thread to decide it will return control voluntary. I also use the APIC timer for the scheduler when it is present, but it's always loaded with a default value of 1ms so no thread can run longer than that without potentially being preempted by another thread at the same priority-level that is ready to run. Blocking of threads is part of a multithreaded API that has critical sections (mutex), signal/wait-for-signal, IPC (locally, between processes and over networks) and fixed delays. The kernel also has a high-precision timer function that is either mapped to PIT, HPET or APIC timer.

mariuszp · Post by **mariuszp** » Sat Apr 02, 2016 4:35 am

rdos wrote:
mariuszp wrote: I use the APIC timer in single-shot mode for scheduling - that is, after switching task, the scheduler sets the current CPU's APIC timer to a specific value, such that it can be interrupted when it needs to switch task next. If a thread needs less time than its normal slice, it calls kyield() which turns off the APIC timer (by enabling interrupts, setting the timer initial count to 0, 2 NOPs, followed by setting it to 0 again, in case a task switch occurs in the middle), and then calls switchTask() itself.
I no longer support cooperative multitasking, and threads cannot call "yield". I would definitely not trust a usermode thread to decide it will return control voluntary. I also use the APIC timer for the scheduler when it is present, but it's always loaded with a default value of 1ms so no thread can run longer than that without potentially being preempted by another thread at the same priority-level that is ready to run. Blocking of threads is part of a multithreaded API that has critical sections (mutex), signal/wait-for-signal, IPC (locally, between processes and over networks) and fixed delays. The kernel also has a high-precision timer function that is either mapped to PIT, HPET or APIC timer.

I never said userspace calls kyield(). It is called by the kernel when performing blocking operations, so that it doesn't waste the entire timeslice.

Userspace threads can, however, if they want to, call the _glidix_yield() system call. Give me one argument why that is dangerous.

rdos · Post by **rdos** » Sat Apr 02, 2016 7:40 am

mariuszp wrote: Userspace threads can, however, if they want to, call the _glidix_yield() system call. Give me one argument why that is dangerous.

That might not be so dangerous, provided a thread cannot disable the preempt-timer for itself and continue to monopolize the core forever.

However, the presence of "yield" itself promotes lousy programming practices like state machines and busy-polling. It's much harder to do this in a more clean thread environment with no yield.

Think of it like this: You design an OS that is supposed to be well-behaved even when people other than yourself write applications. In order for that to happen, you need to steer programmers away by designing user-interfaces that makes it hard to regress to bad programming practices. I think I've mostly managed to do that, with one exception, but the code was thrown-away because it didn't work properly.

mariuszp · Post by **mariuszp** » Sat Apr 02, 2016 8:00 am

No, indeed they cannot control the timer themselves. They can only DECREASE the timeslice they receive, never increase it.

kzinti · Post by **kzinti** » Sun Apr 03, 2016 12:51 am

rdos wrote:However, the presence of "yield" itself promotes lousy programming practices like state machines and busy-polling. It's much harder to do this in a more clean thread environment with no yield.

Please explain how "yield" promotes "lousy programming practices" and why that is a problem. Given two OS implementations that are identical except the first one has a yield system call and the second one doesn't, why is the first one suffering?

I don't understand how a program communicating to the OS that it doesn't need the CPU any more is a problem.

rdos · Post by **rdos** » Sun Apr 03, 2016 9:51 am

kiznit wrote:
rdos wrote:However, the presence of "yield" itself promotes lousy programming practices like state machines and busy-polling. It's much harder to do this in a more clean thread environment with no yield.
Please explain how "yield" promotes "lousy programming practices" and why that is a problem. Given two OS implementations that are identical except the first one has a yield system call and the second one doesn't, why is the first one suffering?

That's obvious. In RDOS, you cannot do yield, but a thread can ask the OS to wait for a fixed amount of time before returning. That way the programmer is forced to decide when it is a good idea to do the next round in a loop (like when you poll something on a serial port), and not just do it as quickly as possible.

kiznit wrote: I don't understand how a program communicating to the OS that it doesn't need the CPU any more is a problem.

That's easy. If a thread needs to do that it doesn't use typical multithreaded APIs like mutexes and waiting for events. It's busy-polling some device, and wants to do it as much as it can, but it already did it so it asks the OS to give up temporarily with yield. There is no other reason why you would want to use yield.

kzinti · Post by **kzinti** » Sun Apr 03, 2016 4:37 pm

rdos wrote:That's obvious. In RDOS, you cannot do yield, but a thread can ask the OS to wait for a fixed amount of time before returning. That way the programmer is forced to decide when it is a good idea to do the next round in a loop (like when you poll something on a serial port), and not just do it as quickly as possible.

It's not obvious. In fact you are wrong. Yield can and is useful in other contexts then polling a device.

rdos wrote:That's easy. If a thread needs to do that it doesn't use typical multithreaded APIs like mutexes and waiting for events. It's busy-polling some device, and wants to do it as much as it can, but it already did it so it asks the OS to give up temporarily with yield. There is no other reason why you would want to use yield.

There are other reasons to use yield. Strategic use of yield() can improve performance by giving other threads or processes a change to run when (heavily) contended resources (e.g. mutexes) have been released by the caller. yield() is particularly useful when trying to balance a workload between multiple CPU cores. Of course, calling yield() unnecessarily or inappropriately can result in unnecessary/inefficient context switches and degrade system performance. But that is true of any API: misuse it and it's not good.

There is a reason all OSes provide such an API and the reason is that it is quite useful in practice (cooperative scheduling between threads/processes). The examples you give (polling a device) is obviously not a good place to use yield. But I take issue with your "no other reason why you would want to use yield". That is simply not true.

Just to be clear about what I said above: imagine you have a thread holding a mutex for a significant amount of time. Other important threads are waiting on that mutex. You release the mutex and you want the important threads to be scheduled immediately. How do you accomplish this without using a yield() call?

OSDev.org

copy-on-write/mapped files without race conditions?

copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?

Re: copy-on-write/mapped files without race conditions?