Multiprocessing critical regions

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
bewing
Member
Member
Posts: 1401
Joined: Wed Feb 07, 2007 1:45 pm
Location: Eugene, OR, US

Post by bewing »

JamesM wrote:That method is actually described in the wiki link Combuster posted.
Oops. I thought Combuster was just posting a link to google or something. :wink:
MarkOS wrote: Or there are faster or safer methods?
As far as "faster" goes -- it depends on if you intend that to be on a multi-CPU system. You can code a spinlock in fewer opcodes. But on a multiCPU system, if a spinlock is accessed often then it will always take several hundred CPU cycles per access, because of cache invalidations. (RFO cache operations.)
User avatar
Jeko
Member
Member
Posts: 500
Joined: Fri Mar 17, 2006 12:00 am
Location: Napoli, Italy

Post by Jeko »

bewing wrote:As far as "faster" goes -- it depends on if you intend that to be on a multi-CPU system. You can code a spinlock in fewer opcodes. But on a multiCPU system, if a spinlock is accessed often then it will always take several hundred CPU cycles per access, because of cache invalidations. (RFO cache operations.)
So there aren't any method faster than spinlocks?
Why there are cache invalidations?

However, can you pust code for a faster spinlock?
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Post by Korona »

According to the Intel Manuals the spin-wait loop should be aligned to the cache line size (IIRC that was 32 bytes). You should also use the "pause" instruction to notify the cpu that you're in a spin-wait loop.
User avatar
bewing
Member
Member
Posts: 1401
Joined: Wed Feb 07, 2007 1:45 pm
Location: Eugene, OR, US

Post by bewing »

Why is there cache invalidation?

Imagine a simple two CPU system, each with its own cache. CPU1 - Cache1, CPU2 - Cache2. CPU1 runs code that accesses/modifies the value of the spinlock counter. The hardware caches the memory address containing the spinlock counter into Cache1, modifies the value, and marks that part of the cache "dirty". Perhaps CPU1 now releases the lock. The spinlock counter in Cache1 gets modified again, and is still dirty. Now CPU2 wants to access/modify the spinlock. When it tries, Cache1 announces that it owns a dirty copy, and needs to write it back out to memory. It does so, clears its copy, and only then Cache2 is allowed to read that memory, modify the value, and marks its copy as now being dirty. Please notice that flushing Cache1, and loading Cache2 took a tremendous amount of time.

Now, imagine the case where they try to access the spinlock simultaneously. CPU2 has to spin, constantly reading the value of the spinlock until CPU1 releases it. But every loop on CPU2 modifies the value. So Cache1 initially held a dirty copy. Then it was written out, Cache2 read it in, dirtied it, spins on the cached value until CPU1 wants to release the lock, then Cache2 has to write out the memory line, Cache1 has to read it back in, modify it, now Cache2 wants it back to see what the new value is, Cache1 writes it back out to RAM, Cache2 reads it back in. Each of those points where a cache line has to be passed from one cache to another is called an RFO.

Now, that was only a 2 CPU case. So imagine 3 CPUs. 2 of them spinning on the value of the lock. In the previous case, Cache2 owned the memory the entire time that it was spinning on the lock. But now, CPU2 and CPU3 are both constantly modifying the value of the lock. So on every single spin loop, there is an RFO, transferring a dirtied memory address from Cache2 to Cache3 and back again.

Now imagine an N CPU case. CPU1 owns the lock, but the hardware doesn't know that. You have N-1 CPUs spinning on the lock, and CPU1 wants to release it. Therefore you have N RFO requests, and (on average) half of them will get processed before CPU1 finally gets its RFO request handled, gets control of that memory address, and can finally release the lock. On real hardware, every RFO takes hundreds of CPU cycles to flush cache, and reload the new cache -- during which time at least one, and usually both CPUs are idle. If you are using hyperthreading, then you can recover some of these CPU cycles.

The best case happens when the lock is used so rarely that the cache for one CPU gets flushed before the next CPU ever gets around to asking for that same spinlock address. Then all you get is just an access to uncached memory (which still takes several hundred CPU cycles), instead of an RFO (which takes even more).

See this thread for a detailed explanation of memory functions, caching, and cache coherency (make sure you have several hours available before reading it).
http://www.osdev.org/phpBB2/viewtopic.php?t=16012

This thread has links to a spinlock tutorial (see osdev64's post):
http://www.osdev.org/phpBB2/viewtopic.php?t=14261

Brendan has some spinlock code here:
http://www.osdev.org/phpBB2/viewtopic.php?t=9669

And the last half of this topic was a discussion on spinlocks, the time they take, and some more spinlock code from Brendan:
http://www.osdev.org/phpBB2/viewtopic.p ... 4&start=45
Ready4Dis
Member
Member
Posts: 571
Joined: Sat Nov 18, 2006 9:11 am

Post by Ready4Dis »

My question is, why are you worrying about multiple programs accessing an ATA driver, and why would you assign such a task to a single CPU? The ATA driver should have a queue of awaiting instructions, and a way to notify the program when the request is met (or failed). This completely aleviates it of needing to synchronize timing between multiple processes/thread, and it doesn't care which CPU takes it over, the only time you need to lock/wait is when the buffer is being modified (and if you're smart, you will have 2 locks, one for removing from the buffer, and one for adding to the buffer since they can be exclusive, especially in the case of a circular/ring buffer). this minimizes spin time, and makes synchronizing a moot point. Possibly on the queue would be something like so: LBA, Sect Count, ThreadToNotify, PhysicalAddr (or memory mapping of sorts), Operation (read/write). Then, once the operation is processed, it can wake up the thread (blocking I/O of course, if not, then another method to alert it, like messages for asynchronous transfers). It would simply process it's queue when it wasn't empty, otherwise just lie dormant (you can even put the thread to sleep until the list gets modified again, then awake it).

To the original poster, the only really critical parts are those that can be accessed at more than one time by more than one CPU, the trick is to minimize this. Obviously, adding/removing tasks, modifying the kernel's page tables, etc can have drastic effects on other process' running. Most drivers queue's will only want one person to be able to add something at a time, etc. It also depends on what kind of OS you have, microkernel is way different than monolithic in that, most CPU's will spend time in user-land, and much less time in kernel land where they can interfere with each other. So, many less locks are required, although it will still be similar in the case of adding/removing requests to drivers, you can only add one to the list at a time so they aren't both trying to overwrite the same memory location, the good thing here though is that it can be done on the kernel message passing/processing side and ignored completely in most drivers/software, kind of a set it and forget it, unlike the monolithic version that is in kernel mode much more often and each driver must be written with multitasking/multicpu in mind (unless process' don't access them directly you can still use the kernel to be the only thing to have access to the drivers and keep all the spin locking on the kernel size).
User avatar
Jeko
Member
Member
Posts: 500
Joined: Fri Mar 17, 2006 12:00 am
Location: Napoli, Italy

Post by Jeko »

This article is very useful: http://www.osdever.net/tutorials/spinlock_part_3.php

Read it! (also parts 1 and 2)

Now I must implement queues of tasks and of requests with atomic operations! It's very faster!
Post Reply