I've been debating some ideas regarding locking strategy for scheduler and resource accesses. One idea I'm mulling over is essentially using a queue for a given lock and issuing IPI's to communicate lock release to the next task in the list, and to possibly issue them to free up highly contended locks at a threshold.
It would work similar to this:
Lock function:
1. Compare and attempt to acquire lock atomically
2. If lock acquisition fails, add queue entry atomically (This consists of the processors apic id and process number). This would inevitably be a spinlock in itself for the queue update.
3. Switch to the next task waiting to run on a given cpu
Unlock function:
1. Check queue, if empty simply free lock
2. Check if the destination cpu is servicing another lock acquisition IPI, and atomically set lock acquisition if not.
3. If cpu is busy move down the list
4. If the end of the list is hit simply free the lock
Unlock IPI handler would essentially:
1. Identify the task and set it's TSS eax value to the current lock value, for the next cmpxchg loop
2, Issue EOI
3. Switch to lock task for it's remaining slice of time
I'd imagine this is a ton of overhead and overly complicated for such a problem and it might at best find use for locks that usually have long wait times. How feasible would such a lock setup be?
Locking question
Ok, a few problems that you would have to deal with here:
What if 2 processes attempt to aquire a lock simultaneously, that is already locked. They can't both add to the queue at the same time, so they both. Will you use another atomic lock with a queue, or just let one spin until the list is updated, then add the next one to the list? How do you get resources for the queue, are the automatically assigned when you create the lock? Is there a pool, do you use malloc/free, a linked list, etc? The complexity that you are talking about adding for a simple problem seems a bit over-kill, and will cause tons of problems of its own. The trick is, to just minimize the chance of locks colliding, rather than try to come up with a bunch of tricks around it. I would be more concerned with 'why' is there so much spinning going on, rather than how do i optimize my spinning. My suggestion, queue's, but in a differnet way than what you're talking. An app is calling a driver for example:
Lock Driver Queue, add entry, unlock. The entry would be whatever information is required to perform some operation (for example, disk access). You would send the process, address, offset and size. The driver would simply check it's queue, if it has something, process it, this can be a circular queue (my other suggestion), so that the driver doesn't have to lock the queue when removing information from it, nore does it have to chekc if th queue is locked when reading the next entry. The chance of 2 processes spinning is reduced (since it's only as long as it takes to copy a few bytes), the driver itself doesn't interfere with processes, and if a process does spin, it will be a very short time period. Now, there may be other times when it's not as easy to do, but the trick is to just minimize either the time spent spinning, or the chance that they will spin. The driver only needs to be woken if it's sleeping (ok, there is an atomic operation here, but it's VERY short), and any process asking for something could either a.) go to sleep, or b.) do something else, so you can do asynchronous operations as well (then as a parameter you would set it to give you some sort of notification or message depending on your IPC system). So, here would be a rundown:
Attempt to read a block from HD
Kernel locks the drivers Queue, and adds you to it's list, then checks if driver is sleeping (using atomic operation so the driver doesn't put itself to sleep after the kernel just saw that it was awake!). Process is put to sleep until request is completed (either completed or errored). Kernel puts driver back to sleep, and wake up process and sends it a message tell it success or failure.
What if 2 processes attempt to aquire a lock simultaneously, that is already locked. They can't both add to the queue at the same time, so they both. Will you use another atomic lock with a queue, or just let one spin until the list is updated, then add the next one to the list? How do you get resources for the queue, are the automatically assigned when you create the lock? Is there a pool, do you use malloc/free, a linked list, etc? The complexity that you are talking about adding for a simple problem seems a bit over-kill, and will cause tons of problems of its own. The trick is, to just minimize the chance of locks colliding, rather than try to come up with a bunch of tricks around it. I would be more concerned with 'why' is there so much spinning going on, rather than how do i optimize my spinning. My suggestion, queue's, but in a differnet way than what you're talking. An app is calling a driver for example:
Lock Driver Queue, add entry, unlock. The entry would be whatever information is required to perform some operation (for example, disk access). You would send the process, address, offset and size. The driver would simply check it's queue, if it has something, process it, this can be a circular queue (my other suggestion), so that the driver doesn't have to lock the queue when removing information from it, nore does it have to chekc if th queue is locked when reading the next entry. The chance of 2 processes spinning is reduced (since it's only as long as it takes to copy a few bytes), the driver itself doesn't interfere with processes, and if a process does spin, it will be a very short time period. Now, there may be other times when it's not as easy to do, but the trick is to just minimize either the time spent spinning, or the chance that they will spin. The driver only needs to be woken if it's sleeping (ok, there is an atomic operation here, but it's VERY short), and any process asking for something could either a.) go to sleep, or b.) do something else, so you can do asynchronous operations as well (then as a parameter you would set it to give you some sort of notification or message depending on your IPC system). So, here would be a rundown:
Attempt to read a block from HD
Kernel locks the drivers Queue, and adds you to it's list, then checks if driver is sleeping (using atomic operation so the driver doesn't put itself to sleep after the kernel just saw that it was awake!). Process is put to sleep until request is completed (either completed or errored). Kernel puts driver back to sleep, and wake up process and sends it a message tell it success or failure.
Thanks for the reply, Queue adds would be done atomically at the first null entry found in the list. Basically there's a small spin loop in the queue update but it should be minimal on most systems. All lock resources would be allocated at lock creation, never to be relocated in an array style format. If a queue fills the lock does a standard spin as long as the process is active. This wasn't for something like a HD driver, I pretty much am planning todo what you're suggesting there and just have a formatted request queue that's locked while it's being updated and then read by HD's subsystem ISR.
I was thinking more for something like memory management where I could see a lot of things contending for it, the whole idea is just to prevent processes from spending a lot of time spinning while there's other tasks to be done. Initially I had through monitor/mwait provided similar functionality, but upon reading into them I found out that was not the case. I'm not really to the point where I have a concrete need for an alternative lock system and processes of mine are actually spending a lot of time spinning. It's pretty much at the idea/theory stage and I'm just curious if it isn't feasible on paper at this point, which it seems to be. Thanks again for the reponse
I was thinking more for something like memory management where I could see a lot of things contending for it, the whole idea is just to prevent processes from spending a lot of time spinning while there's other tasks to be done. Initially I had through monitor/mwait provided similar functionality, but upon reading into them I found out that was not the case. I'm not really to the point where I have a concrete need for an alternative lock system and processes of mine are actually spending a lot of time spinning. It's pretty much at the idea/theory stage and I'm just curious if it isn't feasible on paper at this point, which it seems to be. Thanks again for the reponse
I didn't realize you were so far along and having issues like that already . Well, for my current setup I simply have a SleepTask() call if the check fails to allow other tasks to do their thing, then when the task is called again it checks, if it's still buy, it just sleeps again. This works ok for me for now, however when I get a bit further I may need to modify this to be a bit more robust.
An idea that just popped into my head that may be worth thinking about:
Have a status in the task saying that it's waiting on a resource to be free (all locks are registered and assigned an ID, probably it's address for simplicity since it'll be in the kernel). The process can be put to sleep pending the resource, so when the lock is freed, the kernel will search the sleeping tasks for one that was waiting on a resource and unlock the resource if none are found (that way, the first task in the sleeping list waiting for the resource is always next). Depending on how many tasks are sleeping (or in a special list possible) would make the algorythm more or less efficient. Just an idea, may or may not be worth it.
Another Idea:
If you had a process ID stored with the lock, you could relinquish the rest of your time to that process that you're waiting on, this would help it finish much faster, and more chance that you would be good to go when your turn came back around (especially if the task that has the lock is an idle or low priority with a high priority waiting). This keeps the CPU moving, at POSSIBLE expense of an extra context switch or two (although, spinning is reduced to almost nothing).
So, are you going to have variable sized queue's for different locks (depending on the chances of hitting a collision?), or what are your plans on this? If a lock is allocated, and your memory manager is locked, it would spin (or be put on the memory manager locks queue)... what happens if a resource try's to lock this lock while it's waiting for memory? Or can this never happen (possibly nobody knows the lock exists until the driver is fully loaded). It sounds like a viable plan on paper, how much processor overhead you are saving is up in the air, like I said, I just put the task to sleep and hope it's done on the next cycle.
Just trying to give some of my brainstorms to see if it can help you with your implementation, or perhaps you will come up with a better method. I would try to figure out what kind of overhead you're looking at, and what the average 'wait' time is for spinning vs. doing the queue thing, or any other method, and see if it's even worth it before proceeding with any implementations.
An idea that just popped into my head that may be worth thinking about:
Have a status in the task saying that it's waiting on a resource to be free (all locks are registered and assigned an ID, probably it's address for simplicity since it'll be in the kernel). The process can be put to sleep pending the resource, so when the lock is freed, the kernel will search the sleeping tasks for one that was waiting on a resource and unlock the resource if none are found (that way, the first task in the sleeping list waiting for the resource is always next). Depending on how many tasks are sleeping (or in a special list possible) would make the algorythm more or less efficient. Just an idea, may or may not be worth it.
Another Idea:
If you had a process ID stored with the lock, you could relinquish the rest of your time to that process that you're waiting on, this would help it finish much faster, and more chance that you would be good to go when your turn came back around (especially if the task that has the lock is an idle or low priority with a high priority waiting). This keeps the CPU moving, at POSSIBLE expense of an extra context switch or two (although, spinning is reduced to almost nothing).
So, are you going to have variable sized queue's for different locks (depending on the chances of hitting a collision?), or what are your plans on this? If a lock is allocated, and your memory manager is locked, it would spin (or be put on the memory manager locks queue)... what happens if a resource try's to lock this lock while it's waiting for memory? Or can this never happen (possibly nobody knows the lock exists until the driver is fully loaded). It sounds like a viable plan on paper, how much processor overhead you are saving is up in the air, like I said, I just put the task to sleep and hope it's done on the next cycle.
Just trying to give some of my brainstorms to see if it can help you with your implementation, or perhaps you will come up with a better method. I would try to figure out what kind of overhead you're looking at, and what the average 'wait' time is for spinning vs. doing the queue thing, or any other method, and see if it's even worth it before proceeding with any implementations.
The way I had it planned would be that the queue field is a 32 bit value, the upper 8 bits contain the APIC Id of the processor and the lower 24 the process Id. When a task seeks to acquire a lock it pushes this information into the queue atomically and calls for a task switch pushing a state code along with the lock address, that will indicate it's waiting on a lock. The scheduler will recieve this and calculate it's remaining time and switch to whatever other tasks are available. If none are it will halt or use monitor/mwait if available to put the processor into a low power state.
There's also going to be two 32 bit fields for every processor to indicate if they're servicing a lock acquisition. One of these will be the process Id and the other will be the current value of the lock. One of these will be set as a monitored if the processor supports monitor/mwait, in order to wake it for the IPI that's about to follow. If the primary field isn't null, denoting it's in use the release mechanism will go down the queue list looking for one that is, if it hits the end of the list or a null entry it'll wrap back around till it finds one, the odds of this happening very often shouldn't be too high as the IPI service time should be low. And really on a dual core system there's no chance of it ever happening during proper operation. The list entry selected will be removed by successive locked xchg operations from the end of the list. And an IPI will be issued from the information provided in the lock. If the request comes from the same processor the fields are simply set and an INT opcode is issued to vector to the same ISR.
The isr for lock acquistion IPI would locate the indicated process in it's task table and alter the EAX to the one provided in the other lock acquisition register. The acquire function immediately execute a cmpxchg once it's run again with what should be the current value of the lock. This should garuntee exclusive lock acquisition without ever freeing the lock globally.
I'd like to use different scales of locks, this one would probably only be reserved for more processor intensive tasks that require locking. A standard spinlock would do for anything with shorter execution time that tends to remain static. I don't really the know the performance overhead at this point I'll probably just write the code and test it out.
There's also going to be two 32 bit fields for every processor to indicate if they're servicing a lock acquisition. One of these will be the process Id and the other will be the current value of the lock. One of these will be set as a monitored if the processor supports monitor/mwait, in order to wake it for the IPI that's about to follow. If the primary field isn't null, denoting it's in use the release mechanism will go down the queue list looking for one that is, if it hits the end of the list or a null entry it'll wrap back around till it finds one, the odds of this happening very often shouldn't be too high as the IPI service time should be low. And really on a dual core system there's no chance of it ever happening during proper operation. The list entry selected will be removed by successive locked xchg operations from the end of the list. And an IPI will be issued from the information provided in the lock. If the request comes from the same processor the fields are simply set and an INT opcode is issued to vector to the same ISR.
The isr for lock acquistion IPI would locate the indicated process in it's task table and alter the EAX to the one provided in the other lock acquisition register. The acquire function immediately execute a cmpxchg once it's run again with what should be the current value of the lock. This should garuntee exclusive lock acquisition without ever freeing the lock globally.
I'd like to use different scales of locks, this one would probably only be reserved for more processor intensive tasks that require locking. A standard spinlock would do for anything with shorter execution time that tends to remain static. I don't really the know the performance overhead at this point I'll probably just write the code and test it out.