Re:sync on a smp system (again)
Posted: Mon Nov 21, 2005 4:18 am
I?m using cdimage from microsoft and because of that there are so many microsoft?s!
The Place to Start for Operating System Developers
http://f.osdev.org/
I did it completely differently...Candy wrote: I'm not sure it's a method for your OS design but it's at least one that I wanted to employ in my own OS. I designed it for massively parallel computing.
I'm not sure I understand your scheduler properly - what happens when a thread blocks?Candy wrote: I tend to try to separate the scheduler from the actual running of threads. Up to a certain level, it's possible for one cpu to schedule threads to all the others and be able to do that within the set time limit. If you centralize the scheduling you can both experiment with and profit from concurrent scheduling of threads and stuff like that.
Code: Select all
threadA:
for (;;) {
sendMessage(threadB); // Un-block thread B
wait_for_message(); // Block thread A
count++;
}
}
threadB:
for (;;) {
sendMessage(threadA); // Un-block thread A
wait_for_message(); // Block thread B
}
}
From my own reply which you quoted:Brendan wrote: I'm not sure I understand your scheduler properly - what happens when a thread blocks?
For example, how would your scheduler handle "IPC ping-pong"? IPC ping-pong is a little stress test I invented - it goes something like:
The "IPC ping-pong" test is a horrid little thing. For my scheduler (assuming that the threads involved have high enough priority) the likely result is an entire CPU doing constant thread switches. The "count" variable increases as fast as possible (taking messaging and thread switch overhead into account), and all other CPUs remain unaffected.Code: Select all
threadA: for (;;) { sendMessage(threadB); // Un-block thread B wait_for_message(); // Block thread A count++; } } threadB: for (;;) { sendMessage(threadA); // Un-block thread A wait_for_message(); // Block thread B } }
For your scheduler I'm only guessing, but 2 possibilities would be:
A) When a thread blocks, the CPU it was running on spins until the reschedule-handle-processor causes threads to be rescheduled.
B) When a thread blocks, the CPU it was running on sends an IPI to the reschedule-handle-processor which forces threads to be rescheduled on all CPUs.
For case A, the IPC ping-pong test would be handled like my scheduler handles it, except that the "count" variable would increase much slower because the CPU would spend a huge amount of time spinning. This isn't too good.
For case B, the IPC ping-pong test would seriously degrade the performance of all CPUs (but the "count" variable would increase fast). This isn't too good either - it represents serious scalability problems.
The "perfect" result is open to interpretation, but IMHO getting that count variable increasing as fast as possible (i.e. reducing thread switch and IPC overhead) and minimizing the effect on other CPUs is important, especially if the OS has a relatively high number of "event driven" threads.
The emphasis was on "concurrent scheduling of threads". Given a multi-cpu system, the scheduler (which isn't described here, this is the dispatcher in the form I see it now) would try to schedule those two at the same time. Having multiple threads in a process that perform a form of IPC pingpong as you describe it are described in the Modern Operating Systems book as gang scheduling. Code should not block if the code for releasing it is running on a cpu. Yes, this means it'd use two processors, but achieve much higher speeds.Candy wrote: I tend to try to separate the scheduler from the actual running of threads. ... If you centralize the scheduling you can both experiment with and profit from concurrent scheduling of threads and stuff like that.
For some OSs (mine at least) there's natural "thread chains" consisting of seperate event driven threads.Candy wrote:I fail to see where and/or how your method would be required anywhere. Most situations would allow for a higher-bandwidth higher-latency transmission, such as a cyclic buffer with semaphores.
IMHO it'd depend on the load characteristics and style of IPC used. For "client/server" style IPC it'd work well, but for my style of IPC (where there's no return message to wake the blocked thread up) it wouldn't work too well.Candy wrote:I admit I haven't entirely worked out whether this is the best idea. This is at least one idea which works pretty good, since you avoid the normal spinning in case of normal threads (mainly for calculation threads). In case of heavy-IO, you'd get the normal degraded performance, which wouldn't matter much since the IO would very probably be the bottle neck there (in my context of real-life x86 computers).
For a large parallel computer you end up with NUMA, where typically there's nodes consisting of 4 or 8 CPUs and a memory bank, and lots of these nodes. To improve performance the OS would try to minimize the traffic between nodes, so you'd probably want a mostly seperate scheduler for each node rather than one CPU trying to schedule threads for all (near and far) CPUs.Candy wrote:Thinking about it, another idea pops up. What if, for a massively parallel computer you have one cpu continually dedicated to managing what the rest is doing? It could preschedule threads for cpu's, tell them to idle and tell them to gang schedule easily. You'd need an IPI with content (or a little space per CPU) to transmit the information, but other than that, I think it's an idea.
As was to be expected. Most of the time, when you have a discussion with somebody and you can't figure out why you can't agree, you have implicit assumptions that don't match.Brendan wrote: Doh - our assumptions don't mix well, and mine weren't stated (and should have been).![]()
Well... from that point of view, the pingpong situation is quite likely. I'm also now realising that you implicitly used a microkernel, where I implicitly used a monolithic kernel. Microkernels seem to swap threads in kernel-land more often.I assume that all CPUs have something they can do. Mainly because performance and other factors can usually be ignored otherwise, but also because it's true for my OS (where there's a "system idle thread" for each CPU that does "HLT" and other things).
For me, if a thread calls "get_message()" when there are no messages for it to get, then it blocks until a message is received and another thread runs instead. This is what makes the ping-pong test so horrid - it's meant to cause an extremely high number of thread switches.
I tend to consider these not high priority in that they must be processed directly. I try to see each thread as a separate entity and I see interlocks between threads as bad design. If you have a thread that sends a message and then busy-waits (or stop-waits) on the answer, it's a pointless call to another thread (since you'll incur two swaps for pretty much no improvement). Threads should be able to process independantly from others. So, I'd see the keyboard thread as a thread that receives a keystroke, stores it in the GUI server keyboard input buffer and waits for another keystroke. If there's no more keystroke, a thread switch occurs.The "thread chain" goes:
kernel->keyboard_driver->UI->app->UI->fonts->UI->video
A similar thing can happen for file I/O:
app->VFS->filesystem->VFS->disk_driver->VFS->filesytem->VFS->app
I'm assuming that too. I'm however assuming IPC that doesn't perform as strictly sender/receiver with tight coupling, which decreases the importance of thread switches greatly. Thread switch overhead reduction is a good thing, but if you can just reduce the amount of thread switches, try that first.IMHO it'd depend on the load characteristics and style of IPC used. For "client/server" style IPC it'd work well, but for my style of IPC (where there's no return message to wake the blocked thread up) it wouldn't work too well.
I'd also assume that any combination of threads could be running at any given point in time (e.g. some pure calculation threads, some pure I/O threads and many threads that are somewhere in-between). Further, I'd look at the behaviour when the OS is overloaded, as that's where overhead matters most.
And that's where you get out of my line of target. I'm still aiming for consumer systems, home computers and stuff used like that. I'm aiming for allowing user type functions to be multithreaded as good as possible, so scaling beyond 64 cpu's is out of bounds. Yet, I do need to have an intelligent scheduler that schedules all threads with respect to the others, mainly considering gang scheduling and stuff like that (since with my programming model, that'll speed up the processing more than anything else). It'll need to be able to do the module overhead stuff and signal when there's room for more splits or when there are too many threads so that some have to be merged. That'll give the best response times and processing speeds for my target.For a large parallel computer you end up with NUMA, where typically there's nodes consisting of 4 or 8 CPUs and a memory bank, and lots of these nodes. To improve performance the OS would try to minimize the traffic between nodes, so you'd probably want a mostly seperate scheduler for each node rather than one CPU trying to schedule threads for all (near and far) CPUs.
One method of doing this is to run a mostly independant instance of the OS on each node, so that each node has it's own copy of the kernel, it's own device drivers (for any hardware controlled from that node), it's own free page pool, etc. This isn't much different to a cluster of seperate computers, except the "inter-node" bus is used for high speed networking and very little else.
AFAIK massively parallel systems (e.g. more than 128 CPUs) are almost always clusters of seperate computers (usually connected by some form of high speed networking). For example, 256 4-way computers sharing a LAN. The difference between this and a normal network of computers isn't much - aesthetics (compact motherboards, plug-in blades, rack mount backplanes) and no keyboard/video/sound/floppy/CD.
This is how my OS does do things, but typically life isn't that simple. Take my UI for example - if I set it's priority to very low then it's much more likely to have many messages in it's queue when it gets CPU time, and it will process all of these messages until there are none left (before it blocks because it has nothing to do). This would have a huge impact on response time though - if the user presses a key when the computer is under load it'd take ages for the computer to respond to the key, and the user would be disgusted. Instead, if I give the UI a high priority it's likely there will only be one message in it's queue when it gets CPU time. In this case it'd process that one message and then probably block because it has nothing else to do.Candy wrote:I tend to consider these not high priority in that they must be processed directly. I try to see each thread as a separate entity and I see interlocks between threads as bad design. If you have a thread that sends a message and then busy-waits (or stop-waits) on the answer, it's a pointless call to another thread (since you'll incur two swaps for pretty much no improvement). Threads should be able to process independantly from others. So, I'd see the keyboard thread as a thread that receives a keystroke, stores it in the GUI server keyboard input buffer and waits for another keystroke. If there's no more keystroke, a thread switch occurs.Brendan wrote:The "thread chain" goes:
kernel->keyboard_driver->UI->app->UI->fonts->UI->video
A similar thing can happen for file I/O:
app->VFS->filesystem->VFS->disk_driver->VFS->filesytem->VFS->app
The GUI server runs. It processes all messages there are and only when it can not do any more (not when it can let another run!) does it switch out too.
That reduces the context (if it really does work out, I have still got to make it and perform test runs) switches to a minimum amount.
If you can make this work without the programmer specifically designing code for it, I'll give you a knighthood!Candy wrote:My main problem with scheduling is that these processing things aren't constant in amount of work and that they're not commonly constructed to be run on multiple cpu's. So, my main challenge at the moment is making a module interface that allows the modules to be ripped apart at any point into more threads, with the least amount of overhead going into the separation where the separation point is pretty much halfway. That way, a program can assume a module is doing the work specified, each module assumes the data is available and local, and the entire program runs on all cpu's even though there's no explicit threading going on.
I just don't know how to split them up yet, so I'm going to have to :
1. get the module system running - test version of zlib is being set up...
2. Figure out how to split them into multiple threads without breaking the program. I expect this to work similar to the "Proxy"-design pattern, where the "get work from previous step" is replaced by a "read from buffer" and where a new endpoint thread is created that auto-pulls data from the previous step. The endpoint thread will be the proxy for the next step and the read-buffer "module" will be the proxy for the entire chain before it. There's just this point about deciding where to split..... I smell a chance for PhD
3. Creating an efficient scheduler that knows about these threads and that gang-schedules them.
For sake of responsiveness, I'd have separate threads for anything with user interaction that are detached from any physical delays and that have high priority at all times. So, they can't wait for images to be loaded, they can't stop because something isn't doing as expected and they're expected to send out all requests to do something at the same time.Brendan wrote: For my OS, for the sake of "responsive-ness", all device drivers have a very high priority and things like the UI and VFS have a high priority. This is what leads to those thread chains (in conjunction with my refusal to waste CPU time by spinning until a message does arrive when the thread has nothing left to do).
I was hoping on the nobel prize of course. If you're going to aim impossible, might as well aim for the top impossible.If you can make this work without the programmer specifically designing code for it, I'll give you a knighthood!
If the programmer does need to design code specifically for your module system, then I don't see why applications couldn't do it with normal threads on any decent OS...
Since you changed the code, it could prove to be useful to wait for you to upload the new code before we look. Do you have any idea where it might go bad?FlashBurn wrote: I think that both of you have good method, but I like to stay with mine and make it working.
I have changed the code ... now I get a deadlock ... some of you could have a look!?