sync on a smp system (again)

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
FlashBurn

Re:sync on a smp system (again)

Post by FlashBurn »

I?m using cdimage from microsoft and because of that there are so many microsoft?s!
User avatar
bubach
Member
Member
Posts: 1223
Joined: Sat Oct 23, 2004 11:00 pm
Location: Sweden
Contact:

Re:sync on a smp system (again)

Post by bubach »

Wow, not bad. They got in some copyright **** into an image file!? :o
"Simplicity is the ultimate sophistication."
http://bos.asmhackers.net/ - GitHub
FlashBurn

Re:sync on a smp system (again)

Post by FlashBurn »

Ok, I?m back online at home ;D

@brendan

I have a bootsector for floppies, hdds and cds, but the code for parsing the FAT filesystem isn?t ready. I will work on that if I find time and when I have the mood for that.

@all

At the moment I use a spinlock for every thread and the scheduler takes this spinlock when it dequeues a thread and releases it when it enqueues this thread. This method is not really working :(

I like to hear if there is anyone who has a better method for that problem.

Problem:

cpu1 puts a thread in a semaphore queue releases the spinlock and cpu2 has finished with the code that the semaphore saves and it tries to wake up the thread which is still running at cpu1.
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

I'm not sure it's a method for your OS design but it's at least one that I wanted to employ in my own OS. I designed it for massively parallel computing.

One processor is designated the reschedule-handle-processor. All runnable threads are in priority order (or any order, in fact, I don't care :)) in the available list. CPU's are numbered in speed order (IE, using the first N processors should give the fastest working setup). The first CPU waits for the available list mutex and takes it, it then takes the first value off the head and sends an IPI to the next processor. That one takes the first value off the head and sends the IPI on. The last processor releases the spinlock. If a processor can't run a thread, it goes to idle and adds itself to the idle processor list.

The scheduling itself is pretty ad-hoc, it schedules N processes at once with the least amount of overhead possible. There is no spinlocking, no competition, no real problems to be had.

The four limits are of course:
1. The first processor is interrupted again before the last releases the spinlock. If it detects this, it skips the interrupt, or at your option, changes the clock speed or such.
2. The list goes empty because the refill-methods are too slow. You can't make them asynchronous, but you can merge the addition into this algorithm by making each take one thread off the head and inserting whichever it needed to during the last cycle.
3. A processor needs to find another thing to do while the spinlock is taken. It then spins on the lock (yup, there's a reason for it to be a spinlock) until it can do something.
4. Setting up or using gang scheduling isn't accounted for. All processes are considered equal and the speed of a processor or cache coherency issues are ignored. You could order the processors by logical level (where each group has likelyness to be put together) and change the available-list to a 2-layer structure, where processes will be added with a given priority with threads in it which can then be run. This increases the chance of them ending up on a higher shared cache level (lower numbered and faster) than otherwise.

On your problem specific, if you're trying to wake up a thread that's not sleeping, store a wakeup for it instead. Use a semaphore and store the amount of wakeups in it.

HTH, Candy
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:sync on a smp system (again)

Post by Brendan »

Hi,
Candy wrote: I'm not sure it's a method for your OS design but it's at least one that I wanted to employ in my own OS. I designed it for massively parallel computing.
I did it completely differently...

For the last version of my OS, each CPU had it's own independant "mini-scheduler", where each mini-scheduler used many "ready to run" queues.

When a thread became ready to run, the OS decided which CPU it should run on (the least loaded CPU that is "enabled" via. the thread's CPU affinity). Once this is decided the correct queue is determined (based on the threads policy/priority). Then the queue's spinlock is acquired, the thread is inserted in the queue and the spinlock is freed.

For the mini-scheduler itself (for thread switches), it would insert the currently running thread on a CPUs "ready to run" queue (not necessarily the same CPU). Then it would find which of it's "ready to run" queues has the highest priority thread, acquire the spinlock for that queue, take the first thread off the queue, then release the spinlock and switch threads.

During the thread switch, the scheduler calculates the maximum amount of time the thread should be given (based on it's priority, etc) and sets the CPUs local APIC timer accordingly (local APIC in "one-shot" mode).

The chance of one CPU finding a queue's lock held by another CPU is very small because the lock is never held for long, all schedulers/CPUs are doing things at different times, and there are many queues for each CPU/mini-scheduler (many locks).

CPU load balancing is a tricky thing to describe. At first glance it would appear that this system would take a while to cope with CPU load changes. For example, if 2 CPUs are under heavy load and all of the threads for CPU A decide to block you'd get an unbalanced load (CPU A without much work to do and CPU B still under heavy load). In practice this doesn't happen or doesn't matter when it does.

Very high priority threads typically run for very small periods of time before blocking or using all of the allocated timeslice and are always run first, so the load from very high priority threads will become balanced quickly. The load from very low priority threads could take a relatively long time to become balanced, but they don't matter much because they are very low priority and CPU load is likely to change a lot before they'd run anyway.

In general, a thread's priority is proportional to how quickly it's load will become balanced, and inversely proportional to how important it is to balance quickly. I hope that makes sense...

The advantages of this scheme are that it's very fast and no single lock is under heavy contention. Apart from the many "ready to run" queue locks, there are no other locks and there's no IPIs.



Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

I tend to try to separate the scheduler from the actual running of threads. Up to a certain level, it's possible for one cpu to schedule threads to all the others and be able to do that within the set time limit. If you centralize the scheduling you can both experiment with and profit from concurrent scheduling of threads and stuff like that.

I mainly prefer centralizing it because I want to be able to auto-thread a program using my modules system. In that case, I need a very complex scheduler to do that, I think, so it's practical to separate it out.

Since it'll also take until 8-way opteron-6's are commonplace before my OS is going to run properly, I'd better put in decent concurrent scheduling :).
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:sync on a smp system (again)

Post by Brendan »

Hi,
Candy wrote: I tend to try to separate the scheduler from the actual running of threads. Up to a certain level, it's possible for one cpu to schedule threads to all the others and be able to do that within the set time limit. If you centralize the scheduling you can both experiment with and profit from concurrent scheduling of threads and stuff like that.
I'm not sure I understand your scheduler properly - what happens when a thread blocks?

For example, how would your scheduler handle "IPC ping-pong"? IPC ping-pong is a little stress test I invented - it goes something like:

Code: Select all

threadA:
    for (;;) {
       sendMessage(threadB);    // Un-block thread B
       wait_for_message();         // Block thread A
       count++;
    }
}

threadB:
    for (;;) {
       sendMessage(threadA);    // Un-block thread A
       wait_for_message();         // Block thread B
    }
}
The "IPC ping-pong" test is a horrid little thing. For my scheduler (assuming that the threads involved have high enough priority) the likely result is an entire CPU doing constant thread switches. The "count" variable increases as fast as possible (taking messaging and thread switch overhead into account), and all other CPUs remain unaffected.

For your scheduler I'm only guessing, but 2 possibilities would be:

A) When a thread blocks, the CPU it was running on spins until the reschedule-handle-processor causes threads to be rescheduled.

B) When a thread blocks, the CPU it was running on sends an IPI to the reschedule-handle-processor which forces threads to be rescheduled on all CPUs.

For case A, the IPC ping-pong test would be handled like my scheduler handles it, except that the "count" variable would increase much slower because the CPU would spend a huge amount of time spinning. This isn't too good.

For case B, the IPC ping-pong test would seriously degrade the performance of all CPUs (but the "count" variable would increase fast). This isn't too good either - it represents serious scalability problems.

The "perfect" result is open to interpretation, but IMHO getting that count variable increasing as fast as possible (i.e. reducing thread switch and IPC overhead) and minimizing the effect on other CPUs is important, especially if the OS has a relatively high number of "event driven" threads.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

Brendan wrote: I'm not sure I understand your scheduler properly - what happens when a thread blocks?

For example, how would your scheduler handle "IPC ping-pong"? IPC ping-pong is a little stress test I invented - it goes something like:

Code: Select all

threadA:
    for (;;) {
       sendMessage(threadB);    // Un-block thread B
       wait_for_message();         // Block thread A
       count++;
    }
}

threadB:
    for (;;) {
       sendMessage(threadA);    // Un-block thread A
       wait_for_message();         // Block thread B
    }
}
The "IPC ping-pong" test is a horrid little thing. For my scheduler (assuming that the threads involved have high enough priority) the likely result is an entire CPU doing constant thread switches. The "count" variable increases as fast as possible (taking messaging and thread switch overhead into account), and all other CPUs remain unaffected.

For your scheduler I'm only guessing, but 2 possibilities would be:

A) When a thread blocks, the CPU it was running on spins until the reschedule-handle-processor causes threads to be rescheduled.

B) When a thread blocks, the CPU it was running on sends an IPI to the reschedule-handle-processor which forces threads to be rescheduled on all CPUs.

For case A, the IPC ping-pong test would be handled like my scheduler handles it, except that the "count" variable would increase much slower because the CPU would spend a huge amount of time spinning. This isn't too good.

For case B, the IPC ping-pong test would seriously degrade the performance of all CPUs (but the "count" variable would increase fast). This isn't too good either - it represents serious scalability problems.

The "perfect" result is open to interpretation, but IMHO getting that count variable increasing as fast as possible (i.e. reducing thread switch and IPC overhead) and minimizing the effect on other CPUs is important, especially if the OS has a relatively high number of "event driven" threads.
From my own reply which you quoted:
Candy wrote: I tend to try to separate the scheduler from the actual running of threads. ... If you centralize the scheduling you can both experiment with and profit from concurrent scheduling of threads and stuff like that.
The emphasis was on "concurrent scheduling of threads". Given a multi-cpu system, the scheduler (which isn't described here, this is the dispatcher in the form I see it now) would try to schedule those two at the same time. Having multiple threads in a process that perform a form of IPC pingpong as you describe it are described in the Modern Operating Systems book as gang scheduling. Code should not block if the code for releasing it is running on a cpu. Yes, this means it'd use two processors, but achieve much higher speeds.

I admit I haven't entirely worked out whether this is the best idea. This is at least one idea which works pretty good, since you avoid the normal spinning in case of normal threads (mainly for calculation threads). In case of heavy-IO, you'd get the normal degraded performance, which wouldn't matter much since the IO would very probably be the bottle neck there (in my context of real-life x86 computers).

Thinking about it, another idea pops up. What if, for a massively parallel computer you have one cpu continually dedicated to managing what the rest is doing? It could preschedule threads for cpu's, tell them to idle and tell them to gang schedule easily. You'd need an IPI with content (or a little space per CPU) to transmit the information, but other than that, I think it's an idea.

But that would be for massively parallel computers (at least 64 processor cores). I expect this method to allow cpu's to reschedule at least often enough.

In case of the example where only one of the threads would be scheduled, yep, it'd be blocked until the other cpu's have a thread to run. When that happens, it can switch back & forth quickly.

I fail to see where and/or how your method would be required anywhere. Most situations would allow for a higher-bandwidth higher-latency transmission, such as a cyclic buffer with semaphores.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:sync on a smp system (again)

Post by Brendan »

Hi,

Doh - our assumptions don't mix well, and mine weren't stated (and should have been). :)

I assumes that all CPUs have something they can do. Mainly because performance and other factors can usually be ignored otherwise, but also because it's true for my OS (where there's a "system idle thread" for each CPU that does "HLT" and other things).

For me, if a thread calls "get_message()" when there are no messages for it to get, then it blocks until a message is received and another thread runs instead. This is what makes the ping-pong test so horrid - it's meant to cause an extremely high number of thread switches.
Candy wrote:I fail to see where and/or how your method would be required anywhere. Most situations would allow for a higher-bandwidth higher-latency transmission, such as a cyclic buffer with semaphores.
For some OSs (mine at least) there's natural "thread chains" consisting of seperate event driven threads.

For example, when a user presses a key the kernel sends an "IRQ received" message to the keyboard driver, the keyboard driver would service the hardware and send a "key press" message to the user interface. The user interface receives this and sends the "key press" message to the active application. Typically, the active application would take a look and send an "add this character to my video" message back to the user interface. Then the user interface would send a message to the font renderer, the font renderer would send a return message containing pixel data and the user interface adds the pixel data to the application's video buffer and then sends a message to the video driver.

The "thread chain" goes:
kernel->keyboard_driver->UI->app->UI->fonts->UI->video

A similar thing can happen for file I/O:
app->VFS->filesystem->VFS->disk_driver->VFS->filesytem->VFS->app

Most of the steps in the chain involve very little processing, and you end up with frequent short bursts of "ping-pong".

As you can tell, something like gang scheduling won't help for these thread chains.
Candy wrote:I admit I haven't entirely worked out whether this is the best idea. This is at least one idea which works pretty good, since you avoid the normal spinning in case of normal threads (mainly for calculation threads). In case of heavy-IO, you'd get the normal degraded performance, which wouldn't matter much since the IO would very probably be the bottle neck there (in my context of real-life x86 computers).
IMHO it'd depend on the load characteristics and style of IPC used. For "client/server" style IPC it'd work well, but for my style of IPC (where there's no return message to wake the blocked thread up) it wouldn't work too well.

I'd also assume that any combination of threads could be running at any given point in time (e.g. some pure calculation threads, some pure I/O threads and many threads that are somewhere in-between). Further, I'd look at the behaviour when the OS is overloaded, as that's where overhead matters most.
Candy wrote:Thinking about it, another idea pops up. What if, for a massively parallel computer you have one cpu continually dedicated to managing what the rest is doing? It could preschedule threads for cpu's, tell them to idle and tell them to gang schedule easily. You'd need an IPI with content (or a little space per CPU) to transmit the information, but other than that, I think it's an idea.
For a large parallel computer you end up with NUMA, where typically there's nodes consisting of 4 or 8 CPUs and a memory bank, and lots of these nodes. To improve performance the OS would try to minimize the traffic between nodes, so you'd probably want a mostly seperate scheduler for each node rather than one CPU trying to schedule threads for all (near and far) CPUs.

One method of doing this is to run a mostly independant instance of the OS on each node, so that each node has it's own copy of the kernel, it's own device drivers (for any hardware controlled from that node), it's own free page pool, etc. This isn't much different to a cluster of seperate computers, except the "inter-node" bus is used for high speed networking and very little else.

AFAIK massively parallel systems (e.g. more than 128 CPUs) are almost always clusters of seperate computers (usually connected by some form of high speed networking). For example, 256 4-way computers sharing a LAN. The difference between this and a normal network of computers isn't much - aesthetics (compact motherboards, plug-in blades, rack mount backplanes) and no keyboard/video/sound/floppy/CD.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

Brendan wrote: Doh - our assumptions don't mix well, and mine weren't stated (and should have been). :)
As was to be expected. Most of the time, when you have a discussion with somebody and you can't figure out why you can't agree, you have implicit assumptions that don't match.
I assume that all CPUs have something they can do. Mainly because performance and other factors can usually be ignored otherwise, but also because it's true for my OS (where there's a "system idle thread" for each CPU that does "HLT" and other things).

For me, if a thread calls "get_message()" when there are no messages for it to get, then it blocks until a message is received and another thread runs instead. This is what makes the ping-pong test so horrid - it's meant to cause an extremely high number of thread switches.
Well... from that point of view, the pingpong situation is quite likely. I'm also now realising that you implicitly used a microkernel, where I implicitly used a monolithic kernel. Microkernels seem to swap threads in kernel-land more often.
The "thread chain" goes:
kernel->keyboard_driver->UI->app->UI->fonts->UI->video

A similar thing can happen for file I/O:
app->VFS->filesystem->VFS->disk_driver->VFS->filesytem->VFS->app
I tend to consider these not high priority in that they must be processed directly. I try to see each thread as a separate entity and I see interlocks between threads as bad design. If you have a thread that sends a message and then busy-waits (or stop-waits) on the answer, it's a pointless call to another thread (since you'll incur two swaps for pretty much no improvement). Threads should be able to process independantly from others. So, I'd see the keyboard thread as a thread that receives a keystroke, stores it in the GUI server keyboard input buffer and waits for another keystroke. If there's no more keystroke, a thread switch occurs.

The GUI server runs. It processes all messages there are and only when it can not do any more (not when it can let another run!) does it switch out too.

That reduces the context (if it really does work out, I have still got to make it and perform test runs) switches to a minimum amount.

This method of programming does benefit from gang scheduling. Since all threads are working until they can't get any more work to do, they'll use the CPU in full. If they can't get any more work, they'll switch to another thread and do something else. Given a number of threads where each next thread has an equal or larger amount of work per byte of input, this chain never stops and processing is done with near-100% utilisation.

My main problem with scheduling is that these processing things aren't constant in amount of work and that they're not commonly constructed to be run on multiple cpu's. So, my main challenge at the moment is making a module interface that allows the modules to be ripped apart at any point into more threads, with the least amount of overhead going into the separation where the separation point is pretty much halfway. That way, a program can assume a module is doing the work specified, each module assumes the data is available and local, and the entire program runs on all cpu's even though there's no explicit threading going on.

I just don't know how to split them up yet, so I'm going to have to :
1. get the module system running - test version of zlib is being set up...
2. Figure out how to split them into multiple threads without breaking the program. I expect this to work similar to the "Proxy"-design pattern, where the "get work from previous step" is replaced by a "read from buffer" and where a new endpoint thread is created that auto-pulls data from the previous step. The endpoint thread will be the proxy for the next step and the read-buffer "module" will be the proxy for the entire chain before it. There's just this point about deciding where to split..... I smell a chance for PhD :)
3. Creating an efficient scheduler that knows about these threads and that gang-schedules them.

oh... how about splitting this one again?
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

darned message limit :(
IMHO it'd depend on the load characteristics and style of IPC used. For "client/server" style IPC it'd work well, but for my style of IPC (where there's no return message to wake the blocked thread up) it wouldn't work too well.

I'd also assume that any combination of threads could be running at any given point in time (e.g. some pure calculation threads, some pure I/O threads and many threads that are somewhere in-between). Further, I'd look at the behaviour when the OS is overloaded, as that's where overhead matters most.
I'm assuming that too. I'm however assuming IPC that doesn't perform as strictly sender/receiver with tight coupling, which decreases the importance of thread switches greatly. Thread switch overhead reduction is a good thing, but if you can just reduce the amount of thread switches, try that first.
For a large parallel computer you end up with NUMA, where typically there's nodes consisting of 4 or 8 CPUs and a memory bank, and lots of these nodes. To improve performance the OS would try to minimize the traffic between nodes, so you'd probably want a mostly seperate scheduler for each node rather than one CPU trying to schedule threads for all (near and far) CPUs.

One method of doing this is to run a mostly independant instance of the OS on each node, so that each node has it's own copy of the kernel, it's own device drivers (for any hardware controlled from that node), it's own free page pool, etc. This isn't much different to a cluster of seperate computers, except the "inter-node" bus is used for high speed networking and very little else.

AFAIK massively parallel systems (e.g. more than 128 CPUs) are almost always clusters of seperate computers (usually connected by some form of high speed networking). For example, 256 4-way computers sharing a LAN. The difference between this and a normal network of computers isn't much - aesthetics (compact motherboards, plug-in blades, rack mount backplanes) and no keyboard/video/sound/floppy/CD.
And that's where you get out of my line of target. I'm still aiming for consumer systems, home computers and stuff used like that. I'm aiming for allowing user type functions to be multithreaded as good as possible, so scaling beyond 64 cpu's is out of bounds. Yet, I do need to have an intelligent scheduler that schedules all threads with respect to the others, mainly considering gang scheduling and stuff like that (since with my programming model, that'll speed up the processing more than anything else). It'll need to be able to do the module overhead stuff and signal when there's room for more splits or when there are too many threads so that some have to be merged. That'll give the best response times and processing speeds for my target.

Generally speaking though, up to a reasonably-parallel computer, it's probably preferred to do the scheduling sort of central, so that you can enhance the inter-thread relations and manage the moving of threads between cpu's and stuff like that. For a system that's larger than that, it can be applied at the semi-top-layer, the one below the top one. The top one in that case would be the LAN, where the semi-top-layer would be the 4-cpu box. In that case, it's again advantageous to schedule them all together, so that they can be scheduled as efficient as possible.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:sync on a smp system (again)

Post by Brendan »

Hi,
Candy wrote:
Brendan wrote:The "thread chain" goes:
kernel->keyboard_driver->UI->app->UI->fonts->UI->video

A similar thing can happen for file I/O:
app->VFS->filesystem->VFS->disk_driver->VFS->filesytem->VFS->app
I tend to consider these not high priority in that they must be processed directly. I try to see each thread as a separate entity and I see interlocks between threads as bad design. If you have a thread that sends a message and then busy-waits (or stop-waits) on the answer, it's a pointless call to another thread (since you'll incur two swaps for pretty much no improvement). Threads should be able to process independantly from others. So, I'd see the keyboard thread as a thread that receives a keystroke, stores it in the GUI server keyboard input buffer and waits for another keystroke. If there's no more keystroke, a thread switch occurs.

The GUI server runs. It processes all messages there are and only when it can not do any more (not when it can let another run!) does it switch out too.

That reduces the context (if it really does work out, I have still got to make it and perform test runs) switches to a minimum amount.
This is how my OS does do things, but typically life isn't that simple. Take my UI for example - if I set it's priority to very low then it's much more likely to have many messages in it's queue when it gets CPU time, and it will process all of these messages until there are none left (before it blocks because it has nothing to do). This would have a huge impact on response time though - if the user presses a key when the computer is under load it'd take ages for the computer to respond to the key, and the user would be disgusted. Instead, if I give the UI a high priority it's likely there will only be one message in it's queue when it gets CPU time. In this case it'd process that one message and then probably block because it has nothing else to do.

For my OS, for the sake of "responsive-ness", all device drivers have a very high priority and things like the UI and VFS have a high priority. This is what leads to those thread chains (in conjunction with my refusal to waste CPU time by spinning until a message does arrive when the thread has nothing left to do).
Candy wrote:My main problem with scheduling is that these processing things aren't constant in amount of work and that they're not commonly constructed to be run on multiple cpu's. So, my main challenge at the moment is making a module interface that allows the modules to be ripped apart at any point into more threads, with the least amount of overhead going into the separation where the separation point is pretty much halfway. That way, a program can assume a module is doing the work specified, each module assumes the data is available and local, and the entire program runs on all cpu's even though there's no explicit threading going on.

I just don't know how to split them up yet, so I'm going to have to :
1. get the module system running - test version of zlib is being set up...
2. Figure out how to split them into multiple threads without breaking the program. I expect this to work similar to the "Proxy"-design pattern, where the "get work from previous step" is replaced by a "read from buffer" and where a new endpoint thread is created that auto-pulls data from the previous step. The endpoint thread will be the proxy for the next step and the read-buffer "module" will be the proxy for the entire chain before it. There's just this point about deciding where to split..... I smell a chance for PhD :)
3. Creating an efficient scheduler that knows about these threads and that gang-schedules them.
If you can make this work without the programmer specifically designing code for it, I'll give you a knighthood! :)

If the programmer does need to design code specifically for your module system, then I don't see why applications couldn't do it with normal threads on any decent OS...


Cheers.

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

Brendan wrote: For my OS, for the sake of "responsive-ness", all device drivers have a very high priority and things like the UI and VFS have a high priority. This is what leads to those thread chains (in conjunction with my refusal to waste CPU time by spinning until a message does arrive when the thread has nothing left to do).
For sake of responsiveness, I'd have separate threads for anything with user interaction that are detached from any physical delays and that have high priority at all times. So, they can't wait for images to be loaded, they can't stop because something isn't doing as expected and they're expected to send out all requests to do something at the same time.

If you separate everything to multiple threads for processing, you'll get a lot of thread switches. My logic still revolves around not switching when you don't have to, and it's kind of the choice for a monolithic kernel that allows me not to switch all the time.
If you can make this work without the programmer specifically designing code for it, I'll give you a knighthood! :)

If the programmer does need to design code specifically for your module system, then I don't see why applications couldn't do it with normal threads on any decent OS...
I was hoping on the nobel prize of course. If you're going to aim impossible, might as well aim for the top impossible.

Well... he does have to adjust the code for the system... but I think the modification I'd require is a quite acceptable one.

The code for performing a modification on an object of any type, generalized, can be put into a module. The module handling code in the module library can link them together for you given a certain start- and endpoint. You can link more together given a certain start and endpoint.

The code the user creates is separated into two bits, one is explicit calculation code, separated into modules with specific inputs and outputs. Say, one that converts a file-object into an image-object for filetype jpg, one that does that for filetype png and one that performs a blur conversion. He then creates a program interface, that uses these modules without explicitly referring to them (so it's not like a DLL revisited). He calls the library for a module that can convert "file" to "image", filetype not important. For each image you want to view, it assembles the module given by the OS to a blur filter to give an image. The image output is acquired through a single shot call to the stream (so it reads a single image from the file and blurs it). It's then assigned to a screen location.

This separates the UI programming, which can be done in any language you care using, from the module programming, which can again be done in any language. Each module has a single responsibility, the UI provides generic functionality.

Disadvantages are that you can't pre-test everything, you can't be certain that people will use your UI, you can't mix code quickly and it's harder to make hacks.

Advantages are that you don't have to pretest everything, you can use any UI to do a given task even though it wasn't developed for those file formats, you can develop with a sure process and with separation of responsibilities and it helps creating cleaner code.

On the scheduling aspect, if you're requested to create a chain of more than one of these modules, it can be advantageous to split the stream somewhere. Since each step creates discrete blocks of output and each step after it can discretely acquire more of those blocks, it's possible to chop it into multiple bits, with semi-permanent buffers in between. That way, any application will benefit from multiple processors, even an application as trivial as an image-blurring program. It decreases response times given multiple cpu's, which is imo a very important thing for consumer computers in the coming decade.
FlashBurn

Re:sync on a smp system (again)

Post by FlashBurn »

I think that both of you have good method, but I like to stay with mine and make it working.

I have changed the code especially the locking of the threads and the scheduler queue. But now I get a deadlock, but I can?t find where this could occure. So maybe some of you could have a look!?
User avatar
Candy
Member
Member
Posts: 3882
Joined: Tue Oct 17, 2006 11:33 pm
Location: Eindhoven

Re:sync on a smp system (again)

Post by Candy »

FlashBurn wrote: I think that both of you have good method, but I like to stay with mine and make it working.

I have changed the code ... now I get a deadlock ... some of you could have a look!?
Since you changed the code, it could prove to be useful to wait for you to upload the new code before we look. Do you have any idea where it might go bad?
Post Reply