OSDev.org

Posted: **Tue Nov 09, 2004 3:17 pm**

What algorithms do you use in your CPU scheduling (for finding out which task should be loaded next), are you using a job scheduler too?

And my second question is what memory management (for malloc and so on) algo are you using?

I have tried for long to find out which algo would handle the job fastest and best but still had limited success so I'm hoping one of you guys could help me out.

TIA!

Posted: **Tue Nov 09, 2004 4:25 pm**

Googles wrote: What algorithms do you use in your CPU scheduling (for finding out which task should be loaded next), are you using a job scheduler too?

My CPU scheduler uses the simplest algorithm there is for most tasks. It takes the highest priority that's available and the first task of it.

Exception for the soon future is softrealtime, in which tasks can request certain threads to be more prioritized than others, so that they'd get up to 70% (arbitrary limit) of the cpu in order to make their deadlines, instead of their legally entitled slice. This is to make sure that realtime playing things get the ability to do a realtime processing function. Demands are that it is activated exactly the same, it does one frame and quits (semi-quits) and that it is not killing if a deadline is missed. The thread structure is to be reused for the same thread next time etc.

And my second question is what memory management (for malloc and so on) algo are you using?

I'm atm using next fit, but I'm considering doing radix best fit instead. (best fit with indexes per power of 2) don't have any time to actually make it atm so it's just dreaming.

Posted: **Tue Nov 09, 2004 11:00 pm**

Could you tell me how radix best fit works?

Posted: **Wed Nov 10, 2004 12:39 am**

Googles wrote: Could you tell me how radix best fit works?

Most of these memory storage algorithms use a list of blocks with empty space. The list can be stored either in the empty space itself (making each block at least X bytes large) or next to it (wasting space occasionally). Best fit in general searches the entire list for a block of size N, where N is the size you want. It ignores all smaller ones and tries to find the best fit, the one that leaves the least amount of slack space (unused space). Improving on this is sorting the list, which costs a little more time with freeing and allocing, but does allow you to stop when you found one that fits. Best fit has the habit to "fill up the space" with unusable little blocks. Yet, it's quite good since it does not waste large blocks. The best cases include c++ environments where most allocations are identical in size, and the worst one include reallocs where you only just increase the size.

The idea behind the radix thing (not sure if it's the real name, just thought it up in a parallel with radix sort) is that you make a bunch of pointers in an always-present array that point to the first block in the list of size X = 2^(entnum + c), where entnum is the pointer number. c would be 3 or 4, making blocks at least 8 or 16 bytes. This list would stretch until about entnum=28, making for up to 4GB blocks.

The advantage of these pointers is that you can skip over all way too small blocks immediately, and start searching in blocks that are only slightly too small. It doesn't improve the best-fit best case (all identical blocks), but it does improve the best-fit worst case (all different size). The algorithmic time complexity (how long does it take in general based on the amount of entries) doesn't decrease, but the amortized time complexity (how long does it really take, also taking into account the exact time spent) decreases by about a factor 28 in its ideal case, and my guess for the average one is 4-8x.

In short, don't care about optimizing until you know it's your time eater. I know mine is gonna take a long time, so I optimize it (not really good, but I just wanted to

).

If you haven't done any one yet, you should start with plain best fit or plain next fit. Don't do anything fancy.

If you haven't done one and don't really want to, make one that just decreases available space and has a nonworking free function. That works quite bad in systems that stay up (you have a continual memory leak) but works quite nicely in OS system tests, say the 3 minutes before you have exploited all options of your mini OS.

Posted: **Wed Nov 10, 2004 3:00 am**

Hi,

Googles wrote: What algorithms do you use in your CPU scheduling (for finding out which task should be loaded next), are you using a job scheduler too?

I schedule threads only (no job scheduling). Each thread belongs to a scheduling policy, and has a priority within this policy. The first policy (policy 0) is for threads that need very fast response times but don't use much CPU time - IRQ handlers, etc. Policy 0 threads use a co-operative scheduler (highest priority thread that can run does run).

The remaining policies are policy 1 (high), policy 2 (normal) and policy 3 (idle). All of these policies use variable frequency scheduling, for e.g. a thread with the highest possible priority gets 256 time slices in the same amount of time that a thread with the lowest possible priority would only get one time slice. The actual length of a time slice depends on which policy - time slices for policy 2 threads are twice as long as time slices for policy 1 threads, and time slices for policy 3 threads are twice as long as time slices for policy 2 threads (or 4 times longer than policy 1 time slices).

The scheduler always runs threads from the most important (numerically lowest) policy. Any thread may pre-empt the currently running thread if the currently running thread is in a less important policy. Any policy 0 thread will also pre-empt a lower priority policy 0 thread, but threads in other policies will not pre-empt threads in the same policy.

To prevent (buggy) policy 0 threads from locking up the entire computer, the time they are allowed is restricted. If a policy 0 thread exceeds this time limit (which is the same amount of time a policy 1 thread would get for it's time slices) it is deemed faulty and terminated. Because my OS is designed for everything from 80486 up there is a large CPU performance difference between the slowest and fastest computers that it could be run on. Therefore my OS does not use fixed time slices. Instead (during boot) it measures the actual performance of the CPU/s (which has nothing to do with CPU operating frequencies) and calculates time slice lengths from that. This reduces the overhead of thread switching on slow computers while improving the "smoothness" on fast computers.

Add to this some thread blocking: waiting for message, sleep, deep sleep, stopped, halted (crashed/debugging) and terminated. Then there's code to allow a thread to change it's policy and priority, it's name, etc. The OS also keeps track of how much time (in mS) each thread has used, and how much time each process has used in each policy (mostly for performance tuning, but it also allows users to see where all the CPU time is going).

The multi-CPU versions of the kernel also have CPU affinity (which is actually sub-domain affinity, or "physical chip affinity"), and NUMA domain affinity.

My scheduler has evolved over time to avoid common problems. Firstly the OS is intended for desktop/server, where real time features are (IMHO) NOT desirable (doing 95% of things very fast and 5% of things slowly is better than doing everything with quaranteed average speed). Purely co-operative schedulers are severely effected by CPU hogs, variable time slice schedulers perform badly under load (too much time between time slices), round-robin (no priorities) is always ugly. Variable frequency does have a bit more overhead though..

It's intended that policy 0 is used for "system processes" (device drivers, file system code, etc). Policy 1 is for anything related to the user (GUI, CLI and all user input and video/sound output code in applications), policy 2 is for normal work (compiling, sorting, converting, general processing), and policy 3 is for anything that can be done during idle time that may improve performance during non-idle time (disk defragging, scan-disk, cleaning/testing memory, spell-checking, pre-processing, etc). This way the computer can be completely overloaded without effecting the "responsiveness" of the GUI (I've found typical users don't understand the difference between overall performance and perceived/user interface performance).

Cheers,

Brendan

Posted: **Wed Nov 10, 2004 4:53 am**

@brendan: You give f. ex. floppy driver a penalty, if it against all rules eats up too much cpu time? Thats a cool thing to do. Should consider it too.

Currently, I have only one prioritized round robin queue - for user processes, and four system queues which work in cooperative manner: a driver task or a service process run as long as they need to - and nevermind starving userprocesses. sooner or later they get their share to run.

My pick_process() is straight forward: loop throu the list of run queues, from highest prioity to lowest and p?ck the first runnable process from the actual inspected queue - and exit the loop.

Posted: **Wed Nov 10, 2004 7:16 am**

Hi,

beyond infinity wrote: @brendan: You give f. ex. floppy driver a penalty, if it against all rules eats up too much cpu time? Thats a cool thing to do. Should consider it too.

I don't penalize it for using too much time - if a policy 0 thread uses too much time it raises a "critical error". My critical error handler then dumps some information into the kernel log (and future versions will probably pop up some sort of dialog box), and then terminates the process. I guess being wiped off the computer is a penalty, but I don't adjust it's priority and let it keep running or anything. This is mainly so that other programmers detect and fix the problem while developing the code (and get bug reports if they don't).

In general I want policy 0 to be perfect for things like IRQ handlers - I don't want some games programmer to think that using policy 0 would be a good way of speeding up a 3D polygon renderer...

Threads in other policies can consume as much CPU time as they like. The variable frequency scheduling means the user still has enough control to kill even the highest priority policy 1 thread (it's not so much of a denial of service issue).

Dropping the priority of policy 1 CPU hogs might be a good idea though. I'll put it under consideration - not sure how to tell the difference between a very busy and important service (e.g. web server running on a dedicated box in the corner of a internet service provider) and a CPU hog. Anyone have any suggestions?

Cheers,

Brendan

Posted: **Wed Nov 10, 2004 10:43 am**

Therefore my OS does not use fixed time slices. Instead (during boot) it measures the actual performance of the CPU/s (which has nothing to do with CPU operating frequencies) and calculates time slice lengths from that.

How do you do that?

Candy: DO you know of any other memory handling algo that would minimize memory leak and still be very fast?

Posted: **Wed Nov 10, 2004 11:37 am**

Googles wrote: Candy: DO you know of any other memory handling algo that would minimize memory leak and still be very fast?

I know of 6 memory allocation strategies, all of which have worst-cases (obviously) and all of which have best-cases. The two that are not in the fit-category (best,worst,next,first) are McCusick-Karels and Buddy.

McCusick-Karels takes fixed size blocks of say 64k and divides them into blocks of a decided size, and then allocates those blocks as a static size. The good is that it usually has blocks of any basic size available. Bad is that once a block is divided it's very hard to get it back together again.

Buddy views the entire space as a really big block of data with a power of 2 size. It is then divided upon request into two equal blocks, and if they are still too big one of them is divided further etc. Freeing happens the other way around, if you free a second block of a set it's merged, and so on.

Worst case is allocating a very tiny block (log(memsize) operations of resizing blocks) and then freeing it again (again log(memsize)).

There is no perfect algorithm. Take these as a guideline and develop your own mixtures of them, since these are 6 pure forms with bad cases for each and combining them using your intelligence gets you a better algorithm nearly always.

Posted: **Wed Nov 10, 2004 9:17 pm**

Hi,

Googles wrote:
Therefore my OS does not use fixed time slices. Instead (during boot) it measures the actual performance of the CPU/s (which has nothing to do with CPU operating frequencies) and calculates time slice lengths from that.
How do you do that?

The basic idea is to measure how many times a piece of code can be executed in a fixed amount of time.

I use a loop containing a lot of instructions that don't actually do anything. I "calibrate" the loop by calculating how many cycles it would take for one iteration on an 80386. Then I keep running the loop until 250 mS has passed, while keeping track of how many times the loop has been run. The loop counter is used to calculate the effective speed of the CPU, which I call "VMHz" (Virtual MHz). Due to the calibration, an 8 VMHz CPU is roughly equivelent to an 8 Mhz 80386, and a 1000 VMHz CPU is roughly equivelent to a (ficticious) 1 Ghz 80386.

It's important to select the instructions in the loop with care. Because I use the VMHz to determine kernel timing I use a variety of instructions that are most commonly used in the kernel's code (32 bit integer instructions).

While my OS is doing this timing loop it will also measure CPU frequency (if RDTSC is supported) and CPU bus speed (if the local APIC is supported). The CPU frequency is completely useless (users like to see it though), but the CPU bus speed is used to calibrate the local APIC timer (used by the scheduler instead of the PIT/IRQ0 where possible).

For multi-CPU computers I measure each CPU seperately, so that if one CPU is twice as fast as another the scheduler will give it twice as much work to do. All CPUs are measured at the same time, so that if there's 64 CPUs the user isn't waiting for 16 seconds (64 * 250 mS). This also means that for hyper-threading all logical CPUs are working while the measurements are done, which make the results more realistic (if one logical CPU was idle it'd make the other logical CPU/s faster).

Cheers,

Brendan

Posted: **Thu Nov 11, 2004 3:44 am**

Brendan:

How do you calibrate multiple paths of a single cpu, so an SSE program isn't penalized because you have an AMD, or an integer program isn't penalized because you don't have one.

From your description I get the idea that a level 1 calculation thread (say a computer player for a game) will always preempt lower level things even if it's out of slices. How do you arrange between this?

If you use slices as time limiting factor, how do you prevent calculation threads from only being activated once in a long time on full cpus?

Posted: **Thu Nov 11, 2004 5:16 am**

Hi,

Candy wrote: How do you calibrate multiple paths of a single cpu, so an SSE program isn't penalized because you have an AMD, or an integer program isn't penalized because you don't have one.

I'm not sure I understand this question correctly, but I'm assuming it relates to my time slice length calculations. These calculations are intended to find a good compromise between scheduling "smoothness" (which improves with shorter time slices) and scheduler overhead (which improves with longer time slices).

For example, a CPU that executes my timing loop relatively quickly (ie. fast 32 bit integer operations and few pipeline stages) will get a higher VMhz rating and shorter time slices resulting in more thread switches per second. A CPU that executes my timing loop slowly will get a lower VMHz rating, longer time slices and less thread switches per second.

Let's compare 2 CPUs that have equal performance on a very wide range of instructions (integer, FPU, MMX, SSE), where CPU A executes integer instructions faster but SSE is slower (and CPU B has slower integer and faster SSE). CPU A would have more thread switches per second, but the overhead of each thread switch would be less. CPU B would have less thread switches per second, but the overhead of a thread switch would be more. For example:

Code: Select all

CPU A:
  1111o22o1111o22o1111o22o1111o22o

CPU B:
  11111111oo2222oo11111111oo2222oo

Where,
 1 represents thread 1 running for 1 mS
 2 represents thread 2 running for 1 mS
 o represents 1 mS of thread switch overhead

The amount of time each thread receives and the scheduler overhead remains roughly constant, while the scheduler's "smoothness" is changed to compensate. It doesn't make much difference what each thread is actually doing (ie. integer or SSE)...

The only real problem with this is when FPU/MMX/SSE state needs to be saved (which messes up the above theory). IMHO this is the price you pay for using the FPU/MMX/SSE on any OS.

My OS will delay saving/loading the FPU/MMX/SSE state to avoid as much of this extra overhead as possible (see Intel's manual "12.5.1. Using the TS Flag to Control the Saving of the x87 FPU, MMX, SSE, SSE2 and SSE3 State"). I've also "invented" a method to use this on multi-CPU computers (patent pending

).

Candy wrote: From your description I get the idea that a level 1 calculation thread (say a computer player for a game) will always preempt lower level things even if it's out of slices. How do you arrange between this?

If you use slices as time limiting factor, how do you prevent calculation threads from only being activated once in a long time on full cpus?

Currently I don't do anything to prevent this, in fact a policy 1 thread can go into an infinite loop and prevent all threads in less important policies from getting any time at all. All calculation should be done in policy 2 (or policy 3) - policy 1 is meant for "event driven" things like user interfaces. If a programmer decides to use policy 1 when they shouldn't it's up to the user to complain and/or terminate the process.

I've been thinking about ways to prevent this problem. At the moment I like the idea of dropping the thread back to policy 2 if it consumes too much CPU time in policy 1, but I haven't decided on a good definition of "too much time".

Cheers,

Brendan

Posted: **Thu Nov 11, 2004 5:27 am**

Lets make it a hyperbolic situation, so the point is clearer. Take the numbers with a truckload of salt, they're just to illustrate

CPU A does 1000 MIPS and 1MFLOP, CPU B does 1 MIPS and 1000 MFLOPS. If you calibrate on the MIPS, the MFLOP rating would be off by a factor of 1000000, IE, on one computer the slice would be a millionth of the slice of another. If you then consider your thread to be valid for X time units on your testing processor B, using only FLOPS, you are going to be hogging CPU A even though you might get lots more time units.

Had the problem myself too, been trying to get a decent soft real time system going, but was messed up with the minimum-time and maximum-time in the same arithmetic.

Posted: **Thu Nov 11, 2004 7:00 am**

Candy wrote: Lets make it a hyperbolic situation, so the point is clearer. Take the numbers with a truckload of salt, they're just to illustrate

CPU A does 1000 MIPS and 1MFLOP, CPU B does 1 MIPS and 1000 MFLOPS. If you calibrate on the MIPS, the MFLOP rating would be off by a factor of 1000000, IE, on one computer the slice would be a millionth of the slice of another. If you then consider your thread to be valid for X time units on your testing processor B, using only FLOPS, you are going to be hogging CPU A even though you might get lots more time units.

Had the problem myself too, been trying to get a decent soft real time system going, but was messed up with the minimum-time and maximum-time in the same arithmetic.

As my calibration is only intended for scheduler smoothness/overhead balancing, and not any form of real time, the MFLOPS is largely irrelevant in my case. If any thread takes a lot more time to execute than planned (for any reason), then so be it - the OS gives the same real-time guarantees as the CPU and other hardware: none.

Cheers,

Brendan

Posted: **Thu Nov 18, 2004 5:06 pm**

Candy wrote: If you haven't done one and don't really want to, make one that just decreases available space and has a nonworking free function. That works quite bad in systems that stay up (you have a continual memory leak) but works quite nicely in OS system tests, say the 3 minutes before you have exploited all options of your mini OS.

This got me thinking... In a microkernel, where the execution paths are supposed to be relatively small and predictable, would this be a feasible approach to dynamic memory allocation for data that does not need to persist between context switches? (i.e. -- everything is "freed" every time you return from an interrupt/exception/trap of some kind.) It's a loose idea at this point, so it's hard to think of a good use case....

OSDev.org

CPU Scheduling and Memory Management algorithms

CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms

Re:CPU Scheduling and Memory Management algorithms