Cooperative Multitasking

bigbob · Post by **bigbob** » Sun May 31, 2015 6:06 am

bluemoon wrote:
bigbob wrote:- very fast context-switch
context switch itself is independent to which time sharing model you use.
I suppose you meant less context switch for CM, but then a preemptive model allowing thread to give up its time would be equally less switching for most situations.

Yes, I meant to do fewer things during a context-switch for CM. In FORTH one just needs to change a few pointers (param-stack, return-stack and float-stack) . So the context-switch is very fast.

bluemoon wrote:
bigbob wrote:- no need for synchronization, since a task calls task-switching when it is the best for it (the control is not taken away in the middle of processing e.g. two DWORDs)
Not true if there is concurrent threads on multiple cores.

I still don't have experience with multi-core OSs, so I can't comment on that.

Brendan · Post by **Brendan** » Sun May 31, 2015 9:26 am

Hi,

bigbob wrote:
bluemoon wrote:
bigbob wrote:- very fast context-switch
context switch itself is independent to which time sharing model you use.
I suppose you meant less context switch for CM, but then a preemptive model allowing thread to give up its time would be equally less switching for most situations.
Yes, I meant to do fewer things during a context-switch for CM. In FORTH one just needs to change a few pointers (param-stack, return-stack and float-stack) . So the context-switch is very fast.

I think you've got things a little backwards...

If all code is interpreted; then instead of saving/loading many registers during a task switch you only save/restore a few pointers, but everything except task switches will be a performance crippling pit of despair.

If you JIT compile and optimise the Forth words into native code it'd probably make everything 10 times faster, but then you'd need to save/load many registers during task switches.

Basically, if you care about performance you need to save/load more during task switches, and if you don't care about performance you've got no reason to care how much you load/store during task switches.

Cheers,

Brendan

bigbob · Post by **bigbob** » Sun May 31, 2015 11:15 am

Brendan wrote:I think you've got things a little backwards...
If all code is interpreted; then instead of saving/loading many registers during a task switch you only save/restore a few pointers, but everything except task switches will be a performance crippling pit of despair.
If you JIT compile and optimise the Forth words into native code it'd probably make everything 10 times faster, but then you'd need to save/load many registers during task switches.
Basically, if you care about performance you need to save/load more during task switches, and if you don't care about performance you've got no reason to care how much you load/store during task switches.

Hi,

I haven't thought about it this way yet, you have good points.
On the other hand, performance includes task-switching too. In case of a 1000Hz scheduler, PM wastes a lot of time.

This topic is about CM and you seem to know a lot about FORTH but there can be people who are interested in how FORTH works.
Indirect-threading is not as slow as it might seem.
In FORTH we traverse a linked list in the dictionary in order to find a WORD. The definition of a WORD consists of calling binary code.
For example, if we enter in the command line: "5 3 +" (i.e. 5+3)
then the INTERPRETER finds out that '5' and '3' are numbers and it pushes them to the parameter stack. Next, it finds '+' which it will find in the dictionary (after following a few pointers in the dictionary), then it will call the binary code of '+' (addition) which gets the two items off the stack and pushes the result back. The speed of traversing the dictionary can be improved by having subdictionaries according to the length of the name of the WORDs in the dictionary (so in case of '+', the subdictionary of WORDs with length 1 has to be traversed only.
I have 8 subdictionaries (i.e. 8 linkedlists) in the dictionary and the 8th linkedlist contains WORDS that are longer or equal than 8.
In my dictionary there are 342 words, and the subdictionaries:
1: 21
2: 45
3: 32
4: 53:
5: 47
6: 45
7: 28
8>=: 71
So to find the entry (or binary code of) '+' in the dictionary, the INTERPRETER only needs to "jump" maximum 21 times in the linked list.

I simply like the idea of CM better than PM. Maybe I will regret it later, as you said.
If I want to eat canned fruits, I have to:
1. get the can of fruits
2. open it
3. pour its content in a plate
I think it is better to finish a step and do something else (being interrupted, a task-switch), than being interrupted during a step (e.g. getting the can of fruits).

Maybe I should have called this topic "FORTH+CM"

iansjack · Post by **iansjack** » Sun May 31, 2015 12:04 pm

The problem is that getting the can of fruit might take a long time, involving a trip to the local supermarket. In the meantime, the pan of custard that you have on the stove is overheating and sets itself on fire. So when you come back from the supermarket with the can of fruit your house is no longer there. Unfortunately you forgot to pre-empt the custard by turning the burner off, and there was no provision for a supervisor to do this for you.

It's more sensible to make a point of checking the custard every minute or so, ideally have some third party do it for you. The alternative is just to carry out the tasks sequentially without allowing potential conflicts. This is the approach that MS-DOS used (mainly).

If you are writing another MS-DOS, or don't care too much about sysytem stability and performance then CM is fine and probably easier to program than PM. Otherwise, particularly if you want to support multiple users, it's a no-brainer.

Rusky · Post by **Rusky** » Sun May 31, 2015 12:16 pm

Just to clear this up: preemptive scheduling doesn't have to use a fixed timer interrupt rate, so the tradeoff between 100Hz and 1000Hz is not actually a downside for it. What you should do instead for both performance and power usage reasons is to set a one-shot timer to whatever's left in the timeslice of the thread you're scheduling (or whatever event is going to come next), and possibly coalesce events together that don't need super-precise timing.

bigbob · Post by **bigbob** » Sun May 31, 2015 12:16 pm

iansjack wrote:The problem is that getting the can of fruit might take a long time

I meant getting the fruit off the shelf in my kichen

Getting it from the supermarket would require several additional steps (with task-swicthing).
The tasks are short ones.

I won't have multiuser support.
It was useful when computers were expensive.
Nowdays almost everybody has a computer.

EDIT: I think it(multiuser-support) will be less and less important.

iansjack · Post by **iansjack** » Sun May 31, 2015 12:36 pm

bigbob wrote:
iansjack wrote:The problem is that getting the can of fruit might take a long time
I meant getting the fruit off the shelf in my kichen

I did realize that. And I agree that it is easier to write an OS that deals only with an unrealistic set of assumptions and isn't equipped to deal with unexpected circumstances. But I'm not convinced that that would be a good OS.

Brendan · Post by **Brendan** » Sun May 31, 2015 1:12 pm

Hi,

bigbob wrote:I haven't thought about it this way yet, you have good points.
On the other hand, performance includes task-switching too. In case of a 1000Hz scheduler, PM wastes a lot of time.

Some "back of the envelope" maths...

Let's assume that saving/loading all the registers costs about 200 cycles (it depends on whether you save FPU/MMX, SSE, AVX registers or not, which CPU, etc). If you do this 2000 times per second, then it's going to cost a total of 200000 cycles. A slightly modern 80x86 will do over 1 billion cycles per second, so 200000 cycles of overhead per second is 200 us of overhead per second, or about 0.02% of CPU time.

However, every time you switch from one task to another you've got caches full of the old task's data and none of the data the new task uses is in the cache. Cache misses are expensive (more expensive than saving/loading registers). For most OS's there's also other work you do during task switches, like keeping track of how much time each task has used, updating whatever data structures the scheduler relies on, etc. All of this overhead exists regardless of whether you save/load all registers or not; and probably adds up to maybe 1000 cycles. This means that not saving registers is 1000 cycles per task switch (0.1% of CPU time) and and saving/loading registers is more like 1200 cycles per task switch (0.12% of CPU time).

Of course if you're smart you only do task switches when necessary and don't do them 1000 times per second. For example, if there's only one task that's able to run (all other tasks are waiting for IO or something) you'd do zero task switches per second because there's literally nothing else to switch to.

bigbob wrote:This topic is about CM and you seem to know a lot about FORTH but there can be people who are interested in how FORTH works.

I don't know that much about Forth (e.g. never used it).

bigbob wrote:Indirect-threading is not as slow as it might seem.
In FORTH we traverse a linked list in the dictionary in order to find a WORD. The definition of a WORD consists of calling binary code.
For example, if we enter in the command line: "5 3 +" (i.e. 5+3)
then the INTERPRETER finds out that '5' and '3' are numbers and it pushes them to the parameter stack. Next, it finds '+' which it will find in the dictionary (after following a few pointers in the dictionary), then it will call the binary code of '+' (addition) which gets the two items off the stack and pushes the result back. The speed of traversing the dictionary can be improved by having subdictionaries according to the length of the name of the WORDs in the dictionary (so in case of '+', the subdictionary of WORDs with length 1 has to be traversed only.
I have 8 subdictionaries (i.e. 8 linkedlists) in the dictionary and the 8th linkedlist contains WORDS that are longer or equal than 8.
In my dictionary there are 342 words, and the subdictionaries:
1: 21
2: 45
3: 32
4: 53:
5: 47
6: 45
7: 28
8>=: 71
So to find the entry (or binary code of) '+' in the dictionary, the INTERPRETER only needs to "jump" maximum 21 times in the linked list.

I'm not sure where to start here...

Modern CPUs don't execute one instruction at a time - they have a pipeline with many instruction "in flight" at various stages. When an instruction depends on the results of a previous instruction it has to wait until the previous instruction completes and (unless the CPU can re-order instructions around it) you end up with bubble in the pipeline that causes large performance loss (a pipeline stall).

For linked lists, with a maximum of 21 entries you'd have to go through about 10 entries on average. That will most likely result in 10 pipeline stalls because the CPU has to fetch "next" before it can think about fetching the next "next". These pipeline stalls will probably cost about 20 cycles each (depending on CPU, etc), so you're looking at 200 cycles just to find the word. Then you need an branch to something capable of performing the operation, which is an unpredictable branch causing another pipeline stall (another ~20 cycles). After that you'd (e.g.) pop 2 values data off the stack, do the operation and push the result on the stack; which (with some luck) might only be 3 cycles.

Of course this doesn't include fetching the next word and deciding how large it is. The total cost for a single simple operator (e.g. an addition) is probably going to be about 250 cycles or more.

For normal/native code simple operators cost 1 cycle or less, and if there's a dependency problem the CPU can typically re-order instructions around it.

Basically; what you're planning is going to be about 250 times slower than native code.

Now; I assumed that for Forth you can build your own words and insert them in the dictionary (a bit like sub-routines or functions or procedures in other languages). In that case, as your dictionary grows everything gets even slower. For a large application (many user-defined words) it might be 500 times slower than native code.

I guess what I'm saying here is that the "~0.02% of additional CPU time" that pre-emptive scheduling might cause isn't really worth worrying about...

Cheers,

Brendan

bigbob · Post by **bigbob** » Sun May 31, 2015 2:40 pm

Hi,

The speed of indirect-threaded code and compiled C-code was compared, and indirect-threaded code was only slightly slower.
I read about it years ago, and I can't find it know. As far as I remember, with non-optimized gcc, the results of the indirect-threaded code were sometimes better.
So I believe the speed is not measured the way you described it. Forth would be the slowest language if it was 250 times slower than native code.

Yes, new words can be added to the dictionary in Forth.
It's true, as the dictionary grows, the OS gets slower.
One way of mitigating the problem is to have several dictionaries (with subdictionaries in each).
The first one can contain the core Forth words, the second one the GUI-related words, and so on.
The order of the search can be set (e.g. dict1, dict3, dict4, dict2), so if I plan to execute GUI-related WORDs, I change the order of the search, so that the GUI-dictionary would be the first one.
I have four dictionaries but three of them are currently empty.

iansjack · Post by **iansjack** » Sun May 31, 2015 3:10 pm

I read about it years ago

As a general rule I would take anything that was published years ago, regarding processor instruction timing and optimizations, with a very large pinch of salt. Processors have changed a lot over the years. With a 8086 you could just add up the clock cycles for individual instructions; it's not that simple now.

We are living in the future I'll tell you how I know
I read it in the paper fifteen years ago
We're all driving rocket ships and talking with our minds
And wearing turquoise jewelry and standing in soup lines

John Prine

Brendan · Post by **Brendan** » Sun May 31, 2015 8:52 pm

Hi,

bigbob wrote:The speed of indirect-threaded code and compiled C-code was compared, and indirect-threaded code was only slightly slower.
I read about it years ago, and I can't find it know. As far as I remember, with non-optimized gcc, the results of the indirect-threaded code were sometimes better.
So I believe the speed is not measured the way you described it. Forth would be the slowest language if it was 250 times slower than native code.

For what you described, it will be at least 250 times slower than native code (and probably more like 500 times slower in practice).

If you give each word an ID (e.g. 32-bit integer ID) and have a lookup table for the built-in operators (with IDs from 0 to N), and a large hash table for the user-defined words (with IDs above N); and do a few other tweaks; you might get it up to about 100 times slower than native code.

If you add support for basic JIT compiling that converts sequentially executed words into native code by "stitching together" (concatenating) sequences of instructions (e.g. have 4 instructions to do "pop; pop; add; push" and 4 instructions that do "pop; pop; sub; push"; and then when someone wants an addition then a subtraction join the pieces to get "pop; pop; add; push; pop; pop; sub; push") it'll probably be more like 30 times slower than native code.

If you add more advanced JIT compiling; including support for flow control (join/link those "sequentially executed" pieces of native code up so that you don't have to stop executing "JITted native code" at control flow changes) and also including basic peephole optimisation (especially removing "push;pop;", so that "pop; pop; add; push; pop; pop; sub; push" becomes "pop; pop; add; pop; sub; push") it'll probably be more like 15 times slower than native code.

If you make the JIT a bit more advanced and use registers for things near the top of the stack (plus do the control flow and peephole stuff) you'll get closer to 8 times slower than native code.

If you do a very advanced JIT; including converting to SSA form and doing a lot of optimisations (including constant propagation, constant folding, dead code elimination, common sub-expression elimination, instruction selection, register allocation, peephole, etc) to generate native code; plus do various tricks (like keeping track of how often words are used and avoiding the extra overhead of heavy optimisation by only doing "light optimisation" for "only used once" code) then you'll get to about 2 times slower than native code.

The next step is "extremely advanced JIT", where you do even more extensive analysis and more expensive optimisations, like loop blocking and auto-vectorisation. This would get you up to about 1.1 times slower than native code. Of course it will probably take 10 years to write a "state of the art" JIT compiler like this.

Cheers,

Brendan

Combuster · Post by **Combuster** » Mon Jun 01, 2015 1:11 am

bigbob wrote:It's true, as the dictionary grows, the OS gets slower.

Why not a lookup-table style radix tree? When you parse you already see each character individually, you can look up the corresponding function immediately while you're at it. You can even add new entries to the dictionary without losing lookup time.

bigbob · Post by **bigbob** » Mon Jun 01, 2015 1:41 am

Combuster wrote:
bigbob wrote:It's true, as the dictionary grows, the OS gets slower.
Why not a lookup-table style radix tree? When you parse you already see each character individually, you can look up the corresponding function immediately while you're at it. You can even add new entries to the dictionary without losing lookup time.

Having other data-structures like a hash-table(Brendan) or a tree(Combuster) could be done, but I wanted to implement a traditional Forth OS.
More importantly, in Forth, the search starts from the end of the dictionary. So the word added last will be found first.
This is an important feature, because if we have a word GREET that prints "Hi", and we add GREET that prints "Hello", then the new GREET will hide the previous one, so "Hello" will be printed if the user types GREET.
A well known word is FORGET in Forth. If we type FORGET GREET , and then the user types GREET , "Hi" will be printed again.
FORGET restores the state of the dictionary to the state it had prior to adding the given word. FORGET removes all the words added to the dictionary since the last GREET, of course.
So it's possible to experiment with new data structures, but they should allow features like FORGET.
I think that rules out the hash-table (I don't think a dictionary together with a hash-table would be a good idea). I will think about the tree.

Brendan · Post by **Brendan** » Mon Jun 01, 2015 3:47 am

Hi,

bigbob wrote:More importantly, in Forth, the search starts from the end of the dictionary. So the word added last will be found first.
This is an important feature, because if we have a word GREET that prints "Hi", and we add GREET that prints "Hello", then the new GREET will hide the previous one, so "Hello" will be printed if the user types GREET.

If you have a word GREET that prints "Hi" and another word GREETNAME that calls GREET; and then you add a new GREET that prints "Hello"; which GREET would GREETNAME use? Would it automatically start using the new GREET, or continue using the old GREET?

Cheers,

Brendan

bigbob · Post by **bigbob** » Mon Jun 01, 2015 4:06 am

Brendan wrote: If you have a word GREET that prints "Hi" and another word GREETNAME that calls GREET; and then you add a new GREET that prints "Hello"; which GREET would GREETNAME use? Would it automatically start using the new GREET, or continue using the old GREET?

The old one will be used

because its address was compiled into GREETNAME.
No to mention that that was the original intention of the user who added GREETNAME.
There are many interesting features in Forth e.g. immediate-words, compile/run-time behavior, postpone, the tick (" ' "), etc.

If anybody is interested, the best Forth tutorial:
http://www.forth.com/starting-forth/

OSDev.org

Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking

Re: Cooperative Multitasking