Hi,
bigbob wrote:I haven't thought about it this way yet, you have good points.
On the other hand, performance includes task-switching too. In case of a 1000Hz scheduler, PM wastes a lot of time.
Some "back of the envelope" maths...
Let's assume that saving/loading all the registers costs about 200 cycles (it depends on whether you save FPU/MMX, SSE, AVX registers or not, which CPU, etc). If you do this 2000 times per second, then it's going to cost a total of 200000 cycles. A slightly modern 80x86 will do over 1 billion cycles per second, so 200000 cycles of overhead per second is 200 us of overhead per second, or about 0.02% of CPU time.
However, every time you switch from one task to another you've got caches full of the old task's data and none of the data the new task uses is in the cache. Cache misses are expensive (more expensive than saving/loading registers). For most OS's there's also other work you do during task switches, like keeping track of how much time each task has used, updating whatever data structures the scheduler relies on, etc. All of this overhead exists regardless of whether you save/load all registers or not; and probably adds up to maybe 1000 cycles. This means that not saving registers is 1000 cycles per task switch (0.1% of CPU time) and and saving/loading registers is more like 1200 cycles per task switch (0.12% of CPU time).
Of course if you're smart you only do task switches when necessary and don't do them 1000 times per second. For example, if there's only one task that's able to run (all other tasks are waiting for IO or something) you'd do zero task switches per second because there's literally nothing else to switch to.
bigbob wrote:This topic is about CM and you seem to know a lot about FORTH but there can be people who are interested in how FORTH works.
I don't know that much about Forth (e.g. never used it).
bigbob wrote:Indirect-threading is not as slow as it might seem.
In FORTH we traverse a linked list in the dictionary in order to find a WORD. The definition of a WORD consists of calling binary code.
For example, if we enter in the command line: "5 3 +" (i.e. 5+3)
then the INTERPRETER finds out that '5' and '3' are numbers and it pushes them to the parameter stack. Next, it finds '+' which it will find in the dictionary (after following a few pointers in the dictionary), then it will call the binary code of '+' (addition) which gets the two items off the stack and pushes the result back. The speed of traversing the dictionary can be improved by having subdictionaries according to the length of the name of the WORDs in the dictionary (so in case of '+', the subdictionary of WORDs with length 1 has to be traversed only.
I have 8 subdictionaries (i.e. 8 linkedlists) in the dictionary and the 8th linkedlist contains WORDS that are longer or equal than 8.
In my dictionary there are 342 words, and the subdictionaries:
1: 21
2: 45
3: 32
4: 53:
5: 47
6: 45
7: 28
8>=: 71
So to find the entry (or binary code of) '+' in the dictionary, the INTERPRETER only needs to "jump" maximum 21 times in the linked list.
I'm not sure where to start here...
Modern CPUs don't execute one instruction at a time - they have a pipeline with many instruction "in flight" at various stages. When an instruction depends on the results of a previous instruction it has to wait until the previous instruction completes and (unless the CPU can re-order instructions around it) you end up with bubble in the pipeline that causes large performance loss (a pipeline stall).
For linked lists, with a maximum of 21 entries you'd have to go through about 10 entries on average. That will most likely result in 10 pipeline stalls because the CPU has to fetch "next" before it can think about fetching the next "next". These pipeline stalls will probably cost about 20 cycles each (depending on CPU, etc), so you're looking at 200 cycles just to find the word. Then you need an branch to something capable of performing the operation, which is an unpredictable branch causing another pipeline stall (another ~20 cycles). After that you'd (e.g.) pop 2 values data off the stack, do the operation and push the result on the stack; which (with some luck) might only be 3 cycles.
Of course this doesn't include fetching the next word and deciding how large it is. The total cost for a single simple operator (e.g. an addition) is probably going to be about 250 cycles or more.
For normal/native code simple operators cost 1 cycle or less, and if there's a dependency problem the CPU can typically re-order instructions around it.
Basically; what you're planning is going to be about 250 times slower than native code.
Now; I assumed that for Forth you can build your own words and insert them in the dictionary (a bit like sub-routines or functions or procedures in other languages). In that case, as your dictionary grows everything gets even slower. For a large application (many user-defined words) it might be 500 times slower than native code.
I guess what I'm saying here is that the "~0.02% of additional CPU time" that pre-emptive scheduling might cause isn't really worth worrying about...
Cheers,
Brendan