Page 5 of 5

Re: Designs of microkernels

Posted: Tue Jan 19, 2021 2:17 pm
by Doctor5555
Perhaps I didn't quite make my meaning clear. My fault entirely.

As I said, a stack-based VM will use a single bytecode instruction that implicitly pops the operands and then pushes the result.
The implementation for the instruction (jit or interpreter, it doesn't matter) will, when it encounters that instruction, pop operands, execute, and then push the result.
The article linked does not make it clear that it is talking about implementation and not bytecode when it says
four lines of instructions is needed to carry out an addition operation
Comparing this to the stated instruction for the register-based VM, which states that a single line of instruction is required, does indicate to me that the 4 lines in the stack-based version are intended as a representation of the bytecode.
This might simply be me misinterpreting the article, which would again be my fault entirely.
I do agree that the images are correct, though.

I'll also note that implementation for a register-based VM would have to carry out essentially the same steps.
It has to use the CPU stack or registers to pass arguments/values the same way.

Code: Select all

; Takes r1, r2, r3 as add operand register indexes
vm_add:
    mov r1, [reg_struct + 8 * r1]
    mov r2, [reg_struct + 8 * r2]
    add r1, r1, r2
    mov [reg_struct + 8 * r3], r1
    ret
And your vm_add_v2 needs to push eax again after the add instruction ;)

Re: Designs of microkernels

Posted: Tue Jan 19, 2021 4:54 pm
by AndrewAPrice
A little off-topic, but I like stack-based bytecode because it's super easy to write a compiler for (you can print out the bytecode while walking an abstract syntax tree) without doing register allocation.

Although your VM could load the bytecode, perform SSA and turn it into a register-based representation (among doing other optimizations), even if you still interpret the end results.

Re: Designs of microkernels

Posted: Wed Jan 20, 2021 11:12 am
by bzt
Doctor5555 wrote:Perhaps I didn't quite make my meaning clear. My fault entirely.
Have no worries, that's why I asked, and you explained, now I understand your confusion and I can answer :-)
Doctor5555 wrote:The article linked does not make it clear that it is talking about implementation and not bytecode when it says
four lines of instructions is needed to carry out an addition operation
Agreed, he should have made it clear that he meant 4 CPU instructions and not 4 VM instructions. As I've said in my previous posts, on some CPU you could optimize for less CPU instructions, but that's still performing the same 4 theoretical steps.
Doctor5555 wrote:And your vm_add_v2 needs to push eax again after the add instruction ;)
Nope, no push needed! It has added one operand to the other already in memory (on top of the stack), meaning the result of the add placed in memory, exactly where push would have stored it should both operands were popped. That's the point of the optimization ;-) (I might have accidentally switched the operands in vmm_add_v1 though, 2nd supposed to be the result)
AndrewAPrice wrote:A little off-topic, but I like stack-based bytecode because it's super easy to write a compiler for
Btw, I've created a toy for you to play with :-D :-D :-D (Let me know if you find any bugs, I've quickly put it together in less than a day)
https://gitlab.com/bztsrc/brainfuck

It is a Brainfuck interpreter, compiler and bytecode optimizer, with configurable optimization levels.

Code: Select all

./bf -O0 mandelbrot.bf
Runs the unoptimizied bf through an interpreter. Therefore it's going to be slow, but still much much faster than simpleinterp.

Code: Select all

./bf -O3 mandelbrot.bf
Will use all optimizations, and it's a lot faster than optinterp3.

Code: Select all

./bf -O4 mandelbrot.bf -o mandelbrot
./mandelbrot
Will compile the bytecode into native code and will save an x86_64 ELF executable. This will use all the optimizations plus will remove the run-time boundary checks for further speed up. Easily outperforms optasmjit.

You can do performance measurements with the -p flag (valid for both the interpreter and the compiler, for the latter profiling code will be generated into the ELF).

One of the reasons why my implementation is faster, because it's in C and only uses arrays with pointers (in contrast the article's author is using C++ with the worst std::vector class). Also my bf parses the source only once, while the article's author goes through the source as well as on the generated tokens in the std::vector multiple times, in several iterations. My implementation has one drawback though, it has a compile time configured limitation on the maximum number of nesting allowed in the BF scripts (set to 256 ATM).

Fun thing, you can also make this compiler to generate ANSI C source, and then optimize that with 'gcc -O3':

Code: Select all

./bf mandelbrot.bf -c mandelbrot.c
gcc -O3 mandelbrot.c -o mandelbrot
It's going to be about 4 or 5 times slower, than if you let bf optimize and generate the Assembly itself :-)

Finally, AST is evil (specially when it's implemented with generated parser code), don't use it! If you want performance, use a hand-crafted parser with a state machine. More difficult to code, granted, but a state machine performs much-much better in a compiler, and allows some optimizations that are impossible with AST. IMHO state machine based parsers are much better, but there's a big pile of subjectivity in this sentence, not an objective fact.

Cheers,
bzt

Re: Designs of microkernels

Posted: Sat Jan 23, 2021 12:38 pm
by eekee
-late reply-
moonchild wrote:
eekee wrote:@AndrewAPrice: Lisp machines will never die! :mrgreen: I'm saying that because it was Lisp and Lisp machines which developed and promoted garbage collection years ago. Nothing else had it in quite the same way, I don't think. But... with your calculator example, you're talking about something which can easily be handled with manual memory allocation. When a program exits, free all its memory.
The problem is, what is ‘its’ memory? Which memory ‘belongs’ to a process, when memory can be shared? Determining that isn't simple if you want to allow cheap IPC.

(The answer is, you do some sort of GC. And that's not (very) hard, but you still have to do it.)
I see, good point. A simple reference counting GC would work. This reminds me not to be put off by the mention of garbage collection, because if you can get away with a reference-counting GC, it's really simple.