Doctor5555 wrote:Perhaps I didn't quite make my meaning clear. My fault entirely.
Have no worries, that's why I asked, and you explained, now I understand your confusion and I can answer
Doctor5555 wrote:The article linked does not make it clear that it is talking about implementation and not bytecode when it says
four lines of instructions is needed to carry out an addition operation
Agreed, he should have made it clear that he meant 4 CPU instructions and not 4 VM instructions. As I've said in my previous posts, on some CPU you could optimize for less CPU instructions, but that's still performing the same 4 theoretical steps.
Doctor5555 wrote:And your vm_add_v2 needs to push eax again after the add instruction
Nope, no push needed! It has added one operand to the other already in memory (on top of the stack), meaning the result of the add placed in memory, exactly where push would have stored it should both operands were popped. That's the point of the optimization
(I might have accidentally switched the operands in vmm_add_v1 though, 2nd supposed to be the result)
AndrewAPrice wrote:A little off-topic, but I like stack-based bytecode because it's super easy to write a compiler for
Btw, I've created a toy for you to play with
(Let me know if you find any bugs, I've quickly put it together in less than a day)
https://gitlab.com/bztsrc/brainfuck
It is a Brainfuck interpreter, compiler and bytecode optimizer, with configurable optimization levels.
Runs the unoptimizied bf through an interpreter. Therefore it's going to be slow, but still much much faster than
simpleinterp.
Will use all optimizations, and it's a lot faster than
optinterp3.
Code: Select all
./bf -O4 mandelbrot.bf -o mandelbrot
./mandelbrot
Will compile the bytecode into native code and will save an x86_64 ELF executable. This will use all the optimizations plus will remove the run-time boundary checks for further speed up. Easily outperforms
optasmjit.
You can do performance measurements with the
-p flag (valid for both the interpreter and the compiler, for the latter profiling code will be generated into the ELF).
One of the reasons why my implementation is faster, because it's in C and only uses arrays with pointers (in contrast the article's author is using C++ with the worst std::vector class). Also my bf parses the source only once, while the article's author goes through the source as well as on the generated tokens in the std::vector multiple times, in several iterations. My implementation has one drawback though, it has a compile time configured limitation on the maximum number of nesting allowed in the BF scripts (set to 256 ATM).
Fun thing, you can also make this compiler to generate ANSI C source, and then optimize that with 'gcc -O3':
Code: Select all
./bf mandelbrot.bf -c mandelbrot.c
gcc -O3 mandelbrot.c -o mandelbrot
It's going to be about 4 or 5 times slower, than if you let bf optimize and generate the Assembly itself
Finally, AST is evil (specially when it's implemented with generated parser code), don't use it! If you want performance, use a hand-crafted parser with a state machine. More difficult to code, granted, but a state machine performs much-much better in a compiler, and allows some optimizations that are impossible with AST. IMHO state machine based parsers are much better, but there's a big pile of subjectivity in this sentence, not an objective fact.
Cheers,
bzt