Unions, program proofs and the Halting Problem

embryo · Post by **embryo** » Thu Mar 20, 2014 1:36 pm

Pancakes wrote:So Brendan get to working on it! LOL

Generally, I agree. But the amount of work is great. Here is the base - language, compiler, IDE(editor, debugger). Next comes OS. So, the way people have completed for 50 years (starting from assembly) should be replicated - the language should be invented, compiler should be created, tools should be developed. It seems too much for one person.

I think it is better to find some form of collaboration, when the long story will be told by many actors. Then it seems possible to do many things.

What do you think about collaboration? (the question is for all participants)

Antti · Post by **Antti** » Fri Mar 21, 2014 4:40 am

embryo wrote:But the amount of work is great. Here is the base - language, compiler, IDE(editor, debugger). Next comes OS.

The amount of work is not insurmountable. If we think about a compiler, for example, it could be an extremely naive implementation without any optimisations. It would still do the task. Actually, it would probably do the task surprisingly well.

embryo wrote:What do you think about collaboration? (the question is for all participants)

Unfortunately, it is very unlikely that it works if the project is trying to do thing differently and is at the early stage. People who would be competent to contribute are too critical and it results in a controversy. People who would be interested in helping are not necessarily competent enough to give any helpful contribution. It is a one-man journey until the design is stabilized and there are enough evidences for the design being successful.

Brendan · Post by **Brendan** » Fri Mar 21, 2014 8:59 am

H,

Antti wrote:
embryo wrote:But the amount of work is great. Here is the base - language, compiler, IDE(editor, debugger). Next comes OS.
The amount of work is not insurmountable. If we think about a compiler, for example, it could be an extremely naive implementation without any optimisations. It would still do the task. Actually, it would probably do the task surprisingly well.

My roadmap goes a little like this:

Plain text to tokenised source code conversion utility. Source code is supposed to be tokenised by a fancy IDE while you type, but there is no fancy IDE for Linux/Windows and converting plain text seemed easier. This turned out to be a lot harder than I first imagined - e.g. something like "typedef myType as range myFunction(myConstant*9) to myConstant + @(myStructure + myType.min).myMember.size" isn't trivial to tokenise.

Initial "all in one" compiler. This doesn't need to generate good quality code, but because the language was designed with the assumption that there will be an optimiser (e.g. things like dead code elimination instead of pre-processing/conditional code) the resulting binaries will be extremely bad if there's no optimisation, and at least some optimisation (constant folding, dead code elimination, etc) will probably be necessary just to make it bearable. This can't really be started until the "plain text to source code" utility is done.

Boot code, kernel modules, etc. I'm planning to cheat here (no support for UEFI, only support 80486, etc). This can't really be started until the compiler is done.

VFS, device manager, some drivers, some misc. stuff. These can't really be started until the kernel is usable.

Minimal GUI and IDE. These can't really be started until there's some device drivers, etc.

Both native compilers. These can't really be started until there's some device drivers, etc.

After all of this, I'm going to end up with a "not very good" OS with bad hardware support, ugly graphics, minimal IDE, etc; and I'll want to spend at least 6 months making everything less bad before I release it as "version 0.1".

Antti wrote:
embryo wrote:What do you think about collaboration? (the question is for all participants)
Unfortunately, it is very unlikely that it works if the project is trying to do thing differently and is at the early stage. People who would be competent to contribute are too critical and it results in a controversy. People who would be interested in helping are not necessarily competent enough to give any helpful contribution. It is a one-man journey until the design is stabilized and there are enough evidences for the design being successful.

For my project, nobody would be able to do anything until at least the kernel is done, and I'm much more likely to wait until "version 0.1" is released before I start looking for contributors - e.g. after it's self-hosting and not as painful to use, so that people can run the OS inside an emulator and use it to develop software.

Of course eventually (maybe "version 0.2") the features of the OS will start to shine - people will be able to create distributed clusters so that it's like they're working together on the same computer. E.g. 2 or more people using different IDE "front ends" that talk to the same IDE "back end", so that multiple people can be working on the same piece of code at the same time (and can watch each other make changes in real time, etc).

Cheers,

Brendan

embryo · Post by **embryo** » Fri Mar 21, 2014 10:54 am

Antti wrote:If we think about a compiler, for example, it could be an extremely naive implementation without any optimisations. It would still do the task. Actually, it would probably do the task surprisingly well.

Yes, but my timing is 50 times slower than optimized version of the same code. However, if the task is just text editing or a console application - it works very well.

Antti wrote:People who would be competent to contribute are too critical and it results in a controversy. People who would be interested in helping are not necessarily competent enough to give any helpful contribution. It is a one-man journey until the design is stabilized and there are enough evidences for the design being successful.

May be it will be a good idea to make some exchange? One competent developer can help to develop something for another competent developer. And even if the area of competence can be intersecting, but I hope it is possible to find some parts where two expertises can collaborate with profit for both developers - nobody knows everything equally and 'best' parts of experiences can really help each other.

embryo · Post by **embryo** » Fri Mar 21, 2014 11:13 am

Brendan wrote:This turned out to be a lot harder than I first imagined - e.g. something like "typedef myType as range myFunction(myConstant*9) to myConstant + @(myStructure + myType.min).myMember.size" isn't trivial to tokenise.

Is it because of limited expertise in compilers or is there no standard way in the compiler knowledge body ready to use for such situation?

Brendan wrote:Initial "all in one" compiler. This doesn't need to generate good quality code, but because the language was designed with the assumption that there will be an optimiser (e.g. things like dead code elimination instead of pre-processing/conditional code) the resulting binaries will be extremely bad if there's no optimisation, and at least some optimisation (constant folding, dead code elimination, etc) will probably be necessary just to make it bearable.

Is it a bad idea to advice you to compile to Java bytecode? The bytecode can be run in any JVM and it allows ease of testing and debugging until there will be other components. And it even can be run by the pure Java OS

Brendan wrote:Minimal GUI and IDE. These can't really be started until there's some device drivers, etc.

There are many parts of the work, that can be done without kernel or drivers. The IDE is very helpful at such stages. In my experience it was really easy to create debugger at the very early stages. And with the stages passed the debugger evolved too.

Brendan wrote:Both native compilers. These can't really be started until there's some device drivers, etc.

What do you mean by 'both'? What is the second compiler for? And why it is impossible to write compiler without drivers? It is enough to have an IDE and debugger for the compiler development.

Brendan · Post by **Brendan** » Fri Mar 21, 2014 1:06 pm

Hi,

embryo wrote:
Brendan wrote:This turned out to be a lot harder than I first imagined - e.g. something like "typedef myType as range myFunction(myConstant*9) to myConstant + @(myStructure + myType.min).myMember.size" isn't trivial to tokenise.
Is it because of limited expertise in compilers or is there no standard way in the compiler knowledge body ready to use for such situation?

It's a combination of factors:

the language doesn't have declarations and doesn't force you to define things before using them
expressions can contain anything (type names, function calls, symbols, etc)
things that define types and data can contain expressions
the tokenised source file format includes a symbol list and a type list, and tokens refer to entries in these list (e.g. "myConstant" might become a "reference to symbol list entry 7" token).
to find out the correct symbol table entry for a structure member, you have to know which structure that member belongs
to know which type of structure is being used, you have to respect operator precedence rules. E.g. "@(myStructure + myType.min)->myMember.size" is very different to "@myStructure + myType.min->myMember.size"
it's too hard to radically restructure it (e.g. convert it to an abstract syntax tree and back again) as I need to preserve the order of tokens, etc
for normal compilers, one or more of the above is not true (e.g. you can built an abstract syntax tree and/or things are declared/defined before use) so existing solutions don't solve the problem

For a simple example, consider these 2 lines of source code:

Code: Select all

myVariable as const myType = 4
typedef myType = range myVariable - 1 to myVariable + 1

This is perfectly legal. You can't completely tokenise one line of source and then completely tokenise the other.

Another example would be:

Code: Select all

typedef test = [myArray[bar.third.size]->second.third * 3 + 1]u8

myArray[99] as @foo

structdef foo = {
 first as test
 second as bar
}

typedef bar = baz

structdef baz = {
 third as u32
 fourth as bar
}

For this case, everything depends on everything. My code handles this correctly, but it took me a while to figure out how to do it.

embryo wrote:
Brendan wrote:Initial "all in one" compiler. This doesn't need to generate good quality code, but because the language was designed with the assumption that there will be an optimiser (e.g. things like dead code elimination instead of pre-processing/conditional code) the resulting binaries will be extremely bad if there's no optimisation, and at least some optimisation (constant folding, dead code elimination, etc) will probably be necessary just to make it bearable.
Is it a bad idea to advice you to compile to Java bytecode?

Java's byte-code isn't capable of representing things that are necessary for low level programming (mostly because it's designed for a safe language).

embryo wrote:
Brendan wrote:Minimal GUI and IDE. These can't really be started until there's some device drivers, etc.
There are many parts of the work, that can be done without kernel or drivers. The IDE is very helpful at such stages. In my experience it was really easy to create debugger at the very early stages. And with the stages passed the debugger evolved too.

You're only able to do that because you're building on top of existing things (and restricted by existing things). I'm discarding all existing things and therefore I can't build on top of them (and aren't restricted by them).

embryo wrote:
Brendan wrote:Both native compilers. These can't really be started until there's some device drivers, etc.
What do you mean by 'both'? What is the second compiler for? And why it is impossible to write compiler without drivers? It is enough to have an IDE and debugger for the compiler development.

The first compiler is typically used by programmers. It does as much optimisation and sanity checking (static analysis) as possible, and generates a portable executable file. This is a bit like compiling Java into Java byte-code; and allows (e.g.) programmers to create code without knowing what the end-user's CPU is.

The second compiler is run on the end user's computer; and converts the portable executable file into a native executable that the CPU can execute directly. This means that while the code is executing there's no translation/interpreter or virtual machine or other pointless bloat slowing things down and gobbling large amounts of RAM.

Cheers,

Brendan

embryo · Post by **embryo** » Fri Mar 21, 2014 2:16 pm

Brendan wrote:to know which type of structure is being used, you have to respect operator precedence rules. E.g. "@(myStructure + myType.min)->myMember.size" is very different to "@myStructure + myType.min->myMember.size"

It seems the language safeness is sacrificed for direct memory access. It is error prone.

Brendan wrote:For a simple example, consider these 2 lines of source code:
Code: Select all
myVariable as const myType = 4
typedef myType = range myVariable - 1 to myVariable + 1
This is perfectly legal. You can't completely tokenise one line of source and then completely tokenise the other.

This example just requires two passes of the tokeniser.

Brendan wrote:Java's byte-code isn't capable of representing things that are necessary for low level programming (mostly because it's designed for a safe language).

When it is about low level my choice is to use assembly and to translate it into machine code directly. And usually there are just a few low level things while all the other can be compiled in the bytecode representation. Level mix is unsafe and compiler agnostic.

Brendan wrote:I'm discarding all existing things and therefore I can't build on top of them (and aren't restricted by them).

But you have to use your PC and Windows/Linux to create new world. Why not to add to the Windows/Linux one more thing - the IDE? The IDE will live in the standard world and will help you to build anything for your personal world. And it's help is really great - the speed of development will be increased many times.

Brendan wrote:The second compiler is run on the end user's computer; and converts the portable executable file into a native executable that the CPU can execute directly. This means that while the code is executing there's no translation/interpreter or virtual machine or other pointless bloat slowing things down and gobbling large amounts of RAM.

But you have mentioned in some post such thing as a garbage collector - it is one of "pointless bloat slowing things down and gobbling large amounts of RAM"

Brendan · Post by **Brendan** » Fri Mar 21, 2014 3:00 pm

Hi,

embryo wrote:
Brendan wrote:Java's byte-code isn't capable of representing things that are necessary for low level programming (mostly because it's designed for a safe language).
When it is about low level my choice is to use assembly and to translate it into machine code directly. And usually there are just a few low level things while all the other can be compiled in the bytecode representation. Level mix is unsafe and compiler agnostic.

There's 2 reasons for assembly: the compiler can't/doesn't support what you want to do (e.g. CPU mode switches, etc), and optimisation. Kernel development ends up being a mixture of both. A "low-ish" level language like C is low level enough to avoid the need to use assembly for optimisation in more cases.

embryo wrote:
Brendan wrote:I'm discarding all existing things and therefore I can't build on top of them (and aren't restricted by them).
But you have to use your PC and Windows/Linux to create new world. Why not to add to the Windows/Linux one more thing - the IDE? The IDE will live in the standard world and will help you to build anything for your personal world. And it's help is really great - the speed of development will be increased many times.

I looked into writing a portable IDE. It doesn't solve the original problems (e.g. you still need to tokenise lines of text when they're entered) and creates additional problems (the entire user interface on top). It'd definitely be possible, but it'd take longer and I'm planning to throw it in the trash when I've moved to self-hosting anyway.

embryo wrote:
Brendan wrote:The second compiler is run on the end user's computer; and converts the portable executable file into a native executable that the CPU can execute directly. This means that while the code is executing there's no translation/interpreter or virtual machine or other pointless bloat slowing things down and gobbling large amounts of RAM.
But you have mentioned in some post such thing as a garbage collector - it is one of "pointless bloat slowing things down and gobbling large amounts of RAM"

You can implement a garbage collector without any translation/interpreter or virtual machine. For things like scripting languages (where programmer time and training are more important than performance) the extra overhead of garbage collection can be justified; while for low level work it's not justified.

Cheers,

Brendan

Antti · Post by **Antti** » Sat Mar 22, 2014 1:30 am

Brendan wrote:when I've moved to self-hosting anyway

You should have a lot of assembly code from your previous projects, e.g. boot code. Are you able (or are you planning) to use this code "as is" as much as possible, i.e. just converting the assembly syntax to suit your new language?

Last but not least, an important question. Do you sleep at all? We can see you being online at any time of day.

Brendan · Post by **Brendan** » Sat Mar 22, 2014 2:32 am

Hi,

Antti wrote:
Brendan wrote:when I've moved to self-hosting anyway
You should have a lot of assembly code from your previous projects, e.g. boot code. Are you able (or are you planning) to use this code "as is" as much as possible, i.e. just converting the assembly syntax to suit your new language?

For stage 1; for BIOS the early boot stuff will be mostly the same (as I don't plan to support 16-bit code generation for the higher level language); but will have half the code from my "Brendan's BIOS Wrapper" experiment built into it. Stage 1.5 will be converted into higher level code where possible and will be 100% 32-bit. I'm not planning to worry about boot loaders for UEFI until later (mostly, I'm too lazy to support position independent code and PE32+ file format for the initial/disposable compiler).

Stage 2 (and the boot modules it uses) will be converted into higher level code where possible too. There will be some minor changes here (e.g. support for "dual boot image" so I can have an architecture specific boot image containing 80x86 code, plus an architecture independent one containing graphics, docs, etc).

Later stuff (kernel setup modules and the kernel modules themselves) will be rewritten; partly because I want to do scheduling very differently, and partly because I haven't looked at my previous kernel/s for ages.

There's also some utilities (for boot image creation, compression/decompression, floppy and CD-ROM disk image creation, etc) that will be recycled "as is" (they're written in C and run on Linux and probably won't need any/many changes). There's also a bunch of documentation for previous versions that can be reused after a little checking.

Antti wrote:Last but not least, an important question. Do you sleep at all? We can see you being online at any time of day.

Some people walk or talk in their sleep, I check the forums and sometimes write replies in my sleep.

Cheers,

Brendan

embryo · Post by **embryo** » Sat Mar 22, 2014 5:18 am

Brendan wrote:I'm planning to throw it in the trash when I've moved to self-hosting anyway.

For a language self-hosting means IDE developed in the language, even if it is started as plain text. However, for the development it is more important to have debugger. Would it be a part of IDE or standalone application - any way it's better to write it in the target language for not having to do it twice. But your time constraints and habits, of course, will guide you better

bwat · Post by **bwat** » Sat Mar 22, 2014 8:46 am

embryo wrote:However, for the development it is more important to have debugger ... any way it's better to write it in the target language for not having to do it twice.

Depending on the language, it can be a good idea to also have a very low-level debugger so you can debug the high-level debugger. I had to add warnings in the documentation of some of the printout hooks in my Prolog system so to avoid having the debugger's printouts invoking the debugger which would lead to printouts which would invoke the debugger which would lead to printouts invoking ......

Brendan · Post by **Brendan** » Sat Mar 22, 2014 9:32 am

Hi,

embryo wrote:
Brendan wrote:I'm planning to throw it in the trash when I've moved to self-hosting anyway.
For a language self-hosting means IDE developed in the language, even if it is started as plain text. However, for the development it is more important to have debugger. Would it be a part of IDE or standalone application - any way it's better to write it in the target language for not having to do it twice. But your time constraints and habits, of course, will guide you better

Initially I'll just be using the debugger that's built into Bochs.

Eventually (e.g. after the IDE can edit source code and the compilers can compile it), I'll need to figure out how debugging, profiling and unit testing should be done (which will require a lot of research, as I strongly suspect these things shouldn't be done how they've traditionally been done).

Cheers,

Brendan

Rusky · Post by **Rusky** » Sat Mar 22, 2014 10:25 am

For the purpose of having a common definition, I'm going to point out that tokenizing (or lexing) is just splitting the text into words (<keyword typedef> <identifier "myType"> <keyword as> ...). I'm guessing Brendan is referring to what is usually called parsing, since lexing is his examples is trivial but parsing them is less so. You may have figured this out already, but it's still possible to parse that in a single pass because the syntax itself doesn't change based on type declarations, etc. You just need to create an "undefined" symbol table entry and then fill it in when you see its definition- any undefined entries left at the end are errors. You may need to enlighten me as to why you can't use an AST- they can preserve token order just fine, for example. Clang's even stores location information for error reporting, which allows it to be printed back out exactly as it was read in.

Also, for the purposes of an initial compiler, you may want to consider using the LLVM library. That way you don't have to worry about code generation at all, and you get world-class optimization for free. I don't know if LLVM could exist in your brave new world, but it would give you a significant head start on your way there. I've written a full-blown compiler for a scripting language in only a few weeks using a recursive descent parser with pratt parsing for expressions written in C++, using LLVM for code generation. You would also get free support for many CPU architectures and object formats, so things like UEFI and later-than-486s would be trivial.

I like the idea of source code being stored in a non-textual format. Even nicer would be if text (or rather appearing like text) were only one view of the code, to go along with your IDE ideas and/or Subtext style interfaces. (By the way, that First Class Copy and Paste link I flippantly tossed at you earlier is the formal specification of Subtext, which is based on what is essentially an AST).

Rusky · Post by **Rusky** » Sat Mar 22, 2014 10:52 am

Back on the topic of alternate exits to a function vs a branch in the caller, and considering this thread at least started out talking about the Mill, I learned something interesting: http://millcomputing.com/topic/new-call ... /#post-711. The Mill call instruction can return multiple results of arbitrary sizes (because of the belt, as opposed to using registers), so you could return an error code or use an Option<T>. Then, because of phasing, the call op and the branch-on-error op can be in the same instruction even though they're dependent (same feature used in the vectorized strcpy). This brings the cost back down to Brendan's solution even without optimizations like partial inlining and is actually a pretty elegant solution. In fact, Brendan's language could be implemented by the callee returning a code for "which continuation argument did I want to take" which the caller can switch on (again, still in the same instruction).

OSDev.org

Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem

Re: Unions, program proofs and the Halting Problem