Self hosting OS/Compiler from scratch

bwat · Post by **bwat** » Mon Jan 20, 2014 6:04 am

Combuster wrote:
bwat wrote:No. Composition is a fragment of the language and therefore tested. See my initial post.
So you test composition, and yet you claim the compiler source itself is a bad example of composition.

The compiler offers no form of composition that cannot be expressed in a smaller, more easily understood test program.
You should read up on relational semantics and structural induction. I was taught this in the early 1990s, so it's not leading edge stuff.

Combuster wrote:Especially when this particular testcase maintains itself,

My test cases need no maintenance. They are based on the formal semantics of the language.

Combuster wrote:and has sever other properties you can't test with any other testcase.

No, it doesn't.

Combuster · Post by **Combuster** » Mon Jan 20, 2014 6:13 am

No, it doesn't.

And thus you demonstrated again you're not worth my time.

Brendan · Post by **Brendan** » Mon Jan 20, 2014 8:02 am

Hi,

Kevin wrote:Brendan will disagree because he considers his OS too different for this to make sense, but I don't think you have to wait for your OS to be running before you can throw away your first compiler.

Yes. For my case:

the languages are different
the libraries are different
the source file format is not "plain text"
the output file format is very different to ELF object files (no object files, no linking)
the tools aren't used the same way (automated file format conversion, no make, no link)
the file IO is very different (versioning file system, all files are immutable)
processes use a different conceptual model (use of thread specific storage and asynchronous communication as the basis for a "shared nothing" concurrency model)
it's intended as a distributed architecture (e.g. local computer supervises the work done by other/remote computers, including load balancing and "start alternative slave/s on slave failure"; where that work may be a pre-compile phase on one remote computer that's fed to multiple back-ends, and the ability to generate N different output executables optimised for N different targets on M different remote computers in parallel)

Kevin wrote:The only reason why you even need a throw-away compiler in the first place is that you're designing a new language, so the first compiler can't be written in its own language. As soon as your preliminary compiler implements a subset of the language that is powerful enough, you can start writing the real compiler. Not running on your own OS, but e.g. Linux. At this point, you can start working on optimisations etc. without having to throw them away later.

Then you use that compiler to start developing your OS, and as soon as its mature enough, porting your compiler will be easy enough. All you really need to change is a few functions for accessing files and similar things.

If you're able to run the native compiler on Linux, then you should start by splitting your "tools + OS" project into 2 separate projects and then forget about the OS until after the "tools" project is finished (and possibly forget about the OS completely, given that there'd be at least 5 well established "work alike" OS projects you can join and/or fork instead).

Kevin wrote:
Brendan wrote:If you must add fancy features, add features that help humans avoid mistakes and/or make it easier for people and compilers ensure that code is correct. Don't add features that make things more complex, or more ambiguous, or make it harder to ensure code is correct (e.g. templates/generics, operator overloading, etc).
How are generics a featuer that make it harder to ensure correctness? Aren't they generally used where you had an untyped pointer before, so introducing type safety, which is one of the most important measures to ensure correctness?

How would you implement a list, for example?

A list of how many items, where each item in the list contains what, and for which use cases?

For a random example, let's assume it's an extremely large list of "name and contact details" structures; where you expect to frequently use the name to search for contact details; but never really expect to use the contact details to search for the name.

The first thing I'd do is split this into 2 structures - one for "hot data" (used frequently for searches) and another for "cold data" (not used for searches). The "hot data" would consist of the name and a reference to the "cold data". I'd create a hash from the name; then I'd have one thread per CPU per NUMA domain per computer, and use "name_hash % number_of_threads" to determine which thread is responsible for which names. Each of these threads would have "name_hash / number_of_threads" sub-lists.

For cache locality (to minimise TLB misses and cache misses), the "hot data structures" for each thread all need to be close to each other. For this reason (regardless of how you store the sub-lists), you're going to need a special "hot data virtual memory pool". You do not want to do something completely retarded (like just use a generic "malloc()" that's shared by all sub-lists for all threads in a process, and completely destroys cache locality). To be more correct; I'd actually want to have a separate "hot data memory pool" for each individual sub-list for each thread (plus another memory pool for each thread for its "cold data").

For the sub-lists themselves; I'd probably just have singly linked lists of "hot data structures". This means having some sort of "next" field, some sort of "reference to cold data" field and the "name" field in that structure. Because the name field would be (UTF8) bytes (no alignment needed), and because I'd be trying to pack the most number of "hot data structures" into the least number of cache lines; I wouldn't use full (64-bit) pointer for the "next" field or for the "reference to cold data" field (e.g. 2*8 bytes plus up to 7 bytes for alignment). Instead I'd use some form of partial pointers.

For the "next" field; because we're using a separate "hot data memory pool" for each individual sub-list we only need an offset from the start of the virtual memory pool, and if we assume some sort of alignment we don't need to store "always zero due to alignment" bits of the offset. This means that we probably only need a 16-bit "(offset to next entry) >> 1" for the next field. This does limit it to 128 KiB per sub-list (or assuming an average of 16 bytes per entry, about 8000 entries); but that doesn't really matter - with a large enough hash size (and good enough hash function) the number of sub-lists will be large enough to avoid the need to store too many entries per sub-list anyway.

In a similar way; for the "reference to cold data" we only need an offset from the start of that memory pool, but we don't care much about cache locality for the cold data and could maybe align entries on 64-byte boundaries if we want. This means we can use something like a 32-bit "offset of cold data >> 6" as a reference to the cold data. That would mean that our cold data virtual memory pool (that is shared by all of one thread's sub-lists) would be limited to 256 GiB; which is more than we're ever likely to use (maybe some of those bits could be used for something else, like a "male or female" flag or something, or maybe we should align cold data entries on a 32-byte or 16-byte boundary instead). Even though this is a 32-bit field I wouldn't bother aligning the field itself on a 4-byte boundary, as it's only used to find the cold data (what doesn't happen frequently) - the "next" field would be aligned on a 2 byte boundary, and that's good enough alignment for the "32-bit cold data reference".

That means that both the "next" and "cold data reference" fields cost us 6 bytes plus one optional byte of padding (rather than the original "2*8 bytes plus up to 7 bytes for alignment"); which means a higher "entries per cache line" and less cache misses.

Now; how would you implement this same list? Would you use a generic "list" template that doesn't scale (either not thread safe and/or has lock contention), that uses a generic memory allocator and has extremely bad cache locality, that also uses 64-bit pointers just in case it doesn't quite suck badly enough already?

Cheers,

Brendan

Owen · Post by **Owen** » Mon Jan 20, 2014 8:20 am

You talked about cache performance and then went on and used linked lists?

Sorry, but you just did more to destroy cache performance with your lists than any NUMA splitting benefit will give you.

Linked lists are terrible for any sort of rapid iteration.

skeen · Post by **skeen** » Mon Jan 20, 2014 8:21 am

Now; how would you implement this same list? Would you use a generic "list" template that doesn't scale (either not thread safe and/or has lock contention), that uses a generic memory allocator and has extremely bad cache locality, that also uses 64-bit pointers just in case it doesn't quite suck badly enough already?

As for thread safety implement this as a binary template argument and specialize for each function, which differs, keep the common core and interface. Use a default argument for the default case.
As for which memory allocator to use, have an allocator be passed by template type, use this to do allocation, thereby allowing all types of allocations under the same data structure, while keeping the common core and interface. Again, use a default argument for the default allocator.
As for pointer size, again pass the desired pointer type as template type parameter. And again, use a default argument for the default type.

If there's any special case behavior you're looking for, simply plug that in using traits of template specialization, this will keep a common interface, and let the code share a common core.

Also don't optimize prematurely. If you're asked to do a list, just make a standard one, when it doesn't cut it, optimize the configuration variables.

Brendan · Post by **Brendan** » Mon Jan 20, 2014 8:41 am

Hi,

Owen wrote:You talked about cache performance and then went on and used linked lists?

Sorry, but you just did more to destroy cache performance with your lists than any NUMA splitting benefit will give you.

Linked lists are terrible for any sort of rapid iteration.

For generic linked lists, I agree.

For the linked lists I described (small number of elements per list as its only for "hash collision handling") with cache locality ensured by a "per list memory allocator", you're entirely wrong.

Cheers,

Brendan

Brendan · Post by **Brendan** » Mon Jan 20, 2014 9:14 am

Hi,

skeen wrote:
Now; how would you implement this same list? Would you use a generic "list" template that doesn't scale (either not thread safe and/or has lock contention), that uses a generic memory allocator and has extremely bad cache locality, that also uses 64-bit pointers just in case it doesn't quite suck badly enough already?
As for thread safety implement this as a binary template argument and specialize for each function, which differs, keep the common core and interface. Use a default argument for the default case.

Code: Select all

Complexity++;

skeen wrote:As for which memory allocator to use, have an allocator be passed by template type, use this to do allocation, thereby allowing all types of allocations under the same data structure, while keeping the common core and interface. Again, use a default argument for the default allocator.

Code: Select all

Complexity++;

skeen wrote:As for pointer size, again pass the desired pointer type as template type parameter. And again, use a default argument for the default type.

Code: Select all

Complexity++;

skeen wrote:If there's any special case behavior you're looking for, simply plug that in using traits of template specialization, this will keep a common interface, and let the code share a common core.

Code: Select all

Complexity++;

if(Complexity > 2) {
   printf("Congratulations, you've created a language design that's so complex that it's going to take you a minimum of 3 decades to implement a compiler that actually works.\n");
}
if(Complexity > 3) {
   printf("Congratulations, you've created a cluster-bork that's so complex no normal programmer will be able to change anything in a large project without breaking things they didn't foresee.\n");
}

skeen wrote:Also don't optimize prematurely. If you're asked to do a list, just make a standard one, when it doesn't cut it, optimize the configuration variables.

Don't be silly - if you're asked to do a list, ask what the requirements are. Don't waste your time implementing something that will never meet the requirements because you failed to ask and/or failed to use common sense.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified" — Donald Knuth

Cheers,

Brendan

Owen · Post by **Owen** » Mon Jan 20, 2014 9:27 am

Brendan, I ask:

What is better, one list type, complex but performant, that one person writes and everybody uses, or many list types, all slightly different, not sharing code (higher I cache use), not as exhaustively debugged (Unlikely to be a problem for a linked list, but perhaps so for a red black tree or similar)

I favor the former.

Of course, I also understand that you're irrationally scared that a "a * b" in the face of operator overloading might cause b to be printed to a, but are not at all scared that "vector_mul(a, b)" will do the same (and think that vector math code should be horrible to write or read as a result), so there's that...

Brendan wrote:Hi,

Owen wrote:You talked about cache performance and then went on and used linked lists?

Sorry, but you just did more to destroy cache performance with your lists than any NUMA splitting benefit will give you.

Linked lists are terrible for any sort of rapid iteration.
For generic linked lists, I agree.

For the linked lists I described (small number of elements per list as its only for "hash collision handling") with cache locality ensured by a "per list memory allocator", you're entirely wrong.

Any length of linked list is a cache disaster, unless the entire list fits on one cache line. The prefetcher can't ever be more than one step ahead of you (and that's on order of hundreds of cycles).

I don't understand why you wouldn't use a small array with pointers to the strings in it, which does let the prefetcher get ahead...

Kevin · Post by **Kevin** » Mon Jan 20, 2014 10:23 am

Brendan wrote:Now; how would you implement this same list?

I wouldn't, just like you didn't. Lists aren't the right data structure for this.

But okay, I think I can still ask what my point really was: Now you have an implementation of a data structure that can efficiently access objects that amongst others contain a string as their index. This is not only useful in the context of this specific module, so let's assume you need the same functionality in more modules. Would you then write the whole mechanism from scratch (or copy and modify it) for each type of object that you want to access this way? Or do you actually share that code? And if you do share it, how (if at all) do you make it type safe?

If you copy instead of share, or if you don't have type safety, you increase the complexity and make it a lot harder to verify correctness.

Brendan · Post by **Brendan** » Mon Jan 20, 2014 10:31 am

Hi,

Owen wrote:What is better, one list type, complex but performant, that one person writes and everybody uses, or many list types, all slightly different, not sharing code (higher I cache use), not as exhaustively debugged (Unlikely to be a problem for a linked list, but perhaps so for a red black tree or similar)

I favor the former.

Why just lists? Why not just have one glorious set of templates for all types of containers?

But, why just containers? Why not just have one glorious set of templates for every arrangement of data structures anyone could possibly imagine?

But, why just data structures? Surely it'd be easier to have no programming languages at all, and just have one single "meta-template" that does everything anyone could possibly want from nothing more than command line options!

Yeah, let's do that. Let's have a world where instead of spending 12 months to learn how to program before being able to understand and write good useful code; people spend 20 years learning how set "meta-template" command line options and die of old age before they manage to produce anything, because that's obviously "easier".

What is better, is code for a specific purpose that's designed for that specific purpose, rather than a massive "swiss army knife" thing that's usable for everything, good for nothing, and only understood by some guy that left the company 6 years ago.

Owen wrote:Of course, I also understand that you're irrationally scared that a "a * b" in the face of operator overloading might cause b to be printed to a, but are not at all scared that "vector_mul(a, b)" will do the same (and think that vector math code should be horrible to write or read as a result), so there's that...

There's nothing irrational about it. The time it'd take someone to figure out that '*' has been overloaded and to find the code for the overloaded '*' is greater than the time it'd take for someone to figure out that "vector_mul()" differs from the standard '*' operator and to find the code for it (in a file hopefully called "vector.c").

For a small project (e.g. where all code fits on the same screen) the difference is negligible. For a large project (e.g. where no member of the team is able to remember every single little detail at any point in time) the sum of all the little "Um, I can't remember what this actually is" moments adds up to a massive disaster for code maintenance.

Owen wrote:
Brendan wrote:
Owen wrote:You talked about cache performance and then went on and used linked lists?

Sorry, but you just did more to destroy cache performance with your lists than any NUMA splitting benefit will give you.

Linked lists are terrible for any sort of rapid iteration.
For generic linked lists, I agree.

For the linked lists I described (small number of elements per list as its only for "hash collision handling") with cache locality ensured by a "per list memory allocator", you're entirely wrong.
Any length of linked list is a cache disaster, unless the entire list fits on one cache line. The prefetcher can't ever be more than one step ahead of you (and that's on order of hundreds of cycles).

I don't understand why you wouldn't use a small array with pointers to the strings in it, which does let the prefetcher get ahead...

When searching the sub-list you know the hash is right and have to compare the "name" string (for each entry) to the original/requested name. A small array of "pointers to names" means that the pointers themselves are likely to share cache lines, but does not mean the strings that you have to access will also share cache lines or be prefetched.

However, to test your theory it should be simple to replace my special purpose list code with your alternative special purpose code, and test the performance difference using a profiler to see which is better for the specific data and specific operations that actually occur in our specific case. I think you'll find that the memory allocator we're using ends up placing entries for my linked lists in easily prefetched "sequential order" when the lists are first created, and that entries aren't deleted often enough to upset the initial "sequential order", but I haven't confirmed this so maybe I'm wrong.

Of course replacing the special purpose code with an alternative should be easy to do, because you don't have to worry about introducing unforeseeable bugs into a piece of code that is relied on by many other (different but vaguely related) special purposes.

Cheers,

Brendan

Brendan · Post by **Brendan** » Mon Jan 20, 2014 11:24 am

Hi,

Kevin wrote:
Brendan wrote:Now; how would you implement this same list?
I wouldn't, just like you didn't. Lists aren't the right data structure for this.

But okay, I think I can still ask what my point really was: Now you have an implementation of a data structure that can efficiently access objects that amongst others contain a string as their index. This is not only useful in the context of this specific module, so let's assume you need the same functionality in more modules. Would you then write the whole mechanism from scratch (or copy and modify it) for each type of object that you want to access this way? Or do you actually share that code? And if you do share it, how (if at all) do you make it type safe?

In all honesty, if I needed something like that (capable of scaling to a LAN of 100+ computers) often enough to care; then I'd implement it as a stand-alone "public service", where normal processes only need to send commands (load, save, retrieve entry, store entry, delete entry, etc). Of course "public service" has special meaning for my project. It means publishing the protocol in an (open) formal specification (with a review process, etc), so that anyone can implement that protocol (and any user can choose whichever implementation they want). It also means defining the format of all messages involved, and ensuring messages sent/received comply with the protocol's formal specification.

If it was simpler (e.g. intended for small lists rather than extremely large lists) then it'd be a "copy & paste then customise for your purpose" thing.

Kevin wrote:If you copy instead of share, or if you don't have type safety, you increase the complexity and make it a lot harder to verify correctness.

The opposite actually. Only needing to worry about one case (your own specific case) is an order of magnitude easier than worrying about many potential cases (templates).

For example, maybe it was exactly what we needed when it was first designed, but then what we needed changed and now we have to be able to generate a list of all names in alphabetical order. One simple solution might be to mess with the hash function (and break everything else that was using our hash function if it was a template) and increase the ability to handle hash collisions to compensate (and add unnecessary bloat to everything else if that was a template). To avoid those template problems, we could cut & paste the template (or make a new specialisation) but that's no better than cut & pasting something simpler that doesn't use templates.

Cheers,

Brendan

Joshw · Post by **Joshw** » Mon Jan 20, 2014 11:28 am

bwat wrote:
Joshw wrote: I tend not to understand the greek and formulas so well and I just want to figure out how to apply it in my code.
I see from your link, which led to your CV, that you're reading the dragon book. That does have a fair amount of "Greek" in it. However, at the end of the day it's just symbol pushing. It's all logical and you do get it after a while. Maybe you'd like another more tutorial-oriented compiler book which takes you through the construction of a compiler, e.g., the tiger book? After that, the more theoretical treatment in the dragon book might be clearer. Most of us understand theory a bit better after a little practice.

If I'd have known people looked at that thing, I'd keep it more updated

I'll look into some other books. I've really just struggled for a day or two on figuring out how to compute the parse table, and handling shift/reduce conflicts, etc. Thank you for your suggestions

As far as lists go, guys, I have always thought generic collections, and generics in general, are intuitive and useful. I'm probably going to include them in my language. Generics open up so many possibilities, and many features in C#, for example, are just syntactic sugar on top of generics (like enumerables and Linq). My plan is to tap more into that and provide a platform for i.e. transparent distributed dependency injection and remoting.

Brendan · Post by **Brendan** » Mon Jan 20, 2014 11:48 am

Hi,

Joshw wrote:As far as lists go, guys, I have always thought generic collections, and generics in general, are intuitive and useful. I'm probably going to include them in my language. Generics open up so many possibilities, and many features in C#, for example, are just syntactic sugar on top of generics (like enumerables and Linq). My plan is to tap more into that and provide a platform for i.e. transparent distributed dependency injection and remoting.

Cars (like generics) can be easy to use - most people can learn to drive a car reasonably quickly. They're almost never the most ideal vehicle for any specific case (too large to be ideal for a single person, too small to be ideal for hauling international freight), but they are easy to use.

Of course using a car is very different to building a car (in the same way that using generics is very different to writing generics). Learning how to build a car well can take years - it requires all sorts of knowledge (physics, regulations, etc) and its full of "D'oh, I overlooked that" moments. I'm glad I don't need to build a car.

Of course learning how to build a car well is very different to learning how to build tools that can be used to build cars well (in the same way that writing generics is very different to writing tools that can be used to write generics). That's a whole world of pain - I'm extremely glad I won't ever need to build tools to build cars well.

Cheers,

Brendan

Owen · Post by **Owen** » Mon Jan 20, 2014 1:08 pm

Brendan wrote:
Owen wrote:Of course, I also understand that you're irrationally scared that a "a * b" in the face of operator overloading might cause b to be printed to a, but are not at all scared that "vector_mul(a, b)" will do the same (and think that vector math code should be horrible to write or read as a result), so there's that...
There's nothing irrational about it. The time it'd take someone to figure out that '*' has been overloaded and to find the code for the overloaded '*' is greater than the time it'd take for someone to figure out that "vector_mul()" differs from the standard '*' operator and to find the code for it (in a file hopefully called "vector.c").

For a small project (e.g. where all code fits on the same screen) the difference is negligible. For a large project (e.g. where no member of the team is able to remember every single little detail at any point in time) the sum of all the little "Um, I can't remember what this actually is" moments adds up to a massive disaster for code maintenance.

Maybe you're using crappy tools?

I hover over the asterisk in "a * b" and hit the "Go to definition key".

My IDE then looks at A and B, sees that they are of type "vec3<float>", and hops me over to the implementation of "vec<T>::operator*(const T& rhs)".

That is exactly the same amount of time as it takes me to find the definition of "vector_mul".

Of course, as somebody who understands the mathematics, the code

Code: Select all

    auto reflection = -reflect(l->dir, normal);

    return l->ambient 
        + l->diffuse * max(dot(normal, l->dir), 0.0)
        + l->specular * pow(max(dot(reflection, -transformedPosition), 0.0), m->shininess;

is a lot less obtuse than

Code: Select all

    auto reflection = -vec3float_reflect(l->dir, normal);

    return vec3float_add(l->ambient,
        vec3float_add(vec3float_mul(l->diffuse, vec3float_max(vec3float_dot(normal, l->dir), vec3float_spread(0.0))),
            vec3float_mul(l->specular, vec3float_pow_float(vec3float_max(vec3float_dot(reflection, vec3float_sub(vec3float_spread(0), transformedPosition)), vec3float_spread(0.0)), m->shininess)));

where even if the function told me it was a Blinn-Phong shader, I can't see the maths for the code, so I have no idea if they're right.

And, after all, I only have to verify the correctness of vec3<float>::operator+(vec3<float>) once

bwat · Post by **bwat** » Mon Jan 20, 2014 1:55 pm

Joshw wrote:I've really just struggled for a day or two on figuring out how to compute the parse table, and handling shift/reduce conflicts, etc.

Figuring that stuff takes a lot longer than a couple of days! The dragon book has over 100 pages on parsing (chap 4) and from the looks of it, LR parsing and parser table generation is about half of that chapter. I've got volumes I and II of Parsing Theory by Sippu and Soisalon-Soininen, and together they're over 600 pages and of those about 200 cover LR parsing and table generation (the first volume covers the basics needed to understand parsing in general and that's another 200 pages). It's a big subject and you shouldn't be surprised if it takes time to get it. If you do make the effort then you'll have a skill that you'll find yourself using in all manner of different types of programs.

OSDev.org

Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch

Re: Self hosting OS/Compiler from scratch