RabbieOS

0b1 · Post by **0b1** » Sat Jun 23, 2018 11:43 am

* '''RabbieOS''' - The Reductively Architected Breviloquently Built Information Environment.
** Contact: [email protected]
** URL: https://sourceforge.net/p/rabbieos/wiki/Home/

I'm not sure this qualifies as an OS yet, but that is the goal.

Unlike many projects, it has a specific purpose in mind: to replace web/database servers in virtual hosting environments. It will be written in a mix of assembler and a custom compiler. It (perhaps naively) aims to throw out many tried and tested ways of doing things in the name of efficiency, including:
- Traditional memory protection techniques (including SYSCALLS)
- non-gated SMP approach
- Safe-code-only compiler
- Minimal stack use
- Unique use of the CPU caches
- Synchronization without tying up the CPU
- Memory management
- Interrupt handling
- Designed to build on windows
- SE principles and conventions: Pragmatic, Agile, Patterns.
- Built-in unit tests
- A gradual move from assembly to the FilthyScript compiler
- Built in TCP stack, HTTP and relational data engine.
- Web-based UI only
- Web-based database and workflow designer
- Custom pre-built web/data "applications"
- Migration tools

I've spent a lot of time working with LAMP, WAMP, and WIN64 stacks (and almost 40 years writing code); my estimation is by ditching some old-and-slow legacy best practices, a 20-fold improvement in performance. That's 20 times as many virtual guests per host - a huge cost saving for hosting providers. It's not a given, however. I am deliberately taking a naive approach to many best practices. Perhaps I will learn they exist for a reason. (Then again, experience has taught me a lot of them exist for legacy reasons only).

Given there's a lot of wheel reinventing, at present pace I aim to make to alpha shortly before the 22nd century.

Once the code is a little more mature and stable (and I have a rudimentary TCP/IP stack capable of ARP and DHCP written in assembly) I will upload the sources and build tools.

Constructive comments welcome!

Octocontrabass · Post by **Octocontrabass** » Thu Jun 28, 2018 11:34 am

davidpi wrote:- Unique use of the CPU caches

All right, you've got my attention. How are you using the CPU caches differently from other OSes?

A couple lines of the introduction page you linked also caught my eye.

Yet the stack may or may not live in a cached area of memory.

Got any examples of situations like this? As far as I know, every OS on every x86 PC from the past 20 years configures the caches so that all memory is cached memory.

All multi-CPU environment need most other memory to be current across chips, meaning cacheing gets in the way.

How does it get in the way?

The x86 caches automatically maintain coherence across multiple CPUs unless you explicitly disable them. At worst, you may get heavy bus traffic if two CPUs are trying to write to the same cache line at the same time, and apparent time travel if those CPUs are trying to read each others' writes as well. Both of these are easy to avoid with spacing between data belonging to different threads and explicit synchronization, and the overhead is minimal in most software.

Preventing caching is possible, but it hurts performance so much that it's not really an option. You also have to ensure that all mappings of a particular range of memory share the same cache setting, which means you can't limit caching to only a specific CPU.

0b1 · Post by **0b1** » Sun Jul 01, 2018 9:15 am

Hi Octocontrabass,

Thank you for your feedback. Challenges like this help me reflect on the design and (hopefully) make it stronger.

I have found that the first and best use for caches is to prevent fetching/writing of frequently used blocks.

Consider two primary use cases of read and writes to RAM, with cacheing enabled.
1. Writing a tiny amount of data, such as an octet.
2. Reading a huge amount of data, such as scanning a disk buffer of database indexes, much larger than the on-chip cache.

If the RAM in question is not already cached, then the CPU must first fetch it. In scenario 1, a block of data is fetched, not just the variable in question, other cached ram is swapped out (requiring a write, if write back is not already enabled). So for one memory area, the existing cache, which may be better remaining cached, is swapped out.

In scenario 2, each part of the block must be read into the cache first in order to be read. The cache doesn't get dirty, so there is no write on swap-out. Yet whatever was in the cache first is swapped out to make room for this sequence of one-time reads. In all likelihood, what was in the cache to start with will have to be read back in.

In both of these scenarios, there is little benefit to caching. The memory must all be read anyway. Any gain, it seems to me, would be lost by the amount of cache changes (up to twice the overhead). (Your comment makes me think I am missing something here, so I will give it some thought, and perhaps you can help me see the flaw).

AFAIK the stack is by far the most frequent use of the same area of RAM, and I haven't found anything to indicate the CPU explicitly caches the stack. Because the stack is CPU specific, it doesn't even need to be written back. Only an individual CPU cares about the stack. (Perhaps a better hardware design would be to dedicate on-chip RAM to the stack; it really doesn't need to be in main memory at all). Even if Intel agreed to the change (unlikely) we'd still be 5 years away from this being in the mainstream. If I find a way to do it, I'd have the CPU keep the stack cached indefinitely, never being swapped out.

So, by caching only the stack (and a few other things) I believe I can minimize the amount of time the CPU has to wait for the MMU reads/writes. (And if I'm wrong, it's one flag change!)

Having all PCI DMA access take place in non-cached areas should mean the PCI bus and CPUs have to synchronize RAM to start and end a DMA transfer, less cycles there as well. With one CPU there the overhead is relatively small, but with multiple CPUs the overhead grows more geometrically than linearly. (Unless the PCI but knows what is already cached, this must be a very similar to spinlock performance losses, if not quite as steep a drop-off)

Because my OS wil not be allowed to write data to the stack, only return addresses. In addition the compiler will enforce low procedural depth and no recursion, the stacks will be relatively tiny (some OSes have stack of 1M, I expect far less than 64k, maybe even 16k).

OSes have been doing this for the past 20 years paraphrased]

Since the early 90s, at least! But in response to OSes that were written with compilers that were not a good match for the job). C as it stands is great for synchronous code, but is a poor choice for asynchronous code: You either have to write programs that are specifically architected at a high-level to be cooperatively multi-tasked, or use preemption: standard practice, but computationally inefficient. I am proposing a simpler programming language/compiler that compiles to cooperatively multi-tasked code (and what is perhaps a new approach to synchronization that doesn't tie up the processor in wait spins).

Most barriers to progress are resistance to change, particularly 'we have always done it this way, why change?'. I think having done it a certain way for 30 years in what is supposed to be a pioneering field is a reason to look for alternatives

Perhaps I am being naive, or arrogant, or perhaps they are the same thing. Or perhaps this task is just too large for one developer to pull off. But I think it's worth throwing out the rulebook and starting over.

Thanks again, and I will definitely remember your feedback as I make progress.

Korona · Post by **Korona** » Sun Jul 01, 2018 10:10 am

Except that CPUs and cache coherency does not work that way: If you disable caching, your 8-bit wide reads will not get magically faster. The latency of fetching a cache line is the same as fetching a single byte. MMIO writes are performed in uncached memory anyway.

Schol-R-LEA · Post by **Schol-R-LEA** » Sun Jul 01, 2018 10:11 am

davidpi wrote:* '''RabbieOS'''

So, based on the name, it's a highly reflective system, I take it?

O wad some Power the giftie gie us;
To see oursels as ithers see us!

Octocontrabass · Post by **Octocontrabass** » Sun Jul 01, 2018 3:36 pm

davidpi wrote:Consider two primary use cases of read and writes to RAM, with cacheing enabled.
1. Writing a tiny amount of data, such as an octet.
2. Reading a huge amount of data, such as scanning a disk buffer of database indexes, much larger than the on-chip cache.

In scenario 1, you probably want to use non-temporal stores. Non-temporal stores are for situations where you're going to write but not read memory, and don't need the strong memory ordering or coherency enforced by disabling the cache. They typically have the biggest benefit for sparse writes that would otherwise incur a large penalty from cache line eviction/fill/write-back.

In scenario 2, you probably want to use non-temporal loads or non-temporal prefetches. These are for situations where you want to read some memory exactly once and prevent it from polluting the cache. They typically have the biggest benefit in this exact scenario.

No matter what, you must benchmark your implementations on a variety of CPUs. You may even find it necessary to come up with different implementations for different CPU models.

davidpi wrote:(Your comment makes me think I am missing something here, so I will give it some thought, and perhaps you can help me see the flaw).

The point of the caches is to hide memory latency, by combining multiple reads and writes into one cache line read and one cache line write. With the help of prefetching, data can arrive in the cache before you need it, and stay in the cache until you're done. When you disable the cache, you no longer get any of those benefits: every read or write becomes a bus cycle, and those bus cycles are forced to occur in the order your program does them. (Bus cycles can transfer many bytes of data at once - perhaps up to 16 bytes at a time. This isn't too bad for cache lines, but very bad when you're moving a single byte at a time!)

davidpi wrote:AFAIK the stack is by far the most frequent use of the same area of RAM, and I haven't found anything to indicate the CPU explicitly caches the stack.

The stack is ordinary memory, subject to the same cache settings as any other memory: write-back, unless explicitly specified otherwise.

davidpi wrote:So, by caching only the stack (and a few other things) I believe I can minimize the amount of time the CPU has to wait for the MMU reads/writes. (And if I'm wrong, it's one flag change!)

Disabling the caches will have the opposite effect: you're maximizing time spent waiting for bus cycles to complete!

davidpi wrote:Having all PCI DMA access take place in non-cached areas should mean the PCI bus and CPUs have to synchronize RAM to start and end a DMA transfer, less cycles there as well. With one CPU there the overhead is relatively small, but with multiple CPUs the overhead grows more geometrically than linearly.

On x86, cache coherency is enforced by hardware unless otherwise specified. That includes PCI DMA. You'll get all the same cache snooping traffic whether the cache is enabled or disabled, so there's no real benefit to disabling it.

However, it seems like PCIe has a way to disable that cache snooping traffic. When you do that, you have to carefully manage the caches to ensure they're properly flushed before you can begin a transfer. That may include temporarily disabling the cache for some areas; I'm not really familiar with the details here.

davidpi wrote:Since the early 90s, at least!

If you go that far back, you'll find hardware that isn't able to cache all of the installed memory.

davidpi wrote:I am proposing a simpler programming language/compiler that compiles to cooperatively multi-tasked code (and what is perhaps a new approach to synchronization that doesn't tie up the processor in wait spins).

Windows, back when it was just a DOS shell, used cooperative multitasking. Things got ugly whenever a bug prevented a program from cooperating.

davidpi wrote:Most barriers to progress are resistance to change, particularly 'we have always done it this way, why change?'. I think having done it a certain way for 30 years in what is supposed to be a pioneering field is a reason to look for alternatives

You should also ask "why have we always done it this way?" You may find some of your ideas have already been tried and didn't work out very well.

0b1 · Post by **0b1** » Wed Jul 04, 2018 7:03 am

Schol-R-LEA, yes it's indirectly named after Robert Burns. It's wee and sleeket, but not cowrin' or tim'rous!

0b1 · Post by **0b1** » Wed Jul 04, 2018 7:44 am

Octocontrabass, thanks for your detailed response.

Do you have any examples or links regarding temporal/non-temporal memory? That new terminology to me.

(AFAIK, CPU caches now or in the 90s aren't able to cache all of installed memory, typically they only cache < 1MB nowadays, and only a handful of kb in the 90s)

My goal is to speed up the CPU by pointing the stack address to the only area of cached RAM (cached without write-back). Then the stack, in theory, will be in the CPU's 'fast' RAM, not the system's 'slow' RAM. Whether that will work, or will benefit, in the way I surmise, but I believe it is worth a shot. If not, maybe I will have learned something, or maybe I will figure out a different way, or both (or neither, of course).

Cooperative multitasking did not work well with traditional compilers and APIs, because uncontrolled code could do what it liked -- it did not have to yield control, so a bug (or malicious code) could keep control of the CPU. A closed API, coupled with a compiler that compiles code into blocks designed to yield control will (imo) prevent this problem and remove the overhead of pre-emption, which is quite high! If it looks like it might be necessary there could be a safety interrupt, designed to kill processes that were not yielding control. If the compiler has sufficient control over the code structure, I suspect that would be superfluous ... or at least only necessary int the test build.

The challenge with that is timing. Out-of-order execution and different clock speeds make it very challenging (if not impossible) to compile code that runs in a particular time slice. I may have to use a different metric such as the number of nano instructions, or a relative 'cost' of a particular instruction. Not to mention an effective algorithm for unrolling non predictable loops. Even then, I don't see how I can predict a time slice.

RabbieOS is not designed to be an infinitely extensible platform such as Linux or Windows, but a highly efficient web and relational data 'appliance'. Whether I/O limitations will allow the 20x more appliances per host I hope to achieve also remains to be seen. But it's worth a shot and I am learning a lot on the journey.

Had Gates, Torvalds, Gosling, Jobs, Turing, Napier and their peers done things the way they'd always been done our smartphones would be in a truck hauling 500V thermionic valve technology and a steampunk clockwork display the size of the dashboard. Doing things the way they've always been done gets you started, but then it's time to push the envelope. IMNSHO 'Don't reinvent the wheel' is good for profit and stability, but not for innovation.

Thanks again,
DP

Korona · Post by **Korona** » Wed Jul 04, 2018 9:49 am

The top of the stack is already cached even if you do not disable caching for anything else. Disabling caching will make everything slower without making stack access faster. That is simply not how caching works. Read up on how MESI and cache associativity actually works so that you know what you are actually talking about.

You can only "reinvent the wheel" successfully if you actually know what you're doing. Not only that, you have to know it better than your predecessors - so better study their work first. Right now, you are just reinventing some mistakes.

PS: You also seem to have no idea how much caching actually impacts performance. I recently implemented WC caching in my OS. During the implementation I accidentally disabled caching for main memory once. An -O3 screen update with caching disabled goes from a few microseconds to "you can see the pixels appearing on screen" for an Intel i7 7700K.

DavidCooper · Post by **DavidCooper** » Wed Jul 04, 2018 10:58 am

davidpi wrote:If it looks like it might be necessary there could be a safety interrupt, designed to kill processes that were not yielding control. If the compiler has sufficient control over the code structure, I suspect that would be superfluous ... or at least only necessary in the test build.

All you need's a timer. The timer's interrupt routine sets a variable to tell the app to return control the next time the app reads it. All apps would occasionally check that variable (from multiple places in their code, such as inside outer loops which the program may be stuck in for long lengths of time) to see if they should return control to the OS, preferably at times when there is minimal information in registers needing to be saved. If an app fails to hand back control before the next timer tick, the OS can take back control by force, and if it needs to do this repeatedly it can ask the user if the app might need to be shut down. You can still combine this with a pre-emptive system to remove unwanted delays for apps which can't tolerate little delays, but most apps don't need such precision.

The challenge with that is timing. Out-of-order execution and different clock speeds make it very challenging (if not impossible) to compile code that runs in a particular time slice. I may have to use a different metric such as the number of nano instructions, or a relative 'cost' of a particular instruction. Not to mention an effective algorithm for unrolling non predictable loops. Even then, I don't see how I can predict a time slice.

Don't try to predict it - the processor speed may be changed as its temperature goes up and down. The length of a time slice before returning control may also need to change depending on how many other apps are running, so it's better to let the OS decide when control should be returned, but to remove the need for it to do so in a disruptive way. You certainly don't want apps to hand back control a hundred times more often than necessary, but apps may need to be able to hand control back that quickly when there's a lot going on in the machine, so it's best if they check first to see if they should return or keep going for a bit longer, and it is sufficient to read a variable, make a comparison and perform a conditional jump (which will only occasionally lead to control being returned to the OS).

Octocontrabass · Post by **Octocontrabass** » Wed Jul 04, 2018 11:22 am

davidpi wrote:Do you have any examples or links regarding temporal/non-temporal memory? That new terminology to me.

I don't have anything specific, aside from Intel's and AMD's architecture manuals. As far as I know, the first non-temporal instructions were introduced with SSE, so that might help you find more information.

davidpi wrote:(AFAIK, CPU caches now or in the 90s aren't able to cache all of installed memory, typically they only cache < 1MB nowadays, and only a handful of kb in the 90s)

I'm not referring to the size of the cache, but the range of memory that can be made cacheable. Some obsolete hardware had limitations on that. For example, you might install 20MB of RAM, but caching is only enabled for the first 16MB. (Too bad for you if the OS loaded your program somewhere in that uncacheable 4MB!)

davidpi wrote:My goal is to speed up the CPU by pointing the stack address to the only area of cached RAM (cached without write-back). Then the stack, in theory, will be in the CPU's 'fast' RAM, not the system's 'slow' RAM. Whether that will work, or will benefit, in the way I surmise, but I believe it is worth a shot. If not, maybe I will have learned something, or maybe I will figure out a different way, or both (or neither, of course).

With nothing but the stack in the cache, performance will be terribly slow since each instruction must be fetched from slow RAM every time it executes. If you decide to do it anyway, I'd like to see the benchmarks.

davidpi wrote:If it looks like it might be necessary there could be a safety interrupt, designed to kill processes that were not yielding control.

And then you have this interrupt force a yield instead of killing the process... whoops, you've just invented preemption.

0b1 · Post by **0b1** » Fri Jul 06, 2018 12:37 pm

Hi Korona,

Thanks for your response. I will read up on the MESI protocol, even if I do feel a bit trolled.

Makes me want to wax philosophical

The top of the stack is already cached even if you do not disable caching for anything else.

I can't find a reference in the Intel manuals. Do you have a page number? How much of the top of the stack? Does that use up Lx cache space or its own dedicated chip space?

I disagree on several fronts, including:

- Analysis Paralysis: Too much up front analysis before actually getting something written is one of the biggest barriers to project success I've seen. Over several decades. See Ken Schwaber, et al. I'd rather get something written THEN get it right with some fine tuning.
- Agile versus rigid. The assumption it must be right to start with front-loads a project with risk and tends the solution toward rigidity: it works right (if you are very careful and very lucky) but breaks when you try to change it. Software written quickly and subject to many changes is less rigid, more agile (if you control entropy with effective Patterns). Changeability begets changeability.
- Inclusivity versus snobbery. The presumption you have to know everything before you can begin is usually elitism, which is based on insecurity, like any other 'them and us' frame of mind. It creates an illusion of a high barrier to entry, discourages newbies (I am not one, except to this forum) and creates an in-crowd mentality. More about feeling good about how great one's own knowledge is than about creating great software.
- Pioneering versus Tradition. Not entirely pioneering, as OSDev is the 'shoulders of giants' on which I will be standing if I succeed. Regardless, I want to go into this deliberately naive (as much as possible and still be able to proceed), throwing assumptions and traditions out of the window. I'm am not afraid of mistakes; they are merely learning opportunities with an undeserved bad rep. I want to assume I can do it better even if I learn otherwise along the way because the opposite is a self-fulfilling prophecy. Even if I turn out to be right only 1% of the time it's worth it. Also, what if there IS a simpler solution? Experience tells me that dogmatically following in others' footsteps is a sure way to pass it by unnoticed: their assumptions become yours.

Anyway. SMP is next on my list. Back to cacheing when I next run into trouble in that area.

Korona · Post by **Korona** » Sat Jul 07, 2018 1:59 am

Regarding elitism: I do not mind if anyone tries out new things, sees if they work, keeps ideas that work and discards ideas that don't work. However, that is not what you did. You posted make a big announcement claiming things like

[...] my estimation is by ditching some old-and-slow legacy best practices, a 20-fold improvement in performance. That's 20 times as many virtual guests per host - a huge cost saving for hosting providers.

even though you clearly are not an expert in the area that you're talking about. If you make big claims you need the data or the theoretical justification to back them up - just some anecdotal argument. I'm just refuting your baseless claims. If anyone came to this thread with an elitist attitude, it is the OP: You claim that you are able to see a brand new and revolutionary idea that thousands of others were unable to discover.

Regarding agility: In contrast to many others on this forum, you have not demonstrated a running example of your ideas. All you have is a claim that "I'm going to make the best OS ever". I actually have implemented cache control in my OS which everyone can actually download and run from Github. I'm telling you that your ideas do not work and you still won't listen.

I'm not trying to be rude here (nor do I want to discourage you from doing OS dev or posting on this forum) but be careful with your claims and also with your meta discussion ("I'm doing something revolutionary; everyone else just copies their predecessors").

Octocontrabass · Post by **Octocontrabass** » Sat Jul 07, 2018 5:26 am

davidpi wrote:
The top of the stack is already cached even if you do not disable caching for anything else.
I can't find a reference in the Intel manuals. Do you have a page number? How much of the top of the stack? Does that use up Lx cache space or its own dedicated chip space?

Architecturally, special treatment for the stack is implementation-dependent.

As far as implementations go, all x86 CPUs that I know of treat the stack the same as any other data. Since the stack is ordinary data, it shares the caches with other data, and the cache will fill and evict parts of the stack using the same criteria as any other data. Since the top of the stack is used frequently, it will stay in the cache. How much depends on which parts you use frequently, and whether you use those parts more frequently than other non-stack data. If you use the top of the stack as often as you claim, then the only thing that can cause it to be evicted is cache pollution - something you already should be trying to avoid anyway.

Korona · Post by **Korona** » Sat Jul 07, 2018 5:46 am

There is another thing that one should be aware of on x86: Accessing uncached memory disables speculation (see table 11-2 in the Intel manuals). All memory accesses will become serializing, effectively disabling half of the microarchitectural advances of the last 25 years or so.

OSDev.org

RabbieOS

RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS

Re: RabbieOS