Solar wrote:I do this with one laughing and one crying eye, as I have a strong deja-vu regarding my own, long-sinze defunct OS project, but I'll ask you the killer question:
How does your system of streams-in-streams interact with the outside world, which would still be using files in folders?
You said this would be for the computing machine, which is tossed tasks from your generic desktop. Your desktop doesn't speak "streams in streams", and your computing machine doesn't speak "files in folders". Where do you make the transition, how high will be the cost of the transition, and why go to all the trouble?
Firstly, I didn't bring up file systems. File systems only came in to the discussion because people can't seem to comprehend anybody wanting to write a kernel that doesn't involve users, files, GUIs, and all the associated crap. I am not interested in userland, and never said that what I'm working on would support any sort of file system whatsoever. Is that clear? As for compatibility, you're quite right, which is why I said "if I could travel back in time and change the computing world ..." I can't and we're stuck with the file system metaphor as it stands which is fine by me.
Got it? Good. Right, we can move on.
Solar wrote:Yes, that's part of the exokernel design.
Fair enough, I was under the impression that an exokernel protected client kernels from one another. If that isn't the case, then I guess this project is an exokernel of sorts. It doesn't make any difference ... tomato tomato
Solar wrote:Also part of the exokernel design. 1:1 mapping of HD and RAM is not necessary to achieve this.
Why is there a 1:1 mapping to RAM? Where did I say this exactly? The idea is to construct a mapping to the
virtual address space, not
physical RAM. One simple persistent address space for everything.
One address space to rule them
One address space to find them
One address space to bring them all and in the darkness bind them!
Solar wrote:Excuse me, but you don't seem to be very consistent in what you're saying.
I'm being perfectly consistent. What I'm describing is:
kernel (layer 0)
+ (layer 0.0) CPU architecture (page tables, context switching, etc.)
+ (layer 0.1) CPU <-> machine bridge ( IRQs, bus enumerators, etc.)
+ (layer 0.2) CPU <-> CPU bridge ( IPI, TLB shootdown, synchronisation, etc.)
+ (layer 0.3) Device layer (IDE, Ethernet, etc)
+ (layer 0.4) Virtual address space <-> Device layer translation
+ (layer 0.5) Machine monitor
application (layer 1)
+ (layer 1.0) ... ?
+ (layer 1.1) profit!
Okay, now we have reference points. Layers 0.0 to 0.3 are pretty standard. The machine monitor (layer 0.5) allows a coordinating node to control a compute node, threads can be started/stopped and the virtual address space can be read/written, so far not particularly interesting.
What I'm interested in is what goes on at layer 0.4, where there is a lot of choice. What I'm interested in figuring out is "what is the minimum amount of complexity required to allow for persistence and possibly consistency in the case of a power failure?" And beyond that "how can I make this go
fast?"
Note that consistency and persistence are not one and the same. It would be perfectly valid to construct a system that does not ensure consistency in the event of a power failure, but is persistent in the case of a graceful shutdown (where any disk blocks cached in RAM can be flushed to disk.) This is achievable with a simple 1:1 mapping from disk blocks to the virtual address space, so this is where I plan to start.
A concrete example would be a 1TB disk mapped at 0x10000000000 - 0x1FFFFFFFFFF. When a page fault occurs within this range ((address - 0x10000000000) shr 9) tells us where to find the appropriate sectors on the disk. The read is then queued and the thread is blocked. It's not exactly rocket science, which is precisely the point: it's simple, so it can be made to go very fast.
The first advantage is that everything happens at ring 0, so the processor doesn't **** around switching privileges when a page fault occurs (+1). Then, several instructions later we know what disk block we are interested in (+2.) The algorithm I'm looking at for queueing I/O is very simple, we just keep a binary tree of queued reads/writes indexed by the virtual address of the page that caused the fault.
The advantage here is that the I/O layer can walk left to right and then right to left through the I/O request tree picking off reads/writes
in order, which means the corresponding reads/writes to the disk will happen in
block order as well (because the mapping is linear.) Having the disk head traverse in this fashion is a Good Thing if we want to maximise throughput (+3)
Improving performance will also require a good read ahead prediction algorithm. One approach is to set aside a region of the disk to keep read ahead hints. We can track the order in which page faults occur and collect statistics. Building a small markov (well, markov-ish) model for each disk block allows us to predict (hopefully) in which order future reads will occur. For each disk block we store something analogous to:
Code: Select all
struct model_t
{
struct predict_t
{
block_addr_t prev, next;
unsigned int count;
};
predict_t predict[MODEL_SIZE];
};
The system keeps track of the block addresses for the last two page faults that have occurred prev0 (the most recent) and prev1 (the second most recent). Whenever a page fault occurs we can search the model for block[prev0] and determine if we had a prediction hit (for some n < MODEL_SIZE : predict[n].next == current && predict[n].prev == prev1) If a hit occurs we increment the appropriate count. If a hit does not occur we evict the predict[] entry with the lowest count and replace it with a prediction matching what has occurred and a count of 1. We also need to cause the counts for any predictions where prev == prev1 but next != current to decay (simply dividing them by two is one approach.) This is to stop a prediction miss from being repeated too often.
We can then follow a prediction chain from prev1 and prev0 looking for the blocks most likely to be read in the near future. I strongly suspect that even very small values for MODEL_SIZE (<= 3) would give dramatic performance improvements for frequently read data. The best part is, the model is persistent so the performance benefit persists after a restart.
We can also use prev0 and prev1 (and possibly prev2, prev3 etc) to predict that (prev0 + (prev0 - prev1)) is likely to be accessed in the future. Combining this with the current location of the head (the current read/write position and direction in the I/O tree) allows us to construct a cost estimate and decide if it's worth trying a speculative read ahead.
Nothing about this is revolutionary, my assertion is simply that this can be made to go very fast if one wishes to write a small kernel that does
nothing else. I have thought about this quite a lot ... there are many, many other possible optimisations.
-~ Beware the turing tarpit, in which everything is possible but nothing of interest is easy ~-