Page 2 of 3

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 8:13 am
by childOfTechno
bluemoon wrote:You may have overlook gerryg400's comment.
What's wrong with mmap? What you describe is an extreme case of mapping a hugh file on disk which itself occupy the whole disk space.
It is a subset of traditional OS functionality and thus I think you reduced flexibility.
Well there's nothing wrong with mmap(), it's just nothing that I'm interested in. Arranging things the way I'm describing has one potentially major advantage, sheer bandwidth (there's only one layer between the address space mapping and the disk.)

The traditional directory/file hierarchy is an inflexibility! I'm of the view that we really need to break out of the file mentality. Files-as-a-stream-of-character-data was originally designed as an abstraction like a library of tapes, hence seek(). Once we make the switch to solid state drives, everything has become truly "random access" (or very nearly so) and the file-as-a-tape metaphor starts to look very dated.

It's not like I'm trying to create a new OS that will take over the world, just playing around to see if something works, and then see how well it can be optimised.

Really, why does everybody on OSDEV want to reinvent the wheel? We have enough operating systems that implement UNIX semantics already :)

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 8:15 am
by Solar
childOfTechno wrote:There aren't files, just address space(s). You wouldn't need to store anything as a file, just have a structure stored somewhere:

Code: Select all

struct block_t
{
    void* base_pointer;
    size_t length;
};
You earlier talked about power-loss restartability. That would require that program state - "cached" in RAM - would have to be written to disk at regular, well-defined intervalls.

Again, I emphasize: This has been done before. It's tricky, it requires careful calibration between losing the last X ticks of data and killing performance because you are writing transient (essentially useless) information to disk, and it doesn't come for free.

I don't say it's a bad thing to do, because it's hard to say such things for purpose-built systems without knowing the exact purpose. I just want you to be clear about that your idea is in no way novel, and that there is previous effort that should be studied.

As for the line that "files are so yesterday", I've heard many wanky ideas of what to replace them with, but none of them really made immediate, intuitive sense...

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 8:19 am
by JAAman
In a sense, traditional file management leaves part of the control over swapping data in to and out of RAM up to the application. With entire disks mapped in to the virtual address space only the kernel gets to decide what to swap and when.
no, while programs can choose to swap manually, they don't have to, and ordinarily the OS alone has final say on what is or is not swapped
Physical RAM limits the working set of a program running on a traditional OS anyway, and if a program exceeds this the excess gets swapped out. This means that a program operating on a HD movie with a very large working set (larger than physical RAM) ends up loading data in to RAM, swapping it back and forth and then finally writing the result out to disk.
the "large HD movie with a very large working set (larger than physical RAM)" could be loaded by the application using MMF, and the OS will swap what is needed as its needed, without any action on the part of the application (however, the application can, optionally, change the viewport, which is basically a fancy way of giving windows a hint as to what data is, or is not, needed, which would improve performance significantly if the file(s) is/are larger than physical memory)

the way i see it, you are trying to implement MMF, make it mandatory (it can be useful, but it isn't always), and make it apply to the entire disk without any way for programs to tell the OS which parts are needed... (which of course would be really bad for performance)

the thing is, the OS cannot know what data is most important to the application, nor can it ever know what the application is going to do next, this is especially important with your example of a movie editor -- most of the time the OS will use MRU list to determine what to swap out, but a movie editor (or player) is only going to use each piece once as it passes through the movie, the OS will require lots of paging, and significantly slow down, since it doesn't know the application is done with the old data (and not going to use it again, and therefore it should be discarded) and doesn't know that it will need the new data until it is accessed -- the will automatically result in absolute-worst-case performance, and cannot be resolved properly unless the application is allowed to tell the OS what it needs/doesn't need in some way (which of course violates your "swapping should be determined only by the OS without interference from the application")


WOW, in the time i took to write this post, there were 5 other posts...

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 8:44 am
by Brendan
Hi,
childOfTechno wrote:Some good points have been raised. I guess I need to emphasise that I have little interest in implementing a traditional OS (files, processes, IPC, etc) because then I may as well just use an existing traditional OS. The purpose of mapping an entire drive into virtual address space is to reduce everything "one big Von Neumann box," then run a program on it. Programs don't allocate memory, they just use it, supported by the backing store (a hard disk.)
bubach wrote:Fail to see the point. You would still be restricted to psychical memory so only a fraction of your 500gb-1tb hd would be mapped at a time. lots of page faults and disk access involved either way, especially when working with larger files such as full hd movies.

or did I miss something?
There aren't files, just address space(s). You wouldn't need to store anything as a file, just have a structure stored somewhere:

Code: Select all

struct block_t
{
    void* base_pointer;
    size_t length;
};
And to find various structures, a program has to do a linear search of the entire drive?

To speed that up you could give each structure some sort of key, and create some sort of index so you can search for the structure corresponding to a specific key quickly. If you're clever you could even have keys in some sort of human readable format (e.g. use a text string as the key).

Of course the keys would have to be unique, and even with text strings as keys end users will have problems remembering what the structure they want is called (imagine a user with 1000 "mp3 structures"). To fix that nightmare (and to simplify searching for the key) you'd probably want a hierarchy of keys. That way related structures could be given related keys, and you could even search for a list of all keys that are related.

For example, all structures that belong to a specific user called "Fred" could be given a key that has a "/home/Fred" prefix, and that user could search for all their "mp3 structures" by doing something like "list /home/Fred/mp3/*".

Yeah - that'd be a lot better than a file system... ;)


Cheers,

Brendan

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 8:47 am
by Solar
Next thing coming up: having all data in a searchable database instead of a hierachical structure. 8)

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 9:12 am
by Brendan
Hi,
childOfTechno wrote:The traditional directory/file hierarchy is an inflexibility! I'm of the view that we really need to break out of the file mentality. Files-as-a-stream-of-character-data was originally designed as an abstraction like a library of tapes, hence seek(). Once we make the switch to solid state drives, everything has become truly "random access" (or very nearly so) and the file-as-a-tape metaphor starts to look very dated.
Files have never been a stream of character data. They've always been an "random access array of bytes". For convenience, for some/most languages there's a library or something that makes the "random access array of bytes" look like a stream of characters, but that's only an illusion. If you know C, then you should know the difference between "open()" and "fopen()", between "read()" and "fread(), etc.

Solar wrote:Again, I emphasize: This has been done before. It's tricky, it requires careful calibration between losing the last X ticks of data and killing performance because you are writing transient (essentially useless) information to disk, and it doesn't come for free.
According to the wikipedia page about "single-level store", the idea dates back to the late 1970s.

You'd need all applications to have some sort of "checkpoints", where the OS writes modified data to disk at the checkpoint to prevent inconsistent disk state (which would mean that you'd be constantly writing to disk when the OS is under load, which would completely trash performance and probably destroy an SSD drive within a week due to excessive writes).

Despite this, it's the only way to ensure that if the OS crashes everything is correctly restored to the point just before the crash, so that users can witness the awesomeness of an endless "crash -> reboot -> restore" loop. ;)


Cheers,

Brendan

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 9:27 am
by childOfTechno
Solar wrote:You earlier talked about power-loss restartability. That would require that program state - "cached" in RAM - would have to be written to disk at regular, well-defined intervalls.

Again, I emphasize: This has been done before. It's tricky, it requires careful calibration between losing the last X ticks of data and killing performance because you are writing transient (essentially useless) information to disk, and it doesn't come for free.
Alas, nothing in life is free :) You raise an interesting point though, keeping the disk consistent is indeed tricky. I've thought about various ways to do it (as have other people I'm sure) and decided that I'm just going to get that basics working first. Worrying about it now (before I have a decent IDE driver) is premature :)

One optimisation did spring to mind, a fast memory zero function. Whenever it is determined that an entire page has been zeroed it can simply be marked as zero in the page table and in a "zeroed page" bitmap on the disk, the physical RAM can then be returned to the page pool. A well written application would zero pages that it knows won't be needed again so the kernel can avoid sending unnecessary data to the disk when the system checkpoints.
Solar wrote:I don't say it's a bad thing to do, because it's hard to say such things for purpose-built systems without knowing the exact purpose. I just want you to be clear about that your idea is in no way novel, and that there is previous effort that should be studied.
... as it is for most things. At least I'm not trying to reimplement UNIX as many people on OSDEV seem to want to do :) My major aim is to have a spare box (that currently does nothing) set up as a compute server so I can send it a slab of x64 code from my workstation and say "here, compute this and call back when you're done."
Solar wrote:As for the line that "files are so yesterday", I've heard many wanky ideas of what to replace them with, but none of them really made immediate, intuitive sense...
Agreed, my comment was just an observation that, because the file/folder metaphor is pervasive, people have a hard time conceiving of a system that works any other way (as evidenced by the number of file related comments on this thread, even though the system I'm talking about has utterly nothing to do with files.) It's important not to get trapped by our metaphors!

I'd like to see a filesystem that eliminates the (in my view unnecessary) distinction between files and folders. I'd just have named streams, which can contain named sub streams which can contain named sub sub streams, etc. The equivalent of a folder would just be a stream that is zero bytes long and contains sub streams. (unfortunately, traditional FS compatibility would be a PITA.) (but this is a different topic entirely.)
JAAman wrote:the way i see it, you are trying to implement MMF, make it mandatory (it can be useful, but it isn't always), and make it apply to the entire disk without any way for programs to tell the OS which parts are needed... (which of course would be really bad for performance)
That's exactly what I'm doing, and it isn't necessarily bad for performance.
JAAman wrote:the thing is, the OS cannot know what data is most important to the application, nor can it ever know what the application is going to do next, this is especially important with your example of a movie editor -- most of the time the OS will use MRU list to determine what to swap out, but a movie editor (or player) is only going to use each piece once as it passes through the movie, the OS will require lots of paging, and significantly slow down, since it doesn't know the application is done with the old data (and not going to use it again, and therefore it should be discarded) and doesn't know that it will need the new data until it is accessed -- the will automatically result in absolute-worst-case performance, and cannot be resolved properly unless the application is allowed to tell the OS what it needs/doesn't need in some way (which of course violates your "swapping should be determined only by the OS without interference from the application")
The movie example is rather limiting, because the data is only ever read (it never gets swapped out, the kernel can just discard the data when it falls out the bottom of the MFU/MRU list.) Predicting which disk blocks will be needed in advance is a similar problem to branch prediction, just on a larger scale.

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 9:45 am
by childOfTechno
Brendan wrote:And to find various structures, a program has to do a linear search of the entire drive?

To speed that up you could give each structure some sort of key, and create some sort of index so you can search for the structure corresponding to a specific key quickly. If you're clever you could even have keys in some sort of human readable format (e.g. use a text string as the key).

Of course the keys would have to be unique, and even with text strings as keys end users will have problems remembering what the structure they want is called (imagine a user with 1000 "mp3 structures"). To fix that nightmare (and to simplify searching for the key) you'd probably want a hierarchy of keys. That way related structures could be given related keys, and you could even search for a list of all keys that are related.

For example, all structures that belong to a specific user called "Fred" could be given a key that has a "/home/Fred" prefix, and that user could search for all their "mp3 structures" by doing something like "list /home/Fred/mp3/*".

Yeah - that'd be a lot better than a file system...
Yes, I'm going to implement a high performance compute platform and then resort to a linear search to find something ...

Seriously, have you read anything posted so far?

I don't care how the application deals with the address space provided, I'm just interested in providing it. There are no users, there are no folder/file/database abstractions provided by the kernel, just one flat unprotected address space.
Brendan wrote:Files have never been a stream of character data. They've always been an "random access array of bytes". For convenience, for some/most languages there's a library or something that makes the "random access array of bytes" look like a stream of characters, but that's only an illusion. If you know C, then you should know the difference between "open()" and "fopen()", between "read()" and "fread(), etc.
fopen(), fread(), fwrite(), fseek() buffer IO at the process level, where open(), read(), write(), seek() do not ... the metaphor is that of a tape in any case (even if seeks are near instant.)
Brendan wrote:You'd need all applications to have some sort of "checkpoints", where the OS writes modified data to disk at the checkpoint to prevent inconsistent disk state (which would mean that you'd be constantly writing to disk when the OS is under load, which would completely trash performance and probably destroy an SSD drive within a week due to excessive writes).
NO? really? How often you checkpoint depends on how much data you are prepared to lose ... you can do it once a week provided you don't run out of physical memory (or potentially a journal partition.)

As for SSD, have you heard of wear levelling?

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 10:13 am
by Solar
*sigh* :)
childOfTechno wrote:I've thought about various ways to do it (as have other people I'm sure) and decided that I'm just going to get that basics working first. Worrying about it now (before I have a decent IDE driver) is premature :)
Just make sure that you don't commit to this "whole disk mapping" of yours on the driver level before you have really thought it through.
A well written application would zero pages that it knows won't be needed again...
Hell yea. Kill CPU caches in favor of checkpoint optimization. :D You're trying to approach the horse from the wrong side, I am afraid. Beware, you might get hurt. ;-)
At least I'm not trying to reimplement UNIX as many people on OSDEV seem to want to do :) My major aim is to have a spare box (that currently does nothing) set up as a compute server so I can send it a slab of x64 code from my workstation and say "here, compute this and call back when you're done."
Do you know Beowulf clustering?

Did you know you could pipe in and out of a SSH connection? 8)
Agreed, my comment was just an observation that, because the file/folder metaphor is pervasive, people have a hard time conceiving of a system that works any other way (as evidenced by the number of file related comments on this thread, even though the system I'm talking about has utterly nothing to do with files.) It's important not to get trapped by our metaphors!
The problem is that you haven't provided a convincing alternative yet, so we lack the alternative terminology. 8)
I'd just have named streams, which can contain named sub streams which can contain named sub sub streams, etc.
The equivalent of a folder would just be a stream that is zero bytes long and contains sub streams.
So, you'll have files / streams, and folders / zero-length streams that contain the files / streams... I'm missing the crucial distinction.

But then again, that might be me. I've been talked to about the cool new ways to organize your data for the last ten or twelve years, and never "got" what should be so cool about it. 8)

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 10:18 am
by Solar
childOfTechno wrote:I don't care how the application deals with the address space provided, I'm just interested in providing it. There are no users, there are no folder/file/database abstractions provided by the kernel, just one flat unprotected address space.
That's called an exokernel, something that has been discussed here repeatedly. We know what you're talking about. We're just frowning at your idea of mandatory 1:1 coupling of HD and RAM.
fopen(), fread(), fwrite(), fseek() buffer IO at the process level, where open(), read(), write(), seek() do not ...
Wrong. Check setvbuf() and _IONBF.
...the metaphor is that of a tape in any case (even if seeks are near instant.)
Please elaborate the difference between a sequentially-read file and a stream.
NO? really? How often you checkpoint depends on how much data you are prepared to lose ... you can do it once a week provided you don't run out of physical memory (or potentially a journal partition.)
That's called hibernation. If you want to protect against crashing - as you said earlier - your checkpoints have to be no less than every couple of seconds.
As for SSD, have you heard of wear levelling?
There's a difference between leveling some hundreds of write accesses, and leveling near-constant write accesses...

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 11:16 am
by childOfTechno
Solar wrote:That's called an exokernel, something that has been discussed here repeatedly. We know what you're talking about. We're just frowning at your idea of mandatory 1:1 coupling of HD and RAM.
I know what an exokernel is. This is difference here is that everything just runs in ring 0 and can do whatever it likes. I'm just writing a low level driver to run a single task (with multiple threads) on a dedicated machine. Eventually it would be nice if the system could have graceful restart capability, but that's down the line.

The basic premises of my argument are:

1. If everything runs in ring 0 we don't pay a context switching penalty to handle an exception or interrupt.
2. If we just map a disk directly we don't have a layer of filesystem code slowing things down, the virtual address space subsystem is tightly integrated with the IO subsystem.

In my view, these two things will reduce latency significantly. Having a single unified address space will simplify application design, leading to increases in end-to-end performance.
Solar wrote:Wrong. Check setvbuf() and _IONBF.
I'm not sure what you're getting at here. I'm well aware that you can tweak the buffering behaviour of a libc FILE, it doesn't change the fact that libc is providing the buffering (and is in turn calling read()/write() to deliver the data to the kernel, which has it's own buffers.) The kernel is oblivious to any buffering that a process does.
Solar wrote:That's called hibernation. If you want to protect against crashing - as you said earlier - your checkpoints have to be no less than every couple of seconds.
For certain compute intensive problems (the ones I'm interested in) it doesn't really matter if checkpoints don't occur very often.

If your application doesn't modify a page you don't need to flush it to disk. If the total accumulated changes don't exceed the amount of available physical RAM you don't need to flush anything. Of course if a crash happens all the accumulated changes are lost, but you can always redo the computation from the last checkpoint.

I think my approach would work very well for problems where:

The input data sets are potentially very large.
The working set fits in physical RAM.
The output data also fits in physical RAM.
The process is not dependent on external inputs.

Such a task can run for a long time before a checkpoint is required.

Another class of problem that would be solved well by this approach would be server applications with large data sets that aren't modified very much. A good example would be a search engine, the data set is read much more often than it is written.
Solar wrote:Please elaborate the difference between a sequentially-read file and a stream.
None. Different word, same meaning, I was just using the word "stream" because the word "file" implies the usual behaviour of a filesystem, files and folders in folders. My ideal storage system would make no such distinction. (interestingly, HTTP works the way I'm describing, you can GET a collection and it will return a stream of bytes (usually HTML))
Solar wrote:So, you'll have files / streams, and folders / zero-length streams that contain the files / streams... I'm missing the crucial distinction.

But then again, that might be me. I've been talked to about the cool new ways to organize your data for the last ten or twelve years, and never "got" what should be so cool about it.
The distinction is that *any* stream can contain a substream, not just streams that are zero bytes long.

EDIT: The point I'm making is that, for the most part, the hierarchical storage metaphor is a good one; I would just make one tiny tweak. The advantage, IMO, is that you could handle metadata with substreams and a naming convention. For example:

friday.mp3
+--- info.txt
+--- image.png
some other song.mp3
+--- info.txt
+--- image.png -> /users/foo/music/albums/some album/image.png
+--- album -> /users/foo/musc/albums/some album

or

my program.exe
+--- program data.dat

It's just slightly more flexible than a standard file system ... If I could go back in time and change the computing world ...

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 12:10 pm
by Solar
childOfTechno wrote:I know what an exokernel is. This is difference here is that everything just runs in ring 0 and can do whatever it likes.
Yes, that's part of the exokernel design.
2. If we just map a disk directly we don't have a layer of filesystem code slowing things down...
Also part of the exokernel design. 1:1 mapping of HD and RAM is not necessary to achieve this.
Having a single unified address space will simplify application design, leading to increases in end-to-end performance.
End-to-end performance is not exactly part of the exokernel design, if you get the sarcasm tag. 8)
I'm well aware that you can tweak the buffering behaviour of a libc FILE, it doesn't change the fact that libc is providing the buffering (and is in turn calling read()/write() to deliver the data to the kernel, which has it's own buffers.)
If the kernel does any buffering on read() / write() beyond what's necessary to make the driver work, it deserves to be shot (or rather, formatted off-disk).
For certain compute intensive problems (the ones I'm interested in) it doesn't really matter if checkpoints don't occur very often.
So you don't couple HD and RAM, you simply provide the opportunity to save the whole process space to disk now and then?

Excuse me, but you don't seem to be very consistent in what you're saying.
I think my approach would work very well for problems where:

The input data sets are potentially very large.
The working set fits in physical RAM.
The output data also fits in physical RAM.
The process is not dependent on external inputs.

Such a task can run for a long time before a checkpoint is required.
Might very well be. I just don't see where your approach would work better than others. If both working set and process data fit into physical RAM, whether you do a traditional load / save or do your 1:1 mapping wouldn't make much of a difference...
Solar wrote:Please elaborate the difference between a sequentially-read file and a stream.
None. Different word, same meaning, I was just using the word "stream" because the word "file" implies the usual behaviour of a filesystem, files and folders in folders. My ideal storage system would make no such distinction.
But yes, it does. Whether you have a zero-length stream containing other streams or a folder containing files is effectively the same thing.
The distinction is that *any* stream can contain a substream, not just streams that are zero bytes long.
So your directory could hold data as well... still don't see the advantage.
It's just slightly more flexible than a standard file system ... If I could go back in time and change the computing world ...
I do this with one laughing and one crying eye, as I have a strong deja-vu regarding my own, long-sinze defunct OS project, but I'll ask you the killer question:

How does your system of streams-in-streams interact with the outside world, which would still be using files in folders?

You said this would be for the computing machine, which is tossed tasks from your generic desktop. Your desktop doesn't speak "streams in streams", and your computing machine doesn't speak "files in folders". Where do you make the transition, how high will be the cost of the transition, and why go to all the trouble?

"I mean, what's the angle?" -- Marvin, R.E.D.

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 2:35 pm
by childOfTechno
Solar wrote:I do this with one laughing and one crying eye, as I have a strong deja-vu regarding my own, long-sinze defunct OS project, but I'll ask you the killer question:

How does your system of streams-in-streams interact with the outside world, which would still be using files in folders?

You said this would be for the computing machine, which is tossed tasks from your generic desktop. Your desktop doesn't speak "streams in streams", and your computing machine doesn't speak "files in folders". Where do you make the transition, how high will be the cost of the transition, and why go to all the trouble?
Firstly, I didn't bring up file systems. File systems only came in to the discussion because people can't seem to comprehend anybody wanting to write a kernel that doesn't involve users, files, GUIs, and all the associated crap. I am not interested in userland, and never said that what I'm working on would support any sort of file system whatsoever. Is that clear? As for compatibility, you're quite right, which is why I said "if I could travel back in time and change the computing world ..." I can't and we're stuck with the file system metaphor as it stands which is fine by me.

Got it? Good. Right, we can move on.
Solar wrote:Yes, that's part of the exokernel design.
Fair enough, I was under the impression that an exokernel protected client kernels from one another. If that isn't the case, then I guess this project is an exokernel of sorts. It doesn't make any difference ... tomato tomato
Solar wrote:Also part of the exokernel design. 1:1 mapping of HD and RAM is not necessary to achieve this.
Why is there a 1:1 mapping to RAM? Where did I say this exactly? The idea is to construct a mapping to the virtual address space, not physical RAM. One simple persistent address space for everything.

One address space to rule them
One address space to find them
One address space to bring them all and in the darkness bind them!

Solar wrote:Excuse me, but you don't seem to be very consistent in what you're saying.
I'm being perfectly consistent. What I'm describing is:

kernel (layer 0)
+ (layer 0.0) CPU architecture (page tables, context switching, etc.)
+ (layer 0.1) CPU <-> machine bridge ( IRQs, bus enumerators, etc.)
+ (layer 0.2) CPU <-> CPU bridge ( IPI, TLB shootdown, synchronisation, etc.)
+ (layer 0.3) Device layer (IDE, Ethernet, etc)
+ (layer 0.4) Virtual address space <-> Device layer translation
+ (layer 0.5) Machine monitor
application (layer 1)
+ (layer 1.0) ... ?
+ (layer 1.1) profit!

Okay, now we have reference points. Layers 0.0 to 0.3 are pretty standard. The machine monitor (layer 0.5) allows a coordinating node to control a compute node, threads can be started/stopped and the virtual address space can be read/written, so far not particularly interesting.

What I'm interested in is what goes on at layer 0.4, where there is a lot of choice. What I'm interested in figuring out is "what is the minimum amount of complexity required to allow for persistence and possibly consistency in the case of a power failure?" And beyond that "how can I make this go fast?"

Note that consistency and persistence are not one and the same. It would be perfectly valid to construct a system that does not ensure consistency in the event of a power failure, but is persistent in the case of a graceful shutdown (where any disk blocks cached in RAM can be flushed to disk.) This is achievable with a simple 1:1 mapping from disk blocks to the virtual address space, so this is where I plan to start.

A concrete example would be a 1TB disk mapped at 0x10000000000 - 0x1FFFFFFFFFF. When a page fault occurs within this range ((address - 0x10000000000) shr 9) tells us where to find the appropriate sectors on the disk. The read is then queued and the thread is blocked. It's not exactly rocket science, which is precisely the point: it's simple, so it can be made to go very fast.

The first advantage is that everything happens at ring 0, so the processor doesn't **** around switching privileges when a page fault occurs (+1). Then, several instructions later we know what disk block we are interested in (+2.) The algorithm I'm looking at for queueing I/O is very simple, we just keep a binary tree of queued reads/writes indexed by the virtual address of the page that caused the fault.

The advantage here is that the I/O layer can walk left to right and then right to left through the I/O request tree picking off reads/writes in order, which means the corresponding reads/writes to the disk will happen in block order as well (because the mapping is linear.) Having the disk head traverse in this fashion is a Good Thing if we want to maximise throughput (+3)

Improving performance will also require a good read ahead prediction algorithm. One approach is to set aside a region of the disk to keep read ahead hints. We can track the order in which page faults occur and collect statistics. Building a small markov (well, markov-ish) model for each disk block allows us to predict (hopefully) in which order future reads will occur. For each disk block we store something analogous to:

Code: Select all

struct model_t
{
    struct predict_t
    {
        block_addr_t prev, next;
        unsigned int count;
    };
    predict_t predict[MODEL_SIZE];
};
The system keeps track of the block addresses for the last two page faults that have occurred prev0 (the most recent) and prev1 (the second most recent). Whenever a page fault occurs we can search the model for block[prev0] and determine if we had a prediction hit (for some n < MODEL_SIZE : predict[n].next == current && predict[n].prev == prev1) If a hit occurs we increment the appropriate count. If a hit does not occur we evict the predict[] entry with the lowest count and replace it with a prediction matching what has occurred and a count of 1. We also need to cause the counts for any predictions where prev == prev1 but next != current to decay (simply dividing them by two is one approach.) This is to stop a prediction miss from being repeated too often.

We can then follow a prediction chain from prev1 and prev0 looking for the blocks most likely to be read in the near future. I strongly suspect that even very small values for MODEL_SIZE (<= 3) would give dramatic performance improvements for frequently read data. The best part is, the model is persistent so the performance benefit persists after a restart.

We can also use prev0 and prev1 (and possibly prev2, prev3 etc) to predict that (prev0 + (prev0 - prev1)) is likely to be accessed in the future. Combining this with the current location of the head (the current read/write position and direction in the I/O tree) allows us to construct a cost estimate and decide if it's worth trying a speculative read ahead.

Nothing about this is revolutionary, my assertion is simply that this can be made to go very fast if one wishes to write a small kernel that does nothing else. I have thought about this quite a lot ... there are many, many other possible optimisations.

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Feb 29, 2012 3:45 pm
by gravaera
childOfTechno wrote:
Brendan wrote:NO? really? How often you checkpoint depends on how much data you are prepared to lose ... you can do it once a week provided you don't run out of physical memory (or potentially a journal partition.)
This checkpoint stuff is legit. I swear, I use it all the time on Windows and Debian almost everyday already.

I press ^S :wink:

Re: Page tables, long mode and other thing (looking for idea

Posted: Wed Dec 12, 2012 1:52 am
by TudorOS1
I see what you are doing, you are turning RAM into another cache. A fully associative-cache might I add.

Now about your "streamsystem", I've know the FAT and NTFS filesystems, and NTFS does allow folders with data. FAT doesn't, but a proprietary extension later(one flag you set on directories) and lo and behold, first file = data for folder.

Quite simply your ideas are new when cat's fly.

Update: Discovered edit button