OSDev.org

Posted: **Sun Aug 22, 2010 2:08 am**

I've been thinking about system call interfaces a lot lately. My current thinking runs along the following lines:

1. All IO buffers must be page aligned, which allows for nifty paging tricks.
2. IO blocks at the page level, rather than the call level. (a nifty paging trick.)

For example, a call to the disk driver requesting that a range of pages be filled with data from the disk would execute as follows:

1. The pages specified in the call would be marked not present+blocked in the page tables.
2. The read operations to be performed would be entered in to the disk driver's IO queue
3. The call returns

As the data is brought in from disk the blocked pages are marked as present+unblocked. A process blocks if it touches any page marked as blocked. Simple enough.

The advantage with this approach is that a well written application can request a slab of disk IO and continue processing as long as it doesn't touch any blocked pages. From the user process point of view, the call still *looks* like an ordinary blocking call; if any attempt to read from a blocked page is made the process blocks and waits for the data to arrive.

Efficiently utilizing this interface would mean requesting IO in large batches (allowing the disk driver to reorder IO as much as possible to minimize seek times.)

There is *one* caveat I can see, what happens after a disk error? Since the IO call returns before the IO actually happens, there is no way to inform the process in the usual way (a return value from the call.) Some kind of asynchronous callback or trap would be required.

This approach need not apply just to disk IO, I'm thinking that IPC could also work in the same fashion.

Thoughts?

Posted: **Sun Aug 22, 2010 4:15 am**

It'll add more complexity to what will end up just being blocking IO. Why would an app read from the disk if it didn't plan to access it soon?

It's a good idea to implement into your non-blocking IO, but wouldn't help as much when trying to spoof blocking IO.

You could leave it up to the driver to do all that, just having an interface allowing drivers to tell the kernel whether x page should be blocked/present.

Actually, that could simplify your end quite a bit. You could have the callback go to the driver that blocked/presented the page, who would properly handle the app that caused the callback.

Posted: **Sun Aug 22, 2010 8:27 am**

TylerAnon wrote:It'll add more complexity to what will end up just being blocking IO. Why would an app read from the disk if it didn't plan to access it soon?

True. However the idea is that an application written with this API in mind knows that issuing an IO call and deferring access to the data for as long as possible is a Good Thing.

I'm not sure that it would be all that complex either. Since IO must be page aligned, the pages referred to by the call are simply mapped out and marked as blocked. A kernel API like this would suffice:

kernel32_page_block(pid, page, num)

When the process touches a blocked page it just gets marked as stalled and the page it blocked on is recorded. During the next trip around the scheduler the kernel can check to see if the page is still blocked, and if it is it can skip the process and move on to the next. The example disk driver simply maps in pages as it reads them from disk; it doesn't need to notify the scheduler directly that the IO has completed, the scheduler just takes note that the page is now unblocked and allows the process to run.

TylerAnon wrote:It's a good idea to implement into your non-blocking IO, but wouldn't help as much when trying to spoof blocking IO.

Well the idea here is subtly different to traditional blocking IO. All IO calls are inherently non blocking, its the *pages* that block. Something like a select() call that waits on a set of pages rather than descriptors would be very useful

TylerAnon wrote:You could leave it up to the driver to do all that, just having an interface allowing drivers to tell the kernel whether x page should be blocked/present. Actually, that could simplify your end quite a bit. You could have the callback go to the driver that blocked/presented the page, who would properly handle the app that caused the callback.

Hmm. That's an interesting point. The major thrust of the idea was to allow the driver a maximum amount of latitude when reordering IO, but that has the obvious drawback that it may leave processes starved for pages while the driver ditzes around loading data that won't be needed 'till later. There would need to be a way for the driver to get a hint that "this process *really* wants this data right now"

Posted: **Sun Aug 22, 2010 9:06 am**

ds wrote:
TylerAnon wrote:It'll add more complexity to what will end up just being blocking IO. Why would an app read from the disk if it didn't plan to access it soon?
True. However the idea is that an application written with this API in mind knows that issuing an IO call and deferring access to the data for as long as possible is a Good Thing.

I'm not sure that it would be all that complex either. Since IO must be page aligned, the pages referred to by the call are simply mapped out and marked as blocked. A kernel API like this would suffice:

kernel32_page_block(pid, page, num)

When the process touches a blocked page it just gets marked as stalled and the page it blocked on is recorded. During the next trip around the scheduler the kernel can check to see if the page is still blocked, and if it is it can skip the process and move on to the next. The example disk driver simply maps in pages as it reads them from disk; it doesn't need to notify the scheduler directly that the IO has completed, the scheduler just takes note that the page is now unblocked and allows the process to run.

My point was that it would be extra overhead, for what would end up like it was blocking IO. Even though the call rets before the read is done(like non-blocking IO), when the thread/process accesses what it thinks is ready to read data, it goes into a blocked state. That would be like scanf returning before the string is entered, but blocking if the app tries to read the str before it has been entered. The app wouldn't ask for the str if it didn't need it right away, or it would have explicitly used what it knows to be an asynchronous IO API(In other words, as long as you explicitly market this as for async IO, it's a great idea.).

ds wrote:
TylerAnon wrote:It's a good idea to implement into your non-blocking IO, but wouldn't help as much when trying to spoof blocking IO.
Well the idea here is subtly different to traditional blocking IO. All IO calls are inherently non blocking, its the *pages* that block. Something like a select() call that waits on a set of pages rather than descriptors would be very useful

I got that backwards, I meant that it would be helpful to block pages when the app means to use non-blocking IO(to catch early reads). But you should have a callback to tell it when the read is done. Like an IRQ for an app.

ds wrote:
TylerAnon wrote:You could leave it up to the driver to do all that, just having an interface allowing drivers to tell the kernel whether x page should be blocked/present. Actually, that could simplify your end quite a bit. You could have the callback go to the driver that blocked/presented the page, who would properly handle the app that caused the callback.
Hmm. That's an interesting point. The major thrust of the idea was to allow the driver a maximum amount of latitude when reordering IO, but that has the obvious drawback that it may leave processes starved for pages while the driver ditzes around loading data that won't be needed 'till later. There would need to be a way for the driver to get a hint that "this process *really* wants this data right now"

Yeah, the callback(from the process accessing blocked pages) would be the way to tell the driver to hurry up. When a driver registers blocked pages, it must register a callback for when/if a process access those pages. But we've already established that this method wouldn't be for apps that need speed("The major thrust of the idea was to allow the driver a maximum amount of latitude when reordering IO").

Posted: **Sun Aug 22, 2010 12:23 pm**

As soon as you talk about blocking vs. non-blocking system calls, you are talking about an asynchronous API. The whole question about error "return" values is exactly the same one as you face for any other asynchronous function. It is a problem that needs to be solved in general for any async system call -- and you don't need to waste brain cells trying to solve it again for this specific case. Just research asynchronous APIs and you will find (generally ugly) answers for how to do it.

Also, the whole "make the system call, have it return, have the app wait as long as possible, then attempt to grab the data, and block if it isn't there yet" stuff is again pure asynchronous methodology.

So, I'd say that the main questions are really:
Are you going to have any non-IO system calls that are asynchronous? If so, do you truly want an entire separate asynchronous blocking system just for IO API calls? Or, if you already are planning one, wouldn't it be simpler to use the same asnyc blocking system for async IO system calls that you use for non-IO async calls? Which async blocking system is faster and more flexible (or whatever your favorite criteria is)?

If only IO calls are async, do you want to give the app the ability to peek, and see if the data grabbing will block? And how will you implement that efficiently? The app will generally not care about which "pages" will block -- it will only care about the entire block of data as a unit.

Posted: **Mon Aug 23, 2010 8:25 am**

Hmm, interesting thoughts.

TylerAnon wrote:The app wouldn't ask for the str if it didn't need it right away, or it would have explicitly used what it knows to be an asynchronous IO API(In other words, as long as you explicitly market this as for async IO, it's a great idea.)

Well true, but my thinking is very much centered around async IO. I'm not interested in writing yet another desktop GUI. I'm much more interested in various high performance computing problems. One thing i've spent a while musing over is maximizing disk to network performance.

One part of this equation is eliminating as much buffering as possible. Ideally, the disk driver can just DMA page sized chunks of disk directly to pages which it maps straight to the user process. This is of course subject to copy on write trickery to implement disk caches, etc etc.

Likewise, the NIC driver can just DMA packets to/from page aligned addresses straight to the network card (as much as this is possible, depending on the NIC.)

Hence the idea that everything IO related should be page aligned, it allows the kernel maximum flexibility to map things rather than copy them.

One good example would be IPC. My current thinking about IPC is that a process should be able to pass a single page to another process. Rather than copying message data, the kernel can just map the message page out of one process and in to another.

TylerAnon wrote:But we've already established that this method wouldn't be for apps that need speed

Well it depends on what you mean by speed

High bandwidth over low latency; hammer that IO device, *hard*!

bewing wrote:As soon as you talk about blocking vs. non-blocking system calls, you are talking about an asynchronous API. The whole question about error "return" values is exactly the same one as you face for any other asynchronous function. It is a problem that needs to be solved in general for any async system call -- and you don't need to waste brain cells trying to solve it again for this specific case. Just research asynchronous APIs and you will find (generally ugly) answers for how to do it.

The observation that IO calls may fail at some point after they are made was more about the notion that this system (if I every build it) may not have anything *but* page-async IO. Apps will always have to deal with the possibility that the kernel may interrupt them and declare "****, you know that IO call you made a few msec ago? Well, it didn't work out, and here's the page that faulted."

Personally I think we've been locked in to a "stream" IO mindset for too long. My background in games programming has taught me one thing: "load a big slab 'o data in one go (or better yet, mmap it), then parse." I've not touched (f)read/(f)write/(f)seek for anything performance critical in a *long* time. Hence my current focus on pages. Pages pages pages!

bewing wrote:Are you going to have any non-IO system calls that are asynchronous?

What else is there aside from I/O ?

bewing wrote:If only IO calls are async, do you want to give the app the ability to peek, and see if the data grabbing will block? And how will you implement that efficiently? The app will generally not care about which "pages" will block -- it will only care about the entire block of data as a unit.

That's a good point. Really I only mooted the idea because I was staring at a pile of ia32 assembly code and it occurred to me that blocking this way (with the page tables,) would be easy to implement and possibly fast. How it affects writing a user space process is what I'm trying to work out ...

how fast it can go is another topic i'm currently thinking about. How much page table thrashing is there going to be? My instincts tell me it may well not be worth it, but hey, there's not much point re-implementing unix is there? If you're going to experiment with O/S design, may as well experiment with something radical.

Posted: **Mon Aug 23, 2010 8:56 am**

I was going for a readfile that returned an io buffer (as you describe), and a writefile that does the opposite. I'm not going to now:

- For 95% of the cases (read only) it's equal to what mmap would produce, but platform specific.
- Those other 5% of the cases, doing a regular open/read/close won't add much to the complexity.

I think I'll be reverting them out again.

Posted: **Mon Aug 23, 2010 10:12 am**

Doing page-size transfers when you only have small quantities of data to move is going to be very expensive.

My core IPC design looks like this (and is consequentially how all I/O is done - microkernel):

Kernel transfers 64 byte messages, of which 16 bytes are headers (The same structure can also be used for asynchronous system calls)
These are placed in an outgoing buffer and processed lazily by the kernel; likewise, the kernel places incoming messages in a buffer for lazy handling by the application
Only major synchronous calls are FMK_WaitMessage(queue) and FMK_FlushMessages.

For general IPC, libFusion implements a layer on top of this:

At connection establishment, a 8kB ringbuffer is built in each direction
Short messages (<32 bytes or so) go direct by the kernel IPC system
Most messages go into the ringbuffer
For messages which contain one or more contiguous pages, the un-aligned header goes in the ringbuffer, the page aligned middle chunks are COW mapped, and the un-aligned footer also goes in the ringbuffer
All messages contain information about how far the sender has gotten through processing its own incoming ringbuffer

File IO and such is generally done by memory mapping, for performance reasons; these mechanisms mainly apply to stuff like network I/O.

OSDev.org

Blocking with pages

Blocking with pages

Re: Blocking with pages

Re: Blocking with pages

Re: Blocking with pages

Re: Blocking with pages

Re: Blocking with pages

Re: Blocking with pages

Re: Blocking with pages