OSDev.org

Posted: **Mon Feb 27, 2012 5:00 pm**

I just kicked Bochs into long mode (hooray.) I'm currently grappling with the sheer enormity of the address space! Thinking about this has led me to several conclusions:

There are lots of ways to abuse the scale of the address space. One thing I have wanted to do for a long time is write a very lightweight HPC kernel that effectively maps entire disks into the address space. This is not possible (without using lots of segments) on a 32 bit machine. On x64 there's room enough for even the largest hard drive to be directly addressable.

The idea is very simple: just treat the RAM as cache for the disk. Notionally, the disk is the storage space from which code executes (you literally can branch to any address on the drive.) One obvious advantage to this is that after a reboot the machine picks up exactly where it left off (providing one has code to reliably checkpoint disk and task state.) One obvious downside is that the disk must be mapped to the same virtual address, because any code it contains will be linked assuming a fixed linear address (unless one wishes to make everything position independent.)

also one can execute:

Code: Select all

    mov rdi, START_OF_MAPPED_DRIVE
    mov rcx, SIZE_OF_DRIVE_IN_QWORDS
    mov rax, 0xfeeddeadfeeddead
    cld
    rep stosq

... and goodbye disk (without task isolation)

Which brings me to my second point, I'd like to do away with task isolation.

For a single purpose box (a database for instance) this situation is fine. One layer 0 kernel (that performs the disk to address space mapping) and one layer 1 task (the database) that runs in ring 0 but relies on the mapping provided by layer 0. If you have a bug in your database engine, you (and your database) are probably borked anyway (even on a task isolated machine.)

I'm not sure what the best way to accomplish the disk to address space mapping is. Simpler is better, the aim is to maximise throughput and minimise latency. The simplest solution is a 1:1 mapping, from the master boot record right through to the last sector of the disk. Pressing the reset button with such a scheme would obviously leave the disk in an undefined state; what I'm wondering at the moment is whether it is possible to deal with this at level 1.

I'm also thinking about how to arrange the network interface. I really don't want to have any system calls to level 0, just traps and exceptions. One idea is to arrange a ring buffer in a predefined location and use page faults to trigger transmission of packets. The interface would be very simple, write a packet to the next page aligned address in the right buffer and then cause a page fault (a read would suffice) at the page after that. Once a packet has entered the send queue the pages would be marked in the page table and if the level 1 code wraps around the transmit buffer and causes a page fault on a queued packet, the process is blocked until the packet has been transmitted. A similar scheme would work for packet reception.

Thoughts?

Posted: **Mon Feb 27, 2012 5:45 pm**

I believe this is very similar to what Multics did (though Multics did it at the level of files)

Posted: **Mon Feb 27, 2012 6:35 pm**

Why not use mmap?

Posted: **Mon Feb 27, 2012 8:09 pm**

Fail to see the point. You would still be restricted to psychical memory so only a fraction of your 500gb-1tb hd would be mapped at a time. lots of page faults and disk access involved either way, especially when working with larger files such as full hd movies.

or did I miss something?

Posted: **Tue Feb 28, 2012 6:00 am**

If you're mapping most of the ram as disk cache, aren't you severely limiting space that can be used for a running programs heap and stack?

Posted: **Tue Feb 28, 2012 6:26 am**

childOfTechno wrote:One idea is to arrange a ring buffer in a predefined location and use page faults to trigger transmission of packets.

Using page fault to generate syscall is generally slow. you got a fault, you still need to map that page and the application need to retry the write.

Posted: **Tue Feb 28, 2012 6:36 am**

One obvious advantage to this is that after a reboot the machine picks up exactly where it left off

If that is a windows, you see a blue screen and reboot and it present you another blue screen?

Posted: **Tue Feb 28, 2012 6:48 am**

childOfTechno wrote:The idea is very simple: just treat the RAM as cache for the disk.

Yep; the idea is so simple that swap files and swap partitions have been around for decades. They might not map the whole disk into the address space at once, but the idea has been there all along.

And while it might sound like an awesome thing to do, remember that, in every given session, vast amounts of what is stored on the disk are never accessed at all. I.e., you'd go to great lengths to map the whole disk (which does involve quite some metadata handling), but then use only a tiny weensy fraction of it.

(you literally can branch to any address on the drive.)

Yaj - one branch gone wrong, and you find yourself executing inode data. Or better even, a single null pointer and you write your chat log to the MBR!

Seriously, though. I think it's much better to leave the drive handling to the VFS, instead of implementing the VFS on top of the memory mapper...

One obvious advantage to this is that after a reboot the machine picks up exactly where it left off (providing one has code to reliably checkpoint disk and task state.)

Also been done before. I remember seeing QNX Neutrino doing this as a "party trick" at a computer show once (without mapping the whole hard drive).

I think you're trying to reinvent the wheel, only different.

Posted: **Tue Feb 28, 2012 4:19 pm**

Hi,

childOfTechno wrote:On x64 there's room enough for even the largest hard drive to be directly addressable.

Virtual address space size is 2**48 bits, but it's split into 2 contiguous areas with a hole in the middle, so usually you limit "user space size" to 2**47 bits. If you use all of that space for mapping one logical disk you'd have nowhere for the process' code and data, so for sanity you may as well round down to 2**46 bits. That mostly means that in practice the largest hard drive you could map into user space without losing sanity is 64 TiB. Note: for a monolithic kernel (where it'd be "kernel space" not "user space") the limit would be the same, except it'd be "about 64 TiB for all logical drives combined" rather than just for one logical drive.

How large is the largest hard drive? At the moment most hard drive manufacturers are making 3 TB hard drives. You could get 12 of them and use RAID0 (striping) to make those 12 hard drives look like one logical drive that's 32 TiB.

If hard disk sizes are increasing at a rate of "twice as big every 5 years" and the OS takes 10 years to implement and become useful before its first public release; then you could expect that when the OS is released 12 TB physical drives will be available. After the OS is released, how many years do you get until the "64 TiB logical disk size" limit starts being a problem for large RAID arrays?

childOfTechno wrote:For a single purpose box (a database for instance) this situation is fine.

Imagine you've got your disk mapped into the virtual address space and everything is working fine - you've got a file system somewhere with maybe 3 GiB of modified data in RAM that belongs to 1000 different files that is waiting to be written to disk. Then kernel gets a read error from the underlying hard drive. How does the kernel's page fault handler tell the file system there's been a read error, so that the file system can write that 3 GiB of modified data to disk and avoid corrupting 1000 different files?

Cheers,

Brendan

Posted: **Wed Feb 29, 2012 3:29 am**

berkus wrote:
Brendan wrote:If hard disk sizes are increasing at a rate of "twice as big every 5 years" and the OS takes 10 years to implement and become useful before its first public release; then you could expect that when the OS is released 12 TB physical drives will be available. After the OS is released, how many years do you get until the "64 TiB logical disk size" limit starts being a problem for large RAID arrays?
And how long until "Virtual address space size is 2**48 bits" stops being a limitation? VA can always be bumped up to 56 or even 64 bits.

There's only really 3 ways to increase virtual address space size:

add a fifth table (e.g. PLM5) to get 57-bit virtual addresses. This would suck - four levels of page tables is already too many (TLB miss costs).
Completely redesign paging. For example, "64 KiB pages, 64 KiB page tables, 64 KiB page directories and 64 KiB page directory pointer tables" would give you 55-bit virtual addresses with only 3 levels of page tables. Unfortunately there's been a lot of assumptions about 4 KiB page sizes in both software and hardware (e.g. allowed alignment for memory mapped IO in PCI specs), so while I think this would be the best option in the long run, it's not an easy thing to do in the short term.
Remove page tables and add a "fifth" (new fourth) level of page tables. This wastes lots of RAM (2 MiB pages are too big) especially for the "lots of small processes" case, and also runs into the same problems as completely redesigning paging (those "4 KiB pages" assumptions in software and hardware).

In all of these cases, it's going to be a mess.

At the moment, if someone wants true 64-bit addressing then Intel can sell them Itanium and AMD can't sell them anything at all, so I have a feeling that Intel won't bother until/unless they really have to.

AMD has a reason to do it, but unless they have Microsoft's backing it's too much hassle for a feature that Windows won't support.

Basically what I'm saying is that I have no idea when it might happen; but maybe it will only happen when Microsoft needs it, and Microsoft don't waste massive amounts of virtual address space mapping whole disk drives and therefore won't need it until long after it has become a large problem for childOfTechno.

Cheers,

Brendan

Posted: **Wed Feb 29, 2012 4:54 am**

Brendan wrote:This would suck - four levels of page tables is already too many (TLB miss costs).

Implement TLBs as a piece of cache with cache-coherency enabled as the important change (i.e. including MOSI and the shebang). So you can implement tagged TLBs and minimal-cost address space changes without code change and even get rid of the TLB shootdown IPIs for systems that are aware of the feature.

Posted: **Wed Feb 29, 2012 7:40 am**

Some good points have been raised. I guess I need to emphasise that I have little interest in implementing a traditional OS (files, processes, IPC, etc) because then I may as well just use an existing traditional OS. The purpose of mapping an entire drive into virtual address space is to reduce everything "one big Von Neumann box," then run a program on it. Programs don't allocate memory, they just use it, supported by the backing store (a hard disk.)

bubach wrote:Fail to see the point. You would still be restricted to psychical memory so only a fraction of your 500gb-1tb hd would be mapped at a time. lots of page faults and disk access involved either way, especially when working with larger files such as full hd movies.

or did I miss something?

There aren't files, just address space(s). You wouldn't need to store anything as a file, just have a structure stored somewhere:

Code: Select all

struct block_t
{
    void* base_pointer;
    size_t length;
};

Physical RAM limits the working set of a program running on a traditional OS anyway, and if a program exceeds this the excess gets swapped out. This means that a program operating on a HD movie with a very large working set (larger than physical RAM) ends up loading data in to RAM, swapping it back and forth and then finally writing the result out to disk.

In a sense, traditional file management leaves part of the control over swapping data in to and out of RAM up to the application. With entire disks mapped in to the virtual address space only the kernel gets to decide what to swap and when.

brain wrote:If you're mapping most of the ram as disk cache, aren't you severely limiting space that can be used for a running programs heap and stack?

There is no run-time heap separate to persistent storage, it's all just stored on the disk (even the stack.) Everything persists. RAM just acts to cache disk blocks.

As an interesting aside, some of the earliest computers used rotating drum magnetic storage as core, this is somewhat similar (except that we have RAM to speed everything up.)

bluemoon wrote:Using page fault to generate syscall is generally slow. you got a fault, you still need to map that page and the application need to retry the write.

It is possible to implement such an interface with an overhead of one fault per transaction (for fixed size transaction blocks.)

Solar wrote:And while it might sound like an awesome thing to do, remember that, in every given session, vast amounts of what is stored on the disk are never accessed at all. I.e., you'd go to great lengths to map the whole disk (which does involve quite some metadata handling), but then use only a tiny weensy fraction of it.

Implementing a simple 1:1 linear mapping to the disk would be outrageously simple

If the whole disk is mapped then when a page fault occurs you can work out which sector(s) you need to load with a subtract and a couple of shifts!

Solar wrote:Yaj - one branch gone wrong, and you find yourself executing inode data. Or better even, a single null pointer and you write your chat log to the MBR!

Seriously, though. I think it's much better to leave the drive handling to the VFS, instead of implementing the VFS on top of the memory mapper...

What VFS? It's all just memory, one big long linear array of bytes. If an application wants to it can create some kind of heap/filesystem/garbage collected arena.

Indeed, one branch gone wrong and everything does get toasted

But seriously, for a single purpose compute cluster it really doesn't matter.

Brendan wrote:Virtual address space size is 2**48 bits, but it's split into 2 contiguous areas with a hole in the middle, so usually you limit "user space size" to 2**47 bits. If you use all of that space for mapping one logical disk you'd have nowhere for the process' code and data

You'd have plenty of space for the program, ON THE DISK. You just branch to the address where it is stored. There is no separate process memory, from the application's point of view there is only one address space, the disk itself.

Brendan wrote:Imagine you've got your disk mapped into the virtual address space and everything is working fine - you've got a file system somewhere with maybe 3 GiB of modified data in RAM that belongs to 1000 different files that is waiting to be written to disk.

What files? Everything is just one big linear address space running a single compute intensive application, which is free to allocate and deallocate memory as it sees fit.

Posted: **Wed Feb 29, 2012 7:51 am**

You may have overlook gerryg400's comment.
What's wrong with mmap? What you describe is an extreme case of mapping a hugh file on disk which itself occupy the whole disk space.
It is a subset of traditional OS functionality and thus I think you reduced flexibility.

Posted: **Wed Feb 29, 2012 7:53 am**

so at some point during loading the system you overwrite all ram with the disk blocks to boot the system... what happens to the loader that you overwrite with the disk blocks while its still executing, etc?

Sounds to me like something like this would have to be supported by specialist hardware. Older systems with drum memory worked because this was how the hardware worked, and remember it was the extreme slowness of using persistent storage as ram that meant people binned this idea and started using real ram...

even if you have some form of copy-on-write mechanism, youll probably find that having to write every changed sector back to disk will be very prohibitive and your disk will be being battered to death in short order. Maybe this would work better with a huge solid state drive though if you could find/afford one.

Posted: **Wed Feb 29, 2012 8:06 am**

brain wrote:so at some point during loading the system you overwrite all ram with the disk blocks to boot the system... what happens to the loader that you overwrite with the disk blocks while its still executing, etc?

Sounds to me like something like this would have to be supported by specialist hardware. Older systems with drum memory worked because this was how the hardware worked, and remember it was the extreme slowness of using persistent storage as ram that meant people binned this idea and started using real ram...

even if you have some form of copy-on-write mechanism, youll probably find that having to write every changed sector back to disk will be very prohibitive and your disk will be being battered to death in short order. Maybe this would work better with a huge solid state drive though if you could find/afford one.

Try to think of a giant mmap() ... parts of the disk get loaded when a page faults occurs

OSDev.org

Page tables, long mode and other thing (looking for ideas.)

Page tables, long mode and other thing (looking for ideas.)

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea

Re: Page tables, long mode and other thing (looking for idea