Page 3 of 3

Re: Multi-core CPUS

Posted: Sun May 03, 2009 1:48 am
by Benk
Benk wrote:
CPU intensive benchmarks are meaningless unless they use OS calls ( which the papers show are MUCH cheaper with Singularity) .


Wrong. CPU intensive code running under software isolation (with no API calls) suffers additional overhead (caused by things like array bounds checking, etc). The paragraph from the Singularity paper that you quoted mentions this ("the runtime overhead for safe code is under 5%").
You sure im wrong ? It depends ,if im running a tight 1 line CPU bench mark how can this checking happen or even multiple loops but all code in 1 mem page ? It only happens on larger benchmarks and when accessing new data structures (pages) .Hence it really is a memory test not a CPU one.

For CPU intensive workloads software isolation has more overhead, and for "IPC ping-pong" workloads hardware isolation has more overhead. I'm just saying it'd make sense to do a "IPC ping-pong" benchmark *and* a "CPU intensive" benchmark, so that people don't get the wrong idea and think that software isolation is always gives better performance.

Of course you'd think people reading research papers would be smart enough to read between the lines; but obviously this isn't the case...
If you only read the 1 paper .. The others have some of these benchmarks ie the Bartok compiler one . Compilation is a decent benchmark. With the software overhead a smart compiler can reduce this significantly as it can work out the memory a loop can covers . Once the compiler is proven and checks it at compile time why do they even need to verify this at run time ?
Benk wrote:
Memory benchmarks are prob useful . There is also charts comparing the cost for API calls , WIndows creation etc to Windows , Linux etc. There are many other benchmarks in the other 6 papers and 30 design documents released with the distribution showing memory usage etc.


I think I've seen 2 papers and a few videos (for e.g.), and none of the design documents.
I didnt believe some of the claims either so i down loaded it and wanted to have a closer look. The design documents are better than the papers but are more speculative. Im sure some results are not available and a number of design documents are missing (The last one is number 87)

If you download it there are
2 Technical reports
9 Papers ( only from 2005-2007)
4 Getting started docs which cover some of the ideas.
38 Design docs.

In the later video they mentioned they had 3 days to get the benchmarks for the paper deadlines and no time for optimization etc.

Benk wrote:
With regard to paging i think
1) 4K pages are too much overhead for most applications these days . Singularity in the test uses 4k pages.


The problem with large pages is there's wastage - if you're using 4 KiB pages and only need 1 KiB of RAM then you have to waste the other 3 KiB of RAM; and if you're using 1 GiB pages the OS will run out of RAM before it can display a login screen (and then it'll need to swap entire 1 GiB chunks to disk).
This only would affect really small embedded OS. How many apps use less than 4k or even 500K . Take a 401K app with 4K pages it will use 101 pages (404k) , with 16K pages it will use 26 pages ( 416K) . Now your loosing 12k =3 % BUT your page table is 1/4 of the size giving you a lot of memory back. So you prob end up loosing 1% ( So out of 1Gig of user apps you loose 10M) . You also gain performance as you will have less TLB misses and less page table in the CPU caches.

Considering mem is so big and cheap i dont think its an issue.

In a managed environment like Singularity since you dont use the pages for security you could run a single garbage collector and a big page mem manager and a small memory manager. The small memory manager can just take big pages from the big mm and hand it out as smaller ones but underneath everything works on big 1M pages.
Benk wrote:
2) Virtual memory is terrible in practice. Its an order of magnitude quicker to restart most application again than to restart it after it has memory stolen away from it and besides memory is cheap by the time these OS hit the street we are all looking at 8Gig + systems.


Its not an order of magnitude quicker to restart most applications - are you sure you know how virtual memory works?

By the time we're looking at 8 GiB+ systems we'll also be looking at 8 GiB+ applications (or at least 2 GiB applications running on 3 GiB OSs with 2 GiB of file cache and 1 GiB of video data). At the moment DDR2 is cheap, but DDR3 isn't, and newer computers need DDR3.
Me bad.. I should have been more clear i was referring to Virtual memory being paged to disk , paging it back takes for ever. DDR3 is only expensive cause its new ,DDR2 was very expensive compared to DDR/SDRAM when it was released.
Benk wrote:
Bad blocks or disk drivers can have really nasty impact on a machine and further more you need tight coupling between the disk driver and mm.

If the OS runs out of RAM, then what do you suggest should happen:

* Do a kernel panic or "blue screen of death" and shutdown the OS
* Terminate processes to free up RAM
* Let processes fail because they can't get the RAM they need, and let the process terminate itself
* Everyone doubles the amount of RAM installed every time they run out of RAM (to avoid random denial of service problems caused by the first 3 options), even if they only run out of RAM for once in 5 years
* Allow data to be paged to/from disk
What happens now ? If the machine is really out of memory ( and not pretend because mem is full of disk blocks) and pages heavily for longer than a short period it often becomes unusable and its time for the reset switch . In Singularity their is no dynamic loading of content so these things cant happen the only thing that can happen is a new app fails to load which also happens when you run out of swap.

With servers people plan to make sure there is no paging anyway. I think paging is something that was important in the early 80s and 90s not now. For most windows machines i use i turn it of and dont have much problem ( in fact its much better for my usage pattern ).
Benk wrote:
3) There are many ways of dealing with memory fragmentation.


Sure, and they all suck (but paging sucks less). ;)
Agree for nearly all OS models , this is why Sing intrigues me as it does not necessarily need it. The indirection means an intermediate MM can remap memory and hence no fragmentation. See other post.

Re: Multi-core CPUS

Posted: Mon May 04, 2009 12:33 am
by Brendan
Hi,
Benk wrote:
Brendan wrote:
Benk wrote:CPU intensive benchmarks are meaningless unless they use OS calls ( which the papers show are MUCH cheaper with Singularity) .
Wrong. CPU intensive code running under software isolation (with no API calls) suffers additional overhead (caused by things like array bounds checking, etc). The paragraph from the Singularity paper that you quoted mentions this ("the runtime overhead for safe code is under 5%").
You sure im wrong ? It depends ,if im running a tight 1 line CPU bench mark how can this checking happen or even multiple loops but all code in 1 mem page ? It only happens on larger benchmarks and when accessing new data structures (pages) .Hence it really is a memory test not a CPU one.
I'd be more worried about the performance of actual software (not micro-benchmarks). For normal software, the overhead of software isolation would still apply in most/all cases. This extra overhead isn't just bounds checking on arrays (which can be eliminated by the compiler in some cases) - there's also other overhead. For example, later in your post you say "the indirection means an intermediate MM can remap memory and hence no fragmentation" but you assume this indirection is free?
Benk wrote:
Brendan wrote:
Benk wrote: With regard to paging i think
1) 4K pages are too much overhead for most applications these days . Singularity in the test uses 4k pages.
The problem with large pages is there's wastage - if you're using 4 KiB pages and only need 1 KiB of RAM then you have to waste the other 3 KiB of RAM; and if you're using 1 GiB pages the OS will run out of RAM before it can display a login screen (and then it'll need to swap entire 1 GiB chunks to disk).
This only would affect really small embedded OS. How many apps use less than 4k or even 500K .
For (32-bit) Vista, "notepad.exe" is 147 KiB and "calc.exe" is 172 KiB (but that does include executable file headers; and doesn't include uninitialized data sections, any DLLs they use or anything else loaded at run-time; and seems large to me).

For a micro-kernel (where device drivers, etc are run as processes) there's a lot more small processes - device drivers for things like serial, floppy, keyboard, mouse, joystick.

If you're using paging for protection, then you split data into different areas. Pages used by an application's code might be in a "read, execute", pages used by DLLs might also be "read,execute" in a separate area to the application's code (and potentially a different area for each DLL), pages used for some data is "ready only", pages used by most data is "read/write", etc. My (32-bit) Vista machine says "calc.exe" uses 1080 KiB while it's running, but that's probably split into many separate areas with an average of 2048 bytes wasted per area.

Also note that 80x86 doesn't support many page sizes - the choices are 4 KiB, 2 MiB, 4 MiB and 1 GiB (but I don't think any sane OS has ever actually used PSE36 so 4 MiB pages aren't likely). For 2 MiB pages and 4 separate areas you'd end up with an average of 4 MiB wasted (1 MiB per area, regardless of how many pages are used in each area).

Also, you might want to look at how many TLB entries different CPUs have for large pages - for example, there might be 16384 TLB entries for 4 KiB pages and only 16 entries for large pages, and they're shared by the process, any DLLs and the kernel. Because of this, using some large pages can make sense, but only using large pages would cause a large number of TLB misses.
Benk wrote:Take a 401K app with 4K pages it will use 101 pages (404k) , with 16K pages it will use 26 pages ( 416K). Now your loosing 12k =3 % BUT your page table is 1/4 of the size giving you a lot of memory back. So you prob end up loosing 1% ( So out of 1Gig of user apps you loose 10M) . You also gain performance as you will have less TLB misses and less page table in the CPU caches.
For a process using 401 KiB that's split into 4 sections, then for 4 KiB pages you'd probably be using 104 pages (416 KiB with 15 KiB wasted), for imaginary 16 KiB pages you'd probably be using 29 pages (464 KiB with 63 KiB wasted), and for 2 MiB pages you'd probably be using 4 pages (8 MiB with 7791 KiB wasted). In addition, for 4 KiB paging you'd probably also be using 4 page tables and one page directory (an extra 20 KiB) so you'd actually be using a total of 436 KiB, while for 2 MiB pages you'd be using no page tables and at least 2 page directories (an extra 8 KiB) so you'd actually be using a total of 8200 KiB.

Of course out of that 401 KiB some of it may be uninitialized data (the ".bss section") that's never used and some of it may be code that's never used, and with the way virtual memory is implemented by most OSs if an entire page is unused then it costs nothing (e.g. code/data not loaded from disk until it's needed, and uninitialized data that re-uses the same read only page full of zeros until it's modified), so with 4 KiB pages virtual memory might save you some pages of RAM (e.g. instead of actually using 432 KiB you might only be using 400 KiB) but with 2 MiB pages the chance of an entire page being unused is zero (for this case) so you save nothing.

So (in this case) would you rather use 400 KiB (with 4 KiB pages), 8200 KiB (with 2 MiB pages), or 401 KiB plus whatever else you need to keep track of who owns which areas (with no paging)?
Benk wrote:
Brendan wrote:
Benk wrote:Bad blocks or disk drivers can have really nasty impact on a machine and further more you need tight coupling between the disk driver and mm.
If the OS runs out of RAM, then what do you suggest should happen:
  • Do a kernel panic or "blue screen of death" and shutdown the OS
  • Terminate processes to free up RAM
  • Let processes fail because they can't get the RAM they need, and let the process terminate itself
  • Everyone doubles the amount of RAM installed every time they run out of RAM (to avoid random denial of service problems caused by the first 3 options), even if they only run out of RAM for once in 5 years
  • Allow data to be paged to/from disk
What happens now ? If the machine is really out of memory ( and not pretend because mem is full of disk blocks) and pages heavily for longer than a short period it often becomes unusable and its time for the reset switch . In Singularity their is no dynamic loading of content so these things cant happen the only thing that can happen is a new app fails to load which also happens when you run out of swap.
Rather than sending pages that aren't being used anyway to disk and then using that RAM to improve performance where it is needed, you'd rather have worse performance (e.g. smaller file caches) and have applications that refuse to start?
Benk wrote:With servers people plan to make sure there is no paging anyway. I think paging is something that was important in the early 80s and 90s not now. For most windows machines i use i turn it of and dont have much problem ( in fact its much better for my usage pattern ).
For servers, usually people try to make sure there's usually no paging, but they also use swap space just in case (to cover any unusual usage). This depends on what the server is being used for though; and lately people have been using servers for virtualization, where you might have (for e.g.) a real computer with 32 GiB of RAM running 16 virtual machines with 4 GiB of virtual RAM per virtual machine (with "balloon" drivers on the guests, and swapping inside the guests, and swapping in the host).

DDR2 is cheap until you try to go beyond about 64 GiB. Then it becomes insanely expensive because only special motherboards can handle more RAM (and things like ECC start looking a lot more important too). For example, you might need to use a quad-socket Opteron motherboard just to get enough RAM slots (and hyper-transport links) for sixteen 4 GiB DDR2 modules.


Cheers,

Brendan

Re: Multi-core CPUS

Posted: Mon May 04, 2009 7:51 am
by Colonel Kernel
Benk wrote:Since all references are managed ( ie indirect) the GC ( including the OS one) can just repack the memory . It happens all the time with the Java VM and the .Net run time. The question is can you repack it efficiently without using VM ? Or can you have this VM without paging , people used to do it in the past.
Compacting GC in .NET (and presumably Singularity as well) does not rely on double-indirect references. The GC knows where all the references are because of metadata emitted by the compiler that describes the layout of all the types.

Compacting is only one strategy for GC, and not always the best one. It's good because it's relatively simple and allocation from a compacted heap is really fast, but the result is that collections take longer and have to pause many threads, making it unsuitable for real-time applications (or the kernel itself). For this reason, the Singularity kernel uses a concurrent non-compacting collector.

I dislike the term "VM" in this context because it's too vague. What does it mean to do GC "without using a VM" or "with using a VM"? The GC is either missing information about where references are (forcing it to be a "conservative collector" like the Objective-C GC in OS X 10.5) or it doesn't (making it like the .NET and Java GCs).