OSDev.org

Posted: **Thu Feb 07, 2013 9:26 pm**

1) Why does Windows(R) (C) (TM) Swap pages to the pagefile even when there is plenty of memory available? I have 2GB RAM, and only about 1-1.5GB is being used (including memory that is swapped out), yet a lot of it is in the pagefile... why? Wouldn't it be better to disable paging?

2) If I disable interrupts (cli) for a bit, will this cause the system time to be incorrect? As it will disable the timer interrupt that fires 1000 times a second (in my OS), and this is how I keep track of time.

Posted: **Thu Feb 07, 2013 11:18 pm**

BMW wrote:1) Why does Windows(R) (C) (TM) Swap pages to the pagefile even when there is plenty of memory available? I have 2GB RAM, and only about 1-1.5GB is being used (including memory that is swapped out), yet a lot of it is in the pagefile... why? Wouldn't it be better to disable paging?

Those numbers may be deceptive. A page can be "swapped out", in the sense that it is written to the pagefile, without also being evicted from memory. If it is needed again and is still in memory, the system won't go back to disk, as you'd expect. However, it does allow the system more flexibility: if memory is needed all of a sudden, it then doesn't have to go and write that page to disk, because it's already there. This allows lower latency if a large amount of memory is allocated unexpectedly. The system will therefore swap out as much as it can while the disk is not being used.

BMW wrote:2) If I disable interrupts (cli) for a bit, will this cause the system time to be incorrect? As it will disable the timer interrupt that fires 1000 times a second (in my OS), and this is how I keep track of time.

Depends on how long. If it's less than the time between interrupts, you should be fine, because the interrupt controller should queue the pending interrupt that you're blocking until you unblock it, so you'll end up with the same number interrupts in the same time period, just with one offset by a bit. If you disable interrupts for too long, multiple timer interrupts may collide and cause problems though.

Posted: **Fri Feb 08, 2013 1:40 am**

You may also consider to re-sync time periodically on larger interval. (re-sync from CMOS clock or network time server)
But anyway if your timer interrupt is taking too long there is probably design flaw.

Posted: **Fri Feb 08, 2013 3:54 am**

There's a good article on Windows memory management here: http://blogs.technet.com/b/markrussinov ... 55406.aspx

I would guess that the answer to your question lies with reserved, but uncommitted memory. If the memory is reserved, Windows has to somehow set aside space for it (otherwise it couldn't guarantee availability). But it would be wasteful to reserve physical memory so instead it reserves space in the page file. When it actually commits the memory it can choose the appropriate place to store it - RAM or disk - but at least it knows that the space is available somewhere.

Windows memory management is quite sophisticated, so I may well have this completely wrong, but this seems to be a plausible explanation to me. If you play around with Mark Russinovich's "testlimit" program you may learn more. (BTW, if you are at all interested in Windows internals or troubleshooting you should read Mark's blog - he really knows what he is talking about.)

Posted: **Fri Feb 08, 2013 6:45 pm**

iansjack wrote:There's a good article on Windows memory management here: http://blogs.technet.com/b/markrussinov ... 55406.aspx

I would guess that the answer to your question lies with reserved, but uncommitted memory. If the memory is reserved, Windows has to somehow set aside space for it (otherwise it couldn't guarantee availability). But it would be wasteful to reserve physical memory so instead it reserves space in the page file. When it actually commits the memory it can choose the appropriate place to store it - RAM or disk - but at least it knows that the space is available somewhere.

Windows memory management is quite sophisticated, so I may well have this completely wrong, but this seems to be a plausible explanation to me. If you play around with Mark Russinovich's "testlimit" program you may learn more. (BTW, if you are at all interested in Windows internals or troubleshooting you should read Mark's blog - he really knows what he is talking about.)

Thanks heaps, his blog looks very interesting!

Posted: **Sun Feb 10, 2013 11:48 am**

2) If I disable interrupts (cli) for a bit, will this cause the system time to be incorrect? As it will disable the timer interrupt that fires 1000 times a second (in my OS), and this is how I keep track of time.

Sorry to hijack, but is that the best way to keep track of time will the PIT keep timing that accurately, especially when disabling and re-enabling interrupts. I ask the RTC for the time each time the PIT interrupt fires. This doesn't seem to slow anything down, and it would only take a couple of nano seconds to read from the RTC?. Or would it be better to artificially increment time using the PIT?

Thanks

Posted: **Sun Feb 10, 2013 1:21 pm**

The general idea is that doing nothing is faster than doing something.
As you won't need the accurate time every time the timer fire (that would be insane), it's probably just count the interrupts.

To avoid clock drift, it would be good to re-sync time once in a while (like every hour).

By the way, if you disable interrupt with CLI, it only tell the CPU not to handle the interrupt, and it will be pending on the controller. However if second interrupt occur while one is pending, the second one will be dropped. Usually this won't happen as IRQ tends to return quickly, however if there is SMM stuff you may miss a few ticks.

Posted: **Sun Feb 10, 2013 11:33 pm**

Hi,

jammmie999 wrote:Sorry to hijack, but is that the best way to keep track of time will the PIT keep timing that accurately, especially when disabling and re-enabling interrupts.

In general, you should never need to disable IRQs for more than about 10 instructions anyway. The only case where this isn't really possible is during software task switches, where it might be as bad as 200 instructions. For a 33 MHz CPU (assuming 3 cycles per instruction) that might work out to a worst case of about 19 us of "jitter" caused by extra IRQ latency (caused by disabling IRQs).

For timing, there's 2 different things - keeping track of real time, and measuring durations (e.g. how long until a sleeping task should wake up, how long until the currently running task has used all the time it was given, how long until a network connection should timeout, etc).

For keeping track of real time, you don't want any IRQ to be involved at all. Instead you want to read a counter (like the CPU's TSC, the HPET main counter or the ACPI counter); and then have something like NTP (Network Time Protocol), and/or maybe the RTC's update IRQ to keep it synchronised. The important thing is getting good precision (e.g. nanoseconds rather than milliseconds) without the overhead of thousands of pointless IRQs constantly interrupting the CPU.

For measuring durations, you want dynamically programmed delay/s. For example, if the next sleeping task to wake up is meant to wake up in 12345 us, then you'd set the PIT or local APIC timer or HPET to generate an IRQ in 12345 us (or as close to that as you can get). For a 1000 Hz PIT, you'd have 12 pointless IRQs followed by an IRQ that is 900 us too late because you had to round up. Basically you want to use the local APIC timer, HPET or the PIT, and you want it it "one shot" mode and not "fixed frequency mode".

Note: For very accurate delays (e.g. for device drivers, etc), you can set the timer's IRQ to occur just before the delay expires and then poll something like the TSC until the exact time when the delay should expire. This approach can give you a "nano_sleep()" that is accurate to within 1 ns on modern hardware. In this case, the more accurate the first timer's IRQ is the less CPU time you waste polling the TSC (e.g. you don't want to poll for up to 1 ms when you could poll for up to 1 us instead).

Now; for modern computers you're likely to have a usable TSC, plus local APIC timer and HPET; so getting close to 1 ns precision for everything (with no pointless IRQs/overhead) should be reasonably easy.

For old hardware (if you've only got the PIT and RTC to work with), you can set PIT to "one shot" mode and get 838 ns precision for delays, and read the PIT's counter to get the current time (where "current time = time when PIT counter was set last + count that was set last time - current count read from PIT") with 838 ns precision. The problem with this is that there's a lot of extra overhead reading/writing IO ports; especially for reading the count in the default "low byte then high byte" mode, because if you read at the wrong time the low byte can wrap (e.g. so instead of reading 0x1200 or 0x11FF for the count you actually read 0x12FF) and to avoid that you need to send a "latch" command (so reading the count becomes an IO port write followed by 2 IO port reads).

To reduce the overhead of setting and getting the count, you can set the PIT to "low byte only" mode or "high byte only" mode. For "low byte only mode" the maximum delay would be 256 * 828 ns = 212 us, which is too little (e.g. a 1234 us delay would become five 212 us delays followed by a 174 us delay, and you'd have to set the count and send EOI to the PIC chip 6 times). For "high byte only mode" the maximum delay is about 55 ms (much better) and you'd get 212 us precision out of it; which is a more reasonable compromise between precision and overhead (especially for old hardware that doesn't have better timers).

Basically what I'm saying here is that a good OS wouldn't just set the PIT to 1000 Hz and use that for everything; but would determine which timers are available and use what it can get in ways that improve precision and reduce overhead.

jammmie999 wrote:I ask the RTC for the time each time the PIT interrupt fires. This doesn't seem to slow anything down, and it would only take a couple of nano seconds to read from the RTC?. Or would it be better to artificially increment time using the PIT?

You should only read the time and date from the RTC once during boot and keep track of the time yourself after that. For example, (for a bad/simple OS) during boot you might read the RTC's time and date and use it to set a "nanoseconds since 1970, UTC" variable, then after that you might use the RTC's "update IRQ" to add 1000000000 to that "nanoseconds since 1970, UTC" variable each second. Of course if you were doing that it'd be trivial to use the RTC's "periodic IRQ" instead, and add 250000000 to your "nanoseconds since" variable four times per second, or add 1953125 to it 512 times per second.

More importantly, later on you'd be able to replace that variable with something much more precise; like the HPET counter (e.g. "nanoseconds since 1970 = nanoseconds_since_1970_from_RTC_at_boot + (HPET_current_count - HPET_count_at_boot) * 1000000000 / HPET_frequency"); without causing problems for any code that asks the OS for the current time, and without forcing applications to have some sort of retarded scaling factor to compensate for poor design.

The only other case where it's OK to read the time and date from the RTC is when the computer is coming out of a deep power saving state (e.g. where almost everything except RAM was turned off) and you've lost track of time because your normal timer IRQ/s were disabled to save power.

Cheers,

Brendan

Posted: **Mon Feb 11, 2013 2:05 am**

Brendan wrote: In general, you should never need to disable IRQs for more than about 10 instructions anyway. The only case where this isn't really possible is during software task switches, where it might be as bad as 200 instructions.

It shouldn't take 200 instructions to load a new task state. The only portion of the scheduler that needs to run with interrupts disabled is the part that loads the registers of the new task and transfer control to it.

My worst case is when the scheduler switches between long mode and protected mode (and the reverse). Not that this takes a long time, but it will do a complete TLB flush since paging is turned off and on.

Posted: **Mon Feb 11, 2013 6:00 am**

Hi,

rdos wrote:
Brendan wrote: In general, you should never need to disable IRQs for more than about 10 instructions anyway. The only case where this isn't really possible is during software task switches, where it might be as bad as 200 instructions.
It shouldn't take 200 instructions to load a new task state. The only portion of the scheduler that needs to run with interrupts disabled is the part that loads the registers of the new task and transfer control to it.

You're having difficulty with comprehension again - "it might be as bad as 200 instructions" is not the same as "it must be as bad as 200 instructions".

Also note that I'd assume your scheduler is lame and has race conditions. For example, if some code decides to switch to a low priority task, then an IRQ occurs that unblocks a high priority task before the task switch starts; then your OS is probably silly enough to continue switching to the low priority task instead of the high priority task.

I'd also be tempted to assume you don't save/load things like debug registers and performance monitoring counter configuration and state (for "per thread debugging/profiling"); or read the TSC/HPET/whatever during task switches and track how much time each thread has consumed; or adjust the local APIC's TPR so that high priority (real time?) tasks don't get interrupted by low priority IRQs; or avoid saving the old task's FPU/MMX/SSE state if it wasn't modified, or...

Cheers,

Brendan

Posted: **Mon Feb 11, 2013 4:10 pm**

Brendan wrote: Also note that I'd assume your scheduler is lame and has race conditions. For example, if some code decides to switch to a low priority task, then an IRQ occurs that unblocks a high priority task before the task switch starts; then your OS is probably silly enough to continue switching to the low priority task instead of the high priority task.

Not likely. The first thing done after disabling interrupts in preparation for loading a new task is checking for threads in the wake-up list (which IRQs trigger). If it is not empty, the scheduler will reenter the scheduling loop again.

I don't think there are any race conditions. The scheduler has a lock which it acquires that makes it unnecessary to disable interrupts while scheduling. This lock must be released before registers for the new task are loaded, and instead interrupts must be disabled until the load is completed.

Brendan wrote: I'd also be tempted to assume you don't save/load things like debug registers and performance monitoring counter configuration and state (for "per thread debugging/profiling"); or read the TSC/HPET/whatever during task switches and track how much time each thread has consumed; or adjust the local APIC's TPR so that high priority (real time?) tasks don't get interrupted by low priority IRQs; or avoid saving the old task's FPU/MMX/SSE state if it wasn't modified, or...

I support time consumed per thread in 1us increments. Debug registers are loaded when needed (that doesn't need to be done with interrupts disabled, neither does consumed time per thread).

Posted: **Mon Feb 18, 2013 7:21 am**

rdos wrote:I don't think there are any race conditions. The scheduler has a lock which it acquires that makes it unnecessary to disable interrupts while scheduling. This lock must be released before registers for the new task are loaded, and instead interrupts must be disabled until the load is completed.

If that is one single big lock, and is locked every time the scheduler do its work, and lock for long duration, it is very likely to cause race problems on multi-core system.

Did you by any chance mis-read that with re-entry problem?

By the way, context switch should not require any lock. I suppose a (or an array of) lock is required only for task picker.

Posted: **Mon Feb 18, 2013 2:18 pm**

bluemoon wrote: If that is one single big lock, and is locked every time the scheduler do its work, and lock for long duration, it is very likely to cause race problems on multi-core system.

The lock is per core, not global. That's because each core runs its own instance of the scheduler, and uses a local thread list to pick threads. Because there is no stealing of threads, the scheduler doesn't need a global lock. The scheduler does need to (briefly) acquire a global lock when posting threads to the global scheduler list, and when moving them to the local list, but those are short locks protecting a list.

The lock basically is to protect the scheduler from wakeups from IRQs, and to make sure those are activated as soon as possible, even if they occur when the scheduler is running. Part of releasing this lock is to check for threads that are woken up from IRQs. Additionally, the lock makes sure the scheduler cannot be reentered in case a new IRQ occurs that calls the scheduler.

Posted: **Tue Feb 19, 2013 9:32 am**

Hi,

Brendan wrote:Note: For very accurate delays (e.g. for device drivers, etc), you can set the timer's IRQ to occur just before the delay expires and then poll something like the TSC until the exact time when the delay should expire. This approach can give you a "nano_sleep()" that is accurate to within 1 ns on modern hardware. In this case, the more accurate the first timer's IRQ is the less CPU time you waste polling the TSC (e.g. you don't want to poll for up to 1 ms when you could poll for up to 1 us instead).

Interesting case. I've never thought about having such precision before. Could you please give me 1 or 2 examples of why would a device driver need under 1 ns precision instead of the 70 ns (usually) from HPET and waste up to 69 ns polling?

Cheers,

Luís

Posted: **Wed Feb 20, 2013 1:26 am**

Hi,

Luis wrote:
Brendan wrote:Note: For very accurate delays (e.g. for device drivers, etc), you can set the timer's IRQ to occur just before the delay expires and then poll something like the TSC until the exact time when the delay should expire. This approach can give you a "nano_sleep()" that is accurate to within 1 ns on modern hardware. In this case, the more accurate the first timer's IRQ is the less CPU time you waste polling the TSC (e.g. you don't want to poll for up to 1 ms when you could poll for up to 1 us instead).
Interesting case. I've never thought about having such precision before. Could you please give me 1 or 2 examples of why would a device driver need under 1 ns precision instead of the 70 ns (usually) from HPET and waste up to 69 ns polling?

Lots of devices require small delays for various reasons. For an example of how often these tiny delays are needed; you can do "grep sleep */*.c */*/*.c */*/*/*.c */*/*/*/*.c" in the "linux/drivers" directory of Linux' source code. There's a whole bunch of different delay functions ("msleep()", "msleep_interruptible()", "usleep()", "ssleep()", "ata_msleep()", etc) and if you add them all up you'll find that tiny delays are being used in several thousand places.

The typical scenario is that the datasheet/specification for the device says something like "after doing <foo> you need to wait for <delay> before doing <bar>", so the person writing the driver gets all cautious and doubles the value of <time> just in case, and the person who implemented the kernel's "nanosleep()" (or "msleep()" or "usleep()" or whatever) had to make sure the delay is "<time> or greater". The end result is that (e.g.) if the datasheet/specification for the device says "1 us delay" you might end up with a 200 us delay in practice; and all these "extra big" tiny delays add up and make hardware seem slower than it should be.

Worse, if you look into it (e.g. the section about "short delays" here) you'll find that typically "msleep()" just calls "usleep()" and "usleep()" is a busy-wait that just burns 100% of CPU time polling. Now try "grep "msleep(100)" */*.c */*/*.c */*/*/*.c */*/*/*/*.c" and think about those 382 places where drivers are burning 100 ms of CPU time (of course you can do similar searches for "msleep(150)", "msleep(200)", etc; for example you'll probably find 85 occurrences of "msleep(500)" (!) ).

So; basically (using Linux as a guide) there's 4 problems:

There's lots of "different" functions for small delays that all do the same thing. Common sense would suggest replacing all of them with a single delay function.
All of these functions waste 100% of CPU time, even when idiots use them for excessively huge delays. Common sense would suggest running other tasks where possible, to avoid wasting CPU time for no reason.
Device driver programmers can't know how fast the computer is going to be (or how precise the scheduler's timer will be), and can't know when to use something like a "msleep()" and when to use something like "schedule_timeout()". Common sense would suggest that device driver programmers shouldn't need to make assumptions in the first place.
Device driver programmers don't have much confidence in the accuracy of the kernel's timing and exaggerate their delays. Common sense would suggest that providing a higher precision interface (even if the scheduler's timer isn't as precise) would reduce the tendency to exaggerate delays (e.g. "3 units of exaggeration" might be 3 ms or 3 ns depending on the units being used).

If you add this all up; you end up with a "nanosleep()" that tells the scheduler to run other tasks for as long as possible (but not longer) followed by a busy wait.

Once you get this far you end up with the situation I described earlier - you want "nanosleep()" to work in nanoseconds (and not some stupid "jit_delay * HZ" mess, and not something like "msleep()" that makes some people want something more precise like "usleep()") regardless of how precise the kernel's/scheduler's timing is; and then you want the scheduler's timer to be as a accurate as possible to reduce the time spent busy waiting *and* the total time spent waiting.

Of course you'd also want 2 different versions of this - one that is less accurate that never does the busy waiting (that runs other tasks for "delay rounded up"), and one that is more accurate that does do the busy-waiting (that runs other tasks for "delay rounded down"). That way if someone decides to sleep for 123 minutes and doesn't care about precision so much they can call "nanosecond_fast( 123*60*1000*1000*1000 );" and if someone wants to sleep for 123 nanoseconds and does care about precision they can call "nanosecond_precise( 1234 );".

Now; I doubt that any specific device actually requires 1 ns precision, because this would be silly - e.g. even if an OS could guarantee 1 ns precision, nothing prevents an IRQ from occurring immediately after the delay has expired and messing it up. What I'm saying is that an OS should support the best timing hardware can support to improve efficiency (e.g. tiny busy-waits instead of huge busy-waits) and reduce hassles/problems (and to avoid ending up with a crappy hack like Linux).

Cheers,

Brendan

OSDev.org

A few random questions

A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions

Re: A few random questions