OSDev.org

Posted: **Tue Jan 31, 2012 10:07 pm**

Hi,

I came across something curious as I was exercising my code for semaphores. My test basically prints the characters from ASCII 33 to 126 a few hundred times and then waits and repeats. There was a difference in the way that QEMU handled order of printing the characters versus the way Bochs did the same. QEMU prints the characters in ascending order the way I would expect and ran for more than 300,000 iterations where my spot checks did not show anything in the wrong order. However, Bochs botches up the order, though it appears consistent with each iteration (iteration to iteration may be different).

Here is a snapshot from a Bochs execution (notice the digits, as an easy call-out):

:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+
.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{
|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklm
nopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`
^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQ
RTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABC
DEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/021453
6798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(
')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwv
xyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehi
jklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\
[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLN
OJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?
@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/02
14536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#
"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrs
tuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdf
gehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHIKMLNOJPQRTSVUW
XZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<;=>?@ABCDEFGHI
KMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&#"%$(')*+.,-/0214536798:<
;=>?@ABCDEFGHIKMLNOJPQRTSVUWXZY\[]_`^bacdfgehijklmnopqrstuwvxyz{|~!}&"#%$(')*+.,
-/0214536798:<;=>?@ABCDEFGHIKMLONJPRQTSUVWXYZ\[]_`^bacdfgehijklmnpoqrstuwvxyz{|~
}
iteration complete: 3
Spur: 0 Prc: 1 H:0 N:0 L:0 W:0 F:1023

Later in the same execution I get "1032546789" for the digits.

Is there a config setting I may be missing in Bochs? Or do I really have a bug in my code that I am not seeing and QEMU is masking it? I'm looking for a point in the right direction.... I have been working from the position that QEMU is behaving correctly, but I want to be sure before I continue.

I am using a simple round-robin scheduling algorithm with 3 possible priorities. All task processes run at the same "normal" priority with the code that created the processes at a low priority.

Also, if it helps my development environment is FC15 and Bochs version is 2.4.5-4.fc15.

The code I am exercising with is:

Code: Select all

// Exercise the Kernel:
	SemID sem = SemCreate(0);
	
	while (1) {
		uint32 i;

		for (i = 33; i <= 126; i ++) {
			Resume(CreateProcess(TASK_FUNC(PrintLetter), 2, i, sem));
		}
		
		while (1 < processCount) {
			if (0 > SemCount(sem)) Signal(sem);
			else {
				asm volatile("sti");
				asm volatile("hlt");
			}
		}
		
		iteration ++;
		kPutStr("\niteration complete: "); kPutDec(iteration);   
		
		for (i = 0; i < 65000000; i ++) ; // do nothing
	}

My PrintLetter() function simple:

Code: Select all

#include "kernel.h"

void PrintLetter(char l, SemID sem)
{
	int i;
	
	for (i = 0; i < 490; i ++) {
		Wait(sem);
		kPutCh(l);
	}
}

I'm more than happy to share other code if it helps... Just tell me what you want to see.

Thanks for the help!

Posted: **Tue Jan 31, 2012 11:14 pm**

Hi,

eryjus wrote:Or do I really have a bug in my code that I am not seeing and QEMU is masking it? I'm looking for a point in the right direction.... I have been working from the position that QEMU is behaving correctly, but I want to be sure before I continue.

Qemu will emulate one CPU for about 10 ms, then emulate the next CPU for 10 ms, etc; where 10ms could be several hundred million emulated instructions.

Bochs will emulate "<quantum>" instructions for one CPU, then emulate "<quantum>" instructions for the next CPU, etc; where "<quantum>" is typically about 5 instructions.

For a real computer, different CPUs run at the same time.

Now imagine what happens if you've got a very dodgy spinlock, like this:

Code: Select all

.wait:
    cmp dword [somewhere],0
    jne .wait
    mov dword [somewhere],1

On Qemu, because such a huge number of instructions are emulated by one CPU before it switches to another CPU, the chance of a different CPU accessing "[somewhere]" at the wrong time is very low, and it's likely that you won't see a problem. On Bochs, because far less instructions are emulated between CPU switches it's likely that you will see a problem fairly often. On a real computer there's even more chance of seeing a problem than there is on Bochs.

Now imagine what happens if you change it to this:

Code: Select all

.wait:
    bts dword [somewhere],0     ;Set the first bit and return previous bit's state in carry flag
    jc .wait

Here, because both Qemu and Bochs emulate one instruction at a time, you will never see a problem at all. However, on a real computer, 2 CPUs could read "[somewhere]" at the same time, then both update the value at "[somewhere]" and both think they have the lock. On real CPUs you'd have to use "lock bts dword [somewhere],0" to ensure that only one CPU can do the "bts" instruction at a time.

Of course this is just a silly example to demonstrate what can happen when there's race conditions and/or dodgy locking - I'm sure your problem looks very different.

Cheers,

Brendan

Posted: **Tue Jan 31, 2012 11:56 pm**

Brendan,

Thank you for your reply which is thoughtful as always.

I have not done anything to "turn on" a second CPU, even if more than one is being emulated. Though I have not read the section on activating additional CPUs in the intel guides, I assumed activating them was an explicit task. I will have to read up on that regardless of the underlying root of the problem.

Thanks

Posted: **Wed Feb 01, 2012 12:36 am**

Hi,

eryjus wrote:I have not done anything to "turn on" a second CPU, even if more than one is being emulated. Though I have not read the section on activating additional CPUs in the intel guides, I assumed activating them was an explicit task. I will have to read up on that regardless of the underlying root of the problem.

Doh - I assumed, and we all know how that works..

You do have to activate other CPUs explicitly, and you can't do that by accident.

That only leaves timing related problem/s in the semaphore code, IRQ handler/s or scheduler. I can't think of anything that would fail to show symptoms on Qemu, unless it's sensitive to CPU speed (as Bochs is a little slower than Qemu). Maybe on Bochs the process receives the character but (sometimes) doesn't quite get enough time to print the character before the scheduler's IRQ causes a task switch.

Cheers,

Brendan

Posted: **Sat Feb 04, 2012 5:45 pm**

Brendan,

Thank you for your guidance. It turns out I had a couple of things to deal with.

I did find a bug in my semaphore code picking the right process to rechedule to when the blocked process releases. I remembered your reply to my first post (http://forum.osdev.org/viewtopic.php?f=1&t=24452) and your coaching on how to code Reschedule():

I recommend that people begin with a "Switch_To_Thread(Thread_ID)" function. Your "Reschedule()" would decide which thread to switch to and then call the "Switch_To_Thread(Thread_ID)" function; and other parts of the kernel will call "Reschedule()" or "Switch_To_Thread(Thread_ID)" directly.

Thanks again for that. I did go back and fix my error.

Second, I do have a timing issue. What is happening is that a process will unblock and the timer will fire and ask for a reschedule before a character will be allowed to print. Ultimately, the process goes from blocked to running to ready queue before the letter is printed and that changes the order of the letters until the next instance of the timing condition.

My timer is setup to fire 2000 times per second. Though I played with the frequency of the timer in both QEMU and Bochs, I still experienced similar issues though not to the extent. However, if QEMU is faster than Bochs, then I can understand the differences. But, my question becomes what should it be (and is there a way to calculate the optimal value at runtime)?

In the meantime, I have added code to ensure that the current process gets 2 time-slices when it gets control (which has not masked the problem).

Thanks!

Posted: **Sat Feb 04, 2012 10:19 pm**

Hi,

eryjus wrote:My timer is setup to fire 2000 times per second. Though I played with the frequency of the timer in both QEMU and Bochs, I still experienced similar issues though not to the extent. However, if QEMU is faster than Bochs, then I can understand the differences. But, my question becomes what should it be (and is there a way to calculate the optimal value at runtime)?

There's 2 or 3 different issues here!

The first issue is the locking. You have none (your "semaphore" isn't a semaphore in the normal sense, it's a synchronisation primitive - like pthread_cond_wait() but without the mutex). For example, consider something like:

Code: Select all

myLock displayLock;

void PrintLetter(char l, SemID sem)
{
   int i;
   
   for (i = 0; i < 490; i ++) {
      Wait_and_get_lock(sem, &displayLock);
      kPutCh(l);
      free_lock(&displayLock);
   }
}

The next issue is "is there a way to calculate the optimal value at runtime?". For anything involving fixed frequency timers there's always a compromise between "smoothness"/granularity/precision and overhead. For example, if each timer IRQ costs an average of 123 cycles of overhead and the CPU does 1234567890 cycles per second, then at 1 IRQ per second you get 123 cycles out of 1234567890 cycles used on timer overhead (or 0.00001% of CPU time in overhead and a granularity of 1 second); at 100000 IRQs per second you get 12300000 cycles out of 1234567890 cycles used on timer overhead (or 0.9963% of CPU time in overhead and a granularity of 10 us). This is a formula that looks like this:

overhead = cycles_per_IRQ * IRQs_per_second / cycles_per_second

You can rearrange the formula:

IRQs_per_second = (overhead / cycles_per_IRQ) * cycles_per_second

If each timer IRQ costs an average of 123 cycles of overhead and the CPU does 1234567890 cycles per second and you want to spend 1% of the CPU's time on timer overhead, then "(0.01 / 123) * 1234567890 = 100371.373 Hz". If the CPU is a lot slower (e.g. a 25 MHz 80486) and you still only want to spend 1% of CPU's time on timer overhead, then "(0.01 / 123) * 25000000 = 2032.52 Hz".

Of course the overhead doesn't have to be a fixed value - you could (for example) use a linear equation to determine what overhead you want based on CPU speed, and then determine timer frequency from that; so that you get 0.1% overhead on 25 MHz CPUs, 0.01% overhead on 4 GHz CPUs, and something in between for CPUs that are in between.

For the last issue, do you do a task switch every time the timer IRQ occurs? In this case, if you do a (voluntary) task switch just before the timer IRQ fires, then a task can get "almost zero" time before the timer IRQ does a (forced) task switch. This would be bad and/or silly. You need to have a minimum amount of time before the timer IRQ forces a task switch.

The first alternative is to do a task switch after the timer IRQ has occurred "X times since the last task switch". For example, if the timer increases a "ticks since boot" variable once every 500 us; then (during task switches) you can do "next task switch time = ticks since boot + (10 ms / 500 us)". In this case you can guarantee that the task will get at least 9.5 ms (if a voluntary task switch occurs immediately before "ticks since boot" is incremented) and won't get more than 10 ms.

The next alternative is to use the PIT in "periodic mode" (or "mode 0: interrupt on terminal count"). In this case you'd decide how long until the timer IRQ should occur and then set the PIT's count during the task switch. This is like the previous alternative, except that it's far more precise (you get 838.1 ns increments rather than 500 us increments). For example, if (during the task switch) you want the IRQ to occur 10 ms from now then you'd set the count to 11932, and an IRQ would occur after at least 9.9993 ms and no longer than 10.00015 ms. This method also avoids the overhead of unnecessary IRQs (e.g. for 10 ms you get one IRQ when you want it, not 20 IRQs like the previous example), but there are limitations (e.g. you couldn't have a delay longer than about 55 ms as the PIT count is only 16-bit) and there are disadvantages (the extra overhead of setting the PIT count during task switches, and difficulty in using the same timer to track real time).

Cheers,

Brendan

Posted: **Sat Feb 04, 2012 11:21 pm**

Thanks again for your reply. You have given me a lot to think about (again!

).

Brendan wrote:There's 2 or 3 different issues here!

No real surprise there... Thank you for pointing them out. More to learn!

I see your point on my "semaphore". My reference was http://en.wikipedia.org/wiki/Semaphore_(programming).

Brendan wrote:For anything involving fixed frequency timers there's always a compromise between "smoothness"/granularity/precision and overhead. For example, if each timer IRQ costs an average of 123 cycles of overhead and the CPU does 1234567890 cycles per second, then at 1 IRQ per second you get 123 cycles out of 1234567890 cycles used on timer overhead (or 0.00001% of CPU time in overhead and a granularity of 1 second); at 100000 IRQs per second you get 12300000 cycles out of 1234567890 cycles used on timer overhead (or 0.9963% of CPU time in overhead and a granularity of 10 us). This is a formula that looks like this:

overhead = cycles_per_IRQ * IRQs_per_second / cycles_per_second

You can rearrange the formula:

IRQs_per_second = (overhead / cycles_per_IRQ) * cycles_per_second

If each timer IRQ costs an average of 123 cycles of overhead and the CPU does 1234567890 cycles per second and you want to spend 1% of the CPU's time on timer overhead, then "(0.01 / 123) * 1234567890 = 100371.373 Hz". If the CPU is a lot slower (e.g. a 25 MHz 80486) and you still only want to spend 1% of CPU's time on timer overhead, then "(0.01 / 123) * 25000000 = 2032.52 Hz".

I need to spend some time playing with the formulas to internalize them. But formulas are great!

Brendan wrote:For the last issue, do you do a task switch every time the timer IRQ occurs?

Yes.

Brendan wrote:In this case, if you do a (voluntary) task switch just before the timer IRQ fires, then a task can get "almost zero" time before the timer IRQ does a (forced) task switch.

As I have discovered.

Brendan wrote:This would be bad and/or silly.

So, I need to realign my thinking...

Brendan wrote:You need to have a minimum amount of time before the timer IRQ forces a task switch.

... and my patch isn't so kludgy after all.

Thanks again. I have some more thinking to do.

OSDev.org

QEMU and Bochs differences

QEMU and Bochs differences

Re: QEMU and Bochs differences

Re: QEMU and Bochs differences

Re: QEMU and Bochs differences

Re: QEMU and Bochs differences

Re: QEMU and Bochs differences

Re: QEMU and Bochs differences