OSDev.org

Posted: **Fri Oct 19, 2012 11:44 am**

Hello,

I am running our own OS. I want to measure OS overhead using the TSC on an intel i7-2600 multicore processor.

I have the following application:

Code: Select all

int main()
{
    for(int i = 0; i < REP; i++)
        run(i)
}

void run(int test)
{
    create threads
    join threads
    delete threads
    print the collected TSC values
}

I am measuring the overhead as follows:

Code: Select all

void Thread::sleep(..)
{
    //disables interrupts and gets a spin lock
    TSC::Time_Stamp start = TSC::time_stamp();
    //performs some operations
    _wakeup_tsc[_wakeup_counter] = TSC::time_stamp() - start;
    _wakeup_counter++;
   //enables interrupts and releases the spin lock
}

static TSC::Time_Stamp time_stamp() {
	Time_Stamp ts;
	ASMV("rdtsc" : "=A" (ts) : );
	return ts;
}

When I run the application with only 1 repetition, I got very low TSC values, as expected (there are only 15 threads in the system). If I change to 10 repetitions, for example, the TSC values increase a lot (about 10 times more even in the first loop).

I understand that if I change the repetitions, the compiler (g++) will change the memory addresses in the generated system image. Is there a reasonable explanation for this behavior? Maybe, something related to memory banks accesses? I cannot understand how the read TSC values change so much.

Best regards,
Giovani

Posted: **Fri Oct 19, 2012 11:11 pm**

Hi,

giovanig wrote:When I run the application with only 1 repetition, I got very low TSC values, as expected (there are only 15 threads in the system). If I change to 10 repetitions, for example, the TSC values increase a lot (about 10 times more even in the first loop).

When you measure how long it takes to execute "run()" 10 times you found that it takes about 10 times longer than just executing "run()" once? In similar news, scientists have discovered that 10 elephants are roughly 10 times heavier than a single elephant, it takes 10 times longer to walk 10 Km than it takes to walk 1 Km, and 10 pizzas cost about 10 times more than one pizza.

I don't see what the problem is.

Were you expecting that caches would make the second, third, ..., tenth time faster (e.g. so that doing it 10 times only takes 8 times as long as doing it once)? If you were, then maybe a lot of the code/data (e.g. most of the kernel) is already in caches before your test starts.

Were you expecting that doing it 10 times would use more CPUs and be a lot faster than 10 times longer? In this case; the "join threads" part (where you wait for the created threads to stop) would prevent anything extra from happening in parallel.

Cheers,

Brendan

Posted: **Sat Oct 20, 2012 9:55 am**

Hi Brendan, thank you for your answer.

I guess I was not clear enough. I am not measuring how long the program takes to execute. Obviously, the time would be 10x slower. I am measuring how long some OS activities take to execute (time to choose a thread to run for example). These activities do not depend on the loop and they are executed with interrupts disabled and inside a mutual exclusion code protected by a spin lock.

What I would expect is a very similar time independently of the loop size. The problem is when I measure the time running the application with more than 1 iteration (I reset the internal counters before starting a new iteration), I got a TSC value that is extremely higher than executing the same application and the same TSC and Thread code with a loop with 1 iteration. There is no filesystem, all data is kept in memory buffers.

The only explanation I see, is that when I increase the loop size, the code and data of the OS (scheduling lists for example) and application are placed in another memory region and the time to access the data in this new position is higher, for some reason related to the architecture.

Posted: **Sat Oct 20, 2012 10:41 am**

Hi,

giovanig wrote:I guess I was not clear enough. I am not measuring how long the program takes to execute. Obviously, the time would be 10x slower. I am measuring how long some OS activities take to execute (time to choose a thread to run for example). These activities do not depend on the loop and they are executed with interrupts disabled and inside a mutual exclusion code protected by a spin lock.

What I would expect is a very similar time independently of the loop size. The problem is when I measure the time running the application with more than 1 iteration (I reset the internal counters before starting a new iteration), I got a TSC value that is extremely higher than executing the same application and the same TSC and Thread code with a loop with 1 iteration. There is no filesystem, all data is kept in memory buffers.

The only explanation I see, is that when I increase the loop size, the code and data of the OS (scheduling lists for example) and application are placed in another memory region and the time to access the data in this new position is higher, for some reason related to the architecture.

This sounds far too unlikely. More likely is that you're not measuring what you think your measuring; or there's serious flaws in your OS (e.g. code to choose a thread to run that wastes time checking threads that have been terminated), or something else is going on.

I'd suggest trying many different loop counts - e.g. 1, 10, 100, 1000, ..., 10000000. This should give you a better idea of the symptoms.

Cheers,

Brendan

OSDev.org

using TSC to measure time returns different values

using TSC to measure time returns different values

Re: using TSC to measure time returns different values

Re: using TSC to measure time returns different values

Re: using TSC to measure time returns different values