Page 2 of 2

Posted: Fri Mar 07, 2008 9:30 pm
by inx
Thank you all for the useful advice. I've implemented a few of the suggestions, and everything's happily working. :)

After I got all of that done, my previously-working multitasking code died horribly. It would load the tasks, switch to user mode, and run just fine until the first timer interrupt, at which time it would promptly triple fault with a page fault from inside my task switching code.

I lost some sleep poking and prodding at it from every angle I could think of, even though everything *looked* correct to me. I took a break about an hour ago, then came back and decided to just give up for now, disable task switching, and start implementing my syscall interface. Lo and behold, as soon as I started scrolling down through my isr.S file, I found it...

Yesterday, when I was adding the first ISR stub for syscalls, I apparently accidentally hit x in vi. I had deleted a '$', and so exceptions worked, software interrupts worked, but IRQs fubar'ed the stack without giving me any compile warnings. :lol: Just thought you all might find it humorous.

Posted: Sat Mar 08, 2008 4:19 am
by Wave
Thanks for the explanation, bewing.

Posted: Sat Mar 08, 2008 9:51 am
by skyking
bewing wrote:
Wave wrote: Can you explain to me why that is? I always thought counting up would be faster because add is faster than sub, but the x86 has the loop instruction that subs automatically. It is however slower than sub cmp jmp anyways.
All CPUs have the same basic set of flags to test on, when it comes to arithmetic operations. Carry, Zero, Sign, Overflow. Immediately after an arithmetic opcode, you can sometimes test them, and get a useful result.
Not true. MIPS fx does not do it that way.
bewing wrote:
skyking wrote: you don't save that clock cycle on a modern compiler
It is not always wise to count on the compiler to save your @$$ from crappy coding. And in any case, programmers often DO use the index inside the loop, and then the compiler won't touch the code. But you can still code the loop to count up or down.
It is also not wise to look at "how fast some current CPU performs this opcode vs. that one" -- all opcodes are tending toward the limit of 1 opcode per calculation cycle. It is wisest to code for that.
I already assumed that we're running at one clock cycle per second. However if you really care about saving one clock cycle per loop you should start by not using a crappy compiler, and normally you'd be able to squeeze some more clock cycles by using assembler.

Btw I tested to use your example from C source:

Code: Select all

void loop(int* p, int n, int x)
{
	int j, k;

	for( j=0; j<n; j++ )
	{
		p[5*j] = x;
	}
}
versus:

Code: Select all

void loop(int* p, int n, int x)
{
	int j = n;
	p = p+5*n;

	do
	{
		*p = x;
		p-=5;
	} while( --j>=0 );
}
And run with n being 25*1024*1024 and got 400ms vs 391ms. You're example saved 2% execution time. Congratulations but didn't you promise a lot more?

Posted: Sat Mar 08, 2008 11:37 am
by JamesM
You're example saved 2% execution time.
If I saved 2% of our company's product's execution time, I would get an absolutely massive bonus. No joke.

Posted: Sun Mar 09, 2008 1:10 am
by bewing
*sigh* The particular example I happened to code, happened to use a memory store operation inside the loop. As you have now discovered, memory store operations (on average) do not happen in one clock cycle.
So the loops were not really 5 clock cycles vs. four. In fact, you seem to have determined that on average, it was 50 vs. 49. So this particular memory store operation seems to take about 45 times as long as a register operation on average.
The savings depend on the tightness of the loop, as I said. If the loop that I had ccded had only used register operations, and no memory operations at all, then the saving would have more closely approached 20%.

And Wave -- if you are truly convinced that addition is (and always will be) faster than subtraction, you can count UP from -N to 0, of course. You'd have to put a lot of explanations in your code, as to why you were doing such a crazy thing, of course. :wink:

Posted: Sun Mar 09, 2008 1:47 am
by piranha
So.....if you save the time, then why keep arguing over this?

Go with the faster way. Simple. And, again I must ask (as I'm a little confused), who cares? 2% time saved. Woopie. So you have a program that takes 36 hours to complete, and instead is takes (rounded) 35.3 hours. Does it matter? I'd be long gone before it even got to 10 min. If it took long enough for me to notice a difference, I wouldn't run the program.

-JL

Posted: Mon Mar 10, 2008 5:51 am
by bewing
piranha wrote: So.....if you save the time, then why keep arguing over this?
I'm not arguing; just showing the young'uns one tiny simple optimization.
piranha wrote: And, again I must ask (as I'm a little confused), who cares? 2% time saved. Woopie.
Because: This was just one easy small optimization. If I have 24 more of these tricks (and I do), then my code runs 50% faster. Does that begin to interest you?

Posted: Mon Mar 10, 2008 7:14 am
by Solar
skyking wrote:Btw I tested to use your example from C source:
I'll swallow any criticism at your method (which I consider inapropriate), and instead show mine.

Code: Select all

#include <stdlib.h>
#include <assert.h>

void loop_init( int * p, int n, int x )
{
    size_t i;
    for ( i = 0; i < n; ++i )
    {
        *p = x;
        p += 5;
    }
}

void loop_up( int * p, int n, int x )
{
    size_t i;
    for ( i = 0; i < n; ++i )
    {
        *p = x;
        p += 5;
    }
}

void loop_down( int * p, int n, int x )
{
    size_t i;
    for ( i = n; i > 0; --i )
    {
        *p = x;
        p += 5;
    }
}

int main()
{
    size_t limit = 25*1024*1024;
    int * p = (int *)malloc( limit * sizeof( int ) );
    assert( p != NULL );
    int i;
    loop_init( p, limit, 42 );
    for ( i = 0; i < 1000; ++i )
        loop_up( p, limit, 42 );
    for ( i = 0; i < 1000; ++i )
        loop_down( p, limit, 42 );
    return 0;
}
Using GCC 4.1.2 on a Linux 2.6.23:

Code: Select all

$ gcc -pg test.c
$ ./a.out
$ gprof a.out | head
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 50.28     81.94   81.94      1000    81.94    81.94  loop_up
 50.10    163.58   81.65      1000    81.65    81.65  loop_down
...
More runs showed that:
  • Optimization settings (-O0, -O2) had no measurable influence on results,
  • loop_down() was faster than loop_up() by anywhere from 0.12% to 0.85%.
That's an efficiency increase well below 1%.

Disadvantages of counting down is the effort you would have to put into persuading your co-developers (on a pretty slim business case to start with), and - I am not making this up! - the two minutes I spent debugging loop_down() while loop_up() worked right from the start (loop_down() means --i, not ++i, dumbass 8) ).

In an environment like PDCLib, where efficiency really counts, I would actually be counting down - but not because of the odd cycle it saves, but because I could count down n directly instead of creating a local variable on the stack... so tastes differ. I don't think either of the two "tricks" makes any difference in the real world.

Posted: Mon Mar 10, 2008 1:51 pm
by skyking
bewing wrote:*sigh* The particular example I happened to code, happened to use a memory store operation inside the loop. As you have now discovered, memory store operations (on average) do not happen in one clock cycle.
So the loops were not really 5 clock cycles vs. four. In fact, you seem to have determined that on average, it was 50 vs. 49. So this particular memory store operation seems to take about 45 times as long as a register operation on average.
The savings depend on the tightness of the loop, as I said. If the loop that I had ccded had only used register operations, and no memory operations at all, then the saving would have more closely approached 20%.
But what meaningful can you do with a loop that uses only register variables, are using the loop counter and is not possible to do in constant time. The later is far more meaningful than chopping of 1-2% in a loop.
And Wave -- if you are truly convinced that addition is (and always will be) faster than subtraction, you can count UP from -N to 0, of course. You'd have to put a lot of explanations in your code, as to why you were doing such a crazy thing, of course. :wink:
Then you'd need to do that explaination if you count from N to 0 as well.

The real problem with the optimization you suggested is that you did not do measurments of the performance. Rule number one should be to do this before you try to optimize and after as well. I've seen a lot of "optimizations" that increase the execution time.