Page 3 of 4

Posted: Sun Sep 23, 2007 12:18 pm
by M-Saunders
Yes Zacariaz, that is indeed a quicker way to do things! But like Frank says, in the case of MikeOS, I prefer to to split things up nicely -- so in this case, the first code chunk sets up the stack, and the second sets up the data segment. It's slightly easier to follow when in two parts.

That said, if you do spot something completely redundant in MikeOS, please do let me know!

Cheers,
Mike

Posted: Sun Sep 23, 2007 4:01 pm
by B.E
inflater wrote:My DIV explanation:

To divide 16 by 4:

mov ax,10h
mov bx,4
div bx

; now in AX there should be 4 ;)
Or you could do this (remember if you shift bits by one right, your dividing the number by 2).

Code: Select all

mov ax, 16
shr ax, 2

Posted: Mon Sep 24, 2007 2:42 am
by Zacariaz
yes i am familier with bit shifting ;)

anyway, just had another though about ADD vs. MOV.
what is fastest and most space efficient.

Code: Select all

 add ax, 512
;or
mov ax, 09C0h
ive just checked with the manual, and it seems, if im not mistaken, that mov ax, 09C0h actually is 4 bytes, and the add ax, 512 3 bytes.
So the question remains: which instruction uses most cycles

Posted: Mon Sep 24, 2007 10:34 am
by JAAman
as for cycles used, that probably depends on the internal state of the CPU (and is unpredictable at compile-time), cache, and other things that are difficult or impossible to predict -- which is why intel stopped releasing cycle timing information

i wouldnt be surprised if both instructions took about the same number of cycles on average -- remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions -- im not sure what the capabilities of the core2 ALU are...)

Posted: Mon Sep 24, 2007 2:30 pm
by Combuster
remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions
Strange, I just checked the intel optimisation manual and it gave a latency of up to 80 cycles for division on the older models and still some 22 on the conroe (core 2). On average a div can be pipelined every 20-40 clocks. Shifts in comparison take one clock or less.

Multiplies have been vastly improved on the P4 and later, going from 10 clocks to three to generate the result.

(Intel® 64 and IA-32 Architectures Optimization Reference Manual)

Posted: Wed Sep 26, 2007 1:10 pm
by Zacariaz
what about binary instructions fx.
instead of:
mov ax, 07C0h
you could write:
xor ax,ax
or ax,07C0h

as far as i know those instructions only use one cycle so allthough it would be hard to read code (and im allso aware of the flags), wouldnt it be more efficient?

YEs, i know im just rambling here, but eh, it is generel ramblings right? ;)

Posted: Wed Sep 26, 2007 1:57 pm
by Zacariaz
Accordingly to some docs i found on the net, allready from the 486 both MOV and ADD uses only one single cycle, so i guess i am just rambling :oops:

Posted: Wed Sep 26, 2007 6:30 pm
by frank
mov ax, 07C0h is the better form. For one I believe it would take up less space in the executable and 2 it doesn't destroy the flags.

Posted: Thu Sep 27, 2007 2:55 am
by Zacariaz
actually i beieve that the intel manual states that add and mov are the same size, but im probably wrong, the manual inst allways that easy to understand...

Posted: Thu Sep 27, 2007 5:53 pm
by Zacariaz
I pulled this:

Code: Select all

DIV:

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg8             80-90  14    14    16            2
        reg16           144-162 22    22    24            2
        reg32              -    -     38    40            2
        mem8        (86-96)+EA  17    17    16           2-4
        mem16     (150-168)+EA  25    25    24           2-4  (W88=158-176+EA)
        mem32              -    -     41    40           2-4

MUL:

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg8            70-77   13   9-14  13-18          2
        reg16          118-113  21   9-22  13-26          2
        reg32             -     -    9-38  13-42         2-4
        mem8        (76-83)+EA  16  12-17  13-18         2-4
        mem16     (124-139)+EA  24  12-25  13-26         2-4
        mem32             -     -   12-21  13-42         2-4
of the net, and assuming that this information is correct, i have a question or two.

1. Have the "cycle consumtion" improvet sicnificantly on newer processors?

2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.

3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?


Thanks

Posted: Fri Sep 28, 2007 5:10 pm
by JAAman
1. Have the "cycle consumtion" improvet sicnificantly on newer processors?
oh yes, a lot
in general, looking at cycle timings on newer CPUs is irrelevent, since this number fluctuates significantly
2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.
there are a lot of things that can influence this... so its hard to tell, but this chart looks quite simplistic even for a 386...
for one thing, there is no cache-hit/cache-miss information, or state-information, or TLB-hit/miss information, or...

so its hard to tell, just off a chart like this... in fact its gotten so complex, that intel no longer even provides a chart, simply because there are too many things to consider, that it wouldnt really help at all...
3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?
not really...

Posted: Sat Sep 29, 2007 3:19 am
by Zacariaz
ok, the problem is that im using the % like a 1000 times a second, +/- im not sure, anyway the div instruction, has got to be killing the cpu, nomatter hov efficient it is. them mul is used just as much, but it can be brought down alot, so that im not so worried about.

Anyway, if there isnt a more efficient way than div, then that is the answer i was looking for. (not the once i hoped for though)

thanks

Posted: Sat Sep 29, 2007 4:41 am
by Combuster
you can avoid div's in several ways
1: redesign your algorithm. Consider things like bresenham line drawing opposed to solving y = ax+b for each step. In essence, try to see if you can deduce the result based on the previous one.

2: rewrite the equations. you can eliminate costly operations by applying the inverse somewhere. if you'd check the distance between two points for a certain value, the common case would be to calculate sqrt(dx²+dy²) and compare it to the distance. Instead, you can compute dx² + dy² and compare it to distance² (which eliminates a sqrt for a mul, saving some 80% of time)

3: optimize in assembly:
use shr, sar when dividing by powers of two
use and when computing the modulus with power-of-two
use aam when dividing small numbers
use lookup tables
multiply with reciprocals instead of dividing.

Posted: Sat Sep 29, 2007 5:05 am
by Zacariaz
Well, i fould out this:

Code: Select all

#include <vector>
#include <climits>
class Prime {
        std::vector<unsigned short> Pv;
        unsigned long Test;
    public:
        Prime():Test(3) {Pv.push_back(3);}
        unsigned long long next() {
            Test += 2;
            for(long n = 0;Test >= Pv[n] * Pv[n]; n++) {
                if (Test % Pv[n] == 0) {
                    Test += 2;
                    n = -1;
                }
            }
            if(Test <= USHRT_MAX) Pv.push_back(Test);
            return(Test);
        }
};
when i descided to do something about my somewhat... limited ASM skills.
Now, i know it can be optimized ALOT alone in c++, but still i fail to see a more efficient methode, appart from the sieve maybe.

I dont think it should be too hard to write in assembly.

Posted: Sat Sep 29, 2007 7:13 am
by Combuster
You can eliminate a lot of components:
consider all cases where a = b * c (i.e., not prime) and make b >= c
now, the largest value c can attain is b, i.e a = c * c -> c = sqrt(a)

you will only need to check the numbers up to and including sqrt(a) to determine primehood.

The overall improvement is pretty nice: for numbers above 100 it saves 90% of checks, and above 10000 even 99% (that means you only need to do ~400 divs when your number is just above one million :D)

hand optimizing your version to assembly won't even get close to that kind of improvement.