real mode dos assembly trouble

M-Saunders · Post by **M-Saunders** » Sun Sep 23, 2007 12:18 pm

Yes Zacariaz, that is indeed a quicker way to do things! But like Frank says, in the case of MikeOS, I prefer to to split things up nicely -- so in this case, the first code chunk sets up the stack, and the second sets up the data segment. It's slightly easier to follow when in two parts.

That said, if you do spot something completely redundant in MikeOS, please do let me know!

Cheers,
Mike

B.E · Post by **B.E** » Sun Sep 23, 2007 4:01 pm

inflater wrote:My DIV explanation:

To divide 16 by 4:

mov ax,10h
mov bx,4
div bx

; now in AX there should be 4

Or you could do this (remember if you shift bits by one right, your dividing the number by 2).

Code: Select all

mov ax, 16
shr ax, 2

Zacariaz · Post by **Zacariaz** » Mon Sep 24, 2007 2:42 am

yes i am familier with bit shifting

anyway, just had another though about ADD vs. MOV.
what is fastest and most space efficient.

Code: Select all

 add ax, 512
;or
mov ax, 09C0h

ive just checked with the manual, and it seems, if im not mistaken, that mov ax, 09C0h actually is 4 bytes, and the add ax, 512 3 bytes.
So the question remains: which instruction uses most cycles

JAAman · Post by **JAAman** » Mon Sep 24, 2007 10:34 am

as for cycles used, that probably depends on the internal state of the CPU (and is unpredictable at compile-time), cache, and other things that are difficult or impossible to predict -- which is why intel stopped releasing cycle timing information

i wouldnt be surprised if both instructions took about the same number of cycles on average -- remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions -- im not sure what the capabilities of the core2 ALU are...)

Combuster · Post by **Combuster** » Mon Sep 24, 2007 2:30 pm

remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions

Strange, I just checked the intel optimisation manual and it gave a latency of up to 80 cycles for division on the older models and still some 22 on the conroe (core 2). On average a div can be pipelined every 20-40 clocks. Shifts in comparison take one clock or less.

Multiplies have been vastly improved on the P4 and later, going from 10 clocks to three to generate the result.

(IntelÂ® 64 and IA-32 Architectures Optimization Reference Manual)

Zacariaz · Post by **Zacariaz** » Wed Sep 26, 2007 1:10 pm

what about binary instructions fx.
instead of:
mov ax, 07C0h
you could write:
xor ax,ax
or ax,07C0h

as far as i know those instructions only use one cycle so allthough it would be hard to read code (and im allso aware of the flags), wouldnt it be more efficient?

YEs, i know im just rambling here, but eh, it is generel ramblings right?

Zacariaz · Post by **Zacariaz** » Wed Sep 26, 2007 1:57 pm

Accordingly to some docs i found on the net, allready from the 486 both MOV and ADD uses only one single cycle, so i guess i am just rambling

frank · Post by **frank** » Wed Sep 26, 2007 6:30 pm

mov ax, 07C0h is the better form. For one I believe it would take up less space in the executable and 2 it doesn't destroy the flags.

Zacariaz · Post by **Zacariaz** » Thu Sep 27, 2007 2:55 am

actually i beieve that the intel manual states that add and mov are the same size, but im probably wrong, the manual inst allways that easy to understand...

Zacariaz · Post by **Zacariaz** » Thu Sep 27, 2007 5:53 pm

I pulled this:

Code: Select all

DIV:

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg8             80-90  14    14    16            2
        reg16           144-162 22    22    24            2
        reg32              -    -     38    40            2
        mem8        (86-96)+EA  17    17    16           2-4
        mem16     (150-168)+EA  25    25    24           2-4  (W88=158-176+EA)
        mem32              -    -     41    40           2-4

MUL:

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg8            70-77   13   9-14  13-18          2
        reg16          118-113  21   9-22  13-26          2
        reg32             -     -    9-38  13-42         2-4
        mem8        (76-83)+EA  16  12-17  13-18         2-4
        mem16     (124-139)+EA  24  12-25  13-26         2-4
        mem32             -     -   12-21  13-42         2-4

of the net, and assuming that this information is correct, i have a question or two.

1. Have the "cycle consumtion" improvet sicnificantly on newer processors?

2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.

3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?

Thanks

JAAman · Post by **JAAman** » Fri Sep 28, 2007 5:10 pm

1. Have the "cycle consumtion" improvet sicnificantly on newer processors?

oh yes, a lot
in general, looking at cycle timings on newer CPUs is irrelevent, since this number fluctuates significantly

2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.

there are a lot of things that can influence this... so its hard to tell, but this chart looks quite simplistic even for a 386...
for one thing, there is no cache-hit/cache-miss information, or state-information, or TLB-hit/miss information, or...

so its hard to tell, just off a chart like this... in fact its gotten so complex, that intel no longer even provides a chart, simply because there are too many things to consider, that it wouldnt really help at all...

3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?

not really...

Zacariaz · Post by **Zacariaz** » Sat Sep 29, 2007 3:19 am

ok, the problem is that im using the % like a 1000 times a second, +/- im not sure, anyway the div instruction, has got to be killing the cpu, nomatter hov efficient it is. them mul is used just as much, but it can be brought down alot, so that im not so worried about.

Anyway, if there isnt a more efficient way than div, then that is the answer i was looking for. (not the once i hoped for though)

thanks

Combuster · Post by **Combuster** » Sat Sep 29, 2007 4:41 am

you can avoid div's in several ways
1: redesign your algorithm. Consider things like bresenham line drawing opposed to solving y = ax+b for each step. In essence, try to see if you can deduce the result based on the previous one.

2: rewrite the equations. you can eliminate costly operations by applying the inverse somewhere. if you'd check the distance between two points for a certain value, the common case would be to calculate sqrt(dxÂ²+dyÂ²) and compare it to the distance. Instead, you can compute dxÂ² + dyÂ² and compare it to distanceÂ² (which eliminates a sqrt for a mul, saving some 80% of time)

3: optimize in assembly:
use shr, sar when dividing by powers of two
use and when computing the modulus with power-of-two
use aam when dividing small numbers
use lookup tables
multiply with reciprocals instead of dividing.

Zacariaz · Post by **Zacariaz** » Sat Sep 29, 2007 5:05 am

Well, i fould out this:

Code: Select all

#include <vector>
#include <climits>
class Prime {
        std::vector<unsigned short> Pv;
        unsigned long Test;
    public:
        Prime():Test(3) {Pv.push_back(3);}
        unsigned long long next() {
            Test += 2;
            for(long n = 0;Test >= Pv[n] * Pv[n]; n++) {
                if (Test % Pv[n] == 0) {
                    Test += 2;
                    n = -1;
                }
            }
            if(Test <= USHRT_MAX) Pv.push_back(Test);
            return(Test);
        }
};

when i descided to do something about my somewhat... limited ASM skills.
Now, i know it can be optimized ALOT alone in c++, but still i fail to see a more efficient methode, appart from the sieve maybe.

I dont think it should be too hard to write in assembly.

Combuster · Post by **Combuster** » Sat Sep 29, 2007 7:13 am

You can eliminate a lot of components:
consider all cases where a = b * c (i.e., not prime) and make b >= c
now, the largest value c can attain is b, i.e a = c * c -> c = sqrt(a)

you will only need to check the numbers up to and including sqrt(a) to determine primehood.

The overall improvement is pretty nice: for numbers above 100 it saves 90% of checks, and above 10000 even 99% (that means you only need to do ~400 divs when your number is just above one million

)

hand optimizing your version to assembly won't even get close to that kind of improvement.