real mode dos assembly trouble

All off topic discussions go here. Everything from the funny thing your cat did to your favorite tv shows. Non-programming computer questions are ok too.
M-Saunders
Member
Member
Posts: 155
Joined: Fri Oct 27, 2006 5:11 am
Location: Oberbayern
Contact:

Post by M-Saunders »

Yes Zacariaz, that is indeed a quicker way to do things! But like Frank says, in the case of MikeOS, I prefer to to split things up nicely -- so in this case, the first code chunk sets up the stack, and the second sets up the data segment. It's slightly easier to follow when in two parts.

That said, if you do spot something completely redundant in MikeOS, please do let me know!

Cheers,
Mike
User avatar
B.E
Member
Member
Posts: 275
Joined: Sat Oct 21, 2006 5:29 pm
Location: Brisbane Australia
Contact:

Post by B.E »

inflater wrote:My DIV explanation:

To divide 16 by 4:

mov ax,10h
mov bx,4
div bx

; now in AX there should be 4 ;)
Or you could do this (remember if you shift bits by one right, your dividing the number by 2).

Code: Select all

mov ax, 16
shr ax, 2
Image
Microsoft: "let everyone run after us. We'll just INNOV~1"
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

yes i am familier with bit shifting ;)

anyway, just had another though about ADD vs. MOV.
what is fastest and most space efficient.

Code: Select all

 add ax, 512
;or
mov ax, 09C0h
ive just checked with the manual, and it seems, if im not mistaken, that mov ax, 09C0h actually is 4 bytes, and the add ax, 512 3 bytes.
So the question remains: which instruction uses most cycles
User avatar
JAAman
Member
Member
Posts: 879
Joined: Wed Oct 27, 2004 11:00 pm
Location: WA

Post by JAAman »

as for cycles used, that probably depends on the internal state of the CPU (and is unpredictable at compile-time), cache, and other things that are difficult or impossible to predict -- which is why intel stopped releasing cycle timing information

i wouldnt be surprised if both instructions took about the same number of cycles on average -- remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions -- im not sure what the capabilities of the core2 ALU are...)
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions
Strange, I just checked the intel optimisation manual and it gave a latency of up to 80 cycles for division on the older models and still some 22 on the conroe (core 2). On average a div can be pipelined every 20-40 clocks. Shifts in comparison take one clock or less.

Multiplies have been vastly improved on the P4 and later, going from 10 clocks to three to generate the result.

(Intel® 64 and IA-32 Architectures Optimization Reference Manual)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

what about binary instructions fx.
instead of:
mov ax, 07C0h
you could write:
xor ax,ax
or ax,07C0h

as far as i know those instructions only use one cycle so allthough it would be hard to read code (and im allso aware of the flags), wouldnt it be more efficient?

YEs, i know im just rambling here, but eh, it is generel ramblings right? ;)
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

Accordingly to some docs i found on the net, allready from the 486 both MOV and ADD uses only one single cycle, so i guess i am just rambling :oops:
frank
Member
Member
Posts: 729
Joined: Sat Dec 30, 2006 2:31 pm
Location: East Coast, USA

Post by frank »

mov ax, 07C0h is the better form. For one I believe it would take up less space in the executable and 2 it doesn't destroy the flags.
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

actually i beieve that the intel manual states that add and mov are the same size, but im probably wrong, the manual inst allways that easy to understand...
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

I pulled this:

Code: Select all

DIV:

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg8             80-90  14    14    16            2
        reg16           144-162 22    22    24            2
        reg32              -    -     38    40            2
        mem8        (86-96)+EA  17    17    16           2-4
        mem16     (150-168)+EA  25    25    24           2-4  (W88=158-176+EA)
        mem32              -    -     41    40           2-4

MUL:

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg8            70-77   13   9-14  13-18          2
        reg16          118-113  21   9-22  13-26          2
        reg32             -     -    9-38  13-42         2-4
        mem8        (76-83)+EA  16  12-17  13-18         2-4
        mem16     (124-139)+EA  24  12-25  13-26         2-4
        mem32             -     -   12-21  13-42         2-4
of the net, and assuming that this information is correct, i have a question or two.

1. Have the "cycle consumtion" improvet sicnificantly on newer processors?

2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.

3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?


Thanks
User avatar
JAAman
Member
Member
Posts: 879
Joined: Wed Oct 27, 2004 11:00 pm
Location: WA

Post by JAAman »

1. Have the "cycle consumtion" improvet sicnificantly on newer processors?
oh yes, a lot
in general, looking at cycle timings on newer CPUs is irrelevent, since this number fluctuates significantly
2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.
there are a lot of things that can influence this... so its hard to tell, but this chart looks quite simplistic even for a 386...
for one thing, there is no cache-hit/cache-miss information, or state-information, or TLB-hit/miss information, or...

so its hard to tell, just off a chart like this... in fact its gotten so complex, that intel no longer even provides a chart, simply because there are too many things to consider, that it wouldnt really help at all...
3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?
not really...
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

ok, the problem is that im using the % like a 1000 times a second, +/- im not sure, anyway the div instruction, has got to be killing the cpu, nomatter hov efficient it is. them mul is used just as much, but it can be brought down alot, so that im not so worried about.

Anyway, if there isnt a more efficient way than div, then that is the answer i was looking for. (not the once i hoped for though)

thanks
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

you can avoid div's in several ways
1: redesign your algorithm. Consider things like bresenham line drawing opposed to solving y = ax+b for each step. In essence, try to see if you can deduce the result based on the previous one.

2: rewrite the equations. you can eliminate costly operations by applying the inverse somewhere. if you'd check the distance between two points for a certain value, the common case would be to calculate sqrt(dx²+dy²) and compare it to the distance. Instead, you can compute dx² + dy² and compare it to distance² (which eliminates a sqrt for a mul, saving some 80% of time)

3: optimize in assembly:
use shr, sar when dividing by powers of two
use and when computing the modulus with power-of-two
use aam when dividing small numbers
use lookup tables
multiply with reciprocals instead of dividing.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Zacariaz
Member
Member
Posts: 1069
Joined: Tue May 22, 2007 2:36 pm
Contact:

Post by Zacariaz »

Well, i fould out this:

Code: Select all

#include <vector>
#include <climits>
class Prime {
        std::vector<unsigned short> Pv;
        unsigned long Test;
    public:
        Prime():Test(3) {Pv.push_back(3);}
        unsigned long long next() {
            Test += 2;
            for(long n = 0;Test >= Pv[n] * Pv[n]; n++) {
                if (Test % Pv[n] == 0) {
                    Test += 2;
                    n = -1;
                }
            }
            if(Test <= USHRT_MAX) Pv.push_back(Test);
            return(Test);
        }
};
when i descided to do something about my somewhat... limited ASM skills.
Now, i know it can be optimized ALOT alone in c++, but still i fail to see a more efficient methode, appart from the sieve maybe.

I dont think it should be too hard to write in assembly.
This was supposed to be a cool signature...
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

You can eliminate a lot of components:
consider all cases where a = b * c (i.e., not prime) and make b >= c
now, the largest value c can attain is b, i.e a = c * c -> c = sqrt(a)

you will only need to check the numbers up to and including sqrt(a) to determine primehood.

The overall improvement is pretty nice: for numbers above 100 it saves 90% of checks, and above 10000 even 99% (that means you only need to do ~400 divs when your number is just above one million :D)

hand optimizing your version to assembly won't even get close to that kind of improvement.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Post Reply