real mode dos assembly trouble
-
- Member
- Posts: 155
- Joined: Fri Oct 27, 2006 5:11 am
- Location: Oberbayern
- Contact:
Yes Zacariaz, that is indeed a quicker way to do things! But like Frank says, in the case of MikeOS, I prefer to to split things up nicely -- so in this case, the first code chunk sets up the stack, and the second sets up the data segment. It's slightly easier to follow when in two parts.
That said, if you do spot something completely redundant in MikeOS, please do let me know!
Cheers,
Mike
That said, if you do spot something completely redundant in MikeOS, please do let me know!
Cheers,
Mike
Or you could do this (remember if you shift bits by one right, your dividing the number by 2).inflater wrote:My DIV explanation:
To divide 16 by 4:
mov ax,10h
mov bx,4
div bx
; now in AX there should be 4
Code: Select all
mov ax, 16
shr ax, 2
Microsoft: "let everyone run after us. We'll just INNOV~1"
yes i am familier with bit shifting
anyway, just had another though about ADD vs. MOV.
what is fastest and most space efficient.
ive just checked with the manual, and it seems, if im not mistaken, that mov ax, 09C0h actually is 4 bytes, and the add ax, 512 3 bytes.
So the question remains: which instruction uses most cycles
anyway, just had another though about ADD vs. MOV.
what is fastest and most space efficient.
Code: Select all
add ax, 512
;or
mov ax, 09C0h
So the question remains: which instruction uses most cycles
as for cycles used, that probably depends on the internal state of the CPU (and is unpredictable at compile-time), cache, and other things that are difficult or impossible to predict -- which is why intel stopped releasing cycle timing information
i wouldnt be surprised if both instructions took about the same number of cycles on average -- remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions -- im not sure what the capabilities of the core2 ALU are...)
i wouldnt be surprised if both instructions took about the same number of cycles on average -- remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions -- im not sure what the capabilities of the core2 ALU are...)
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Strange, I just checked the intel optimisation manual and it gave a latency of up to 80 cycles for division on the older models and still some 22 on the conroe (core 2). On average a div can be pipelined every 20-40 clocks. Shifts in comparison take one clock or less.remember all ALU instructions (including DIV) can be executed in a single cycle or less (prescott can execute 8 of these in a single cycle, so can northwood -- with a restriction: only 2 can be MUL/DIV instructions
Multiplies have been vastly improved on the P4 and later, going from 10 clocks to three to generate the result.
(Intel® 64 and IA-32 Architectures Optimization Reference Manual)
what about binary instructions fx.
instead of:
mov ax, 07C0h
you could write:
xor ax,ax
or ax,07C0h
as far as i know those instructions only use one cycle so allthough it would be hard to read code (and im allso aware of the flags), wouldnt it be more efficient?
YEs, i know im just rambling here, but eh, it is generel ramblings right?
instead of:
mov ax, 07C0h
you could write:
xor ax,ax
or ax,07C0h
as far as i know those instructions only use one cycle so allthough it would be hard to read code (and im allso aware of the flags), wouldnt it be more efficient?
YEs, i know im just rambling here, but eh, it is generel ramblings right?
I pulled this:of the net, and assuming that this information is correct, i have a question or two.
1. Have the "cycle consumtion" improvet sicnificantly on newer processors?
2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.
3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?
Thanks
Code: Select all
DIV:
Clocks Size
Operands 808x 286 386 486 Bytes
reg8 80-90 14 14 16 2
reg16 144-162 22 22 24 2
reg32 - - 38 40 2
mem8 (86-96)+EA 17 17 16 2-4
mem16 (150-168)+EA 25 25 24 2-4 (W88=158-176+EA)
mem32 - - 41 40 2-4
MUL:
Clocks Size
Operands 808x 286 386 486 Bytes
reg8 70-77 13 9-14 13-18 2
reg16 118-113 21 9-22 13-26 2
reg32 - - 9-38 13-42 2-4
mem8 (76-83)+EA 16 12-17 13-18 2-4
mem16 (124-139)+EA 24 12-25 13-26 2-4
mem32 - - 12-21 13-42 2-4
1. Have the "cycle consumtion" improvet sicnificantly on newer processors?
2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.
3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?
Thanks
oh yes, a lot1. Have the "cycle consumtion" improvet sicnificantly on newer processors?
in general, looking at cycle timings on newer CPUs is irrelevent, since this number fluctuates significantly
there are a lot of things that can influence this... so its hard to tell, but this chart looks quite simplistic even for a 386...2. I am having trouble understanding why these two instructions in generel uses more cycles on the 486 than 386.
for one thing, there is no cache-hit/cache-miss information, or state-information, or TLB-hit/miss information, or...
so its hard to tell, just off a chart like this... in fact its gotten so complex, that intel no longer even provides a chart, simply because there are too many things to consider, that it wouldnt really help at all...
not really...3. As for the DIV instruction, if im only gonna need the remainder(%), is there a more efficient way than using DIV?
ok, the problem is that im using the % like a 1000 times a second, +/- im not sure, anyway the div instruction, has got to be killing the cpu, nomatter hov efficient it is. them mul is used just as much, but it can be brought down alot, so that im not so worried about.
Anyway, if there isnt a more efficient way than div, then that is the answer i was looking for. (not the once i hoped for though)
thanks
Anyway, if there isnt a more efficient way than div, then that is the answer i was looking for. (not the once i hoped for though)
thanks
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
you can avoid div's in several ways
1: redesign your algorithm. Consider things like bresenham line drawing opposed to solving y = ax+b for each step. In essence, try to see if you can deduce the result based on the previous one.
2: rewrite the equations. you can eliminate costly operations by applying the inverse somewhere. if you'd check the distance between two points for a certain value, the common case would be to calculate sqrt(dx²+dy²) and compare it to the distance. Instead, you can compute dx² + dy² and compare it to distance² (which eliminates a sqrt for a mul, saving some 80% of time)
3: optimize in assembly:
use shr, sar when dividing by powers of two
use and when computing the modulus with power-of-two
use aam when dividing small numbers
use lookup tables
multiply with reciprocals instead of dividing.
1: redesign your algorithm. Consider things like bresenham line drawing opposed to solving y = ax+b for each step. In essence, try to see if you can deduce the result based on the previous one.
2: rewrite the equations. you can eliminate costly operations by applying the inverse somewhere. if you'd check the distance between two points for a certain value, the common case would be to calculate sqrt(dx²+dy²) and compare it to the distance. Instead, you can compute dx² + dy² and compare it to distance² (which eliminates a sqrt for a mul, saving some 80% of time)
3: optimize in assembly:
use shr, sar when dividing by powers of two
use and when computing the modulus with power-of-two
use aam when dividing small numbers
use lookup tables
multiply with reciprocals instead of dividing.
Well, i fould out this:when i descided to do something about my somewhat... limited ASM skills.
Now, i know it can be optimized ALOT alone in c++, but still i fail to see a more efficient methode, appart from the sieve maybe.
I dont think it should be too hard to write in assembly.
Code: Select all
#include <vector>
#include <climits>
class Prime {
std::vector<unsigned short> Pv;
unsigned long Test;
public:
Prime():Test(3) {Pv.push_back(3);}
unsigned long long next() {
Test += 2;
for(long n = 0;Test >= Pv[n] * Pv[n]; n++) {
if (Test % Pv[n] == 0) {
Test += 2;
n = -1;
}
}
if(Test <= USHRT_MAX) Pv.push_back(Test);
return(Test);
}
};
Now, i know it can be optimized ALOT alone in c++, but still i fail to see a more efficient methode, appart from the sieve maybe.
I dont think it should be too hard to write in assembly.
This was supposed to be a cool signature...
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
You can eliminate a lot of components:
consider all cases where a = b * c (i.e., not prime) and make b >= c
now, the largest value c can attain is b, i.e a = c * c -> c = sqrt(a)
you will only need to check the numbers up to and including sqrt(a) to determine primehood.
The overall improvement is pretty nice: for numbers above 100 it saves 90% of checks, and above 10000 even 99% (that means you only need to do ~400 divs when your number is just above one million )
hand optimizing your version to assembly won't even get close to that kind of improvement.
consider all cases where a = b * c (i.e., not prime) and make b >= c
now, the largest value c can attain is b, i.e a = c * c -> c = sqrt(a)
you will only need to check the numbers up to and including sqrt(a) to determine primehood.
The overall improvement is pretty nice: for numbers above 100 it saves 90% of checks, and above 10000 even 99% (that means you only need to do ~400 divs when your number is just above one million )
hand optimizing your version to assembly won't even get close to that kind of improvement.