usefullness of seg:off pointers

a5498828 · Post by **a5498828** » Sun Feb 27, 2011 6:48 pm

This is a question for people who wrote/are writing/plan to do so a real mode os.

Where should i use near pointers, and where far?

Is it common for caller to require calee to handle over 64kb?

For example int 13h use far pointers. Entire ivt is far pointer based, wich is not supprising.

In protected mode they even added things like RPL to ensure secure segment usage.
That means im not supposed to use near pointers.

So whats the use of near pointers? They existed (call/jmp/retn/) even in 8086 when nobody though of flat address space.
That means they had/have a valid use. What is it? Relative jumps - ok, loops. But relative call? Indirect near jumps/calls?

Tell me, when i am supposed to use far pointers, and when near. In real mode of course.

Brendan · Post by **Brendan** » Sun Feb 27, 2011 9:00 pm

Hi,

a5498828 wrote:Where should i use near pointers, and where far?

The simple answer is, use near pointers everywhere you can, and use far pointers when you have to. Even in real mode loading a segment register adds some overhead.

The longer answer is.... longer.

Because a segment is limited to 64 KiB, real mode OSs usually have several different "memory models" that effect segment usage. From memory:

tiny - same segment used for all segment registers; fast (few segment register loads); "code+data+stack" is limited to a maximum of 64 KiB
small- one segment for code, and one segment used for all data and stack segment registers; fast (few segment register loads); code limited to 64 KiB and "data+stack" is limited to 64 KiB
medium- multiple segments for code, and one segment used for all data and stack segment registers; slightly slower ("far call" needed a lot but few data/stack segment register loads); code limited to 640 KiB and "data+stack" is limited to 64 KiB
compact- one segment for code, and multiple segments used for data and stack; even slower ("far pointers" needed for all data accesses); code limited to 64 KiB and "data+stack" is limited to 640 KiB
large - multiple segments for code and multiple segments for data and stack; even slower ("far call" and "far pointers"); "code+data+stack" limited to 640 KiB
overlays - similar to "large", but code is split up into modules/overlays and loaded into memory when needed; even slower; "data+stack" limited to 640 KiB less space reserved for "base code + modules/overlays", but there's no limit on the number of overlays so code size is only really limited by disk size (or in practice, disk speed and performance requirements)

On top of this, a few schemes were used to allow a program to access more memory. The first was "bank switching", where pieces of data would be switched from expanded or extended memory (typically via. a special memory manager). The other alternative was "extenders", where protected mode is used by the application to get past the memory usage limitations caused by the OS.

a5498828 wrote:Is it common for caller to require calee to handle over 64kb?

For things like the OS's API/s, the OS must use far pointers (to be able to support the more useful memory models, and other reasons), which mostly forces the callee to use far pointers when calling these API/s. For a program calling it's own code it depends on which memory model the program uses.

a5498828 wrote:So whats the use of near pointers? They existed (call/jmp/retn/) even in 8086 when nobody though of flat address space.
That means they had/have a valid use. What is it? Relative jumps - ok, loops. But relative call? Indirect near jumps/calls?

Imagine a system with 100 separate pieces of code where each piece of code is less than 64 KiB. In this case you can use near pointers for all calls and jumps within the same piece of code; and only use slower far calls when something in one piece of code has to call something in another piece of code.

A relative call is mostly just a near call with a shorter opcode (e.g. a 1-byte "distance from IP" stored in the instruction rather than a 2-byte "new value of IP" stored in the instruction).

Of course if you have to handle segmentation all over the place, and have to handle all the different memory models, and have to figure out what to do with free memory fragmentation and other problems, and have to write your own compiler/s because the only compilers that support real mode and segmentation also expect DOS; then you'd realise protected mode and paging is a lot easier in the long run (and usually faster and more efficient too). Note: I only mention this because a lot of beginners actually make the mistake of thinking "real mode" is easier, which is only true in the short term.

Cheers,

Brendan

a5498828 · Post by **a5498828** » Sun Feb 27, 2011 9:43 pm

thank you for answer, real mode is easier for me because the same argument (near vs far) can be taken into protected and even long mode. And there you have thinkg like privilege levels, gates, tasks, and other thing. Paging is even more complicated. Real mode is just the most basic of it all, and when wrote right will work on cpus from 1970s

I just wonder how fast would code execute on oldest 8086 wich on core2duo is done in 1 second.

I think i will use near calls as you suggested for internal functions, and far for everything else.

A relative call is mostly just a near call with a shorter opcode (e.g. a 1-byte "distance from IP" stored in the instruction rather than a 2-byte "new value of IP" stored in the instruction).

there is 1 byte relative call?

Im talking about x86, forgot to mention.

Brendan · Post by **Brendan** » Sun Feb 27, 2011 10:17 pm

Hi,

a5498828 wrote:thank you for answer, real mode is easier for me because the same argument (near vs far) can be taken into protected and even long mode. And there you have thinkg like privilege levels, gates, tasks, and other thing. Paging is even more complicated. Real mode is just the most basic of it all, and when wrote right will work on cpus from 1970s

I know it seems like real mode is "easy", and in the beginning it is. With protected mode and paging the protection stuff could be ignored, segmentation can be ignored, all the gates (except "interrupt trap gates") aren't strictly necessary, you wouldn't need any TSS (until/unless you want to use protection), etc. You also get a clean way of hiding they underlying physical address space, which avoids things like memory fragmentation problems, working around "holes", working around "maximum memory" restrictions, etc. In the short term learning how to use paging is harder than not learning how to use paging; but it avoids a huge amount of hassle in the long run.

a5498828 wrote:I just wonder how fast would code execute on oldest 8086 wich on core2duo is done in 1 second.

If something takes 1 second on a modern OS running on a modern CPU, then it's likely that it'd take days on an 8086 (especially when you have to swap large amounts of code and data to/from disk just to get past the "lack of memory" problems). Even something as simple as a decoding a 1024*768 bitmap would be a huge nightmare.

a5498828 wrote:
A relative call is mostly just a near call with a shorter opcode (e.g. a 1-byte "distance from IP" stored in the instruction rather than a 2-byte "new value of IP" stored in the instruction).
there is 1 byte relative call? Im talking about x86, forgot to mention.

Heh - sorry. The "rel8" addressing is only for JMP.

Makes me wonder how many people write position independent real mode code..

Cheers,

Brendan

rdos · Post by **rdos** » Mon Feb 28, 2011 8:40 am

Brendan wrote:Of course if you have to handle segmentation all over the place, and have to handle all the different memory models, and have to figure out what to do with free memory fragmentation and other problems, and have to write your own compiler/s because the only compilers that support real mode and segmentation also expect DOS;

Not true. The OpenWatcom compiler supports segmentation without the need for an underlaying DOS system. It also supports the flat, small (and recently compact) memory model for 32-bit applications, with no assumption about an underlaying DOS API. The compact memory model can also be optimized by always letting DS point to DGROUP, and thus avoiding far pointers on local data.

Brendan wrote:I know it seems like real mode is "easy", and in the beginning it is. With protected mode and paging the protection stuff could be ignored, segmentation can be ignored, all the gates (except "interrupt trap gates") aren't strictly necessary, you wouldn't need any TSS (until/unless you want to use protection), etc. You also get a clean way of hiding they underlying physical address space, which avoids things like memory fragmentation problems, working around "holes", working around "maximum memory" restrictions, etc. In the short term learning how to use paging is harder than not learning how to use paging; but it avoids a huge amount of hassle in the long run.

The same thing can be said about using a flat memory model vs using a protected memory model with enforced segment isolation. It will take longer to code, and it might be slightly slower, but in the end will contain fewer bugs because of the design choice. There is always a trade-off between speed-effort-protection.

ErikVikinger · Post by **ErikVikinger** » Mon Feb 28, 2011 2:16 pm

Hello,

rdos wrote:The same thing can be said about using a flat memory model vs using a protected memory model with enforced segment isolation. It will take longer to code, and it might be slightly slower, but in the end will contain fewer bugs because of the design choice. There is always a trade-off between speed-effort-protection.

Full ACK, and i think the speed-loose for segmentation is really small and you have a chance to become additional speed through a better design (flexible memory sharing could be faster with segments).

Greetings
Erik

Owen · Post by **Owen** » Wed Mar 09, 2011 5:20 pm

ErikVikinger wrote:
rdos wrote:The same thing can be said about using a flat memory model vs using a protected memory model with enforced segment isolation. It will take longer to code, and it might be slightly slower, but in the end will contain fewer bugs because of the design choice. There is always a trade-off between speed-effort-protection.
Full ACK, and i think the speed-loose for segmentation is really small and you have a chance to become additional speed through a better design (flexible memory sharing could be faster with segments).

Segment loads are vector path operations (i.e. done completely in order), so are themselves quite slow. Segments with non-zero bases combined with SIB addressing adds an extra cycle to instructions (because the AGU can't do four additions in one). Relative branches all have a latency addition of 1 (branches have zero latency with segbase=0), far jumps have a minimum latency of 20 cycles IIRC, and are unpredicted, etc, etc...

Performance wise, segmentation is bad news.

(Source: AMD K8 & K10 optimization guides. Intel are much less helpful)

rdos · Post by **rdos** » Wed Mar 09, 2011 11:48 pm

Owen wrote:Segment loads are vector path operations (i.e. done completely in order), so are themselves quite slow. Segments with non-zero bases combined with SIB addressing adds an extra cycle to instructions (because the AGU can't do four additions in one). Relative branches all have a latency addition of 1 (branches have zero latency with segbase=0), far jumps have a minimum latency of 20 cycles IIRC, and are unpredicted, etc, etc...

Performance wise, segmentation is bad news.

(Source: AMD K8 & K10 optimization guides. Intel are much less helpful)

No wonder that AMD is losing the race against Intel. Providing an adder in hardware costs a few transistors, and such ICs were among the first digital ICs constructed.

Solar · Post by **Solar** » Thu Mar 10, 2011 12:39 am

Again, a market thing: If "no one" in the market still uses segmented memory (niche OS's nonwithstanding), why bother to optimize for it?

I bet you a case of beer that Intel doesn't have those additional transistors, either.

rdos · Post by **rdos** » Thu Mar 10, 2011 1:06 am

But the real issue is that checking if base is 0 takes a similar amount of time as doing an addition, so this is clearly a bad design. What they could mean is that 64-bit long mode (which force segment base to 0) is faster because they could discard adding the segment base, and not that a 32-bit OS that sets up non-zero bases for segments will execute slower, but I don't know. Anyway, I provided a fix for this on CPUs that support VME-flag, so on those systems flat memory model applications will use a 0 base.

This is more politics than engineering. Seems like big companies like Microsoft today can get their favorite optimizations into hardware, while leaving competing designs in software. This is their way of boasting about how fast code they have done, when they in reality has some really bloated code that should run slow as hell.

Solar · Post by **Solar** » Thu Mar 10, 2011 1:18 am

rdos wrote:This is their way of boasting about how fast code they have done...

*cough* *sputter*

Microsoft? Fast code?

Have they ever claimed such a thing?

Combuster · Post by **Combuster** » Thu Mar 10, 2011 3:27 am

Owen wrote:Performance wise, segmentation is bad news.

Performance wise, TLB flushes and pagewalks are bad news. YMMV.

Owen · Post by **Owen** » Thu Mar 10, 2011 8:06 pm

rdos wrote:But the real issue is that checking if base is 0 takes a similar amount of time as doing an addition

You can check if the gate is 0 at segment register load time. You have to add for every address generation. Once the register is loaded, you distribute a bit around the sequencer saying whether SIB addresses need an extra segment base add cycle.

rdos · Post by **rdos** » Thu Mar 10, 2011 11:55 pm

Owen wrote:You can check if the gate is 0 at segment register load time. You have to add for every address generation. Once the register is loaded, you distribute a bit around the sequencer saying whether SIB addresses need an extra segment base add cycle.

Ever heard of parallell adding? There is no need do those adds in sequence, rather keep a block of as many adders as are necesary and perform them in parallell. Unused inputs are set to 0.

Casm · Post by **Casm** » Sat Mar 19, 2011 8:14 pm

I just wonder how fast would code execute on oldest 8086 wich on core2duo is done in 1 second.

In the early eighties the processor in a PC ran at 4.77MHz. So if you assume that a two core processor is the equivalent of a single core running at 4GHz, something which would take the modern processor 1 second would have taken an 8086 about a quarter of an hour. Back in those days I used to hang a computer with:

Code: Select all

xor cx,cx
label:
push cx
xor cx,cx
loop $
pop cx
loop label

I never waited to see how long it took to run before aborting it.

OSDev.org

usefullness of seg:off pointers

usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers

Re: usefullness of seg:off pointers