OSDev.org

Posted: **Tue Jul 06, 2004 8:02 am**

Two questions, first
1. Im in PM using segments... but how do I acess a segment in C? in ASM I write segment:Offset. But how do I do this in C? Or can I still use the physical adresse? like if I have a segment (0x8) that has a base adress that?s 0x1000... can I use 0x1000 instead of 0x8:0?
2. What happens if I got two segments, with different privilege levels, that?s overlapping each other?

Posted: **Tue Jul 06, 2004 10:50 pm**

BrandonChris wrote: Two questions, first
1. Im in PM using segments... but how do I acess a segment in C?

If you are using GCC, not at all. GCC does not support a segmented memory model. I heard that some less popular compiler (Watcom?) supports it, though; you'd have to check out the documentation of that compiler. In any way, such addressing isn't compiler-portable (as it's not part of the language proper).

Posted: **Wed Jul 07, 2004 7:02 am**

So when I use an adress in GCC, it becomes the physicall adress? E.g. if I write something like
char *pointer = 0x12BBB
then it really points to the physicall adress 0x12BBB, no matter how my segments look?

Posted: **Wed Jul 07, 2004 7:24 am**

it means that if you use GCC to compile, and have ds.base not equal to es.base, or to ss.base, the code will simply fail.

Posted: **Wed Jul 07, 2004 8:01 am**

because the compiler dont know/care about the segments, so I have to use the same base adress for the code/stack/data?

Posted: **Wed Jul 07, 2004 8:13 am**

Is this correct; Cause if I set e.g. the base of the Code Segment = 0x10, then the processor will read 0x10 bytes "too far" when trying to read a variable? So if I use a variabel that the compiler puts on the adress 0x1000, then the processor will look for it att 0x1000 + 0x10? right?

Posted: **Wed Jul 07, 2004 12:09 pm**

BrandonChris wrote: Is this correct; Cause if I set e.g. the base of the Code Segment = 0x10, then the processor will read 0x10 bytes "too far" when trying to read a variable? So if I use a variabel that the compiler puts on the adress 0x1000, then the processor will look for it att 0x1000 + 0x10? right?

If you put everything 10 bytes too far, only absolute jumps don't work. GCC doesn't make absolute jumps, only references. All absolute thing must be relocated, so in a way, if you don't have a BINARY format it all works even with a base of 10 or 18247.

If you use binary you can't adjust the absolute references so it's no good trying to.

PS: the absolute references don't make too much of a difference, since all references are done to the base of the segment.

PDS (post-di-scriptum): you COULD use some segment base & limit for 32-bit applications that are small, to reduce TLB / cache flush overhead. Makes library calls a living hell though...

Posted: **Wed Jul 07, 2004 4:13 pm**

Alright thanks...
But what is the different segments for? I mean sometimes the processor uses the base of the code seg and sometimes the data seg and so on..when and why?

Posted: **Wed Jul 07, 2004 5:14 pm**

and if I write something like this:
char *p = 0x10000
will p point the to physicall adress 0x10000, or 0x10000 + some base adress. And if, what base adress? the code segment?/datasegment

Posted: **Wed Jul 07, 2004 5:55 pm**

Just to start from a clean base, this is how physical addresses are formed.
Real-Mode

Physical address = (Segment * 0x10) + Offset

Protected-Mode

Linear Address = Segment Base + Offset

With paging turned on the Physical Address is dependant on the page table entry that describes the page which the Linear Address (Sometimes called Virtual Address) occupies.

Without paging Linear Address = Physical Address

***

In Protected-Mode segments are described by a DESCRIPTOR, located in either the GDT or LDT. These descriptors are loaded into SEGMENT registers (CS, DS, ES, FS, GS, SS).

Setting a descriptor's type to CODE indicates that the memory described by the descriptor is executable. In normal circumstances the only register loaded with a CODE type descriptor is CS. CS can only be changed via a long jump, a far return, an iret, or a task switch (Might have missed one).

Setting a descriptor's type to DATA indicates that the memory described by the descriptor is NON-executable. In normal circumstances every segment register aside from CS will contain a DATA type descriptor (DS, ES, SS, FS, GS). Any instruction that accesses memory (Aside from those that use CS. See above) will these registers. Eg mov [si], 0x1 uses DS even though it isn't stated explicitly.

The reason for having 2 separate types is to increase protection.

The base and length of a segment can be any value you like. Most people use a flat memory model (All descriptors have base 0, length 4gb), but there are other possibilities.

The application code doesn't see any of this. As far as it is concerned the application has the entire address space to itself. It is the job of the OS to make sure that if the application uses address 0x0 that address 0x0 actually corresponds to the physical address containing the data/code the application expects.

Cause if I set e.g. the base of the Code Segment = 0x10, then the processor will read 0x10 bytes "too far" when trying to read a variable?
So if I use a variabel that the compiler puts on the adress 0x1000, then the processor will look for it att 0x1000 + 0x10? right?

Yes. It would be the responsibility of your OS to make sure that the application's data has been loaded at the physical address of 0x10.

Posted: **Wed Jul 07, 2004 6:52 pm**

Curifir:
So what you mean is that an instruction like mov "starts" at the data segment?s baseadress? So
mov ..., [adress], will acutally be
mov ..., [adress + data segments base adress]?

But it doesnt matter when using the flat memory model?

Posted: **Wed Jul 07, 2004 7:02 pm**

So when I set a direct pointer like

unsigned char *videoMem = (unsigned char *) 0xb8000;

that will actually "point" to the memory location
0xb8000 + [Code Segment?s base adress]

or something like that at least?

Posted: **Wed Jul 07, 2004 9:16 pm**

Ok. I'll try the really really simplified explanation. Be patient, this might seem dumb.

In real-mode you use a logical address in the form A:B to address memory.

This is translated into a physical address using the equation:

Physical address = (A * 0x10) + B

The registers in pure real-mode are limited to 16 bits for addressing. 16 bits can represent any integer between 0 and 64k.

This means that if we set A to be a fixed value and allow B to change we can address a 64k area of memory. This 64k area is called a segment.

A = A 64k segment
B = Offset within the segment

The base address of a segment is the (A * 0x10) portion of the equation I showed.

It should be obvious that segments can overlap.

Eg
The segment 0x1000 has a base address of 0x10000. This segment occupies the physical address range 0x10000 -> 0x1FFFF,

However the segment 0x1010 has a base address of 0x10100. This segment occupies the physical address range 0x10100 -> 0x200FF

As you can see we could use either segment to reach physical addresses between 0x10100 and 0x1FFFF since the segments overlap.

The x86 line of computers have 6 segment registers (CS, DS, ES, FS, GS, SS). They are totally independent of one another.

CS = Code Segment
DS = Data Segment
SS = Stack Segment
ES = Extra Segment
FS/GS = General Purpose Segments

CS is the only Segment Register that cannot be directly altered. The only time (I'm sure I'm missing one) CS is altered is when the code switches execution into another segment. The only commands that can do this are:

Far Jump. Here the new value for CS is encoded in the jump instruction. Eg JMP 0x10:0x100 says to load CS with segment 0x10 and IP with 0x100. CS:IP is the logical address of the instruction to be executed.

Far Call. This is exactly the same as a far jump, but the current values of CS/IP are pushed onto the stack before executing at the new position.

INT. The processor reads the new value of CS/IP from the Interrupt Vector Table and then executes what is effectively a far call after pushing EFLAGS onto the stack.

Far Return. Here the processor pops the return segment/offset from the stack into CS/IP and switches execution to that address.

IRET. This is exactly the same as a far return apart from the processor popping EFLAGS off the stack in addition to CS/IP.

Apart from these cases no instruction alters the value of CS.

DS,ES,FS,GS,SS are used to form addresses when you want to read/write to memory. They don't always have to be explicitly encoded, because some processor operations assume that certain segment registers will be used.

Eg.

MOV [SI], AX will write the word contained in ax to the address DS:SI

MOV ES:[DI], AX will write the word contained in ax to the address es:di

CMPSB will compare the byte at DS:SI to the byte at ES:DI, set the zero flag if they are equal and decrement/increment SI and DI according to the state of the direction flag.

As you can see, often the segment register being used is not contained in the instruction, but there is one being used. EVERY time you form an address on an x86 processor there will be a segment register involved.

***

Ok, that cover real-mode. Hopefully the reasons for me going into that detail will become clear shortly.

Posted: **Wed Jul 07, 2004 9:17 pm**

In protected-mode you use a logical address in the form A:B to address memory.

As in real-mode A is the segment part and B is the offset within that segment.

The registers in protected mode are limited to 32 bits. 32 bits can represent any integer between 0 and 4Gb.

Because B can be any value between 0 and 4Gb our segments now have a maximum size of 4Gb (Same reasoning as in real-mode).

Now for the difference.

In protected mode A is not an absolute value for the segment. In protected mode A is a selector. A selector represents an offset into a system table called the Global Descriptor Table(GDT). The GDT contains a list of descriptors. Each of these descriptors contains information that describes the characteristics of a segment.

Each descriptor contains the following information:

The base address of the segment
The default operation size in the segment (16-bit/32-bit)
The privilege level of the descriptor (Ring 0 -> Ring 3)
The granularity (Segment limit is in byte/4kb units)
The segment limit (The maximum legal offset within the segment)
The segment presence (Is it present or not)
The descriptor type (0 = system; 1 = code/data)
The segment type (Code/Data/Read/Write/Accessed/Conforming/Non-Conforming/Expand-Up/Expand-Down)

For the purposes of this explanation I'm only interested in 3 things. The base address, the limit and the descriptor type.

I'll deal with the descriptor type first. If this is clear (System type) then the descriptor isn't actually describing a segment, it's describing either one of the special gate mechanisms, where to find an LDT, or a TSS. These have nothing to do with general addressing, so I'll assume a descriptor type of 1 (code/data type) and leave you to read the Intel manuals for the rest.

The segment is described by its base address and limit. Remember in real-mode where the segment was a 64k area in memory? The only difference here is that the size of the segment isn't fixed. The base address supplied by the descriptor is the start of the segment, the limit is the maximum offset the processor will allow before producing an exception.

So the range of physical addresses in our protected mode segment is:

Segment Base -> Segment Base + Segment Limit

Given a logical address A:B (Remember that A is a selector) we can determine the physical address it translates to using:

Physical address = Segment Base (Found from the descriptor) + B

All the other rules from real-mode still apply.

Segments can overlap
CS,DS,ES,FS,GS,SS are independent of each other
CS cannot be changed directly

In protected mode CS can also be changed via the TSS or a gate.

***

That covers protected mode segments (Paging is not what you asked for). Hopefully segments now make sense.

***

Finally I'll just run through some things about C.

Most C compilers assume a flat-memory model.

In this model all the segments cover the full address space (Usually 0->4Gb on x86). In essence this means that we completely ignore the A part of our A:B logical address. The reason for this is that most processors don't actually have segmentation (Plus it's a hell of a lot easier for the compiler to optimise).

This leaves you with 2 descriptors per privilege level (Ring 0 and Ring 3 normally), one for code and one for data, which both describe precisely the same segment. The only difference being that the code descriptor is loaded into CS, and the data descriptor is used by all the other segment registers. The reason you need both a code and data descriptor is that the processor will not allow you to load CS with a data descriptor (This is to help with security when using a segmented memory model, and although useless in the flat-memory model it is still required because you can't turn off segmentation).

In general if you want to use the segmentation mechanism, by having the different segment registers represent segments with different base addresses, you won't be able to use a modern C compiler, and may very well be restricted to just Assembly.

Basically, since you keep giving C examples, do what the rest of the C world does, which is set up a flat-memory model, use paging, and ignore the fact that segmentation even exists.

***

Hope that clears it up.

Posted: **Wed Jul 07, 2004 11:56 pm**

I took the liberty of copying this into the FAQ: What Segments are About.

OSDev.org

Segments

Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments

Re:Segments Part 1

Re:Segments Part 2

Re:Segments