Page 1 of 2

New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 1:35 pm
by babaliaris

Code: Select all

;*********************************************
;	Boot1.asm
;		- A Simple Bootloader
;
;	Operating Systems Development Tutorial
;*********************************************

bits	16							; We are still in 16 bit Real Mode

org		0x7c00						; We are loaded by BIOS at 0x7C00

start:          jmp loader					; jump over OEM block

;*************************************************;
;	OEM Parameter block
;*************************************************;

; Error Fix 2 - Removing the ugly TIMES directive -------------------------------------

;;	TIMES 0Bh-$+start DB 0					; The OEM Parameter Block is exactally 3 bytes
								; from where we are loaded at. This fills in those
								; 3 bytes, along with 8 more. Why?

bpbOEM			db "My OS   "				; This member must be exactally 8 bytes. It is just
								; the name of your OS :) Everything else remains the same.

bpbBytesPerSector:  	DW 512
bpbSectorsPerCluster: 	DB 1
bpbReservedSectors: 	DW 1
bpbNumberOfFATs: 	    DB 2
bpbRootEntries: 	    DW 224
bpbTotalSectors: 	    DW 2880
bpbMedia: 	            DB 0xF0
bpbSectorsPerFAT: 	    DW 9
bpbSectorsPerTrack: 	DW 18
bpbHeadsPerCylinder: 	DW 2
bpbHiddenSectors: 	    DD 0
bpbTotalSectorsBig:     DD 0
bsDriveNumber: 	        DB 0
bsUnused: 	            DB 0
bsExtBootSignature: 	DB 0x29
bsSerialNumber:	        DD 0xa0a1a2a3
bsVolumeLabel: 	        DB "MOS FLOPPY "
bsFileSystem: 	        DB "FAT12   "

msg	db	"Welcome to My Operating System!", 0		; the string to print

;***************************************
;	Prints a string
;	DS=>SI: 0 terminated string
;***************************************

Print:
			lodsb					; load next byte from string from SI to AL
			or			al, al		; Does AL=0?
			jz			PrintDone	; Yep, null terminator found-bail out
			mov			ah,	0eh	; Nope-Print the character
			int			10h
			jmp			Print		; Repeat until null terminator found
PrintDone:
			ret					; we are done, so return

;*************************************************;
;	Bootloader Entry Point
;*************************************************;

loader:

	xor	ax, ax		; Setup segments to insure they are 0. Remember that
	mov	ds, ax		; we have ORG 0x7c00. This means all addresses are based
	mov	es, ax		; from 0x7c00:0. Because the data segments are within the same
				; code segment, null em.

	mov	si, msg						; our message to print
	call	Print						; call our print function

	xor	ax, ax						; clear ax
	int	0x12						; get the amount of KB from the BIOS

	cli							; Clear all Interrupts
	hlt							; halt the system
	
times 510 - ($-$$) db 0						; We have to be 512 bytes. Clear the rest of the bytes with 0

dw 0xAA55							; Boot Signiture
I understand everything except these instructions:

Code: Select all

loader:

	xor	ax, ax		; Setup segments to insure they are 0. Remember that
	mov	ds, ax		; we have ORG 0x7c00. This means all addresses are based
	mov	es, ax		; from 0x7c00:0. Because the data segments are within the same
				; code segment, null em.
I can not understand why we have to initialize the data segment and extra segment registers to zero.

When I run this bootloader with qemu even with commenting down the first 3 lines in the loader: label, everything work just fine, but in a real machine it does not, i don't see the output "Welcome to My Operating System!". I tried running the bootloader in a real machine without commenting the two mov instructions that set the ds and es to zero and the program was working as expected.

Probably I haven't understnad correctly how segments work. I mean how does the processor using the ds and es register in order to identify the data?

What if i set ds and es to be equal to 0x7c00 ? I have to try this :p

I understand that my data and instructions are in the code segment, the point is why ds and es must be zero, it does not make sense. It would make more sense if i was initializing them to be 0x7c00 as the code segment.

Why the author of the code does not use segment .text in the assembly code? Is it because when you put everything in the code segment it is self-explanatory so you don't have to explicitly type it?

Does the org actually initialize the cs register to be cs=0x7c00? And how does it know that I'm talking about the code segment (based on my previous question)?

Thank you :P

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 1:47 pm
by dseller
Simple, they are not guaranteed to be 0 if you do not explicitly set them to that value. They can basically be any random value.

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 1:55 pm
by babaliaris
dseller wrote:Simple, they are not guaranteed to be 0 if you do not explicitly set them to that value. They can basically be any random value.
Yes, but why they must be zero?

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 2:07 pm
by glauxosdever
Hi,

babaliaris wrote:I understand everything except these instructions:

Code: Select all

loader:

	xor	ax, ax		; Setup segments to insure they are 0. Remember that
	mov	ds, ax		; we have ORG 0x7c00. This means all addresses are based
	mov	es, ax		; from 0x7c00:0. Because the data segments are within the same
				; code segment, null em.
I can not understand why we have to initialize the data segment and extra segment registers to zero.
The comment in the "mov es, ax" line is wrong; the base segmented address is 0x0000:0x7C00 (the format being segment:offset) and the base linear address is 0x00007C00 (the formula being address = 16*segment + offset).

However, if you had a segmented address 0x07C0:0x0000, the linear address would still be 0x00007C00. Same for 0x0700:0x0C00, 0x00C0:0x7000, 0x0600:0x1C00 and many other combinations. The point here is to use something that is convenient, e.g. it's good to set DS and ES to zero here, as data can be accessed using this segment.

Now, if you want to write to linear address 0x00014000, you may like to set DS=0x1400 and use an offset of 0x0000, or you may like to set DS=0x1000 and use an offset of 0x4000.

As an exercise, how would you set the segment and the offset in order to write to address 0x000B8014 (VGA text memory starting at 0x000B8000)?


Now, let's get more practical. You specify ORG 0x7C00, which means that the assembler should assume the code starts at offset 0x7C00. So, since the code is loaded at linear address 0x00007C00, CS has to be 0x0000. The catch here is that you don't know whether CS:IP is 0x0000:0x7C000 or 0x07C0:0x0000 or something else, so you have to explicitly set it.

As for DS and ES, it's the same, you also have to explicitly set them. DS is implicitly used in some instructions (unless you specify another segment register), and ES is some others (notably string instructions). So, to address something you loaded along with the code starting at 0x00007C00, when using ORG 0x7C00, you have to set DS and ES to 0x0000, for the same reasons you have to set CS to 0x0000.


Regards,
glauxosdever

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 2:29 pm
by babaliaris
glauxosdever wrote:Hi,

babaliaris wrote:I understand everything except these instructions:

Code: Select all

loader:

	xor	ax, ax		; Setup segments to insure they are 0. Remember that
	mov	ds, ax		; we have ORG 0x7c00. This means all addresses are based
	mov	es, ax		; from 0x7c00:0. Because the data segments are within the same
				; code segment, null em.
I can not understand why we have to initialize the data segment and extra segment registers to zero.
The comment in the "mov es, ax" line is wrong; the base segmented address is 0x0000:0x7C00 (the format being segment:offset) and the base linear address is 0x00007C00 (the formula being address = 16*segment + offset).

However, if you had a segmented address 0x07C0:0x0000, the linear address would still be 0x00007C00. Same for 0x0700:0x0C00, 0x00C0:0x7000, 0x0600:0x1C00 and many other combinations. The point here is to use something that is convenient, e.g. it's good to set DS and ES to zero here, as data can be accessed using this segment.

Now, if you want to write to linear address 0x00014000, you may like to set DS=0x1400 and use an offset of 0x0000, or you may like to set DS=0x1000 and use an offset of 0x4000.

As an exercise, how would you set the segment and the offset in order to write to address 0x000B8014 (VGA text memory starting at 0x000B8000)?


Now, let's get more practical. You specify ORG 0x7C00, which means that the assembler should assume the code starts at offset 0x7C00. So, since the code is loaded at linear address 0x00007C00, CS has to be 0x0000. The catch here is that you don't know whether CS:IP is 0x0000:0x7C000 or 0x07C0:0x0000 or something else, so you have to explicitly set it.

As for DS and ES, it's the same, you also have to explicitly set them. DS is implicitly used in some instructions (unless you specify another segment register), and ES is some others (notably string instructions). So, to address something you loaded along with the code starting at 0x00007C00, when using ORG 0x7C00, you have to set DS and ES to 0x0000, for the same reasons you have to set CS to 0x0000.


Regards,
glauxosdever
I think I get what you're saying but I tried to set CS to zero and it does not work, I don't get the message written on the screen :(
The code is the same as the original post, with the following update:

Code: Select all

loader:

	xor	ax, ax
        mov  cs, ax ; I added this new line here
	mov	ds, ax
	mov	es, ax

	mov	si, msg
	call	Print

	cli
	hlt

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 2:48 pm
by glauxosdever
Hi,


Normally, you set CS and IP as a jump, for example:

Code: Select all

jmp 0x0000:label
Hope this helps! :-)


Regards,
glauxosdever

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 2:54 pm
by JAAman
babaliaris wrote:
I think I get what you're saying but I tried to set CS to zero and it does not work, I don't get the message written on the screen :(
you don't need to set CS at all -- in fact, you should never set CS within the context of a 1st-stage legacy bootloader

you will generally need to use a FAR JMP to the 2nd stage anyway, and that will set CS for you after that (and what CS is within the 1st-stage doesn't matter -- it is highly unlikely that CS:IP is set such that IP can reach the 0x7C00 but will overflow before reaching 0x7E00)

The code is the same as the original post, with the following update:

Code: Select all

loader:

	xor	ax, ax
        mov  cs, ax ; I added this new line here
	mov	ds, ax
	mov	es, ax

	mov	si, msg
	call	Print

	cli
	hlt
bad bad bad...

I don't know what assembler allowed you to do this without lots of warnings, but this is wrong

you must never change CS in this way -- technically the instruction does exist (although I believe it is overridden on modern CPUs so that it won't execute correctly), but as soon as that instruction executes, your code will stop executing -- because your code is executing from CS:IP, but you changed CS while IP stays the same -- so now the same CS:IP combination will point to a different address (not the address you want)

there are only 2 safe ways to change CS:
1) use a retf/iret instruction to simultaneously load both CS and IP from the stack
2) use a JMP FAR/CALL FAR instruction to change both CS and IP either from immediate or from an indirect address

both of these will change both CS and IP at the same time, allowing you to control the destination

as I said above, I do not recommend changing CS at all until you leave the 1st-stage, at which point you will probably want to use a FAR JMP anyway to branch into the 2nd stage, and from that point onward you will know what your CS is

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 3:40 pm
by babaliaris
JAAman wrote:
The code is the same as the original post, with the following update:

Code: Select all

loader:

	xor	ax, ax
        mov  cs, ax ; I added this new line here
	mov	ds, ax
	mov	es, ax

	mov	si, msg
	call	Print

	cli
	hlt
bad bad bad...

I don't know what assembler allowed you to do this without lots of warnings, but this is wrong

you must never change CS in this way -- technically the instruction does exist (although I believe it is overridden on modern CPUs so that it won't execute correctly), but as soon as that instruction executes, your code will stop executing -- because your code is executing from CS:IP, but you changed CS while IP stays the same -- so now the same CS:IP combination will point to a different address (not the address you want)

there are only 2 safe ways to change CS:
1) use a retf/iret instruction to simultaneously load both CS and IP from the stack
2) use a JMP FAR/CALL FAR instruction to change both CS and IP either from immediate or from an indirect address

both of these will change both CS and IP at the same time, allowing you to control the destination

as I said above, I do not recommend changing CS at all until you leave the 1st-stage, at which point you will probably want to use a FAR JMP anyway to branch into the 2nd stage, and from that point onward you will know what your CS is
So the IP actually contains an offset from the beginning of the memory where the CS pointer is actually pointing? I thought it contained the actual physical address in the memory.

Also, who is ressponsible for initializing th CS register? Does the bios do that when it loads the MBR sector into memory 0x7c?

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 4:24 pm
by JAAman
babaliaris wrote: So the IP actually contains an offset from the beginning of the memory where the CS pointer is actually pointing?
yes
I thought it contained the actual physical address in the memory.
no, it contains the offset into the segment pointed to by the CS register
Also, who is ressponsible for initializing th CS register?
well, when the CPU first initializes itself the CS register contains the number 0xF000, and the hidden portion points to base address 0xFFFF_0000 so that when combined with IP (which starts at 0xFFF0) it begins executing memory at address 0xFFFF_FFF0...

but on legacy BIOS systems (that is, real computers more than 15 years old, and most emulators if re-configured for legacy operation), the firmware loads the legacy 1st stage to address 0x7C00 and jumps to it -- at this point the combination of CS and IP (usually specified as CS:IP) points to the address 0x0000_7C00 -- however what CS and IP actually contain is undefined and could be anything the firmware wanted to set it to (however for practical reasons CS will usually, but not always, be either 0 or 0x7C0)

all that can be guaranteed is that CS:IP combination points to 0x7C00 -- however, most of the time, whatever CS is, it is capable of pointing to 0x7E00 also (thus can address the entire legacy 1st stage)

Does the bios do that when it loads the MBR sector into memory 0x7c?
if your 1st stage is loaded by an MBR, generally that MBR will change CS, because it needs to relocate itself (and it cannot be sure if the new location is available under a random CS, so it should use a FAR JMP to jump to its copy), in this case it will usually be 0 (but you shouldn't rely on that)

in either case, the CS is set by the JMP FAR instruction the firmware uses to start either the MBR or the legacy 1st stage

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 4:27 pm
by babaliaris
Just to make things a lot more clear. I can't understand some basic concepts about x86 architecture. How exactly the assembler translates labels to memories. Does the assembler translates labels to offsets from the begging of the file which are going to be added later with the CS value to calculate the actual memory address of an instruction? What about data memory? Take a look at some questions I made for you using some drawings too to make things more clear:

Check This Image

Update:
Based on JAAman's answer (before this post) you can see from the image above that I have messed up everything and I don't really understand what is going on. It would be awesome if you could explain some things to me on how labels are translated into address and stuff like that. The tutorials that I'm reading does not make these things clear...


By the way I'm coming from MIPS where we just had one Program Counter that was pointing in the exact location of the next instruction. x86 is so messed up, i can't get it...

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 4:49 pm
by JAAman
babaliaris wrote:Just to make things a lot more clear. I can't understand some basic concepts about x86 architecture. How exactly the assembler translates labels to memories.
that is fairly simple:
the assembler takes the offset of the label from the beginning of the file, and adds whatever you set as ORG to it (so ORG should be the offset within the segment where the file will be loaded
Does the assembler translates labels to offsets from the begging of the file which are going to be added later with the CS value to calculate the actual memory address of an instruction?
no, not exactly... the assembler doesn't generally care at all where the code is located -- when you jump to a label, it doesn't need to know the address of the label, only how far away that label is (because all short and near immediate jumps and calls all use relative offsets -- basically the instruction tells the CPU to jump backwards (or ahead) by x number of bytes) and so the assembler just needs to count how many bytes there are between the instruction and the label
What about data memory?
basically, for data references, it takes the offset of the label from the start of the file, and then adds the ORG value (so you should set your ORG to the offset from DS.base to where the file is loaded in memory)

so if your DS.base == 0 (which would be if you set DS to 0) and your ORG is 0x7C00, then the assembler will count how far the label is from the start of the file and add 0x7C00 (because that is the offset from DS.base to the start of the file)


remember, the assembler doesn't know or care anything about the segment (DS usually, sometimes ES, FS, or GS), that is handled by the CPU itself

while in RMode, the CPU basically just multiplies whatever is in DS*16, and then that becomes DS.base (actually this multiplication is only done once, and stored in the hidden portion of DS register)

whenever the CPU needs to fetch instructions (or whenever a CS override is applied to a data reference) it adds CS.base to the address to get the actual in-memory address

whenever the CPU needs to fetch data, it uses DS.base (unless there is an override to tell it to use a different register instead, or in certain instructions which require 2 segments, where it uses ES as the second register)
Take a loop at a some questions I made for you using some drawing too to make things more clear:

Check This Image
I have a personal policy of never clicking any links in posts -- plus that website takes forever to load over my internet connection, so I'm not going to look at that, sorry

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 5:17 pm
by babaliaris
From your answer I'm concluding to these hypotheses:

1) A segment does not only contain my code which in this examples is exactly 512 bytes. Until now I was thinking that this is true so I was expecting that the CS will point exactly at 0x7c .

2) When the program starts, IP happens to be exactly 0x7c, as the ORG offset, so that CS:IP = first instruction in my program.

3) When a jump is happening this is the way that IP is getting updated: IP = value_of(ORG) + label_offset, so CS:IP --> to the next instruction.

4) Data memory: DS + value_of(ORG) + label_offset . But this still does not make sense. If DS = 0, like i set it in my program and ORG is the offset from the beggining of the code segment, then DS + value_of(ORG) + label_offset will probably give an address less than the beginning of the code segment itself, since DS is zero.

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 6:09 pm
by JAAman
babaliaris wrote:From your answer I'm concluding to these hypotheses:

1) A segment does not only contain my code which in this examples is exactly 512 bytes. Until now I was thinking that this is true so I was expecting that the CS will point exactly at 0x7c .
yes -- a segment is the region of memory from segment.base to segment.base + segment.limit

in RMode, when you upadate a segment register, the segment.base field is filled in with the value: segment*16
in RMode, segment.limit is generally 64KB

so, when DS = 0, segment.base = 0, segment limit = 64KB, therefore segment region is the area of physical memory from 0 - 0xFFFF (inclusive) -- it is important to note that this area is the same size as what is addressable by a 16-bit offset (such as IP for code, or SP for the stack)
2) When the program starts, IP happens to be exactly 0x7c, as the ORG offset, so that CS:IP = first instruction in my program.
not exactly -- the CS:IP combination equals 0x7C00, you don't know what CS is, and you don't know what IP is, you only know the result of CS.base + IP = 0x7C00

it is possible that CS = 0 and IP = 0x7C00 therefore CS:IP == 0*16+0x7C00 == 0x7C00
it is also possible that CS = 0x7C0 and IP = 0 therefore CS:IP = 0x7C0 * 0x10 + 0 == 0x7C00

other combinations are also possible, but are fairly rare
3) When a jump is happening this is the way that IP is getting updated: IP = value_of(ORG) + label_offset, so CS:IP --> to the next instruction.
no

when a normal (immediate mode) jump happens, your assembler does this:
jump_operand = label_offset - current_offset

thus the instructions:

Code: Select all

JMP .label
.label:
results in the assembler producing the code:

Code: Select all

JMP 0
because there are exactly 0 bytes between the jump and the destination address
and the code

Code: Select all

.label:
JMP .label
is turned into:

Code: Select all

JMP -2
because the JMP short instruction is exactly 2 bytes long (so it is jumping backwards by 2 bytes to the beginning of the JMP instruction)

when the CPU executes the JMP instruction, it does this:
IP = IP + Jump_operand
4) Data memory: DS + value_of(ORG) + label_offset .
almost -- it should be DS.base + value_of(ORG) + label_offset... but this is mixing 2 different steps

the assembler does
segment_offset = value_of(ORG) + label_offset

and the CPU itself does
DS.base + segment_offset
But this still does not make sense. If DS = 0, like i set it in my program and ORG is the offset from the beggining of the code segment, then DS + value_of(ORG) + label_offset will probably give an address less than the beginning of the code segment itself, since DS is zero.
no it won't

DS = 0, therefore DS.base == 0
ORG is the address offset for where the file is located (in this case, it should be 0x7C00)
so the value the assembler calculates is this:
address = label_offset + 0x7C00
and then at runtime the CPU adds 0 (because DS.base == 0)

because the ORG is 0x7C00, the address will always be higher than 0x7C00 -- which is the address you are loading at

the important key is: DS.base + ORG should always equal the physical address where the file is being loaded into memory

in this case, since you are setting DS.base == 0, then ORG should be the address it is loaded to in memory (in this case, that is 0x7C00)

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 6:33 pm
by babaliaris
I can't thank you enough! I think I get it now! I was confused a lot by the relation between IP and DS but they don't relate with each other. Then I was miss leaded by the ORG offset. The ORG offset is been used for the data memory calculation only right? So my new hypotheses are:

1) The IP and CS together will be set in such a way that at the beginning they will point exactly to the memory address of 0x7c.

2) I don't mess with the CS! On far jumps, these jumps will handle CS for me.

3) ORG is not been used for the calculation for the memory that points to the next instruction, it's use is for the data segments only.


Something that tricked me:
If 0x7c is the actual physical memory where my program is getting loaded, how does ORG = 0x7c is an offset from the beginning of the segment to the head of my program? Doesn't that mean that the segment starts from memory 0x0?

Also this is what I have in my mind right now. Is it correct?
Image

Re: New here, trying to understnad segments and bootloaders.

Posted: Tue Apr 09, 2019 7:23 pm
by JAAman
babaliaris wrote:I can't thank you enough! I think I get it now! I was confused a lot by the relation between IP and DS but they don't relate with each other. Then I was miss leaded by the ORG offset. The ORG offset is been used for the data memory calculation only right? So my new hypotheses are:

1) The IP and CS together will be set in such a way that at the beginning they will point exactly to the memory address of 0x7c.
yes!
2) I don't mess with the CS! On far jumps, these jumps will handle CS for me.
pretty much... there is some other more advanced stuff... but that is more advanced stuff and not necessary for this point
3) ORG is not been used for the calculation for the memory that points to the next instruction, it's use is for the data segments only.
mostly... there are some times when it can be used in connection with instructions, but that is unusual, and unless you are planning to do a lot of ASM programming, you aren't likely to run into that
Something that tricked me:
If 0x7c is the actual physical memory where my program is getting loaded, how does ORG = 0x7c is an offset from the beginning of the segment to the head of my program? Doesn't that mean that the segment starts from memory 0x0?
it does if you set the segment register to 0 (like you are doing with DS) -- note though that you can only access the first 64KB of memory without changing DS to something different (in normal RMode, your offset, and thus your ORG also, will always be less than 64K)

because DS.base == 0, your ORG is the same as the address you are loading to, but if your DS.base != 0, then your ORG would not be the same as the loading address
Also this is what I have in my mind right now. Is it correct?
it looks correct to me!