Separate Stack Segment in Protected mode?

16bitPM · Post by **16bitPM** » Fri Aug 19, 2022 1:12 pm

Octocontrabass wrote: The LLDT instruction doesn't take many cycles by itself, but it's a serializing instruction - it causes a bubble in the pipeline that can cost dozens or even hundreds of cycles on modern CPUs.

Very good point, I didn't think about that!

Octocontrabass wrote:
16bitPM wrote:They COULD have added a descriptor cache, but they didn't.
They did! It was a defining feature of the Pentium II. (But the name "descriptor cache" usually refers to the hidden portion of the segment registers and not a cache designed to speed up segment register loads.)

[/quote]

Oh wow, I didn't know about that!! Apparently, there was a 94-entry cache on Pentium and P-II., but I don't know of they kept it on later processors.

nexos · Post by **nexos** » Fri Aug 19, 2022 2:10 pm

16bitPM wrote:Urgh I know where to find the OW docs But I said 32-bit segmentation is not very well documented. An example would be nice too.

I though you meant Open Watcom wasn't well documented in your post. It certainly sounded like that.

... and 32-bit segmentation is well documented. Read the Intel manuals

16bitPM wrote:OK, I'm not going to discuss personal preference. But I'm sure there are languages that can hide the segmentation better.

I don't know of any.

16bitPM wrote:I thought the segment-descriptor cache was just the hidden part of LDTR, which is included in those timings (except for memory delays of course, and serialization as someone else pointed out).

I highly doubt that LLDT re-loads that until the segment is actually accessed. It's called the principle of lazy evaluation.

16bitPM wrote:I don't think rdos will agree with this. There's a ton of possibilities for segmentation, going from ordinary flat mode to a system with thousands of segments.

None of that can be done only at creation time.

16bitPM wrote:... and has nothing to do with the fact that you got your granularity-argument reversed. Also, I don't get your 128MiB-argument?

In other words, to make use of the granularity, you'll waste a lot of memory on descriptor tables. Paging provides a reasonable granularity that is easy to build off of. Segmentation encourages you to waste tons of memory on tiny segment descriptors.

16bitPM wrote:Yes, let's start a new topic for that

No need. Here's why the TSS is a disaster: it's slower than software context switching and makes your OS non-portable. Enough said

.

Tricks like CoW and swapping and demand paging cannot be implemented with segmentation. Those features save tons of memory. With segmentation, it's nearly impossible to do that.

Also, I see you posted a paper about checking data structure bounds with segmentation. That's honestly a nightmare. In a large app, working with hundreds (even thousands) of data structures at once, imagine having to change segment registers every time you access another one, especially if the structures are being used in different modules. That would result in a lot of extra CPU instructions, which would add up quickly.

Modern paging features make their be no need for segmentation. Simply make the text area RO and NX and use guard pages for stacks. Using segments for allocated items is unrealistic as I outlined in the above paragraph.

rdos · Post by **rdos** » Fri Aug 19, 2022 2:35 pm

nexos wrote:No need. Here's why the TSS is a disaster: it's slower than software context switching and makes your OS non-portable. Enough said .

The TSS is not usable for task-switching other than in the double fault and stack fault handler. However, the reason for that is not in portability, but the fact that task-switching is typically a two-step process (save old task and load new task). Hardware task-switching requires these to run in a single step, which is a bad idea.

nexos wrote: Tricks like CoW and swapping and demand paging cannot be implemented with segmentation. Those features save tons of memory. With segmentation, it's nearly impossible to do that.

Fork is a complete disaster with modern multicore & multitasking systems. The only use of CoW is to implement fork. Linux and other Posix OSes typically have several variants of fork since it is so problematic and SLOW.

Swapping is a complete disaster too, and nobody that has run a computer with too little memory so swapping was activated would agree that it is useful. A modern OS doesn't need to implement swapping to disc since it is an useless concept.

Both demand paging and demand segmentation works, but loading the whole executable is preferrable and faster if most of it is used anyway. Demand segmentation is achieved by setting the present bit to zero. Just as with demand paging.

linguofreak · Post by **linguofreak** » Sat Aug 20, 2022 3:02 am

nexos wrote:
linguofreak wrote:If you're doing things at that granularity, you're heading into iAPX 432 terrritory.

The point being, I think one of the big reasons it failed is one of the big reasons a lot of capability systems fail: too fine-grained.

nexos · Post by **nexos** » Sat Aug 20, 2022 6:21 am

linguofreak wrote:The point being, I think one of the big reasons it failed is one of the big reasons a lot of capability systems fail: too fine-grained.

Ah, makes sense. My limited research of it made me decide it was a huge waste of Intel's money.

rdos wrote:Fork is a complete disaster with modern multicore & multitasking systems.

Agreed. Fork is a relic from before the days of multithreading.

rdos wrote:The only use of CoW is to implement fork.

I wouldn't say that. CoW is very useful for some types of shared memory. For example, imagine I pass a pointer to some memory block from process A to process B. Process A can't guarantee that that pointer will point to valid memory for a long time. Hence, it marks it as CoW. When either process attempts to write to the blocks, the memory is no longer shared.

This is very useful in passing large memory blocks between servers in a microkernel.

rdos wrote: Swapping is a complete disaster too, and nobody that has run a computer with too little memory so swapping was activated would agree that it is useful.

If your consistently running primarily off of swap, than it's not very useful. But if you have one process (e.g,, a linker) that consumes a lot of memory, and you aren't to worried about that process's performance, than swapping can be very useful. Try building LLVM on an 8GB system, watch it crash, and then add 2GB of swap and watch it run great.

Swapping is also very useful to keep costs down in server environments. E.g., if you have VPSs running with 4GB of memory, and then you get a traffic spike that sets it to 4.5GB usage, than instead of crashing, swapping would temporarily kick in to even out the load. True, performance will temporarily suffer, but the system will recover. Else, you'd spend tons of money on memory. Hard disk space is still much cheaper than DRAM.

Swapping is very useful for evening out memory usage spikes.

rdos wrote:Both demand paging and demand segmentation works

Ok, you got me there. You're right on that

.

rdos wrote:but loading the whole executable is preferrable and faster if most of it is used anyway.

That's not exactly true. For an executable that is very large with very large shared libraries, an exec()/CreateProcess() would take a noticeably large amount of time. Demand paging evens out the load on the system, with only the extra cost of #PF's. Of course, #PF's can be partially avoided by using anticipatory paging.

Doing this all at once could potentially cause disk cache wiping, which is never a good situation....

eekee · Post by **eekee** » Sat Aug 20, 2022 3:18 pm

Swap is also useful to hold all the junk mega-apps load and barely ever use. It's surprising how well some sauropodian applications run with 3/4 of their memory paged out.

linguofreak · Post by **linguofreak** » Sun Aug 21, 2022 1:28 am

16bitPM wrote:
nexos wrote: The real show-stopper for segmentation is the lack of toolchain support. GCC / Clang / CL, which are by far the three dominant compilers of the day, have no support for segmentation whatsoever. The only production-quality C compiler I can think of that does support segmentation is Open Watcom.
True, but it's not very well documented if you ask me.

nexos wrote: For example; it's much easier to work with a flat address space than a segmented one.
I'm guessing that also has to do with OS design. For me, the logical distinction comes natural (and it did so too when Multics was designed).

nexos wrote: Not to mention that it's much faster too; on modern CPUs with PCID support, switching page mappings is very fast. On segmented system, you have to re-load the LDT every time, which isn't too expensive when segmentation is used lightly, but gets very expensive very quick when you have many segments in your address space.
First of all, there is also overhead in maintaining a paged system even with PCID. Secondly, reloading the LDT is just loading the appropriate LDT selector, putting the data in LDTR and do some checks. This is independent of the number of selectors. The timings for the LLDT instruction on old CPU's (I only have those readily available atm, but they should be within the same order of magnitude for newer CPU's) are :

80286 : 17-19 cycles
80386 : 20 cycles
80486 : 11 cycles
Pentium : 9 cycles

That's not so bad. Of course, the LDT has to be filled, but that's probably mostly at the start of the process.

nexos wrote: It also is much more granular; you control memory down to the page, which is very useful for swapping, memory protection, and other things. It also provides clean separation of physical memory and the address space, which is very useful to user applications.
You are confusing both. For segments <1MiB, the granularity is 1 byte, and for big (1MiB-4GiB) segments, it's 4096 bytes : the same as paging.
As far as I know, the minimum page size is still 4096 bytes.

The point is, if you have *only* segmentation, without paging, then a segment can only be swapped in or out as a unit. If the segment is 1 kB, that's fine. If the segment is 1 MB, that's probably not too much of a problem, at least on modern systems. If the segment is 1 GB, that's a problem. With paging, a 1 GB chunk of address space can be partially swapped out.

nexos wrote: If you are not convinced that paging is better, look at the mess that C development became in the old days of Win16. Any developer who values their time would not want to mess with that.
That was only a problem because of the 64KiB limit, not of the segmentation concept per se.

Also, if you ask me, paging has become a mess. Just looking at all the features that have been added in the past 20 years...
Many of the performance-related criticisms have also more to do with chip developers putting all their money on paging. They COULD have added a descriptor cache, but they didn't.
They also could have added a TSS cache, but... they didn't.

Given the nature of Intel segmentation specifically, one of the issues is that it adds an addition to every memory access (thus putting more load on the ALU or requiring a dedicated adder in the address logic), and that that addition must precede anything that depends on knowing the logical address of the memory access, whereas paging gives more room for certain types of speculative access.

A lot of things that come automatically with the concept of segmentation, have to be done in software instead : position independent code,

PIC can be enabled by other architectural features (such as IP-relative addressing modes being readily available).

oh yeah, and it's possible to address more than 4GiB on a 32-bit system within 1 process space.

For a general segmented system, yes, it is possible to design the system this way. On Intel, yes, you can do this, but with some significant limitations:

1) The process can have segments totaling more than 4GB, yes, but only 4GB can be accessible at any one time (logical address space, whether paging is on or off, is only 4 GB). If you're copying data from a 3 GB segment to another 3 GB segment, you won't be able to have both present at the same time and any instruction that tries to access both simultaneously will take a fault on whichever segment is currently not present, you'll swap in that segment, but in the process boot out the other segment, so when you restart the instruction it will fault on the other segment, and so forth.
2) If you wan't to keep all of a program's data in memory (even though it isn't all in the 4GB logical address space) you'll need paging, and in fact, will need PAE paging. If you use segmentation only, without paging, then anything more than 4GB, on either the system or the program level, will require swapping to disk.

Octocontrabass · Post by **Octocontrabass** » Sun Aug 21, 2022 3:16 pm

linguofreak wrote:PIC can be enabled by other architectural features (such as IP-relative addressing modes being readily available).

RIP-relative addressing is only available in 64-bit mode. In other modes, you have to do things like call a function that returns its own return address or use segmentation.

nexos · Post by **nexos** » Sun Aug 21, 2022 6:18 pm

linguofreak wrote:The point is, if you have *only* segmentation, without paging, then a segment can only be swapped in or out as a unit. If the segment is 1 kB, that's fine. If the segment is 1 MB, that's probably not too much of a problem, at least on modern systems. If the segment is 1 GB, that's a problem. With paging, a 1 GB chunk of address space can be partially swapped out.

Yep. Having constant granularity of memory makes a lot of optimizations a lot simpler.

Octocontrabass wrote:RIP-relative addressing is only available in 64-bit mode.

Yeah, that is a stumbling block. But that's just called a failure on Intel's part. It looks like x86 has a lot of those...

... which is why I hope the industry switches to AArch64 (or even RISC-V) pretty soon.

16bitPM · Post by **16bitPM** » Tue Aug 23, 2022 6:08 am

rdos wrote: Hardware task-switching requires these to run in a single step, which is a bad idea.

Can you expand on that?

16bitPM · Post by **16bitPM** » Tue Aug 23, 2022 6:16 am

linguofreak wrote: 1) The process can have segments totaling more than 4GB, yes, but only 4GB can be accessible at any one time (logical address space, whether paging is on or off, is only 4 GB). If you're copying data from a 3 GB segment to another 3 GB segment, you won't be able to have both present at the same time and any instruction that tries to access both simultaneously will take a fault on whichever segment is currently not present, you'll swap in that segment, but in the process boot out the other segment, so when you restart the instruction it will fault on the other segment, and so forth.
2) If you wan't to keep all of a program's data in memory (even though it isn't all in the 4GB logical address space) you'll need paging, and in fact, will need PAE paging. If you use segmentation only, without paging, then anything more than 4GB, on either the system or the program level, will require swapping to disk.

You would have to use techniques from the 16-bit world and split the program in many smaller modules (but bigger than, say, those of 16-bit Write (written for Windows 3.1)). But yeah, I don't think that's the most useful feature

rdos · Post by **rdos** » Tue Aug 23, 2022 7:10 am

16bitPM wrote:
linguofreak wrote: 1) The process can have segments totaling more than 4GB, yes, but only 4GB can be accessible at any one time (logical address space, whether paging is on or off, is only 4 GB). If you're copying data from a 3 GB segment to another 3 GB segment, you won't be able to have both present at the same time and any instruction that tries to access both simultaneously will take a fault on whichever segment is currently not present, you'll swap in that segment, but in the process boot out the other segment, so when you restart the instruction it will fault on the other segment, and so forth.
2) If you wan't to keep all of a program's data in memory (even though it isn't all in the 4GB logical address space) you'll need paging, and in fact, will need PAE paging. If you use segmentation only, without paging, then anything more than 4GB, on either the system or the program level, will require swapping to disk.
You would have to use techniques from the 16-bit world and split the program in many smaller modules (but bigger than, say, those of 16-bit Write (written for Windows 3.1)). But yeah, I don't think that's the most useful feature

This is easily fixed by paging (memmap). You create a Window of say 2MB, and then you can map this to any physical address using a syscall. I can access 80GB of sample data in my 32-bit signal processing application. This only becomes problematic if you want truly random access. With sequential access or at least localized access, memmap works just fine and doesn't need a 64-bit address space for accessing GBs of data.

16bitPM · Post by **16bitPM** » Sun Oct 23, 2022 7:33 am

I have another question, somewhat related to the topic of stacks :

If I understood correctly, it's possible to have a full 64K/4G segment for all segment registers but SS ; for SS you can define a segment as starting from 0 to the top of program memory and define it expand down (with (E)SP pointing to the TOS). This way you can still access all of memory without segment loads, but implicit stack operations can trigger a stack exception this way. Is this something that makes sense?? Or am I missing something (because I don't think ED segments are used often).

Octocontrabass · Post by **Octocontrabass** » Sun Oct 23, 2022 8:50 am

Compilers that expect a flat address space will use implicit stack addressing (memory operands with EBP as the base) for addresses that are not on the stack.

rdos · Post by **rdos** » Sun Oct 23, 2022 12:33 pm

16bitPM wrote:I have another question, somewhat related to the topic of stacks :

If I understood correctly, it's possible to have a full 64K/4G segment for all segment registers but SS ; for SS you can define a segment as starting from 0 to the top of program memory and define it expand down (with (E)SP pointing to the TOS). This way you can still access all of memory without segment loads, but implicit stack operations can trigger a stack exception this way. Is this something that makes sense?? Or am I missing something (because I don't think ED segments are used often).

I'm not using ED stacks segments. In kernel, every thread has it's own 4k stack, but it cannot be expanded. If it is exhausted, a double fault will result which switches to a handler task with a known valid stack. However, when running the kernel in long mode, I need the stack to be flat and so I also support setting up a flat kernel stack, which of course will not have proper limit checking.

For applications, I currently only support flat memory model, and so the user mode stack typically is flat and not ED. However, ED is supported by the OS if future segmented applications are implemented.

OSDev.org

Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?