Separate Stack Segment in Protected mode?

linguofreak · Post by **linguofreak** » Thu Aug 18, 2022 4:32 am

nexos wrote: Paging, however, provides a huge number of benefits. For example; it's much easier to work with a flat address space than a segmented one. Not to mention that it's much faster too; on modern CPUs with PCID support, switching page mappings is very fast. On segmented system, you have to re-load the LDT every time, which isn't too expensive when segmentation is used lightly, but gets very expensive very quick when you have many segments in your address space.

You could have a segment cache with an LDTID field. It's just an optimization that's not done because few people use x86 segmentation because it doesn't have what it needs to be useful.

It also is much more granular; you control memory down to the page, which is very useful for swapping, memory protection, and other things. It also provides clean separation of physical memory and the address space, which is very useful to user applications.

As I said, the most useful thing for a hardware designer to do would be to implement a non-flat address space via a paging mechanism. It wouldn't technically be segmentation in the base-offset sense, but the way, at an assembly-language level, that a program would access the non-flat features of the processor would resemble Intel segmentation.

If you are not convinced that paging is better, look at the mess that C development became in the old days of Win16. Any developer who values their time would not want to mess with that.

Being forced into using a non-flat address space due to address space size limitations is not great. Being able to use a non-flat address space as an organizational tool is a concept with potential.

linguofreak · Post by **linguofreak** » Thu Aug 18, 2022 4:49 am

Schol-R-LEA wrote:@Linguofreak: I agree with Ivan Godard, who said (paraphrasing) "I would love to make a capability-based architecture, I know how to make a capability-based architecture, but I don't know of any way to sell a capability-based architecture." Security comes dead last in the minds of most consumers, and even most professional administrators. It's sad, but true.

Define your ISA. Come up with a new programming language that has capability features that map closely to the capability features of your ISA, with the reference implementation compiling to an interpreted bytecode that just happens to be your ISA. If the language takes off, port Linux to your ISA. Then implement your ISA in silicon and market it along the lines of "this chip runs FooLang programs *faster*". Then you can start implementing an OS in FooLang that makes better use of the capability features than an OS with legacy baggage from the flat-address-space tradition ever could.

As I've said before, hardware capabilities are what rdos really seems to want. They are not what x86 segmentation provides.

The frustrating thing is that they get *so close*. But without something like what I called the "virtual descriptor table" above (or whatever IBM calls the equivalent z/Arch structure. I can't be arsed to grovel through the z/Arch documentation at the moment), you still need software assistance to provide a capability mechanism.

nexos · Post by **nexos** » Thu Aug 18, 2022 6:57 am

rdos wrote:The microkernel still lives in a flat memory model kernel (even if small) that can contain bugs.

Of course. The exact same thing is true in segmented environment. Even if you use fine-grained near pointers, you can still corrupt your address space.

But let's say your kernel does use a bad pointer. What would happen in a microkernel or segmented monolithic kernel? Let's compare:

Microkernel: Kernel address space is contained between 0xE0000000 and 0xF0000000. The rest of the address space is inaccessible (including user-mode pages, using features like SMAP). Hence there is pretty good chance that a crash will occur and no corruption.

Segmented monolith: Kernel is contained between 0x80000000 and 0xF0000000. Use fine-grained near pointers; more than likely, a bad pointer will be outside of the segments limit and hence cause a crash. Note how much better this is than microkernels depends on how fine grained your segments are.

As we can see, the difference is there, but pretty small when you think about how much less address space microkernels take up.

Of course, you could be extremely fine-grained with your pointers in segmented environments, but that become very hard to manage very quickly. And that becomes a haven for bugs too.

rdos wrote:So, you should not compare segmentation with paging, rather what additional advantage the use of segmentation has over only using paging.

Using segmentation with paging could quickly turn into a mess... Segmentation does help prevent overruns, but to do this you would have to boatloads of near pointers which could easily reduce efficiency a lot.

In trying to reduce bugs, you could make it complicated to the point that people begin introducing bugs.

linguofreak wrote:You could have a segment cache with an LDTID field. It's just an optimization that's not done because few people use x86 segmentation because it doesn't have what it needs to be useful.

True, but if you use one segment per malloc'ed block (which I believe rdos has suggested before), those caches will have to be very large.

linguofreak wrote:Being forced into using a non-flat address space due to address space size limitations is not great. Being able to use a non-flat address space as an organizational tool is a concept with potential.

I didn't even touch the size limitation task aspect. I was thinking about have to differentiate between near and far pointers, figure which kind you needed, and then decorating functions / structures with the mess.

rdos · Post by **rdos** » Thu Aug 18, 2022 7:27 am

nexos wrote:
rdos wrote:So, you should not compare segmentation with paging, rather what additional advantage the use of segmentation has over only using paging.
Using segmentation with paging could quickly turn into a mess... Segmentation does help prevent overruns, but to do this you would have to boatloads of near pointers which could easily reduce efficiency a lot.

Not really. You can think of it as two different "systems". You implement paging the usual way to handle page-level protection and separate address spaces. Then you run segmentation on top of the linear address space where it is motivated. I only use segmentation in kernel & drivers, and applications are still flat. That still keeps the door open for Posix compatibility and using GCC or CLang for applications.

nexos wrote: True, but if you use one segment per malloc'ed block (which I believe rdos has suggested before), those caches will have to be very large.

This is a bit problematic considering the number of available GDT selectors. I mainly map selectors to code & data for drivers, capability objects and a few other things. This keeps most of GDT unused under normal circumstances. I also have a similar size allocator with easy physical address translation, and each of these objects are also mapped to a selector. So, while I can allocate 1000s of small objects within them, they are still protected by the selector.

nexos · Post by **nexos** » Thu Aug 18, 2022 7:43 am

rdos wrote:This is a bit problematic considering the number of available GDT selectors. I mainly map selectors to code & data for drivers, capability objects and a few other things. This keeps most of GDT unused under normal circumstances. I also have a similar size allocator with easy physical address translation, and each of these objects are also mapped to a selector. So, while I can allocate 1000s of small objects within them, they are still protected by the selector.

But each time you need to change what capabilities are being accessed, you need to re-load the segment register. Which is quite expensive.

rods wrote: Then you run segmentation on top of the linear address space where it is motivated.

Wouldn't you need two memory management systems? One for segmentation, and then one for paging? Sounds like a lot duplicated work to me.

rdos · Post by **rdos** » Thu Aug 18, 2022 8:41 am

nexos wrote: But each time you need to change what capabilities are being accessed, you need to re-load the segment register. Which is quite expensive.

Validating the capability is a lot more expensive than loading a segment register.

nexos wrote: Wouldn't you need two memory management systems? One for segmentation, and then one for paging? Sounds like a lot duplicated work to me.

They are quite different. For paging, I support both protected mode paging and PAE paging, and these go through a "virtual method table" that is setup to use either protected mode or PAE paging. Thus, the actual paging support is isolated to a single module that comes in two variants. The basic functions of the module is things like to allocate pages, change page attributes, create & destroy an address space. In kernel, there is a page aligned memory allocator that use the paging module to allocate pages and a small memory allocator (malloc-type) that is implemented with links. For selectors, there are functions to allocate & free GDT selectors, to setup & retrieve their base & limits. There are also functions to allocate & free small memory objects by first using the small memory allocator and then setting up a selector for it. So, the paging & segmentation functions are quite different.

linguofreak · Post by **linguofreak** » Fri Aug 19, 2022 3:23 am

nexos wrote:
linguofreak wrote:You could have a segment cache with an LDTID field. It's just an optimization that's not done because few people use x86 segmentation because it doesn't have what it needs to be useful.
True, but if you use one segment per malloc'ed block (which I believe rdos has suggested before), those caches will have to be very large.

If you're doing things at that granularity, you're heading into iAPX 432 terrritory.

linguofreak wrote:Being forced into using a non-flat address space due to address space size limitations is not great. Being able to use a non-flat address space as an organizational tool is a concept with potential.
I didn't even touch the size limitation task aspect. I was thinking about have to differentiate between near and far pointers, figure which kind you needed, and then decorating functions / structures with the mess.

If you're doing segments for organization rather than working with size limitations, then hopefully you aren't doing a lot with far pointers (at least for data -- for code, a far pointer should generally mean you're calling a library, a near pointer should mean you're calling your own function): you might be doing something like mmaping a file to a segment, loading that segment, and using near pointers to seek around in the file. Or maybe you have a segment as a buffer for communication with untrusted code: the code knows that that segment is its buffer for communicating with your process, so you copy the data you need processed into that segment and call the untrusted code, passing it a near pointer to where the data is in the buffer (meanwhile temporarily dropping access to all your other data segments).

That last bit only really works if your non-flat functionality is a full capability system, so won't really work with Intel (at least not without overhead for doing the capability part in software), but it shows the type of things that a non-flat scheme can help with.

nexos · Post by **nexos** » Fri Aug 19, 2022 6:28 am

linguofreak wrote:If you're doing things at that granularity, you're heading into iAPX 432 terrritory.

Hmm, all have to research the processor a bit more. Sounds interesting from what I've seen so far.

rdos wrote:Validating the capability is a lot more expensive than loading a segment register.

Wouldn't you want to make that operation as inexpensive as possible? If it is expensive, than removing the segment register load could make it less expensive.

rdos wrote:They are quite different.

A better wording on my part would have been "a lot of extra work".

My whole thing comes down to this point:

nexos wrote:Of course. The exact same thing is true in segmented environment. Even if you use fine-grained near pointers, you can still corrupt your address space.

But let's say your kernel does use a bad pointer. What would happen in a microkernel or segmented monolithic kernel? Let's compare:

Microkernel: Kernel address space is contained between 0xE0000000 and 0xF0000000. The rest of the address space is inaccessible (including user-mode pages, using features like SMAP). Hence there is pretty good chance that a crash will occur and no corruption.

Segmented monolith: Kernel is contained between 0x80000000 and 0xF0000000. Use fine-grained near pointers; more than likely, a bad pointer will be outside of the segments limit and hence cause a crash. Note how much better this is than microkernels depends on how fine grained your segments are.

As we can see, the difference is there, but pretty small when you think about how much less address space microkernels take up.

Hence microkernels are a good alternative to the advantages of segmented systems.

16bitPM · Post by **16bitPM** » Fri Aug 19, 2022 6:38 am

nexos wrote: The real show-stopper for segmentation is the lack of toolchain support. GCC / Clang / CL, which are by far the three dominant compilers of the day, have no support for segmentation whatsoever. The only production-quality C compiler I can think of that does support segmentation is Open Watcom.

True, but it's not very well documented if you ask me.

nexos wrote: For example; it's much easier to work with a flat address space than a segmented one.

I'm guessing that also has to do with OS design. For me, the logical distinction comes natural (and it did so too when Multics was designed).

nexos wrote: Not to mention that it's much faster too; on modern CPUs with PCID support, switching page mappings is very fast. On segmented system, you have to re-load the LDT every time, which isn't too expensive when segmentation is used lightly, but gets very expensive very quick when you have many segments in your address space.

First of all, there is also overhead in maintaining a paged system even with PCID. Secondly, reloading the LDT is just loading the appropriate LDT selector, putting the data in LDTR and do some checks. This is independent of the number of selectors. The timings for the LLDT instruction on old CPU's (I only have those readily available atm, but they should be within the same order of magnitude for newer CPU's) are :

80286 : 17-19 cycles
80386 : 20 cycles
80486 : 11 cycles
Pentium : 9 cycles

That's not so bad. Of course, the LDT has to be filled, but that's probably mostly at the start of the process.

nexos wrote: It also is much more granular; you control memory down to the page, which is very useful for swapping, memory protection, and other things. It also provides clean separation of physical memory and the address space, which is very useful to user applications.

You are confusing both. For segments <1MiB, the granularity is 1 byte, and for big (1MiB-4GiB) segments, it's 4096 bytes : the same as paging.
As far as I know, the minimum page size is still 4096 bytes.

nexos wrote: If you are not convinced that paging is better, look at the mess that C development became in the old days of Win16. Any developer who values their time would not want to mess with that.

That was only a problem because of the 64KiB limit, not of the segmentation concept per se.

Also, if you ask me, paging has become a mess. Just looking at all the features that have been added in the past 20 years...
Many of the performance-related criticisms have also more to do with chip developers putting all their money on paging. They COULD have added a descriptor cache, but they didn't.
They also could have added a TSS cache, but... they didn't. A lot of things that come automatically with the concept of segmentation, have to be done in software instead : position independent code, limit checking, protection against stack overflow, ... oh yeah, and it's possible to address more than 4GiB on a 32-bit system within 1 process space.

rdos · Post by **rdos** » Fri Aug 19, 2022 6:52 am

nexos wrote: Hence microkernels are a good alternative to the advantages of segmented systems.

Or maybe a good add-on. I'm implementing my FS drivers using the microkernel concept, but I'm not doing anything else that way.

I don't believe in doing things based on fixed design principles. I've combined paging, segmentation, monolithic and microkernel in the same OS.

If I get time & motivation, I could implement a Posix interface too, even if my OS is not native Posix. I can (at least theoretically) run both long mode, 32-bit flat mode, 16-bit segmented & DOS applications at the same time. That's because I have no native executable format, rather a loader per format I support.

nexos · Post by **nexos** » Fri Aug 19, 2022 8:09 am

rdos wrote:I don't believe in doing things based on fixed design principles.

Me neither. I just believe that everything that can be pushed to user-space should be pushed there.

While I understand using segmentation for isolation, that won't do anything to stop malicious drivers / components. They'll still be able to use any segment they want when running in ring 0.

rdos wrote:That's because I have no native executable format, rather a loader per format I support.

That's actually a very good idea. It seems like that could greatly increase compatibility with other OSes.

rdos wrote:I've combined paging, segmentation, monolithic and microkernel in the same OS.

That's great! I love seeing people do OSDev in different ways, even if I don't necessarily agree about everything in their design

.

16bitPM · Post by **16bitPM** » Fri Aug 19, 2022 9:19 am

rdos wrote: If I get time & motivation, I could implement a Posix interface too, even if my OS is not native Posix. I can (at least theoretically) run both long mode, 32-bit flat mode, 16-bit segmented & DOS applications at the same time. That's because I have no native executable format, rather a loader per format I support.

Soooo hypothetically speaking, if you found time and motivation, you could make a loader that reads in a 32-bit LE file, and create separate segments for each including the stack?

nexos · Post by **nexos** » Fri Aug 19, 2022 9:51 am

16bitPM wrote:True, but it's not very well documented if you ask me.

Look here: https://github.com/open-watcom/open-wat ... umentation

16bitPM wrote:I'm guessing that also has to do with OS design. For me, the logical distinction comes natural (and it did so too when Multics was designed).

No, more or less HLL design. It's much more natural to treat memory as a logical sequence of bytes than as a sequence of segments in HLLs (C, C++, Ada, and so on). It's possible to use segments, but doesn't feel very natural.

16bitPM wrote:First of all, there is also overhead in maintaining a paged system even with PCID. Secondly, reloading the LDT is just loading the appropriate LDT selector, putting the data in LDTR and do some checks. This is independent of the number of selectors. The timings for the LLDT instruction on old CPU's (I only have those readily available atm, but they should be within the same order of magnitude for newer CPU's) are :

The entire load isn't just at the LLDT instruction. That's one part of the data. You also have to invalidate the segment-descriptor cache, and than reload those for memory accesses. Of course, as I said in my earlier post, how relevant that is depends on how granular your segments are. With PCID, you change tags, and that's it (more or less).

16bitPM wrote:That's not so bad. Of course, the LDT has to be filled, but that's probably mostly at the start of the process.

But segmentation is pointless if all LDT work is done at creation time. You need some way to dynamically create segments if you want more granularity.

16bitPM wrote:You are confusing both. For segments <1MiB, the granularity is 1 byte, and for big (1MiB-4GiB) segments, it's 4096 bytes : the same as paging.
As far as I know, the minimum page size is still 4096 bytes.

.. and then you waste 128MiB on the tables necessary to do this. Of course, that could be (more or less) Intel's fault.

16bitPM wrote:That was only a problem because of the 64KiB limit, not of the segmentation concept per se.

Segmentation doesn't feel natural in HLL's, no matter what.

16bitPM wrote:Also, if you ask me, paging has become a mess. Just looking at all the features that have been added in the past 20 years...

That's not paging's fault. That's Intel's fault. Intel had made a mess out of x86.

But still, I'd rather work with Intel's messy paging than segmentation. Segmentation's mess permeates everywhere.

16bitPM wrote:They COULD have added a descriptor cache, but they didn't.

They actually did. If they didn't, that would require 2 memory accesses per memory access

16bitPM wrote:They also could have added a TSS cache, but... they didn't.

TSS? That disaster? Let's not touch that this time

16bitPM wrote:A lot of things that come automatically with the concept of segmentation,

16bitPM wrote:position independent code

PC-relative addressing is the solution with paging. Which CPU makers should have thought of a long time before they did, IMO.

16bitPM wrote:limit checking

That is the one thing segmentation has over paging. I won't try to argue with that. But still, segmentation is much more difficult to work with, as I have said above multiple times.

16bitPM wrote:protection against stack overflow

Guard pages are the solution. Those are pretty simple, and work great.

16bitPM wrote:oh yeah, and it's possible to address more than 4GiB on a 32-bit system within 1 process space

What about something like Microsoft's AWE? But then again, 64-bit paging make this a million time simpler.

In summary, segmentation is a pain to work with, and doesn't feel very natural in HLLs. The industry selected what was better for its needs. Paging was better for its needs.

Octocontrabass · Post by **Octocontrabass** » Fri Aug 19, 2022 11:26 am

16bitPM wrote:The timings for the LLDT instruction on old CPU's (I only have those readily available atm, but they should be within the same order of magnitude for newer CPU's) are :

80286 : 17-19 cycles
80386 : 20 cycles
80486 : 11 cycles
Pentium : 9 cycles

That's not so bad.

The LLDT instruction doesn't take many cycles by itself, but it's a serializing instruction - it causes a bubble in the pipeline that can cost dozens or even hundreds of cycles on modern CPUs.

16bitPM wrote:They COULD have added a descriptor cache, but they didn't.

They did! It was a defining feature of the Pentium II. (But the name "descriptor cache" usually refers to the hidden portion of the segment registers and not a cache designed to speed up segment register loads.)

16bitPM · Post by **16bitPM** » Fri Aug 19, 2022 1:07 pm

nexos wrote:Look here: https://github.com/open-watcom/open-wat ... umentation

Urgh I know where to find the OW docs

But I said 32-bit segmentation is not very well documented. An example would be nice too.

nexos wrote:No, more or less HLL design. It's much more natural to treat memory as a logical sequence of bytes than as a sequence of segments in HLLs (C, C++, Ada, and so on). It's possible to use segments, but doesn't feel very natural.

OK, I'm not going to discuss personal preference. But I'm sure there are languages that can hide the segmentation better.

nexos wrote:The entire load isn't just at the LLDT instruction. That's one part of the data. You also have to invalidate the segment-descriptor cache, and than reload those for memory accesses. Of course, as I said in my earlier post, how relevant that is depends on how granular your segments are. With PCID, you change tags, and that's it (more or less).

I thought the segment-descriptor cache was just the hidden part of LDTR, which is included in those timings (except for memory delays of course, and serialization as someone else pointed out).

nexos wrote:But segmentation is pointless if all LDT work is done at creation time. You need some way to dynamically create segments if you want more granularity.

I don't think rdos will agree with this. There's a ton of possibilities for segmentation, going from ordinary flat mode to a system with thousands of segments.

nexos wrote:
16bitPM wrote:You are confusing both. For segments <1MiB, the granularity is 1 byte, and for big (1MiB-4GiB) segments, it's 4096 bytes : the same as paging.
As far as I know, the minimum page size is still 4096 bytes.
.. and then you waste 128MiB on the tables necessary to do this. Of course, that could be (more or less) Intel's fault.

... and has nothing to do with the fact that you got your granularity-argument reversed. Also, I don't get your 128MiB-argument?

nexos wrote:
16bitPM wrote:They COULD have added a descriptor cache, but they didn't.
They actually did. If they didn't, that would require 2 memory accesses per memory access

"Descriptor cache" was just the hidden part of the segment register, which doesn't suffice if you need to change registers a lot (also not very good for context switching).
I'm talking about a real cache which can hold tens or hundreds of descriptors. As I found thanks to Octocontrabass and this article, they actually had a 96-entry cache in the Pentium and P-II, but I'm not sure for later processors. There's also a patent filed by AMD, much to my surprise.

nexos wrote:TSS? That disaster? Let's not touch that this time

Yes, let's start a new topic for that

OSDev.org

Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?

Re: Separate Stack Segment in Protected mode?