embryo2 wrote:Schol-R-LEA wrote:Paging on the x86 also serves to simplify the sharing of memory (and through that, other resources) between processes and even between paravirtualized operating systems
What can be more simplified than direct access to a memory segment?
Sorry, perhaps 'simplify' was not the best wording I could have used; I was not referring to the simplicity of the software (more on that momentarily) but the to complexity of the computations the L2 cache system would have to make. My understanding of this matter may be incomplete, however. To the best of my knowledge, the OS developer has only limited options for optimizing the L1, L2 and (when present) L3 caches on the x86 chips, mostly regarding flushing a cache or disabling a cache entirely, but the cache hardware itself will be more effective at controlling it's own mapping of segments of memory to cache cells when it can limit the set of memory it needs to map. Page maps, as I understand it, provide some of that information.
If anyone can correct me, or provide additional information in general about this, I would appreciate it.
Furthermore, using unpaged memory is
not simpler from the perspective of the operating system. Most 'flat' memory modes are really only flat from the perspective of a userland application; any multitasking OS still needs to parse the memory up into multiple process spaces. The paging hardware makes this easier, by allowing non-contiguous memory regions to be mapped to a contiguous virtual space (important when memory is being allocated at runtime), among other things. It also reduces the computational burden for the OS itself, again by automatically restricting the space of memory that the system needs to be concerned with at a given moment. Yes, there are costs to using it, and most of the results could be achieved in other ways, but I am not certain if any of those ways are, in fact, simpler from the perspective of the OS developer, nor am I convinced that simplicity
for the operating system developers is really a wise goal. Providing simplicity for the users and the application developers often requires considerable complexity on the part of the OS implementation.
Again, any corroborations, refutations, or clarifications on these points would be useful here.
In any case, it is my (admittedly imperfect) understanding that you cannot enter x86-64 long mode without enabling paging, and I have not yet found anything indicating that it can be disabled for applications once in long mode. Similarly, I have the impression that at least some models of the ARM and other RISC architectures require paging in order to execute code in user mode - the OS, of course, does not need to actually implement meaningful page loading and unloading handlers, but the paging will be in place no matter what AFAICT. I haven't really delved into the matter, however, so I will try to read up on it.
Finally, for my OS design, the details of paging mechanisms, especially those of specific processor models, are not an inherent part of the design; I mean to leave most of that to the hardware-specific code generation in the JIT compiler and the system-level code generator, and separate the rest to hardware-specific modules in the OS itself. Most of the OS itself will be unaware of the specific garbage-collection methods in use at an given time, though that could interrogate the memory manager if the module needs to know something, and can request a GC cycle on its current memory usage (though whether that would be done, or is even applicable to the particular memory usage pattern, would be the call of the memory manager itself).
embryo2 wrote:Schol-R-LEA wrote:if I am not mistaken can be used to make the L2 caching more efficient (since, IIRC, the content-addressable memory lookup works better when it can restrict the possible cache mappings to the pages currently in use, and an OS can, in theory anyway, reschedule to a different process whose data is already in cache when a cache miss occurs, rather than waiting for the cache to refill the memory for the process that experienced the miss... er, I think the cache can both read and be read at the same time, anyway, as long as the consumer isn't reading from the same memory as the producer, maybe I'm just confused).
Currently x86 paging caches information only for x86 paging (like in TLBs). So if we get rid of it then there will be nothing missing in our caches. But may be there are additional mechanisms employed by the cache related hardware. I haven't studied deeply the relations between caching of code/data and paging.
I will admit that I have not done so either, however you seem to have misunderstood what I am talking about. The L2 cache is a content-addressable memory hardware unit on the processor die that prefetches the memory for the L1 cache, which in turn prefetches memory for access by the CPU. It is a fairly complex subsystem that is used in some form by nearly every current generation CPU regardless of architecture, and uses content tagging to map a specific piece of code or data in memory to the copy of it in the cache. It is mostly automatic, but the paging system can (AFAIK) give it additional information about the current state of memory usage, to help it make cache and flush decisions.
embryo2 wrote:However, would we rely on such mechanisms that change too often then it will be the best distraction from more important goals.
I gather that you intend to write a single high-level garbage collector that would be used unchanged on all hardware, correct? While this is fine for your overall reference design, I don't think you would find it feasible in practice. Real-world systems need to have hardware-specific implementations covering multiple configurations, at least as modules that can be substituted for the general-purpose version. Given that I am borrowing heavily from Synthesis, much of the hardware optimization would be handled programmatically, but I would still need to have templates and specializers for the specific hardware in place beforehand to give the system the information it would need to optimize the system.
embryo2 wrote:I suppose it's much better to concentrate on algorithms like JIT or garbage collection instead of hunting negligible advantages of transient hardware related hacks.
Fine, except that garbage collection itself is difficult to implement efficiently without applying a variety of techniques, some of which are indeed going to be 'transient hardware related hacks'. You seem to be seeing virtual memory as a complication, but I do not agree; using a hardware-unaware garbage collection algorithm may be acceptable for a single application using a conservative library collector (e.g., the popular Boehm collector) in a language that does not normally apply GC, but I can't see any way to get adequate performance for system-wide GC without applying every hardware and software hack you can. Paging can actually make this easier, IMAO, as it provides a hardware tool for doing something you will probably have to do anyway (that is, partition the memory into several memory arenas, with different specializations, generation aging criteria, and collection algorithms) in order to provide acceptable performance.
embryo2 wrote:Schol-R-LEA wrote:As for garbage collection, I actually see paging as a benefit for that, not a hindrance. Pages serve as a natural quantum for hierarchical meta-collectors, as well as allowing fairly fine-grained copy collection - I am planning on experimenting with a compacting copier that would run when a dirty page is to be swapped out, copy the live data to an empty page, remap the swap space to the new page (if there is any live data still) and swap it, then free the original page.
Well, I see here a memory copy operation implemented because "there's paging and it should be used somehow". In the end we have the same amount of memory occupied by the same data but marked as a different page. And what is the advantage here? It's simpler just not to copy it.
No, it is copying because that is how copying collectors work - 'garbage collection' is something of a misnomer here, as it is the live data that is actually collected by a copying collector, with the memory containing any remaining dead data then being freed in bulk. The algorithm works by copying the live data to a reserved area of memory in order to both eliminate garbage and compact the heap; when the copying is done, the reserved area is then marked as the active area, and the previous active area is marked as freed. My idea is simply to use paging to trigger the garbage collector, and using pages to partition the memory for faster collection cycles.
Of course, copy collection is only one family of algorithms for GC, and has performance characteristics that lend it to certain uses and not to others. In my planned design, not all memory arenas need use copy collection, but since the 4K pages happen to be a convenient size for the individual arena blocks - just large enough to be useful while still small enough that collecting one would be brief regardless of algorithm, reducing immediate latency - it is simply a useful tool for the hardware-specific part of the memory manager.
The same applies to using page misses to trigger collection - the memory manager has to track when memory is being exhausted within a given arena anyway, and the pager happens to be a convenient mechanism for detecting that. In some cases, this combination may even allow a memory arena to avoid paging a given block out entirely - if the page has no live data, then it can simply be marked as freed, and for arenas using a non-copying algorithm such as mark and sweep, it may be able to clear enough heap to let the memory allocation proceed without actually swapping the page. It is 'just' an optimization, but a practical memory manager probably would require such optimizations to perform adequately, and the OS should be able to apply them when they are available.
True, for a portable system, the reference design of an OS should not rely on any specific mechanism, but it should surely allow such a mechanism to be used for a particular implementation. At some point, any software implementation has to apply the mechanisms of the hardware it is running on, and regardless of whether the hardware-specific implementation is in the OS code, in a compiler or other code generator (regardless of when it is running, so long as it is producing the final executable image), or in a pseudo-machine interpreting a byte code, it should be able to use the information it has regarding the capabilities and state of the system to use the hardware efficiently. Ignoring that reality is almost certain to lead to an unusable system.
embryo2 wrote:Schol-R-LEA wrote:Collecting per page, or per a group of pages, also makes a mixed-algorithm approach more feasible - you could have different collection methods for different types of data, for example a set of pages exclusively for elements of the same size and type
Why we can't do the same without paging?
Schol-R-LEA wrote:It even would make it feasible to have some parts of memory garbage collected and not other parts of memory while still maintaining a system-wide memory management policy (the memory management would still be automatic, but other techniques such as reference counting or automatic references could be employed depending on the usage).
Why we can't do the same without paging?
You can, of course, just not as efficiently or securely (as a hardware mechanism, the paging can, at least in principle, both run faster than the software equivalents, and block uncontrolled access to pages from outside of the processes those pages are assigned to when running userland processes - and unlike segmentation, paging is fine-grained enough and automatic enough to be practical for my intended design). If the paging hardware is there (and you may need to use it anyway), why eschew it if it makes the process simpler? It is automating things I intend to do anyway, so it would make little sense not to use it for it's intended purpose.
Again, you seem to be thinking in terms of writing a single, wholly abstract operating system that does not address the specific hardware in any meaningful way. Doing this would be to ignore a fundamental aspect of OS design, even for a nominally portable OS. It makes far more sense to see the majority of the abstract OS as a reference design to be specialized later, or at least as a set of templates for the OS services. Any working system would do well to provide an API of hooks which can be used to apply hardware-specific code when it is available.
To repeat one more time: if anyone can give me reasons why I am mistaken about any of this, or provide any additional illumination fnord on these matters, it would be appreciated.