x86 32-bit paging management (Pillar Subject)

~ · Post by ~ » Sat Sep 14, 2019 11:51 am

I've working in paging management, and it seems that the easiest way to achieve it is to keep a physical pages bitmap unique to the whole system.

http://sourceforge.net/projects/x86-32-paging/files/
Look at the end of the post for an attached ZIP/.asm file that sums up the easiest way I've found to map a given physical page to a given virtual page, from detecting memory manually to mapping the page.

Code: Select all

;List of paging functions:
;;
 ;OPCODE__CPU_x86_32_raw_paging_allocate_physical_4KB_page      (should be capable to initialize our kernel page directory all by itself)
 ;OPCODE__CPU_x86_32_remap_physical_page_address (can be left unused)
 ;OPCODE__CPU_x86_32_map_physical_page_address_for_malloc_system__page_disabled
 ;OPCODE__CPU_x86_32_map_physical_pagedir_entry__page_disabled
 ;OPCODE__CPU_x86_32_addr2pagedirent
 ;OPCODE__CPU_x86_32_addr2pagetblent
 ;OPCODE__CPU_Disable_Default_Paging
 ;OPCODE__CPU_Enable_Default_Paging
 ;OPCODE__CPU_load_CR3_paging
 ;OPCODE__CPU_generate_4KB_empty_page_directory
 ;OPCODE__CPU_Read_Default_Paging
 ;OPCODE__CPU_Restore_Default_Paging
 ;OPCODE__CPU_x86_32__raw_paging__mark_page_AVL_value
 ;OPCODE__CPU_x86_32__raw_paging__mark_pages_AVL_value (can be unused)
 ;OPCODE__CPU_x86_32__raw_paging__get_distinct_AVL_value
 ;OPCODE__CPU__x86_32__find_biggest_free_virtual_block
 ;OPCODE__CPU__x86_32__find_04MB_free_virtual_block_page_disabled
 ;OPCODE__CPU_x86_32__raw_paging__read_page_AVL_value
 ;OPCODE__CPU_x86_32__free_empty_page_table_page_disabled
 ;OPCODE__CPU_x86_32_raw_paging_free_physical_16KB__AVL_block_page_disabled
 ;OPCODE__CPU_x86_32__pagetable_is_empty_page_disabled
 ;OPCODE__klibc32__malloc_PG_0
 ;OPCODE__klibc32__free_PG_0
 ;OPCODE__CPU_x86_32___virtual_4KB_page_is_not_free__page_disabled
 ;OPCODE__CPU_x86_32__get_AVL_block_size_page_disabled
 ;OPCODE__CPU_x86_32__check_free_virtual_16KB_blocks_after_AVL_allocation__page_disabled
 ;OPCODE__string__search_DWORD_aligned_DWORD_value
 ;OPCODE__CPU_x86_32_raw_paging_allocate_physical_4KB_page_disabled










 ;OPCODE__data_bitmap__free_physical_memory_16KB_cluster (can be unused)
 ;OPCODE__data_bitmap__get_16KB_physical_memory_cluster_state (can be unused)


;List of raw memory functions:
;;
 ;OPCODE__detect_32bit_memory_by_pages    (must be called before any memory allocation)
 ;kalloc_Append_Kernel_Pages



;List of raw binary data functions:
;;
 ;OPCODE__binary_data__find_first_free_WIDEWORD



;List of bitmap functions:
;;
 ;OPCODE__data_bitmap__get_free_bits_offsets
 ;OPCODE__data_bitmap_offset_to_4K_address
 ;OPCODE__data_bitmap__get_free_physical_memory_address
 ;OPCODE__data_bitmap__free_physical_memory_address
 ;OPCODE__data_bitmap__get_free_nibble_offsets
 ;OPCODE__data_bitmap__get_free_physical_memory_16KB_cluster
 ;OPCODE__data_bitmap__get_physical_4KB_memory_address_state



;List of Binary String Processing Functions:
;;
 ;OPCODE__string__search_DWORD_aligned_DWORD_value




;List of VGA card functions:
;;
 ;VGAVideo_CRTC_setTextCurPos



;List of string functions:
;;
 ;num2str



;List of mode 03h text functions:
;;
 ;console_scroll_update
 ;console_doCRLFeffect
 ;console_kprint_visible
 ;console_kprint_uint32_16



;List of debug macros:
;;
 ;printdbg_str_hexnum String_Name,"String to display: ",0,wideax   (last 2 parameters are string terminator and value to display)
 ;include_debugstr String_Name,"FunctionName"  (to mark function starts in assembled code)
 %define isdbg 1       ;Set flag to 0 to disable producing debugging code

;List of library functions:
;;
 malloc
 calloc
 realloc (expand only right after the end of the previous allocation)
 free

Elements that Don't Need to Turn Off Paging

- Converting a virtual address into a byte offset into a page directory entry.

- Converting a virtual address into a byte offset into a page table entry.

- Converting the passed array index into the byte offset of the specified CR3 array index (to get the actual specified page table base address|possible flags) since it is mapped inside the kernel unlike page directories and page tables.

The steps are (you can see them at the end of the ZIP/.asm code):

- Detect available memory to the highest accessible address.
- Create/initialize the physical bitmap with size adapted to the installed RAM to the last page used by the kernel and the first MB.
- Initialize/enable kernel paging.
- Mark the first sequential physical page available for applications in a variable.
- Cache CR0 for being able to change/restore it while the system runs.
- At this point we can easily map new virtual pages.
- The CR3 value is taken into account for being able to modify any page directory, not just the current one.

I need functions to map manually or automatically any address, to unmap them via virtual pages or via the physical bitmap, to search for contiguous blocks of pages returning type, base address and size in pages (free, used, from the same block -for malloc/realloc/free- using the AVAIL field to relate contiguous ones), I need multitasking to test reusing addresses for applications separated from the kernel more than for switching simultaneous programs. Then I can implement malloc-like functions on top of physical and paging management functions for the rest of programs, that don't really need paging or dynamic memory management as much as a C/C++ compatible kernel.

No tutorial says how to manage things beyond the initial mapping for running with paging. Tutorials for other things show functions built step by step to reason how to implement each detail, but paging tutorials have never been specific on the basic set of functions, variables and tracking algorithms that make up the simplest implementation from which to expand as everyone sees fit.

In the above URL I've included a test kernel written in assembly with a function that is capable of mapping a specific virtual address to a specific physical address and set paging attribute bits. It is capable of reserving a new page if there isn't a page table entry for it. That seems to be the easiest thing to implement for managing paging. That's the most important function I've implemented along with the associated bitmap search, other paging and physical memory detection/reserving/freeing functions to make possible for it to work.

Look at the function called OPCODE__CPU_x86_32_raw_paging_allocate_physical_4KB_page, which is the function that maps a physical page to a virtual page address.

Example of identity-mapping Megabyte number 100:

Code: Select all

;Inputs:
;      WIDESI -- Index in the CR3 array 0 to N.
;      WIDEAX -- Privilege/type flags and target physical address.
;      WIDEDI -- Target virtual address (0 for identity mapping)
;
;
;Outputs:
;       WIDEAX -- New virtual page address
;
;;
 mov widesi,0
 mov wideax,(1048576*100)|(_FLAG_x86_32_pageDirectoryEntry_Bit0_Present_TRUE|_FLAG_x86_32_pageDirectoryEntry_Bit1_ReadWrite|_FLAG_x86_32_pageDirectoryEntry_Bit2_UserLevel)
 mov widedi,1048576*100
 call OPCODE__CPU_x86_32_raw_paging_allocate_physical_4KB_page

The opposite function (unmap via a given virtual address in a directory) should make things like freeing a page working as a fully empty page table with no exception to avoid low-level memory leaks easily. But in the end virtual spaces seem to be handled by virtual allocation data to structure the rest of operations.

=====================================================

The easiest is to keep all program allocations to clusters of 4 physical pages and 16 virtual pages (for the virtual ones, that can be partially allocated, not all of them valid of out of 4 page clusters). Think that for a new page directory you need at least 3 new pages (an actual data page, the page directory and a page table for a 4MB block), so if we don't keep things grouped in 4-page clusters the system is capable of reaching a point where it won't be able to allocate new page directories for a new process, and if we need it we will lock up without knowing fast enough if there's enough free RAM speeding up with clusters.

Also, malloc allocations in a directory are easier if kept with a distance of up to 16 Megabytes between allocated blocks to avoid fragmentation widely, to allow realloc() and free() without interfering too often with other blocks, unless a program uses many variables bigger than 16 Megabytes.

Also, the easiest is to keep paging structures unmapped, disable paging and interrupts for the shortest possible time to access them (just enough to ensure that we are coherently marking them as now reserved or free) and reenable to keep working. The physical page bitmap is mapped in kernel space, so searching it doesn't need to disable paging and it's one of the most time consuming operations anyway versus just setting up some paging structures (although we need to search for the AVAIL field for finding big blocks). It causes our address space to keep all addresses available as paging structures will now never interfere because they are unmapped instead of taking up space (doing so is dramatically easier than deciding how to map paging structures themselves and then actual code/data pages), although we'll need to disable paging for allocating/deallocating anything, but now this allocator becomes so simple that we don't need to rely on exceptions for allocating requested pages, we just allocate them when we want them.

Information of contiguous block types could be left to a very fast SQLite3 or manual DB2 database when we have access to disk and swap, to avoid disabling paging too much just to inspect which blocks are contiguous that could be implemented as a cache accelerator.

The question is: Is disabling/enabling paging comparable in speed to reloading CR3 when the cache is invalidated?

=====================================================

Another functions that are needed to manage pages are functions to automatically choose physical and virtual addresses, or force one of them (we already have a function to choose a specific physical/virtual pair meaning it forces them both and leaves searching appropriate addresses to external code). Functions to choose automatic addresses are the most important after having manual mapping functions because those are the ones that take care of where to place stuff without us (they are some of the most used most of the time for normal allocations).

Also functions to search free or reserved virtual and physical page blocks to choose for currently manual mapping functions aligned or not to the page cluster size, allocating virtual addresses apart up to 16 Megabytes and virtual clusters of 16 4KB pages to minimize fragmentation per program. The functions need to return the size in pages of a current/next page block so that we can then decide if it's big enough or not, to actually allocate/free. It will be necessary for malloc() and realloc(). We can use a given value of the AVAIL field to indicate whether the next/previous contiguous pages are part of the same block. We could probably use 2 or 3 different AVAIL values so that we can differentiate between 2 or 3 contiguous blocks (as long as those pages are actually valid, if not AVAIL and the other fields should be 0 at all times to make page entries to indicate a free NULL pointer, and paging structures can have an AVAIL value of 0 as they only need to be mapped or unmapped by the base kernel with single-page fragmentation). It seems to be the easiest way to keep track of the sizes of malloc blocks purely in hardware with zero extra space or additional structures/logic that could introduce unstability because of unnecessary complexity.

=====================================================

It's better to support at least very basic multitasking with a separate CR3 directory to provide resistance without system-wide lockups that we can stop with a hotkey or the like (a key to switch processes manually or to terminate the current one and return to the kernel), and also to make possible to reuse addresses.

The easiest is to employ the AVAIL field marked with all 1's to indicate which pages come from the system/kernel, and that cannot be deallocated from the physical page bitmap when freeing the secondary process memory for termination (free its own page tables but without unmarking kernel memory in the physical page bitmap).

=====================================================

As you can see, it seems that the whole physical/paging/malloc system can be done in hardware and a software bitmap if we figure out a way of making use of available paging bit fields and come up with a reasonable grouping of pages and middle-sized virtual address ranges.

Things like the physical page bitmap could be reallocated to a higher address if it gets in the way of other necessary addresses, for example if it goes beyond the first 4MB of the virtual address space (if programs start there and we need to remap the whole base kernel into the program's space). There is a function that can reallocate the virtual address of a physical page to any other existing virtual address, and other that can do the same even if no page table currently exists (this one should always be used as part of the page manager automation).

But now minimum multitasking with separate CR3 needs to be implemented as a container, for any other function than manually mapping a specific physical page to a specific virtual one to make practical sense (for example to start testing how well malloc-style calls work in programs and how much we will improve against exceptions, how well we can allocate and free memory from separate processes).

Does it look OK to base malloc-like functions purely in the paging hardware with minimum or absolutely no other control data to keep track of the allocations for a given virtual address space? It seems like a practical application that needs such minimalization to effectively reuse addresses several times transparently and to base basic memory management functions only in the available CPU hardware for all tracking.

FINAL CONCLUSION
Learning paging is just about practicing malloc, thinking that paging is just an optimized hardware array to divide time and space, and we must manage it from the point of view of simply allocating and deallocating, also some mechanism of resizing, be it by leaving enough space between blocks or by defragmenting transparently in some way.

malloc/realloc/free are the starting end result. We can spend as much time as we want practicing those functions and their implementation. There is an infinite number of paging tricks we can add to our program depending on the primitives we want for a given software.

In the end we can rewrite a kernel version that only calls malloc, etc., instead of lower-level functions, for a better-separated code.

Octocontrabass · Post by **Octocontrabass** » Sun Sep 15, 2019 10:13 am

~ wrote:- Detect available memory to the highest accessible address.

You assume memory will be in one continuous block. What happens if there are gaps?
You assume memory will be available. What happens if the firmware is already using it for things like ACPI and SMM?
You assume it's safe to probe addresses that aren't memory. What happens if you hit the memory-mapped registers for your motherboard's voltage controllers?

~ wrote:Is disabling/enabling paging comparable in speed to reloading CR3 when the cache is invalidated?

Reloading CR3 is faster: it doesn't invalidate cached global pages.

For small updates to the page tables, INVLPG is fastest: it only invalidates the parts you want to invalidate.

You can't take advantage of these features (or newer ones like PCID) if you're disabling paging. Modern CPUs are designed for operating systems that never disable paging.

~ wrote:Does it look OK to base malloc-like functions purely in the paging hardware with minimum or absolutely no other control data to keep track of the allocations for a given virtual address space?

That's fine, but in the future you might discover you need that extra control data.

~ · Post by ~ » Thu Oct 17, 2019 6:29 am

Also, shouldn't disabling paging in CR0 be unimportant if interrupts are disabled, which could be common when modifying big blocks of pages to prepare them atomically in a way that will never crash?

Octocontrabass · Post by **Octocontrabass** » Thu Oct 17, 2019 7:47 am

Disabling interrupts causes latency, and latency makes your OS slow. Modern OSes use lock-free algorithms with atomic operations so interrupts can stay enabled even when modifying big blocks of pages. (They sometimes still disable interrupts, but only for very short amounts of time.) Of course, this means paging stays enabled too.

"Lock-free" isn't talking about the LOCK prefix. Many lock-free algorithms rely on x86 instructions with the LOCK prefix.

Do you have answers to any of the questions in my previous post?

0xCC · Post by **0xCC** » Fri Oct 18, 2019 7:04 am

I'm currently trying to implement malloc function to automatically choose a Virtual Address with corresponding virtual size and map it to physical address frame, and here is the simplest algorithm i thought of. (ignore this post if it may seem stupid, it has been 2 weeks since i started OS dev lol).
1- Create a bitmap called VirtualMem_BITMAP which is for available 4kb virtual pages (each bit for a 4kb page; 0 for available pages,1 for used pages) that would be ~130KB size table.
2- Create another bitmap called PhysicalMem_BITMAP which is for available 4kb physical pages same as before, the size depends on physical RAM size.
3- When malloc(size,type,protection) is called first we look in VirtualMem_BITMAP for certain number of consecutive zeros in the bitmap (number of consecutive zeros depends on the "size" argument of malloc) if we found them, the position of first zero represents the address of allocated virtual memory, if we didn't find these zeros then no virtual memory available and functions fails.
4- Now we look in PhysicalMem_BITMAP for zeros (They dont have to be consecutive zeros since we can map contiguous virtual memory pages to non-contiguous physical memory pages) and now we fill page table with physical memory frames represented by zeros in PhysicalMem_BITMAP. if there is no enough zeros then we need to swap out some memory pages.

And to make memory management more efficient we can skip step 4 in malloc and apply it only when the page has been accessed and caused page fault.

iansjack · Post by **iansjack** » Fri Oct 18, 2019 8:00 am

The size of memory allocated by malloc is frequently very much less then 4KB, but this is the smallest amount of virtual memory that you allow to be allocated. So it seems to me that this scheme wastes a lot of RAM. Even if the amount allocated is greater than 4KB you are wasting 2KB, on average, per allocation.

0xCC · Post by **0xCC** » Fri Oct 18, 2019 8:43 am

iansjack wrote:The size of memory allocated by malloc is frequently very much less then 4KB, but this is the smallest amount of virtual memory that you allow to be allocated. So it seems to me that this scheme wastes a lot of RAM. Even if the amount allocated is greater than 4KB you are wasting 2KB, on average, per allocation.

but 4KB is smallest virtual size that x86 paging architecture allows for an individual page.
in Windows for example, VirtualAlloc always allocate 64k aligned addresses.

~ · Post by ~ » Fri Oct 18, 2019 9:43 am

Any sort of page management seems to be just an array data structure optimized for speed and space.

Think about how a single page directory is a structure capable of covering the whole memory simply with indexed entries decoded from a virtual address from which we can reconstruct array indexes and page offset.

It's easy to traverse it in a loop, from CR3 to the end physical page offset in little time, no matter how fragmented it is.

It could probably be implemented in software in C as an optimized way to handle arrays of any data type.

iansjack · Post by **iansjack** » Fri Oct 18, 2019 10:39 am

4KB is the smallest physical page that can be allocated, and this will correspond to a virtual page. But malloc frequently allocates far smaller amounts of memory, so your scheme is very wasteful. Pages are irrelevant.

Typically, several calls to malloc will allocate memory from one mapped page.

~ · Post by ~ » Tue Oct 29, 2019 6:41 am

The question is, is it normal for Windows as new to 7 to become so fragmented in RAM that it's unable to run certain programs, meaning it doesn't take the time to defragment itself on such case to reallocate (it can only free existing big blocks)?

I downloaded the latest Half-Life 1.
I tried to run it with system resources at maximum usage.
hl.exe told me that it could not allocate 40 MB.
It seems that there were no virtual or physical 40-Megabyte blocks in the whole system. I had to close a Firefox window with 32 YouTube tabs that had crashed, and then it worked.

It could mean that Windows won't swap to reallocate for big memory block requests, so the longer it runs, the more fragmented it will become and it won't defragment, it will only free memory used by running programs.

It must be the case that most programs are designed to request for fragmented memory to the size of the individual smallest memory block size, so they only request for the small available pieces individually and manage the fragments themselves to avoid this problem.

------------------------------------------------------------------
I had VLC with SHOUTcast stream, eMule with no downloads, uTorrent and JDownloader with several completed downloads, Agent Ransack, Acrobat Reader DC with a 5-8MB PDF, and these:

12 GB installed in Windows 7 32-bit (2.96GB usable -per process?- maybe for the AMD video card)
Firefox tabs: 32 video tabs, 25, 7, 35
Chrome tabs: 14, 2
Windows Explorer: 24 windows
MS-DOS fullscreen graphics enabled

Octocontrabass · Post by **Octocontrabass** » Tue Oct 29, 2019 8:09 am

~ wrote:12 GB installed in Windows 7 32-bit (2.96GB usable -per process?- maybe for the AMD video card)

That's 2.96GB usable in total, shared by all software. You have 9GB of RAM that you aren't using at all.

Physical memory is shared between all software. Virtual memory is separate for each program.

~ · Post by ~ » Tue Oct 29, 2019 8:30 am

Having that much memory and being able to develop most of an OS with 16-bit era graphics/hardware, VESA, DPMI, emulation (for 64-bit, etc.), old game and demo programs for DOS/Win16, is well worth it.

Maybe that's why Firefox and Chrome open so many instances. They know that they can use more than 4GB of real RAM if there is PAE and if they share that memory among many different processes (IPC, inter-process communication/inter-process call).

But all references say that Windows can enable PAE ("BCDEdit /set PAE ForceEnable" as Administrator).

And I've proven to myself that I can run many more things because I initially had only 2GB in this 6470b ProBook, and I almost always had a slowed down system, often halted except for the mouse and Alt+Tab, when trying to open new tabs and Explorer folders.

But after installing the 12GB of RAM from another identical machine that got broken, now I can open many more things without slowdown.

So 32-bit Windows indeed uses more than 4GB, you only have to enable PAE in the BOOT.INI (BCD Boot Configuration Data running "BCDEdit /set PAE ForceEnable" as administrator) options and elsewhere. It employs additional RAM for each new process, maybe all processes are limited to 4GB, but additional RAM can be used with PAE CPUs.

Look at my current memory usage as I write this. It's under 32-bit, with PAE enabled, and works dramatically better than with 2 or 4GB (Sometimes I reach overall usage of more than 6GB, so it was good to find out whether Windows would defragment itself for a 40MB multimedia game program or it would fail swapping out asking to free big existing blocks):

iansjack · Post by **iansjack** » Tue Oct 29, 2019 9:03 am

Your System Information say "2.96GB" useable.

~ · Post by ~ » Tue Oct 29, 2019 9:07 am

2.96GB per program, physical, each address space of programs/DLLs (for the GPU maybe, accounting it would be only 1 GB of video and 0.04GB of who knows what -probably the kernel core, would be like 40.96MB, what HL.EXE couldn't reserve-).

But also says that has more than 4GB to share between applications:

Converting Addresses to Physical Pages Bitmap Offsets

An address needs to be divided by 4096 to get the page number in the bitmap.

iansjack · Post by **iansjack** » Tue Oct 29, 2019 9:28 am

I'm not sure what posting a screenshot that shows that you can use 1.6GB proves. I think we can all agree that you should be able to use that much RAM.

OSDev.org

x86 32-bit paging management (Pillar Subject)

x86 32-bit paging management (Pillar Subject)

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management

Re: x86 32-bit paging management