OSDev.org

Posted: **Sun Jun 25, 2017 4:40 pm**

I was reading this article and a couple quotes caught my eye:

The maximum 2TB limit of 64-bit Windows Server 2008 Datacenter doesn’t come from any implementation or hardware limitation, but Microsoft will only support configurations they can test.

My first question, and the one referenced in the thread's title, is: Is this reasonable? I understand things get a little weird in the 3GB-4GB range, but beyond that is there any reason not to support as much physical memory as the CPU will allow?

...the Windows team started broadly testing Windows XP on systems with more than 4GB of memory. Windows XP SP2 also enabled Physical Address Extensions (PAE) support by default on hardware that implements no-execute memory because its required for Data Execution Prevention (DEP), but that also enables support for more than 4GB of memory.

What they found was that many of the systems would crash, hang, or become unbootable because some device drivers, commonly those for video and audio devices that are found typically on clients but not servers, were not programmed to expect physical addresses larger than 4GB. As a result, the drivers truncated such addresses, resulting in memory corruptions and corruption side effects.

I'm really lost here. Doesn't the OS get to dictate where in the address space devices are mapped to? If it maps them to >4GB, even drivers only keeping the lower 32 bits of an address should be fine, right?

Posted: **Sun Jun 25, 2017 5:09 pm**

1. once this is a licensing limitation, so if you need windows for your very own supercomuputer, lets pay a few 100k usd for microsoft for it.

2. second, there is no performance in currenct cpus with todays algorythms to effectively find holes in multiple terabytes of ram. you can read RAM with 10-30 gbyte/sec on a thread with cpu. this is not an issue, as with ram size incrases, cpu speeds are also incrasing - but its an issue with current designs.

Posted: **Sun Jun 25, 2017 5:32 pm**

azblue wrote:My first question, and the one referenced in the thread's title, is: Is this reasonable? I understand things get a little weird in the 3GB-4GB range, but beyond that is there any reason not to support as much physical memory as the CPU will allow?

Microsoft has to support the OS. It's reasonable to expect that they would like to be able to test it themselves before claiming that it will work, especially when third-party drivers are involved.

azblue wrote:I'm really lost here. Doesn't the OS get to dictate where in the address space devices are mapped to? If it maps them to >4GB, even drivers only keeping the lower 32 bits of an address should be fine, right?

The problem is that drivers need to use physical addresses for things like DMA. When the RAM being used for DMA comes from a physical address above 4GB but the driver tells the hardware to use a physical address below 4GB, bad things happen.

Posted: **Sun Jun 25, 2017 8:30 pm**

azblue wrote:I'm really lost here. Doesn't the OS get to dictate where in the address space devices are mapped to? If it maps them to >4GB, even drivers only keeping the lower 32 bits of an address should be fine, right?

The device drivers mainly communicate with the device itself using memory-mapped I/O and DMA, both of which use physical addresses, not linear addresses. For PAE, the linear addresses are indeed 32-bits only, but the physical addresses are normally 36-bits by default. So even if the memory is mapped at, for example, 3 GB, or any address in the 32-bit linear address space, the actual physical address of that page may be 6 GB, or any other address in the 36-bit physical address space. A driver that is not aware of the existance of physical addresses larger than 32 bits may truncate a 36-bit address, taking the lowest 32-bits only. An example buffer at physical address 4 GB (0x100000000) would be sent to the device by a driver that is not aware of PAE as address zero. My instincts tell me that the device would not work properly that way.

In general, PAE caused compatibility issues with existing software as you have read, and thus to maintain compatibility with existing software, IMHO 32-bit kernels should stick to i386 paging, while 64-bit kernels are the ones that should make use of the entire available RAM as well as the CPU's address bus length. You can use CPUID to detect how many bits wide is the CPU's address bus.

Another workaround that allows you to use PAE yet avoid problems of physical addresses larger than 32-bits with devices that only support 32-bit addressing is to provide functions that let device drivers allocate memory using specific attributes, i.e. instead of a traditional malloc, you can implement a similar function that allows device drivers to tell the kernel things like "give me 32 MB of uncacheable contiguous physical memory in the range of 2 GB to 3 GB, and it must be aligned on a 64-byte boundary" and other similar functionality.

I'm open to other suggestions, if people see flaws or have better solutions.

Posted: **Sun Jun 25, 2017 8:48 pm**

AFAIK, Windows used to have first-fit physical pages allocator implemented with a memory bitmap. While this is feasible with the officially supported amounts of RAM and modern processors, it is not very scalable. As a first-fit allocator, it can at least gracefully recover from fragmentation over time by pushing free memory to the end of the address space (, which increases the probability of contiguous free chunks).

Edit: Most sources indicate that Windows uses bitmaps for its virtual memory management, not its physical memory management. Although physical memory could be occupied by process pages and the non-paged pool, if RAM is significantly over-provisioned it could still be used for filesystem caching. Considering that Windows keeps free and zeroed pages in pfn lists, and that physical memory fragmentation is not that much of an issue for Windows specifically, since it doesn't use direct mapping like Linux, my original statement above could be assumed incorrect.

I am not sure if physical memory had similar issues, but the Windows virtual memory layout was at one point limited due to bit-packing in pointers, in connection with lock-free algorithms. May be particular drivers were doing something similar with physical memory descriptors, difficult to imagine as it may be.

Posted: **Mon Jun 26, 2017 1:47 am**

Many haven't probably thought about what I will say, except for those few who have actually managed to implement proper paged memory managers:

Paging is greatly intended to prevent memory fragmentation after many loaded programs, memory allocations and deallocations, block resizes, etc... You might think that paging will help you with physical fragmentation, and it will, but then you will get fragmentation inside the virtual address space, and you need an increasingly good memory managemet implementation, strong enough to look ahead of time how to lay and treat memory blocks to deal with memory fragmentation.

Higher-level languages might take care of that with their library functions and specifications, but languages like C and assembly under an OS would reasonably be thought to need to make use of different algorithms to link the chunks of data of one same thing to take care of memory fragmentation at the application level, and the kernel would need to do the same, so that (managing memory fragmentation) is a persistent task to be done at all levels, and will always be in a rigid memory system, in other words, all currently existing digital electronics systems.

Posted: **Mon Jun 26, 2017 3:53 am**

You need a structure to hold the list of free pages. Even with simplest "page stack", ie. 8 bytes per page, you need 2TB/4k*8 = 4GB to hold the page stack. This pose a restriction on which logical address region you allocating for this data structure, and for most of us just put everything on last 2GB zone this surely breaks.

Posted: **Mon Jun 26, 2017 4:04 am**

~ wrote:Paging is greatly intended to prevent memory fragmentation after many loaded programs, memory allocations and deallocations, block resizes, etc... You might think that paging will help you with physical fragmentation, and it will, but then you will get fragmentation inside the virtual address space, and you need an increasingly good memory managemet implementation, strong enough to look ahead of time how to lay and treat memory blocks to deal with memory fragmentation.

I am not even sure that there is real benefit to keeping data contiguous in physical memory, aside from using large pages if possible, to reduce TLB thrashing. This even made me double-check my previous post where I made a statement that Windows uses first-fit physical memory allocator, but this makes no sense, because the memory is mapped on demand on Windows. In contrast, Linux uses direct mapping of the physical memory, which makes virtual and physical fragmentation incident. After enough I/O, which is cached on page granularity, the primary kernel allocator gets fragmented. And even though the buddy system helps a bit, big multi-page requests are prone to failure. This is one of the reasons why Linux thread stacks are small - making them big would be hazard for thread creation. What I am trying to say is that indeed virtual memory fragmentation can have important design implications.

Posted: **Mon Jun 26, 2017 6:21 am**

simeonz wrote: Edit: Most sources indicate that Windows uses bitmaps for its virtual memory management, not its physical memory management. Although physical memory could be occupied by process pages and the non-paged pool, if RAM is significantly over-provisioned it could still be used for filesystem caching. Considering that Windows keeps free and zeroed pages in pfn lists, and that physical memory fragmentation is not that much of an issue for Windows specifically, since it doesn't use direct mapping like Linux, my original statement above could be assumed incorrect.

I am a plain dumb yet regarding MM, but I am not sure Windows uses bitmaps for VM allocations.

Windows Internals, Chapter 7: Memory Management 449 wrote: With the lazy-evaluation algorithm, allocating even large blocks of memory is a fast operation.
When a thread allocates memory, the memory manager must respond with a range of
addresses for the thread to use. To do this, the memory manager maintains another set of data
structures to keep track of which virtual addresses have been reserved in the process’s address
space and which have not. These data structures are known as virtual address descriptors
(VADs). For each process, the memory manager maintains a set of VADs that describes the status
of the process’s address space. VADs are structured as a self-balancing binary tree to make
lookups efficient. Windows Server 2003 implements an AVL tree algorithm (named after their
inventors, Adelson-Velskii and Landis) that better balances the VAD tree, resulting in, on average,
fewer comparisons when searching for a VAD corresponding with a virtual address. A diagram
of a VAD tree is shown in Figure 7-28.

Posted: **Mon Jun 26, 2017 1:52 pm**

As others have said, MS actually has to _support_ their customers (to varying degrees). If they don't have access to a device with more than 2TB RAM, then they couldn't even have tested their software on one single machine. The reality is that you will quite likely introduce limits into your code (as people have been doing by using uint32_t, as they did when they used two digits to encode the year, causing the y2k, as they did with the Linux epoch stuff = y2k38).

So I would say that the only sane thing is to actually test it at least a little, and getting 2TB machines is probably a bit difficult, not sure what kind of servers are available these days.

If drivers lived fully in virtual address space, and thus had to request DMA (and other stuff) from Windows then this wouldn't be much of an issue driver wise. However if drivers deal with the "raw" physical addresses and they internally use uint32_t (or any type of pointer but the code is compiled as 32-bit, so the pointers are 32-bit), then they will overflow with larger addresses and of course cause weird stuff until they crash.

As to omarrx024 asking for other suggestions, not sure, but possibly utilizing IOMMU might be an alternative. I don't know nearly enough of Windows's internals to provide anything more useful =)

Posted: **Mon Jun 26, 2017 1:57 pm**

zaval wrote:I am a plain dumb yet regarding MM, but I am not sure Windows uses bitmaps for VM allocations.

I meant that bitmaps are used in kernel virtual memory management, so I wasn't precise. You are right - VAD's are used for user-space virtual memory management. The bitmaps for the non-paged and paged pool allocators are mentioned on CodeMachine. (You can search for "bitmap".) There is also mention in this MS paper. (Search for "bitmap" again if interested.)

Posted: **Mon Jun 26, 2017 1:58 pm**

Geri wrote:there is no performance in currenct cpus with todays algorythms to effectively find holes in multiple terabytes of ram. you can read RAM with 10-30 gbyte/sec on a thread with cpu. this is not an issue, as with ram size incrases, cpu speeds are also incrasing - but its an issue with current designs.

Find what holes? Why do you want to find holes?

You can "easily" read RAM at 120GB/s (four NUMA domains). Not sure how threads are involved here..?

Also I thought CPU wouldn't be increasing in speed, or rather they haven't in terms of Hz for quite some time. Some performance increases will still come but the expectation is that the future performance increases will be largely thru more cores, not single core performance. Though even that's somewhat irrelevant, it's not the CPU speed that matters, it's the RAM speed that's the bottleneck.

Posted: **Mon Jun 26, 2017 2:52 pm**

simeonz wrote: I meant that bitmaps are used in kernel virtual memory management, so I wasn't precise. You are right - VAD's are used for user-space virtual memory management. The bitmaps for the non-paged and paged pool allocators are mentioned on CodeMachine. (You can search for "bitmap".) There is also mention in this MS paper. (Search for "bitmap" again if interested.)

ah. thanks for the links. it's interesting.

I just remember, that I read in the WDK documentation that for the pools, the buddy mechanism is used. Yes, I found it, it's in the WDK Glossary->pool memory.

The memory manager allocates entities from both pools using a buddy scheme.

Posted: **Mon Jun 26, 2017 5:56 pm**

LtG wrote: You can "easily" read RAM at 120GB/s (four NUMA domains). Not sure how threads are involved here..?

you are wrong, and you also acknowledge it when you used the ''-s.
you will never get more than a few gbytes per sec in a real algorithm per threads, and thats a fact and not a question of oppinion

LtG wrote: Also I thought CPU wouldn't be increasing in speed, or rather they haven't in terms of Hz for quite some time.

nowdays we having superscalar cpus with the capability to execute multiple opcodes under a clock cycle. 20 years before the cpus were able to do 1.5 opcode per cycle on average. 10 years before we had 2.5 opcode, now we are having 4-5 and the theocretical limit of current architectures are 3 for arm, 6 for amd and 8 on intel as we have as many execution units per core (altrough some of them are only capable to do a specific thing so we will never fully reach this numbers in reality).

Posted: **Tue Jun 27, 2017 12:47 am**

Geri wrote:
LtG wrote: You can "easily" read RAM at 120GB/s (four NUMA domains). Not sure how threads are involved here..?
you are wrong, and you also acknowledge it when you used the ''-s.
you will never get more than a few gbytes per sec in a real algorithm per threads, and thats a fact and not a question of oppinion

LtG wrote: Also I thought CPU wouldn't be increasing in speed, or rather they haven't in terms of Hz for quite some time.
nowdays we having superscalar cpus with the capability to execute multiple opcodes under a clock cycle. 20 years before the cpus were able to do 1.5 opcode per cycle on average. 10 years before we had 2.5 opcode, now we are having 4-5 and the theocretical limit of current architectures are 3 for arm, 6 for amd and 8 on intel as we have as many execution units per core (altrough some of them are only capable to do a specific thing so we will never fully reach this numbers in reality).

What are the wholes you were talking about?

We were talking about _READING_ RAM (memory bandwidth), not some imaginary "real" algorithm, and the simplest algo's probably can get up to 30GB/s (per core).

"you can read RAM with 10-30 gbyte/sec on a thread with cpu. this is not an issue, as with ram size incrases, cpu speeds are also incrasing"
I was referring to this earlier, this sounds like you're saying that today CPU's can read at 10-30GB/s RAM, and it's not an issue because when RAM sizes increase the CPU speed will increase also.

The issue I have with the above is that the 10-30GB/s is not a limit of the CPU, it's the RAM bandwidth. The CPU can easily consume more, which is why cache is these days the largest single "component" on a CPU, because RAM is too slow and not the CPU.

OSDev.org

Physical Memory - Proper Maximum Limit

Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit

Re: Physical Memory - Proper Maximum Limit