Bochs PAE support

Brendan · Post by **Brendan** » Thu Apr 14, 2005 1:10 am

Hi,

After much investigation I've found that Bochs PAE support did have a bug in it, but does work.

Bochs uses a variable to "cache" the physlcal address contained in CR3, and this variable wasn't updated when CR4 was changed. This means that if your code loads CR3, then enables PAE and then enables paging Bochs will use "CR3 & 0xFFFFF000" instead of "CR3 & 0xFFFFFFE0", resulting in the wrong address used for the PDPT.

To fix this you can enable PAE before setting CR3, or you can fix Bochs' code.

To fix Bochs' code find the "pagingCR4Changed()" function within the file "cpu/paging.cc" and change it to this:

Code: Select all

  void BX_CPP_AttrRegparmN(2)
BX_CPU_C::pagingCR4Changed(Bit32u oldCR4, Bit32u newCR4)
{
  // Modification of PGE,PAE,PSE flushes TLB cache according to docs.
  if ( (oldCR4 & 0x000000b0) != (newCR4 & 0x000000b0) )
    TLB_flush(1); // 1 = Flush Global entries also.

  if (bx_dbg.paging)
    BX_INFO(("pagingCR4Changed(0x%x -> 0x%x):", oldCR4, newCR4));

  if ( (oldCR4 & 0x00000020) != (newCR4 & 0x00000020) ) {
#if BX_SupportPAE
  if (BX_CPU_THIS_PTR cr4.get_PAE())
    BX_CPU_THIS_PTR cr3_masked = BX_CPU_THIS_PTR cr3 & 0xffffffe0;
  else
#endif
    BX_CPU_THIS_PTR cr3_masked = BX_CPU_THIS_PTR cr3 & 0xfffff000;
  }
}

Note: I have reported this to the Bochs bug list. I'm hoping when Bochs 2.2 is released it will be fixed

.

Cheers,

Brendan

Candy · Post by **Candy** » Thu Apr 14, 2005 1:34 am

thanks! Now lemme get back to testing my PAE code on a new 2.2 build

PS: submitted it to [email protected] already?

Brendan · Post by **Brendan** » Thu Apr 14, 2005 1:57 am

Hi,

Candy wrote: PS: submitted it to [email protected] already?

Nope, just the bug tracker.

Last time I dealt with cbothomy it wasn't too beneficial - I was offering full hyper-threading support but couldn't provide "diffs" (being used to cut&paste and unfamiliar with *nix tools like diff and CVS). I did have a *.zip containing all modified files though. He put a message on a patches message board where it was ignored thereafter.

Cheers,

Brendan

Candy · Post by **Candy** » Thu Apr 14, 2005 2:58 am

Do you still have those HT files? Would very much like to do that myself if necessary.

Also, you have experience with osdev on multiprocessor stuff. Could you give a short overview of what went wrong and what common trips are? Also, did you do NUMA memory remapping?

Brendan · Post by **Brendan** » Thu Apr 14, 2005 5:45 am

Hi,

Candy wrote:Do you still have those HT files? Would very much like to do that myself if necessary.

I still have the original files, and will be "re-hacking" HT support into Bochs 2.2 Beta tomorrow. Part of the trouble is that to support HT you need 4 additional BIOS images which can't be compiled with 32 bit GCC. I'll get it all up on a web page to make it easy to download (including the extra BIOS images so you don't need to worry about BCC).

Candy wrote:Also, you have experience with osdev on multiprocessor stuff. Could you give a short overview of what went wrong and what common trips are?

I didn't have too much go wrong, aside from typical bugs that you get implementing anything. I'd suggest splitting "multi-CPU" support into several different areas:

- CPU detection (number of CPUs and features)
- IO APIC and local APIC support
- Starting the other CPUs
- re-entrancy (including IRQ support)
- CPU affinity (which is optional)
- scheduling

I was fairly lucky in that the boot sequence and OS design I had been using was relatively conducive to multi-CPU. Some things that helped where detecting CPU features and starting other CPUs within the real mode code (which made the "trampoline" code easy) and using a micro-kernel design where device drivers run as seperate processes (minimizes re-rentrancy concerns).

Detecting the number of present CPUs meant parsing both ACPI and MP specification tables. Detecting hyper-threading means detecting CPU features at the same time (the MP specification tables return information about physical CPUs only, while the ACPI tables return logical CPUs only).

The scheduler design I had been using needed major reworking to make it perform better (to reduce re-entrancy lock contention). The old scheduler used to select the next thread to run from all available ready-to-run threads, while the multi-CPU version needed seperate thread queues (one for each CPU). Placing a ready-to-run thread on one of the CPU's queues also involves a certain amount of load balancing.

For my OS each device driver is implemented as a normal process, where IRQ's are initially handled by the kernel and sent to the device driver as a message. This means that the device driver's code gets a message, handles it, gets another message, etc. I guess what I'm saying is that the messaging serializes everything for the device driver, so that the device driver itself doesn't need to worry about re-entrancy. This isn't quite true as I support multi-threaded processes, but it's the same problem that all multi-threaded processes have (and my OS makes this easier because there's "kernel space", "process space" and "thread space", where each thread has exclusive access to it's thread space).

[continued in next post]

Brendan · Post by **Brendan** » Thu Apr 14, 2005 5:48 am

[continued from last post]

Then there's re-entracy within the kernel, which isn't as easy as it sounds. The first problem is avoiding deadlocks, which I do with specially hacked re-entrancy locking code (the basic idea being that if a thread is trying to acquire a lock that it has already acquired it generates a critical error). The second problem is trying to minimize lock contention. Here I chose to use extrememly fine-grained locking, and to use spinlocks only. Spinlocks suck if they are used to lock lengthy operations so I avoid lengthy operations within the kernel

. This does mean that some things need careful consideration. For example, the code to spawn a new thread allocates what it needs before creating the thread rather than allocating things while creating the thread (the difference is several small locks used one at a time rather than one lock remaining acquired while other smaller locks are used). An example of very fine grained locking would be my messaging code where every message queue has it's own lock, or my physical memory management where there's up to 1024 free page stacks that have a lock each.

Multi-CPU (where CPUs can be different, rather than SMP) complicates things a bit too. CPU affinity solves half of it though (when a thread is spawned it's CPU affinity is set so that it can only run on CPUs that support the features that the thread needs). Part of the trouble is I do "time accounting" where the OS keeps track of how much time each thread has used. This requires CPU time to be "weighted". For example, if CPU A is twice as fast as CPU B, then a thread that uses 50% of CPU A's time would consume twice as much CPU time as a thread that consumes 50% of CPU B's time.

I also use "CPU domains" and "CPU sub-domains" to keep track of the difference between NUMA, and hyper-threading and/or multi-core CPUs. Each CPU domain corresponds directly to each NUMA memory domain, while sub-domains correspond to physical CPUs/chips. A computer with 4 NUMA domains and 8 dual core chips would be represented as 4 domains with 8 sub-domains (2 sub-domains per domain), where each sub-domain has 2 logical CPUs (8 logical CPUs per domain and 16 logical CPUs total). For e.g.:

[tt]|__Domain 0
| |__Sub-domain 0.0
| | |__CPU #0
| | |__CPU #1
| |__Sub-domain 0.1
| |__CPU #2
| |__CPU #3
|__Domain 1
| |__Sub-domain 1.0
| | |__CPU #4
| | |__CPU #5
| |__Sub-domain 1.1
| |__CPU #6
| |__CPU #7
|__Domain 2
| |__Sub-domain 2.0
| | |__CPU #8
| | |__CPU #9
| |__Sub-domain 2.1
| |__CPU #10
| |__CPU #11
|__Domain 3
|__Sub-domain 3.0
| |__CPU #12
| |__CPU #13
|__Sub-domain 3.1
|__CPU #14
|__CPU #15[/tt]

The main idea behind this is to improve performance, as data in a physical CPUs cache can be used by any logical CPU within that chip, and any CPU within a domain (regardless of sub-domain) uses the same NUMA memory range for memory allocations.

Candy wrote:Also, did you do NUMA memory remapping?

On a NUMA computer my OS will have multiple copies of the kernel's code in physical memory (one copy for each NUMA domain) so that each CPU can use kernel code that is in memory that is "close" to it. This means dynamically choosing where the each copy of the kernel resides in physical memory during boot (not much of a problem for me as the kernel is started after paging is enabled anyway). NUMA also effects CPU affinity, as in my OS all threads owned by a process are restricted to CPUs that share the same NUMA domain. I'd recommend that during CPU feature detection you store a set of flags for the features of each CPU (after bug corrections), and then "logically and" each CPUs feature flags together to get a set of flags representing the features common to all CPUs - it makes kernel code easier.

Cheers,

Brendan

OSDev.org

Bochs PAE support

Bochs PAE support

Re:Bochs PAE support

Re:Bochs PAE support

Re:Bochs PAE support

Re:Bochs PAE support

Re:Bochs PAE support