Multi-CPU & 32/64 bits

Kemp · Post by **Kemp** » Tue Jul 12, 2005 9:57 am

Just quick question about multi-processor systems. I know you have to specifically set up the cpus to all run (at least if I read right), but is this required? For instance, if my OS is designed to run on a uni-cpu system and thus doesn't set up the extra processors will it still run normally on the first cpu? Also, is it different across different types of multi-cpu systems (SMP, hyper-threading, etc)?

Brendan · Post by **Brendan** » Tue Jul 12, 2005 11:18 am

Hi,

Kemp wrote: Just quick question about multi-processor systems. I know you have to specifically set up the cpus to all run (at least if I read right), but is this required? For instance, if my OS is designed to run on a uni-cpu system and thus doesn't set up the extra processors will it still run normally on the first cpu? Also, is it different across different types of multi-cpu systems (SMP, hyper-threading, etc)?

Any multi-CPU computer (with any type or combination of CPUs) will happily run any single-CPU OS without any problems (you just don't get the extra CPU power from the other CPUs - they remain in a "halt" state).

Starting the additional CPU's is relatively easy, and it's the same for all forms of multi-CPU. Detecting them does differ.

For dual/multi core CPUs and the more traditional multi-chip/socket, there's 2 methods of detection - the MP Specification tables (that have been around since 80486) and the (newer) ACPI tables. Both are relatively straight forward (although the ACPI standard is hard to read because it covers a lot of unrelated stuff). For hyper-threading, you can use the ACPI tables or CPUID.

The hard part of multi-CPU support has more to do with IO APICs, local APICs, IPIs, scheduling and re-entrancy/synchronization (which is all the same regardless of the type or combination of CPUs).

The only differences in any of that is with hyper-threading, where it's recommended (but not necessary) to schedule threads on physical CPUs before scheduling threads on logical CPUs, as the performance of one logical CPU depends on the work the other logical CPU is doing.

Another thing that can have an effect is NUMA (which is what you end up with if you support "multi-chip/socket" for AMD CPUs). NUMA has more to do with memory management, but also effects "CPU affinity". The idea here is to make sure threads are always run on the same CPU (or a CPU in the same memory domain if you consider dual core chips) and allocate physical memory accordingly. This isn't strictly necessary either, but does improve performance.

Cheers,

Brendan

Kemp · Post by **Kemp** » Tue Jul 12, 2005 12:04 pm

Ok thanks, that's cleared up a lot of stuff I had floating round in my head. At the moment my OS will only support single-CPU systems (or multi-CPU systems with only one CPU running) as that is all I have for testing right now, though I'm hoping to put together a dual-core AMD system soon but there's no real ETA on that.

Now back to trying to force my boot sector to work, I almost wish I hadn't put development of the actual OS on hold until I had that nailed. Though I suppose it has given me (a lot of) time to refine how it's going to work before I start actual coding.

Brendan · Post by **Brendan** » Wed Jul 13, 2005 4:21 am

Hi,

Kemp wrote:At the moment my OS will only support single-CPU systems (or multi-CPU systems with only one CPU running) as that is all I have for testing right now, though I'm hoping to put together a dual-core AMD system soon but there's no real ETA on that.

Often adding multi-CPU support to an existing OS is harder than starting again from scratch. In the hope of making it easier for you (and others) to avoid a full rewrite, I'll describe some of the pitfalls...

First, when the AP CPUs are started they start in real mode, so if your OS is booted from GRUB (e.g. 32 bit code running above 1 MB) this can require a work-around. Typically a small piece of "trampoline" code is copied or loaded into memory below 1 MB, which prepares the AP CPUs (for e.g. switches them to protected mode and makes them jump to the 32 bit AP entry point). For assembly "incbin" can be an easy way to get the trampoline code into the 32 bit binary, ready for copying below 1 MB. If you're writing in C it's a bigger problem - perhaps get GRUB to load the trampoline code as a module. If you don't use GRUB it's easy to build this trampoline code directly into the OSs real mode boot/initialization code.

Next, during boot there's things which should happen in the same order on all CPUs. Typically the BSP is used for configuring everything and the AP CPUs are synchronized with it. An example here would be the GDT and IDT - the BSP would create it then tell the AP CPU/s that it's ready for use. Synchronization is used to prevent the AP CPUs from loading the GDT and IDT before they are ready. Other things that need to be synchronized can include memory management, caching (MTRR changes), initializing the scheduler/s, etc. Also, if you've got a multi-stage boot you'd use it to prevent the AP CPUs from trying to use code that hasn't been loaded into memory yet.

For this it's best to split the boot/initialization code into sections, and to have a checkpoint between sections. For example:

Code: Select all

BSP_init_code:
   call create_GDT
   call create_IDT
   call load_GDT_and_IDT

   mov [CPU_count],1
   mov [boot_level],1
   mov eax,[total_CPUs]
.wait1:
   cmp [CPU_count],eax
   jb .wait1

   call create_paging_tables
   call enable_paging

   mov dword [CPU_count],1
   mov dword [boot_level],2
   mov eax,[total_CPUs]
.wait2:
   cmp [CPU_count],eax
   jb .wait2

And for the AP CPUs:

Code: Select all

AP_init_code:

;Wait for GDT and IDT to become ready

.wait1:
   cmp dword [boot_level],1
   jb .wait1
   lock inc dword [CPU_count]

   call load_GDT_and_IDT

;Wait for paging data to become ready

.wait2:
   cmp dword [boot_level],2
   jb .wait2
   lock inc dword [CPU_count]

   call enable_paging

Where you need these synchronization checkpoints is going to depend on your OS. You might get by with only a few of them, but you might end up with lots of them. Something like changing the MTRRs and PAT table entries will take 2 checkpoints on it's own, as all CPUs should always be using the same caching.

[continued next post]

Brendan · Post by **Brendan** » Wed Jul 13, 2005 4:23 am

[continued from previous post]

You must not rely on the speed of the CPUs for anything. For example, you couldn't use delay loops instead of the checkpoints above and assume that CPU X will be ready by the time CPU Y does completes the delay loop. The reasons for this is are:

a) for hyper-threading the speed of one logical CPU depends on what the other logical CPU/s are doing.
b) for seperate chips it's possible for one CPU to be using "thermal throttling" where it's running at a reduced rate because it got too hot.
c) it's technically possible (but very rare) for different CPUs to be present. For e.g. a 33 MHz 80486 and a 66 MHz 80486 can work together as the 66 MHz CPU does internal clock doubling (and uses 33 MHz externally). All OS's I know of don't support this (except mine), but considering A and B there isn't any real reason not to support it.

This brings me to CPU features. It's also possible for the CPUs to support slightly different features. An example here would be 2 Pentium CPUs where one supports MMX and the other doesn't. A common response to this is to get the feature flags from CPUID, logically AND them together and store the result. Most OS's expect all CPUs to be entirely identical, but Intel's MP Specification suggests this approach in any case.

Then there's IDT initialization. Most single-CPU OS's will use static data for this, but this aproach fails when you start using IO APICs (for single-CPU or multi-CPU). I'd strongly suggest that you have "IRQ numbers" and "Interrupt Numbers" where there is no direct relationship between either. With PIC chips often "IRQ number + K = Interrupt number", but for IO APICs an IRQ's priority depends on the highest 4 bits of it's interrupt number, and higher interrupt numbers have higher priority. This makes it the reverse of PICs - IRQ 0 could be interrupt 0xF0 while IRQ 15 could be interrupt 0x28.

Then there's re-entrancy protection. In general any memory location that may be modified by one CPU while being read by another CPU must be protected by a lock. Disabling interrupts doesn't work as it doesn't effect other CPUs. Using re-entrancy locks introduces the problem of deadlocks (where the same CPU tries to acquire the same lock twice) and lock contention (which is a performance problem where other CPUs spend ages waiting for a lock to be freed by another CPU). Getting the re-entrancy locks right is where the majority or problems are. I've heard people suggest that a single-CPU OS should use a macro that disables interrupts, so that the macro can be rewritten latter. To be honest it's a load of BS, and doesn't work in practice. The only good thing about this idea is that you can remove the macro and get a pile of errors (a list of where you need changes). In general replacing the "interrupt disable" macro with a single re-entrancy lock macro causes deadlocks and high lock contention.

A specific area of concern for deadlocks is any code that may be used by an IRQ handler. In this case you can't just wait for the lock to become free (the same CPU may have been used to acquire the lock previously). Often OS developers prevent this by disabling interrupts while any lock has been acquired, which causes high interrupt latency instead. There's no "perfect" solution to this. My OS uses 2 different types of locks, one which doesn't effect IRQs and the other that causes any received IRQs to be placed into a buffer and handled when the lock is freed. This still effects interrupt latency, but only when a CPU is using code that an IRQ handler needs to use (not for all locks).

Lock contention is something that also needs careful consideration. 2 CPUs don't give twice the performance of one CPU because of lock contention (and a few other things, like CPU cache thrashing, use of the "lock" instruction prefix, etc). How good your OS is at avoiding lock contention is the most important factor in determining how well it will perform for multi-CPU - if you can imagine a computer with 64 CPUs where 63 of the CPUs are waiting for the other CPU to release a lock then you can imagine how bad things can be. This sounds like an absurd example, but if you use a single/global kernel lock it may be what you end up with. By increasing the number of locks you reduce the chance of lock contention. By using millions of locks, lock contention would be negligable but with this may locks different kernel functions would need to acquire many different locks to get something done, and the overhead of locking/unlocking can become significant. Therefore the number of locks used should depend on the overhead of each lock and the chance of lock contention (ie. the number of CPUs).

[continued next post]

Brendan · Post by **Brendan** » Wed Jul 13, 2005 4:26 am

[continued from previous post]

In addition, there's "lockless algorithms" that can be devised for certain situations. A common one is used for linked list management where many CPUs can be reading and only one CPU modifying. By using lockless algorithms it's possible to remove lock contention and the overhead of the locks at the same time. Lockless algorithms should be used where-ever possible, but can mean completely redesigning how things work.

For example, my "OS's system timer tick" uses a 64 bit value to keep track of milli-seconds since the start of year 2000. The problem is that the timer tick can be read after the low dword is updated but before the high dword is updated, giving a completely incorrect reading. Using a lock to protect against this would involve a relatively large overhead (and deadlock avoidance problems as it's modified during an IRQ). To solve this the IRQ itself only modifies the lowest 32 bits (which can be done atomically with a "lock add") and doesn't keep track of the high 32 bits. Each CPU keeps track of what the low 32 bit was when the counter was read last, and it's own version of the high dword. When a CPU reads the counter it checks if the new value is lower than the old value and updates it's version of the high dword if an overflow was detected. This is completely lockless, but also means you can't just read the system timer tick. Instead there's a "getTimerTick" routine that must be used, that goes something like:

Code: Select all

getTimerTick:
   pushfd
   cli         ;Disable interrupts to prevent thread switches on this CPU only
   mov ebx,[SIBtickLow]
   cmp [gs:CPULISTentryStruct.lastTickLow],ebx
   ja .overflowed
   mov esi,[gs:CPULISTentryStruct.lastTickHigh]    ;esi:ebx = 64 bit timer tick
   mov [gs:CPULISTentryStruct.lastTickLow],ebx
   popfd
   ret

.overflowed:
   inc dword [gs:CPULISTentryStruct.lastTickHigh]
   mov [gs:CPULISTentryStruct.lastTickLow],ebx
   mov esi,[gs:CPULISTentryStruct.lastTickHigh]    ;esi:ebx = 64 bit timer tick
   popfd
   ret

This means that every CPU must call this "getTimerTick" code at least every 2^31 milli-seconds to make sure the high dword remains accurate. Each CPU also needs it's own data area that isn't used by other CPUs (you need this for the scheduler and things anyway). For my OS I have a different GDT descriptor for each CPU's GS to make accessing this area faster (each CPU uses a different value for GS).

Hopefully you can see the benefits of lockless algorithms, and can imagine how complicated they can be for something that isn't as simple as adding a constant to a 64 bit value. Because lockless algorithms often work completely differently, retro-fitting lockless algorithms into existing OS code can be a major undertaking - for the previous of my OS (single-CPU only) reading the timer tick was as simple as 2 "mov" instructions, without any additional requirements.

Cheers,

Brendan

Kemp · Post by **Kemp** » Wed Jul 13, 2005 5:29 am

Ok, I see your (quite extensive) point about recoding for multi-cpu support later rather than straight off. Problem at the moment is no multi-cpu hardware to test on (other people's systems are fine but no good for the early stages where I tend to put out a new test version every 15 mins, lol). I could use Bochs I suppose, but I don't expect code to work on real machines that has only been tested on it. I'll see if I can draw up a sensible plan for having stuff in place to support multi-cpu systems but not fully implemented until I get round to it.

From the length of your answer I'm guessing that supporting multiple cpus is a task that could probably be outlawed under several international treaties on inhumane treatment

Better get started on the reading...

For booting I have a custom one (two-stage boot) so I can do whatever I want in it, including beating multiple processors into shape.

Edit:
I read through your post in more detail and noticed something. What do BSP and AP mean in relation to the cpus?

Also, thanks for taking the time to write all that, it really is helpful

pini · Post by **pini** » Wed Jul 13, 2005 5:56 am

BSP stands for Boot Strap Processor, and AP stands for Application Processor.
It's only a way to distinguish the first processor (BSP), which runs uni-processor OSes, from the others (APs)

Kemp · Post by **Kemp** » Wed Jul 13, 2005 6:01 am

OK, that makes sense. Basically I'm imagining two seperate second stages for my boot loader, where the BSP runs the first one and does the setting up as you suggested while the other processors (when they get activated) are told "go over here and execute that one" so they pick up the stuff that the BSP sets. In the case of a uni-processor system obviously there would be no APs so the second stage would either be unused or just not loaded at all, and the checkpoints would be automatically passed due to the processor count being 1 and the starting count for the number of processors that are ready also being 1.

This is also a good example of what I was thinking for building in support for multi-cpu systems but not actually having code to use them. In this example the second set of code doesn't even need to be written, but when it is it'll interface with the first set nicely.

Brendan · Post by **Brendan** » Wed Jul 13, 2005 7:15 am

Hi,

Kemp wrote: OK, that makes sense. Basically I'm imagining two seperate second stages for my boot loader, where the BSP runs the first one and does the setting up as you suggested while the other processors (when they get activated) are told "go over here and execute that one" so they pick up the stuff that the BSP sets. In the case of a uni-processor system obviously there would be no APs so the second stage would either be unused or just not loaded at all, and the checkpoints would be automatically passed due to the processor count being 1 and the starting count for the number of processors that are ready also being 1.

This is also a good example of what I was thinking for building in support for multi-cpu systems but not actually having code to use them. In this example the second set of code doesn't even need to be written, but when it is it'll interface with the first set nicely.

For just about everything (except the trampoline code), the BSP creates <something> and then the BSP and AP CPUs use <something>. The same code to use <something> can be used by both BSP and AP CPUs, so having a second set of "AP only" routines isn't too useful. For an example see my previous post(s), where the "call load_GDT_and_IDT" and "call enable_paging" routines are used by all CPUs and the only thing that isn't used by the BSP is the "Wait for <something> to become ready" parts (which could be done by a small routine) and the general flow control for AP CPUs:

Code: Select all

waitForBSP:
  cmp dword [boot_level],eax
  jb waitForBSP
  lock inc dword [CPU_count]
  ret

And the general AP flow:

Code: Select all

AP_init_code:
   mov eax,1
   call waitForBSP

   call load_GDT_and_IDT

   mov eax,2
   call waitForBSP

   call enable_paging

The amount additional code that is used by AP CPU's only is negligable.

For the trampoline code, because you're not stuck with GRUB you can get the AP CPUs to jump to "AP_init_code" directly without any additional messing about.

This is (sort of) how my OS does things. The boot code works with the BSP only and loads the second stage. The second stage detects the other CPUs and starts them. From there on everything relies on the checkpoints while going through the remaining boot stages (up until the scheduler takes over). I use the same boot code for BSP and AP CPUs, but (for performance reasons) I do use different kernels (all kernels generated from the same source using conditional assembly). This is mainly because for a single-CPU kernel some things can be relaxed (less need for re-entrancy locking, no need for IPIs or "Inter-Processor Interrupts", no need for CPU affinity), and for the multi-CPU kernel some things can be omitted (PIC chip support).

Kemp wrote:Ok, I see your (quite extensive) point about recoding for multi-cpu support later rather than straight off. Problem at the moment is no multi-cpu hardware to test on (other people's systems are fine but no good for the early stages where I tend to put out a new test version every 15 mins, lol). I could use Bochs I suppose, but I don't expect code to work on real machines that has only been tested on it. I'll see if I can draw up a sensible plan for having stuff in place to support multi-cpu systems but not fully implemented until I get round to it.

Bochs is good enough to allow you to implement the first 4 months or more of a multi-CPU OS without much problem, especially if you've also got a modern single-CPU computer to test with (things like the MP specification table and APIC IRQ handling is mostly the same on single-CPU). Aside from this there's people on the OS testing forum who'd be willing to test multi-CPU OS's (me), and it's possible to get older multi-CPU servers for surprising prices if you don't mind a small gamble (my dual 400 MHz Pentium II Compaq server was about $200 via. eBay).

Kemp wrote:From the length of your answer I'm guessing that supporting multiple cpus is a task that could probably be outlawed under several international treaties on inhumane treatment Better get started on the reading...

I have my own reasons for wanting OS developers to acquire a good working knowledge of the details of multi-CPU support. IMHO it's a "win-win" situation, especially considering the direction CPU manufacturers are heading - in three years time, when single-CPU is obsolete, I don't want to be the only person still here

.

Cheers,

Brendan

Kemp · Post by **Kemp** » Wed Jul 13, 2005 7:25 am

I don't see how the BSP and APs could use the same boot code when the BSP needs to do extra things. There are similarities in the two sets of code you posted, but they are still two different pieces. I'll probably end up going with two different 2nd stages for both ease of reading and my own sanity.

Also, vaguely related, I read that there's something against two processors executing code from the same place in memory (even if no stuff is being written), this would mean each processor would have to have its own copy of the code (including kernel etc). Is that true?

Brendan · Post by **Brendan** » Wed Jul 13, 2005 8:24 am

Hi,

Kemp wrote:I don't see how the BSP and APs could use the same boot code when the BSP needs to do extra things. There are similarities in the two sets of code you posted, but they are still two different pieces.

Different entry points and different "flow control" code, but both using the same sub-routines that do all of the real work. Perhaps a more complete example:

Code: Select all

   org 0x1000

   jmp startBSP
   align 16
   jmp startAP


startBSP:
  call create_GDT
  call create_IDT
  call load_GDT_and_IDT

  mov eax,1
  call setCheckpoint

  call create_paging_tables
  call enable_paging
  
  mov eax,2
  call setCheckpoint

.die:
   jmp .die

   
startAP:
  mov eax,1
  call waitForBSP
  call load_GDT_and_IDT

  mov eax,2
  call waitForBSP
  call enable_paging

.die:
   jmp .die


setCheckpoint:
  mov [CPU_count],1
  mov [boot_level],eax
  mov eax,[total_CPUs]
.wait:
  cmp [CPU_count],eax
  jb .wait
  ret

  
waitForBSP:
  cmp dword [boot_level],eax
  jb waitForBSP
  lock inc dword [CPU_count]
  ret


create_GDT:
  ???
  ret

 
create_IDT:
  ???
  ret

 
load_GDT_and_IDT:
  ???
  ret

 
create_paging_tables:
  ???
  ret

 
enable_paging:
  ???
  ret

To make seperate binaries (one for BSP only and the other for AP only) you'd need to duplicate the "load_GDT_and_IDT" and "enable_paging" code in both binaries. For this little example (where these routines would be simple) it'd make things easier, but for more complex/complete code it create a lot of code duplication. For example, imagine you add a routine to detect CPU information (family, brand name and features, including correcting all errata, generating a static brand name when CPUID isn't supported, detecting CPU cache information, etc) that consists of 2 KB of code and 3 KB of data - you wouldn't want it duplicated. Duplicating the code also makes it more difficult to maintain, as most changes would need to be done to both copies.

Kemp wrote:Also, vaguely related, I read that there's something against two processors executing code from the same place in memory (even if no stuff is being written), this would mean each processor would have to have its own copy of the code (including kernel etc). Is that true?

Executing code (and reading data that is never changed), does not create any problems.

The only thing where it makes any difference at all is if you support NUMA, where it can be better for performance to have seperate copies of the code for different memory domains. This is because a CPU can read from it's own local memory faster than it can read from remote memory. It's not worth implementing this sort of thing for the boot code as the performance improvement doesn't justify the headaches (you'd need to detect which memory domain each CPU uses first, and then dynamically create the a copy of the code for each memory domain at whatever physical address it happens to be). I use this sort of thing for my kernel's code on NUMA computers (but not for non-NUMA) for a (possibly small) performance improvement...

Cheers,

Brendan

Kemp · Post by **Kemp** » Wed Jul 13, 2005 8:35 am

Ah, I see what you mean now with that code, you're quite the teacher

I can't see too many problems with implementing that sort of scheme (that is, problems with me implementing it, obviously if it didn't work at all you wouldn't have suggested it). Preliminary idea for my boot sequence (stuff related to the processors only) in this case will now be:

1st Stage (512 bytes)
Load second stage

2nd Stage (as big as I want)
Create things that all processors need to know (IDT table, any random shared stuff, etc)
Detect other processors that might exist
If there are any other processors, start them up, send them to the other entry point and increment the processor count

If I create the IDT and etc before the other processors are even started then that should mean no need for checkpoints around those items right? The processor can just load them straight away because they're already there.

Edit:
While I'm on the subject of old techniques on new hardware... Would moving 32-bit -> 64-bit later on need an extensive rewrite? Not looking for details on this one, just a yes/no/maybe and a very brief example if you're in the mood

100th Post!

Brendan · Post by **Brendan** » Wed Jul 13, 2005 9:53 am

Hi,

Kemp wrote:If I create the IDT and etc before the other processors are even started then that should mean no need for checkpoints around those items right? The processor can just load them straight away because they're already there.

Yes

.

In fact, depending on how you initialize, what you initialize and how complex your OS is, you could end up with no checkpoints in stage 2 at all.

Things that would require checkpoints are:

a) if you mess with the MTRRs. Messing with MTRRs is optional (you can usually trust the BIOS to do it right).

b) if you measure CPUs speeds. Due to hyper-threading it's best to speed test all CPUs at the same time, otherwise logical CPUs will seem faster.

c) if you detect CPU features during stage 2 and use the results to do certain things differently. For example, my OS detects if all CPUs support "long mode" and tries to load a 64 bit kernel if they do, and checks CPU cache size/configuration to determine how many page colours that the memory manager should support. For me these things need to be done within stage 2, which means the AP CPUs need to run some code and then wait for the BSP to act on the results.

d) transitions to subsequent boot stages. I've got a 4 stage boot and need to make AP CPUs wait until the next stage is ready.

Also it's going to depend on how much work you get the AP CPUs to do. For example, there's always some "per CPU" data structures that need to be initialized (e.g. like those used to keep track of the first tasks). You can make the BSP create all of them, but I make the AP CPUs create their own (it shaves half a milli-second off the time it takes to boot, and I was quite keen to get something happening on all CPUs at the same time when I wrote it

).

Your OS will be different - you might find other things that need checkpoints...

Edit:

Edit:
While I'm on the subject of old techniques on new hardware... Would moving 32-bit -> 64-bit later on need an extensive rewrite? Not looking for details on this one, just a yes/no/maybe and a very brief example if you're in the mood

100th Post!

For 64 bit's it's best to rewrite everything after (and includng) your paging intialization code. If you're tricky you can recycle the earlier boot code though.

This is part of the reason I've got a 4 stage boot - only the last 2 stages (paging setup and kernel initialization) and the kernel itself would need to be different for 64 bit, and all the previous stuff (detecting/starting APs, physical memory initialization, CPU feature detection, scanning MP and ACPI tables, etc) can use the same binary code regardless.

This also allows me to create a single boot disk which detects which kernel to use automatically - much nicer than expecting people to pay for a completely different 64 bit version of your OS like a certain commercial company.

Congrats one the 100th too!

Cheers,

Brendan

Kemp · Post by **Kemp** » Wed Jul 13, 2005 10:33 am

Ok, so basically once you start messing around with addressing you need different code, noted.

On a side-note, the floppy version of my boot sector has stopped giving me error 01h on INT13 (Invalid Parameter) [or in fact panicking on Bochs before getting that far] and has been upgraded to error 20h (Controller Fault) [and still panicking on Bochs before getting that far]. I decided to completely reorganise the code and found so many issues that I can only conclude the old version happened to be working by coincidence alone. Also, I'm going to abandon Bochs and stick with Virtual PC, despite screwing me over with hard disks not appearing in BIOS calls it actually works.

When I get my new system I'll be moving from single-core 32 bit to dual-core (AMD) 64 bit. That's gonna quite a jump code-wise

Edit:
Ok, this is really wierd. From what I can tell from stepping through in Bochs, the problem actually occurs during an INT13 call, not in my code. I get "RIP > CS.limit". Shifting this out to a seperate thread as it has nothing to do with this one.

OSDev.org

Multi-CPU & 32/64 bits

Multi-CPU & 32/64 bits

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question

Re:Multi-CPU question