IOAPIC Troubles
IOAPIC Troubles
Hi all, for the past 2 weeks or so I've been trying to fix this problem on and off.
My implementation of the IOAPIC/LAPIC appears to be faulty, Boch's reports no problems in the log however...
The problem is that the system encounters a hardware hang (Verified on both Bochs and VBox). Using Boch's debugger, if I continue execution as normal and then break and attempt to single step it hangs on the same RIP (mov rax, <address of kernel function>).
I believe the problem is with the redirection table not being configured correctly, I read somewhere that it might cause this kind of hardware hang (The APIC tries to deliver the interrupt but does not know how so the machine enters a deadlock state). I also verified that the interrupt (PIT IRQ) never gets called. The hang occurs before any interrupt is executed.
Here's my implementation:
xapic.c http://pastebin.com/njL1xHsW
xapic.h http://pastebin.com/qsGgb5Jy
os/driver.h http://pastebin.com/7piSxS5S
misc/mp.h http://pastebin.com/Zjvuf5eG
OS Output: http://i.imgur.com/RrkW2.png
I'm completely at a loss, I have no idea why this is happening. I've tried at least 3 different redirection table configurations at this point, none of the other code seems to be invalid, I also know that the redirection table's values are being set to the expected value (read the values back after being written).
Cheers.
My implementation of the IOAPIC/LAPIC appears to be faulty, Boch's reports no problems in the log however...
The problem is that the system encounters a hardware hang (Verified on both Bochs and VBox). Using Boch's debugger, if I continue execution as normal and then break and attempt to single step it hangs on the same RIP (mov rax, <address of kernel function>).
I believe the problem is with the redirection table not being configured correctly, I read somewhere that it might cause this kind of hardware hang (The APIC tries to deliver the interrupt but does not know how so the machine enters a deadlock state). I also verified that the interrupt (PIT IRQ) never gets called. The hang occurs before any interrupt is executed.
Here's my implementation:
xapic.c http://pastebin.com/njL1xHsW
xapic.h http://pastebin.com/qsGgb5Jy
os/driver.h http://pastebin.com/7piSxS5S
misc/mp.h http://pastebin.com/Zjvuf5eG
OS Output: http://i.imgur.com/RrkW2.png
I'm completely at a loss, I have no idea why this is happening. I've tried at least 3 different redirection table configurations at this point, none of the other code seems to be invalid, I also know that the redirection table's values are being set to the expected value (read the values back after being written).
Cheers.
- gravaera
- Member
- Posts: 737
- Joined: Tue Jun 02, 2009 4:35 pm
- Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.
Re: IOAPIC Troubles
Yo,
You don't use paging? A cursory skim showed that you are using the lapic and io-apic base addresses raw, as physical addresses. Nothing else really jumped out at me
--Peace out,
gravaera
You don't use paging? A cursory skim showed that you are using the lapic and io-apic base addresses raw, as physical addresses. Nothing else really jumped out at me
--Peace out,
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
Re: IOAPIC Troubles
Yeah, I'm using long mode so paging is enabled. However I've identity mapped the first 16GB (for now, once I have an MM written I'll obviously go back and map the memory to these addresses the "proper" way). So it shouldn't be a problem.
Cheers.
Cheers.
- gravaera
- Member
- Posts: 737
- Joined: Tue Jun 02, 2009 4:35 pm
- Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.
Re: IOAPIC Troubles
Yo:
Re-read your source:
Lines 38, 43, 49 and 53: When you cast the value to (uint32*). you lose the volatile storage class modifier that you tried to put in at the function signature. Additionally, why is base declared as a pointer to a const value in the signature when you know you will write to it?
In general:
1. Volatile should not be considered a memory barrier. You should use volatile to indicate that the memory area being accessed is likely to be modified outside the scope of the current execution block, beyond the compiler's knowledge.
2. When you want to issue an explicit memory barrier to the compiler, you should use something like:
3. When you want to issue an explicit memory barrier to the CPU, you should use one of the x86 serializing instructions, all of which are outlined in the manuals. As a rule, there are few cases where you will only need to use one serialization mechanism and not the other -- generally, serialization is done both as a compiler directive to let the compiler know not to re-order, and as an instruction to the CPU to let it know that it should memory fence at the current instruction. Tbh, I can't remember what the case for only needing one of them is, but it's something really technical and blurry anyway
Putting it all together:
Any access to IO-REGSEL is write-only; However, IO-WIN may be read or written afterward, and the write to IO-REGSEL must be executed before the read/write to IO-WIN. The value read from IO-WIN changes pending the value written to IO-REGSEL, and the register accessed on write to IO-WIN changes pending the value written to IO-REGSEL. Therefore you must serialize after every write to IO-REGSEL to ensure that the write to IO-REGSEL comes before any accesses to IO-WIN. You need both a compiler serialization directive, and a CPU-level runtime serialization instruction for this case.
Additionally, IO-WIN is a read-write register which is read from, and whose value changes outside of the scope of the current execution stream. A read from IO-WIN may return 5 now, and then if you write to IO-REGSEL and read it again, you may get 10 this time. The compiler however, cannot see any reason for the value in IO-WIN to be different on the second read, so it may optimize out the second read. The compiler must be told, using the "volatile" keyword that the value of IO-WIN should not be presumed on.
Do you have to add "volatile" to writes to IO-REGSEL as well? No. IO-REGSEL is write-only, so any assumptions the compiler makes about its value will be correct, given that it can see all changes being made. The only exception here is if another CPU, A writes to the registers of the IO-APIC while you are using it on CPU B. This should not happen if you use proper locking to lock off the registers to other CPUs.
Do you have to add a memory barrier after reads/writes to IO-WIN? No. Volatile already causes the compiler to drop all assumptions about the value in IO-WIN, and accesses to IO-WIN are "singular" in nature. There are no multi-word command sequences for the IO-APIC, so no ordering needs to be preserved.
It's pretty likely that you've been losing reads/writes multiple times in your code, so your LAPIC/IO-APIC setup is probably only half-done.
EDIT: And you've said that you have identity-mapped all of memory -- have you mapped all memory as "write-through" or "uncached"? It's most likely all mapped as "write-back" to enable the CPU to write only to its cache and then buffer all writes for periodic write-back to RAM.
Memory mapped IO must be mapped as write-through or uncached. Linux maps the IO-APICs as uncached. I personally map them (and all other IO) as write-through and see no problems, but, Linux devs know more than I do, and I'm likely to see problems on specific hardware cases where the caches are implemented strangely.
http://wiki.osdev.org/Memory_mapped_reg ... _C/C%2B%2B New wiki article dedicated to memory and port-mapped IO
--Peace out,
gravaera
Re-read your source:
Lines 38, 43, 49 and 53: When you cast the value to (uint32*). you lose the volatile storage class modifier that you tried to put in at the function signature. Additionally, why is base declared as a pointer to a const value in the signature when you know you will write to it?
In general:
1. Volatile should not be considered a memory barrier. You should use volatile to indicate that the memory area being accessed is likely to be modified outside the scope of the current execution block, beyond the compiler's knowledge.
2. When you want to issue an explicit memory barrier to the compiler, you should use something like:
Code: Select all
asm volatile ("": : :"memory");
Putting it all together:
Any access to IO-REGSEL is write-only; However, IO-WIN may be read or written afterward, and the write to IO-REGSEL must be executed before the read/write to IO-WIN. The value read from IO-WIN changes pending the value written to IO-REGSEL, and the register accessed on write to IO-WIN changes pending the value written to IO-REGSEL. Therefore you must serialize after every write to IO-REGSEL to ensure that the write to IO-REGSEL comes before any accesses to IO-WIN. You need both a compiler serialization directive, and a CPU-level runtime serialization instruction for this case.
Additionally, IO-WIN is a read-write register which is read from, and whose value changes outside of the scope of the current execution stream. A read from IO-WIN may return 5 now, and then if you write to IO-REGSEL and read it again, you may get 10 this time. The compiler however, cannot see any reason for the value in IO-WIN to be different on the second read, so it may optimize out the second read. The compiler must be told, using the "volatile" keyword that the value of IO-WIN should not be presumed on.
Do you have to add "volatile" to writes to IO-REGSEL as well? No. IO-REGSEL is write-only, so any assumptions the compiler makes about its value will be correct, given that it can see all changes being made. The only exception here is if another CPU, A writes to the registers of the IO-APIC while you are using it on CPU B. This should not happen if you use proper locking to lock off the registers to other CPUs.
Do you have to add a memory barrier after reads/writes to IO-WIN? No. Volatile already causes the compiler to drop all assumptions about the value in IO-WIN, and accesses to IO-WIN are "singular" in nature. There are no multi-word command sequences for the IO-APIC, so no ordering needs to be preserved.
It's pretty likely that you've been losing reads/writes multiple times in your code, so your LAPIC/IO-APIC setup is probably only half-done.
EDIT: And you've said that you have identity-mapped all of memory -- have you mapped all memory as "write-through" or "uncached"? It's most likely all mapped as "write-back" to enable the CPU to write only to its cache and then buffer all writes for periodic write-back to RAM.
Memory mapped IO must be mapped as write-through or uncached. Linux maps the IO-APICs as uncached. I personally map them (and all other IO) as write-through and see no problems, but, Linux devs know more than I do, and I'm likely to see problems on specific hardware cases where the caches are implemented strangely.
http://wiki.osdev.org/Memory_mapped_reg ... _C/C%2B%2B New wiki article dedicated to memory and port-mapped IO
--Peace out,
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
Re: IOAPIC Troubles
I do not think so. 82093AA I/O ADVANCED PROGRAMMABLE INTERRUPT CONTROLLER (IOAPIC) specification says that the IOREGSEL register is Read/Write. Bits [7:0] are APIC Register Address and bits [31:8] are reserved.gravaera wrote:Any access to IO-REGSEL is write-only
I do not touch the reserved bits so I first read the IOREGSEL and then I modify only the APIC Register address bits. After that, all the 32 bit are written back.
However, I am not absolutely sure if this is the right way. Would it be safer just to put zeros to those reserved bits?
-
- Member
- Posts: 2566
- Joined: Sun Jan 14, 2007 9:15 pm
- Libera.chat IRC: miselin
- Location: Sydney, Australia (I come from a land down under!)
- Contact:
Re: IOAPIC Troubles
Purely out of interest, are you ever actually installing ISRs for the vectors you're setting up in the redirection table? I'd be curious to see your IRQ handler as well.
Re: IOAPIC Troubles
Hi,
Before configuring the IO APIC you should create a table (or an array of entries) with one entry for each IO APIC input. Each entry should indicate what the interrupt source is, what the interrupt vector is, the delivery mode and target CPU/s, and if the interrupt is edge triggered or level triggered, and active high or active low. Initially this table would be configured such that the first 16 legacy/ISA IRQs are mapped to the first 16 IO APIC inputs (excluding "cascade" - IO APIC input #2 would just be connected to nothing the same as IO APIC inputs #16 and higher). Then you'd parse the ACPI MADT/APIC table and find "interrupt source override" entries, which tells you which IO APIC inputs are different from this initial assumption. For example, there may be an "interrupt source override" entry that says that ISA IRQ #0 is connected to IO APIC input #33 and is level triggered (or anything else).
Note that you may have multiple IO APICs. For example, there might be a total of 48 IO APIC inputs (and your table might have 48 entries), where the first 24 IO APIC inputs are on the first IO APIC and the remaining 24 IO APIC inputs are on the second IO APIC. If the ACPI MADT/APIC table says that ISA IRQ #0 is connected to IO APIC input #33, then that would actually be IO APIC input #9 on the second IO APIC.
Also, APICs have a different interrupt priority scheme than PIC - the interrupt priority depends on the interrupt vector. High priority IRQs (like ISA IRQ #0) should have a low interrupt vector (e.g. vector 0x30) while low priority IRQs (like ISA IRQ #6) should have a high interrupt vector (e.g. vector 0xE0). The IO APIC's spurious interrupt would be the lowest priority and have the highest interrupt vector (e.g. vector 0xFF). Interrupt vectors used by the OS for IPIs tend to be very high priority and would therefore have very low interrupt vectors (for example, you might reserve interrupt vectors 0x20 to 0x27 for the OS's IPIs). Of course interrupt vectors 0x00 to 0x1F are reserved for exception handlers and should never be used for anything else.
This also means that you probably want/need an "IDT entry manager", which is used to allocate interrupt vectors. For example, you might want an interrupt vector for ISA IRQ #0, so you call your IDT entry manager and ask it to allocate a relatively high priority interrupt vector for ISA IRQ #0 to use, and it will search for a free interrupt vector that is closest to the requested/desired priority. This "IDT entry manager" is important later on, for things like configuring PCI IRQs (and MSI).
When configuring the IO APIC/s themselves; you'd just do a "for each entry in my table { ... }" loop where each IO APIC input (of each IO APIC) is configured however the entry in your table says it should be. Alternatively (depending on the nature of your OS/kernel), you might just disable/mask every IO APIC input and worry about doing most of the work when a driver that uses the IO APIC input is actually started.
Mostly what I'm trying to say is that the "ISA IRQs 0 to 15 are mapped to IO APIC inputs 0 to 15" idea is broken and wrong, and that the "IO APIC inputs 0 to 15 are mapped to interrupt vectors 0x20 to 0x2F" idea is also broken and wrong. I think you're trying to emulate the behaviour of the PIC chips and doing "ISA IRQs 0 to 15 are mapped to IO APIC inputs 0 to 15 which are mapped to interrupt vectors 0x20 to 0x2F", which ends up being broken and wrong, and broken and wrong.
Cheers,
Brendan
Did you try the correct redirection table (the one that is described by ACPI's MADT/APIC table)?AUsername wrote:I'm completely at a loss, I have no idea why this is happening. I've tried at least 3 different redirection table configurations at this point, none of the other code seems to be invalid, I also know that the redirection table's values are being set to the expected value (read the values back after being written).
Before configuring the IO APIC you should create a table (or an array of entries) with one entry for each IO APIC input. Each entry should indicate what the interrupt source is, what the interrupt vector is, the delivery mode and target CPU/s, and if the interrupt is edge triggered or level triggered, and active high or active low. Initially this table would be configured such that the first 16 legacy/ISA IRQs are mapped to the first 16 IO APIC inputs (excluding "cascade" - IO APIC input #2 would just be connected to nothing the same as IO APIC inputs #16 and higher). Then you'd parse the ACPI MADT/APIC table and find "interrupt source override" entries, which tells you which IO APIC inputs are different from this initial assumption. For example, there may be an "interrupt source override" entry that says that ISA IRQ #0 is connected to IO APIC input #33 and is level triggered (or anything else).
Note that you may have multiple IO APICs. For example, there might be a total of 48 IO APIC inputs (and your table might have 48 entries), where the first 24 IO APIC inputs are on the first IO APIC and the remaining 24 IO APIC inputs are on the second IO APIC. If the ACPI MADT/APIC table says that ISA IRQ #0 is connected to IO APIC input #33, then that would actually be IO APIC input #9 on the second IO APIC.
Also, APICs have a different interrupt priority scheme than PIC - the interrupt priority depends on the interrupt vector. High priority IRQs (like ISA IRQ #0) should have a low interrupt vector (e.g. vector 0x30) while low priority IRQs (like ISA IRQ #6) should have a high interrupt vector (e.g. vector 0xE0). The IO APIC's spurious interrupt would be the lowest priority and have the highest interrupt vector (e.g. vector 0xFF). Interrupt vectors used by the OS for IPIs tend to be very high priority and would therefore have very low interrupt vectors (for example, you might reserve interrupt vectors 0x20 to 0x27 for the OS's IPIs). Of course interrupt vectors 0x00 to 0x1F are reserved for exception handlers and should never be used for anything else.
This also means that you probably want/need an "IDT entry manager", which is used to allocate interrupt vectors. For example, you might want an interrupt vector for ISA IRQ #0, so you call your IDT entry manager and ask it to allocate a relatively high priority interrupt vector for ISA IRQ #0 to use, and it will search for a free interrupt vector that is closest to the requested/desired priority. This "IDT entry manager" is important later on, for things like configuring PCI IRQs (and MSI).
When configuring the IO APIC/s themselves; you'd just do a "for each entry in my table { ... }" loop where each IO APIC input (of each IO APIC) is configured however the entry in your table says it should be. Alternatively (depending on the nature of your OS/kernel), you might just disable/mask every IO APIC input and worry about doing most of the work when a driver that uses the IO APIC input is actually started.
Mostly what I'm trying to say is that the "ISA IRQs 0 to 15 are mapped to IO APIC inputs 0 to 15" idea is broken and wrong, and that the "IO APIC inputs 0 to 15 are mapped to interrupt vectors 0x20 to 0x2F" idea is also broken and wrong. I think you're trying to emulate the behaviour of the PIC chips and doing "ISA IRQs 0 to 15 are mapped to IO APIC inputs 0 to 15 which are mapped to interrupt vectors 0x20 to 0x2F", which ends up being broken and wrong, and broken and wrong.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- gravaera
- Member
- Posts: 737
- Joined: Tue Jun 02, 2009 4:35 pm
- Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.
Re: IOAPIC Troubles
Yo:
@Antti: Very good point -- always read reserved bits and re-write their previous value back to the register. Thanks for catching that :O
EDIT: Small note after cross checking Linux source:
Linux, io_apic_32.c, Line 116: The Linux devs just write directly without reading the previous value: write1() is a macro that can be traced to here, __write1() -- usually when they do something in a certain way to get around buggy hardware, they leave a source comment though, so this is most likely just negligence (?).
--Peace out,
gravaera
@Antti: Very good point -- always read reserved bits and re-write their previous value back to the register. Thanks for catching that :O
EDIT: Small note after cross checking Linux source:
Linux, io_apic_32.c, Line 116: The Linux devs just write directly without reading the previous value: write1() is a macro that can be traced to here, __write1() -- usually when they do something in a certain way to get around buggy hardware, they leave a source comment though, so this is most likely just negligence (?).
--Peace out,
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
Re: IOAPIC Troubles
This is actually quite interesting. If the reserved bits are used in the future (unlikely?), it must be taken into account that currently almost everyone zeros them. Maybe I will end up zeroing them as well. Otherwise, I probably just create bugs if the hardware does not strictly follow the standard. It may also be that the zero is safer even if some of those reserved bits are call into play. I know that this is against "the right way to handle reserved bits." Or is it? This does not just apply to I/O APIC.gravaera wrote:[...]
It is sure that the hardware must be designed so that it is compatible with the current mainstream operating systems. If it does not work, it is useless. This aspect must be very troublesome for the hardware developers.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: IOAPIC Troubles
I do know some manuals are rather explicit about the actual behaviour and label reserved bits as "MBZ" (must be zero) or "preserve". I wonder if the manual in question includes a definition "reserved" somewhere.Antti wrote:It is sure that the hardware must be designed so that it is compatible with the current mainstream operating systems. If it does not work, it is useless. This aspect must be very troublesome for the hardware developers.
Guessing from the circumstances, treating the remaining bits as MBZ on an index register sounds more appropriate in case some register numbered above 256 was added in a later version and your bit-preserving code suddenly decides to redirect all writes to register+0x100 for no apparent reason.