Machine check exception on LAPIC read when using PAT

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
qookie
Member
Member
Posts: 72
Joined: Sun Apr 30, 2017 12:16 pm
Libera.chat IRC: qookie
Location: Poland

Machine check exception on LAPIC read when using PAT

Post by qookie »

Hi!

I recently added PAT support into my kernel, and all is well and now framebuffer scrolling is nice and fast.
Except, as the title says, when running on real hardware (with an Intel i5-3210m CPU, also happens on an i5-3320m, does not happen on my Athlon 64 X2 PC), I receive a machine check exception.
When reading the MCE status MSRs, MCE_STATUS says an MCE is in progress, and that restart IP is valid, bank 1 MCi_STATUS says that it's valid and that MCi_ADDR is valid, and bank 1 MCi_ADDR points to the LAPIC (although it points to a register I never access, either directly or indirectly).

I can't see why it fails now, or if it's a problem unrelated to PAT, why it didn't fail before.

Thanks in advance for any help!

Relevant code:

Code that maps the LAPIC:

Code: Select all

arch_mm_map_kernel((void *)lapic_base, (void *)lapic_base, 4, // 4 pages
	ARCH_MM_FLAG_R | ARCH_MM_FLAG_W, ARCH_MM_CACHE_UC);
Defines for ARCH_MM_xxx:

Code: Select all

#define ARCH_MM_FLAG_R		0x01 /* page is readable */
#define ARCH_MM_FLAG_W		0x02 /* page is writable */
#define ARCH_MM_FLAG_E		0x04 /* page is executable */
#define ARCH_MM_FLAG_U		0x08 /* page is accessible in user mode */

#define ARCH_MM_CACHE_WB	0 /* write-back */
#define ARCH_MM_CACHE_WT	1 /* write-through */
#define ARCH_MM_CACHE_WC	2 /* write-combining */
#define ARCH_MM_CACHE_WP	3 /* write-protect */
#define ARCH_MM_CACHE_UC	4 /* uncacheable */
#define ARCH_MM_CACHE_DEFAULT ARCH_MM_CACHE_WB
Relevant bits of VMM code:

Code: Select all

// PAT setup
cpu_set_msr(0x277, 0x0000000005010406);

// flags
#define VMM_FLAG_WRITE		(1<<1)
#define VMM_FLAG_USER		(1<<2)
#define VMM_FLAG_PAT0		(1<<3)
#define VMM_FLAG_PAT1		(1<<4)
#define VMM_FLAG_PAT2		(1<<4)
#define VMM_FLAG_NX		(1ull<<63)

// these flags only apply to the lowest-level table entries
// higher tables (PML4, PDP, PD) only get (arch_flags & (WRITE | USER))
// the present bit is set automatically for each entry
int vmm_arch_to_vmm_flags(int flags, int cache) {
	int arch_flags = 0;

	if (flags & ARCH_MM_FLAG_W) arch_flags |= VMM_FLAG_WRITE;
	if (flags & ARCH_MM_FLAG_U) arch_flags |= VMM_FLAG_USER;
	if (!(flags & ARCH_MM_FLAG_E)) arch_flags |= VMM_FLAG_NX;

	if (cache & (1 << 0)) arch_flags |= VMM_FLAG_PAT0;
	if (cache & (1 << 1)) arch_flags |= VMM_FLAG_PAT1;
	if (cache & (1 << 2)) arch_flags |= VMM_FLAG_PAT2;

	return arch_flags;
}
Working on managarm.
Octocontrabass
Member
Member
Posts: 5578
Joined: Mon Mar 25, 2013 7:01 pm

Re: Machine check exception on LAPIC read when using PAT

Post by Octocontrabass »

qookie wrote:When reading the MCE status MSRs, MCE_STATUS says an MCE is in progress, and that restart IP is valid, bank 1 MCi_STATUS says that it's valid and that MCi_ADDR is valid, and bank 1 MCi_ADDR points to the LAPIC (although it points to a register I never access, either directly or indirectly).
What are the actual values of these registers? There's more information in them that might help us identify the cause (or at least narrow down the possibilities).
User avatar
qookie
Member
Member
Posts: 72
Joined: Sun Apr 30, 2017 12:16 pm
Libera.chat IRC: qookie
Location: Poland

Re: Machine check exception on LAPIC read when using PAT

Post by qookie »

Octocontrabass wrote: What are the actual values of these registers? There's more information in them that might help us identify the cause (or at least narrow down the possibilities).
The values are as follows

Code: Select all

MCG_CAP = 0x0000000000000C07
MCG_STATUS = 0x0000000000000005

MCi_STATUS (for bank 1) = 0xBF80000000200001
MCi_ADDR (for bank 1) = 0x00000000FEE00340
MCi_MISC (for bank 1) = 0x0000000000000086

MCi_STATUS (for banks 0, 2, 3, 4) = 0x0000000000000000
MCi_STATUS (for banks 5, 6) = 0x0020000000000000
Working on managarm.
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: Machine check exception on LAPIC read when using PAT

Post by stlw »

When core is accessing Local APIC, it first checks the memory access for consistency. If the access is found invalid, #MC is raised.
The things that are checked :
- The access must be UC memory type
- The access must be 1,2 or 4 bytes
- The access must be aligned to its data size

I would guess that your APIC page in mapped to WB memory type and this is causing machine check.
Especially if the problem started when you introduce PAT.
although it points to a register I never access, either directly or indirectly
If APIC page is not mapped to UC it might be even access speculatively on wrong path and this will certainly cause #MC
Last edited by stlw on Fri Nov 22, 2019 5:06 am, edited 1 time in total.
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: Machine check exception on LAPIC read when using PAT

Post by stlw »

I guess I see your problem:

Code: Select all

#define ARCH_MM_CACHE_WB   0 /* write-back */
#define ARCH_MM_CACHE_WT   1 /* write-through */
#define ARCH_MM_CACHE_WC   2 /* write-combining */
#define ARCH_MM_CACHE_WP   3 /* write-protect */
#define ARCH_MM_CACHE_UC   4 /* uncacheable */
while in real life:

Code: Select all

enum {
  BX_MEMTYPE_UC = 0,
  BX_MEMTYPE_WC = 1,
  BX_MEMTYPE_RESERVED2 = 2,
  BX_MEMTYPE_RESERVED3 = 3,
  BX_MEMTYPE_WT = 4,
  BX_MEMTYPE_WP = 5,
  BX_MEMTYPE_WB = 6,
  BX_MEMTYPE_UC_WEAK = 7, // PAT only
};
With memory type == 4 you map your APIC page to write through and this cause #MC
User avatar
qookie
Member
Member
Posts: 72
Joined: Sun Apr 30, 2017 12:16 pm
Libera.chat IRC: qookie
Location: Poland

Re: Machine check exception on LAPIC read when using PAT

Post by qookie »

stlw wrote:I guess I see your problem:

Code: Select all

#define ARCH_MM_CACHE_WB   0 /* write-back */
#define ARCH_MM_CACHE_WT   1 /* write-through */
#define ARCH_MM_CACHE_WC   2 /* write-combining */
#define ARCH_MM_CACHE_WP   3 /* write-protect */
#define ARCH_MM_CACHE_UC   4 /* uncacheable */
while in real life:

Code: Select all

enum {
  BX_MEMTYPE_UC = 0,
  BX_MEMTYPE_WC = 1,
  BX_MEMTYPE_RESERVED2 = 2,
  BX_MEMTYPE_RESERVED3 = 3,
  BX_MEMTYPE_WT = 4,
  BX_MEMTYPE_WP = 5,
  BX_MEMTYPE_WB = 6,
  BX_MEMTYPE_UC_WEAK = 7, // PAT only
};
With memory type == 4 you map your APIC page to write through and this cause #MC
I might be mistaken, but I reprogram the PAT with 0x0000000005010406 (apologies if this was hard to see in the VMM snippets!), which should match my caching mode constants. I haven't noticed anything in Intel or AMD manuals about reserved entries either.
Working on managarm.
Octocontrabass
Member
Member
Posts: 5578
Joined: Mon Mar 25, 2013 7:01 pm

Re: Machine check exception on LAPIC read when using PAT

Post by Octocontrabass »

qookie wrote:

Code: Select all

MCi_ADDR (for bank 1) = 0x00000000FEE00340
MCi_MISC (for bank 1) = 0x0000000000000086
Going by this, the fault address could be anywhere from 0xFEE00340 to 0xFEE0037F (physical). Unfortunately I couldn't figure out how to interpret the other error registers; Intel doesn't seem to have it documented for these CPUs. If it's caused by a misplaced read or write in your code, you should be able to catch it by setting a breakpoint using the debug registers.
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: Machine check exception on LAPIC read when using PAT

Post by stlw »

As a self check you may try to define MTRR which overlaps with APIC physical address and have UC memory type.
The MTRR of UC memory type will take over PAT memory type regardless what you have in it.
If your problem goes away -> your APIC page is configured to other than UC memory type.
User avatar
qookie
Member
Member
Posts: 72
Joined: Sun Apr 30, 2017 12:16 pm
Libera.chat IRC: qookie
Location: Poland

Re: Machine check exception on LAPIC read when using PAT

Post by qookie »

Apologies for not getting back to you all earlier.

It seems I am very stupid, and as such made a very stupid mistake. My page table bit defines were wrong. :oops:

the PAT bit was defined as such:

Code: Select all

#define VMM_FLAG_PAT2      (1<<4)
while it should've been

Code: Select all

#define VMM_FLAG_PAT2      (1<<7)
You can even see this mistake in the original post.

Thanks for all the help. I apologise for not noticing the issue earlier.
Working on managarm.
Post Reply