Problem with on-demand page mapping on x86_64 (Solved: red zone issue)

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
max
Member
Member
Posts: 629
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Problem with on-demand page mapping on x86_64 (Solved: red zone issue)

Post by max »

So I've been working on a port of my OS from x86 to x86_64 and tasking is basically functional again, but now I'm a bit stuck with my on-demand page mapping.

When loading ELF binaries/shared objects I don't load the actual content but only keep track of the area and where to load it from. Then, when a page fault occurs, I map that page, load the content and let the code continue. This worked fine on x86 previously.

Now I'm getting a very funny behaviour. This unsuspicious looking example code:

Code: Select all

auto dependency = (g_elf_dependency*) heapAllocate(sizeof(g_elf_dependency));
const char* libName = (const char*) (object->dynamicStringTable + it->d_un.d_val);

logInfo("%! now trying to read string length of %x for first time...", "debug", libName);
logInfo("%# stringLength first time: %i", stringLength(libName));
logInfo("%# stringLength second time: %i", stringLength(libName));
logHexDump8((void*) libName, 0, 2);
Causes the following, interesting log output:

Code: Select all

debug: now trying to read string length of 0x0000000000a63abc for first time... // task starts and wants the length of "libc.so"
pagefault: accessed 0x0000000000a63abc from 0xffffffff80221997                  // task faults because page is not there
pagefault: ondemand mapped: 0x0000000000a63000                                 // page is mapped in exception handler
0x0000000000a63abc: 0x6c 0x69 0x62 0x63 0x2e 0x73 0x6f 0x0 libc.so           // exception handler dumps memory after mapping the page
0x0000000000a63ac4: 0x6c 0x69 0x62 0x67 0x63 0x63 0x5f 0x73 libgcc_s
0x0000000000a63acc: 0x2e 0x73 0x6f 0x2e 0x31 0x0 0x47 0x43 .so.1GC
// now IRETQing back into the thread that caused the page fault
stringLength first time: 3             // First string length wrong
stringLength second time: 7            // Second time, it's correct
0x0000000000a63abc: 0x6c 0x69 0x62 0x63 0x2e 0x73 0x6f 0x0 libc.so      // dumping the memory again in the faulted task
0x0000000000a63ac4: 0x6c 0x69 0x62 0x67 0x63 0x63 0x5f 0x73 libgcc_s
0x0000000000a63acc: 0x2e 0x73 0x6f 0x2e 0x31 0x0 0x47 0x43 .so.1GC
Looking at my stringLength function, it disassembles to this:

Code: Select all

ffffffff80221976 <_Z12stringLengthPKc>:
ffffffff80221976:	55                   	push   %rbp
ffffffff80221977:	48 89 e5             	mov    %rsp,%rbp
ffffffff8022197a:	48 89 7d e8          	mov    %rdi,-0x18(%rbp)
ffffffff8022197e:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%rbp)
ffffffff80221985:	eb 04                	jmp    ffffffff8022198b <_Z12stringLengthPKc+0x15>
ffffffff80221987:	83 45 fc 01          	addl   $0x1,-0x4(%rbp)
ffffffff8022198b:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
ffffffff8022198f:	48 8d 50 01          	lea    0x1(%rax),%rdx
ffffffff80221993:	48 89 55 e8          	mov    %rdx,-0x18(%rbp)
ffffffff80221997:	0f b6 00             	movzbl (%rax),%eax
ffffffff8022199a:	84 c0                	test   %al,%al
ffffffff8022199c:	0f 95 c0             	setne  %al
ffffffff8022199f:	84 c0                	test   %al,%al
ffffffff802219a1:	75 e4                	jne    ffffffff80221987 <_Z12stringLengthPKc+0x11>
ffffffff802219a3:	8b 45 fc             	mov    -0x4(%rbp),%eax
ffffffff802219a6:	5d                   	pop    %rbp
ffffffff802219a7:	c3                   	ret
Looking at where exactly the page fault occurs, it happens exactly at the movzbl instruction at ffffffff80221997.

It's quite strange to me. Could it be that the movzbl instruction is partially completed - possible source pointer already incremented or something like this? Or is it rather due to register corruption at some point?

Yes I‘m doing invlpg and I also tried a memory barrier, but all to no avail…
Last edited by max on Fri Mar 21, 2025 2:16 am, edited 1 time in total.
MichaelPetch
Member
Member
Posts: 829
Joined: Fri Aug 26, 2016 1:41 pm
Libera.chat IRC: mpetch

Re: Problem with on-demand page mapping on x86_64; instruction restart after page fault?

Post by MichaelPetch »

Given that you get different output on two different invocations I thought maybe something might be corrupting the stack. I noticed this:

Code: Select all

ffffffff80221976 <_Z12stringLengthPKc>:
ffffffff80221976:	55                   	push   %rbp
ffffffff80221977:	48 89 e5             	mov    %rsp,%rbp
ffffffff8022197a:	48 89 7d e8          	mov    %rdi,-0x18(%rbp)
ffffffff8022197e:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%rbp)
In particular `mov %rsp,%rbp` followed by `mov %rdi,-0x18(%rbp)` without allocating memory on the stack before writing below RSP suggests to me that this code is being built with the red zone on (see https://en.wikipedia.org/wiki/Red_zone_(computing) ). If this is code running in the kernel it has to be built with the red zone off. GCC by default enables the red zone as is is part of the x86_64 Linux System V ABI . The red zone is something that applies to x86_64 code running in user space, but doesn't apply to x86 (not part of the i386 System V ABI). In build.sh add `-mno-red-zone` to CFLAGS. Add this flag to any other CFLAGS in any other build scripts for code that runs inside the kernel. I didn't check the other build directories you have.

As an example in kernel/build.sh https://github.com/maxdev1/ghost/blob/8 ... ild.sh#L26 you have:

Code: Select all

CFLAGS="-std=c++11 -D_GHOST_KERNEL_=1 -Wall -Wno-unused-but-set-variable -ffreestanding -fno-exceptions -fno-rtti"
Change to

Code: Select all

CFLAGS="-std=c++11 -D_GHOST_KERNEL_=1 -Wall -Wno-unused-but-set-variable -ffreestanding -fno-exceptions -fno-rtti -mno-red-zone"
I haven't even attempted to build your kernel but this is something major to fix first as it could be causing you issues. I don't know if it is related to your current problem or not.
User avatar
max
Member
Member
Posts: 629
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Re: Problem with on-demand page mapping on x86_64; instruction restart after page fault?

Post by max »

Wow Michael that was quick. Thanks so much for the fast response. I‘ll try it asap tomorrow morning but got to sleep it‘s 2am here lol. I wasn‘t aware of the redzone issue since previously I was only working on x86 but now I‘m trying to port to x86_64, still a few new pitfalls!
User avatar
max
Member
Member
Posts: 629
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Re: Problem with on-demand page mapping on x86_64 (Solved: red zone issue)

Post by max »

That solved it. Thank you so much Michael. I should post here more often instead of debugging for hours :mrgreen: but I learned something new indeed. Thanks a lot.

Code: Select all

     pagefault  accessed 0x0000000000a63abc from 0xffffffff80221bc3
     pagefault  ondemand mapped: 0x0000000000a63000
                  0x0000000000a63abc: 0x6c 0x69 0x62 0x63 0x2e 0x73 0x6f 0x0 libc.so 
                  0x0000000000a63ac4: 0x6c 0x69 0x62 0x67 0x63 0x63 0x5f 0x73 libgcc_s
                  0x0000000000a63acc: 0x2e 0x73 0x6f 0x2e 0x31 0x0 0x47 0x43 .so.1 GC
                  stringLength first time: 7
                  stringLength second time: 7
                  0x0000000000a63abc: 0x6c 0x69 0x62 0x63 0x2e 0x73 0x6f 0x0 libc.so 
                  0x0000000000a63ac4: 0x6c 0x69 0x62 0x67 0x63 0x63 0x5f 0x73 libgcc_s
                  0x0000000000a63acc: 0x2e 0x73 0x6f 0x2e 0x31 0x0 0x47 0x43 .so.1 GC
           elf    created TLS master: 0xa000a000, size: 0x1000, uTO: 0x0000000000000000
sebihepp
Member
Member
Posts: 210
Joined: Tue Aug 26, 2008 11:24 am
GitHub: https://github.com/sebihepp

Re: Problem with on-demand page mapping on x86_64 (Solved: red zone issue)

Post by sebihepp »

Just a reminder: It isn't enough to just compile your kernel with -mno-red-zone. If you use libgcc for example, you need to build it with -mno-red-zone too.
User avatar
max
Member
Member
Posts: 629
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Re: Problem with on-demand page mapping on x86_64 (Solved: red zone issue)

Post by max »

sebihepp wrote: Fri Mar 21, 2025 2:50 pm Just a reminder: It isn't enough to just compile your kernel with -mno-red-zone. If you use libgcc for example, you need to build it with -mno-red-zone too.
Great tip, thanks sebihepp. I‘ll consider it in my toolchain patches.
Post Reply