OSDev.org

Posted: **Sun Dec 28, 2014 10:04 am**

Hi,

I have a problem with my code when calling a specific method of a class.

Basically this happens when creating a new process, here is the function.

Code: Select all


int create_process (routine _startRoutine)
{
    asm volatile("cli");
    uint64_t _rip = read_rip();
// The child come back later here, and jumps to the else section
    uint8_t core_id = Utils::getAPICId();
    if ( coreManager->getCurrentKernelPageManager(core_id)->getParentKernelPageManager() == NULL)
    {
        // Process initialization
        asm volatile("sti");
        return 0 ;
    }
    else
    {
         routine _threadRoutine = coreManager->getCurrentKernelPageManager(core_id)->getStartRoutine();

//  When I use routine _threadRoutine = coreManager->getCurrentKernelPageManager()->getStartRoutine() everything works fine.
        _threadRoutine();
        return 100;

    }
}

Okay, the process initialization goes well and exists to the main program.

The problem occurs in the else section when the scheduler fires the child process. Basically, when I hit the first function call specifically as below it fires a page fault.

Code: Select all

routine _threadRoutine = coreManager->getCurrentKernelPageManager(core_id)->getStartRoutine();

At the beginning I could not identify the cause of the problem (and still don't), but after trying different things I realized that the problem occur when I call this method with a parameter.

So I added another method. Below are the two methods in CoreManager that I use alternatively

Code: Select all

KernelPageManager * CoreManager::getCurrentKernelPageManager (uint16_t core_id)
{
    if ( core_id < core_count)
        return cores[core_id]->getCurrentKernelPageManager();
    return NULL;

}
KernelPageManager * CoreManager::getCurrentKernelPageManager ()
{
    uint16_t core_id = Utils::getAPICId();
    if ( core_id < core_count)
        return cores[core_id]->getCurrentKernelPageManager();
    return NULL;

}

So they basically do the same thing, which is calling a method in another object stored in an array based on the core_id of the processor running this peice of code.

When I call the first method (The one with the parameters) I get the page fault, while if I use the second one, it goes through. I also tried and emptied the first method (The one with the parameters) completely to make it return NULL as below and still the problem exists at the call.

Code: Select all

KernelPageManager * CoreManager::getCurrentKernelPageManager (uint16_t core_id)
{
    return NULL;
}

I suspect that it has to do with the stack or rbp ?? I use a clean stack before jumping to the code location of the child, and set the rsp and rbp to the start of the stack (the higher base address). the child never returns as it hits an infinite loop so I don't need to copy the parent stack content (May be I will need in the future) but I don't think that I have reached a point where this could be the problem.

Here is the output of objdump of the faulty part:

Code: Select all

    }
    else
    {
         routine _threadRoutine = coreManager->getCurrentKernelPageManager(core_id)->getStartRoutine();
   17c9b:       89 de                   mov    %ebx,%esi
   17c9d:       48 b8 20 20 02 00 00    movabs $0x22020,%rax
   17ca4:       00 00 00 
   17ca7:       48 8b 38                mov    (%rax),%rdi
   17caa:       ff d5                   callq  *%rbp
   17cac:       48 89 c7                mov    %rax,%rdi
   17caf:       48 b8 90 37 01 00 00    movabs $0x13790,%rax
   17cb6:       00 00 00 
   17cb9:       ff d0                   callq  *%rax
        _threadRoutine();
   17cbb:       ff d0                   callq  *%rax
        return 100;
   17cbd:       b8 64 00 00 00          mov    $0x64,%eax

    }
}

Thanks,
Karim.

Posted: **Sun Dec 28, 2014 10:48 am**

kemosparc wrote:
Code: Select all
    uint64_t _rip = read_rip();

Oh no, no! This is pretty bad! You shouldn't never do this! You're not knowing in what state the stack, and registers are. The compiler can pretty sure generate code that's invalid once the new process pops into the mix. There are various workarounds for this. One of those is a GCC-specific extension, the unary && operator:

Code: Select all

someLabel:
  // ...
  uint64_t _rip = (uint64_t)&&someLabel;

Also, WTF is this?

kemosparc wrote:
Code: Select all
   17caa:       ff d5                   callq  *%rbp

I know it's possible (and sometimes improves performance), but it difficults debugging and stack tracing. What happens? You're using %rbp as a normal register! There was a flag for doing this...

Posted: **Sun Dec 28, 2014 11:04 am**

Thanks a lot for the quick reply

Okay, I will fix the read_rip, although I doubt it is the cause.

But I am more interested in your last comment.

This code I did not generate, it is from the objdump. So TF is not mine, it is the compiler.

You said

There was a flag for doing this ...

Can you please elaborate on that?

Thanks,
Karim.

Posted: **Sun Dec 28, 2014 11:17 am**

kemosparc wrote:You said

KemyLand wrote: There was a flag for doing this ...
Can you please elaborate on that?

I think you misunderstand me. What I meant is that there is a flag for treating %ebp as a GPR. Just a quick Google Search and it appears to be -fomit-frame-pointer, are you using it?

Posted: **Sun Dec 28, 2014 11:25 am**

No, I am not using this flag.

Here are the flags I am using in my Makefile:

Code: Select all

ELF_FLAGS= -Ttext 0x10000  -T linker.ld -ffreestanding -O2 -nostdlib
GCC_FLAGS= -ffreestanding -O2 -Wall -Wextra  -fno-exceptions -fno-rtti -fno-builtin -fno-stack-protector -mcmodel=large -mno-red-zone -mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-3dnow -Wno-type-limits

Anything that I should rconsider?

Thanks
Karim.

Posted: **Sun Dec 28, 2014 11:37 am**

kemosparc wrote:
Code: Select all
-fno-stack-protector -Wno-type-limits -mno-red-zone

Why do you use them? Stack protection is great for debugging, and the redzone is mandated by the x86-64 ABI! The type limits warnings can appear if you don't do the appropiate casting, which can lead to bugs. NEVER disable any warning! You also need -Werror. This will help you write correct code, as you'll not be tempted to avoid warnings

!

You're using -Ttext 0x10000 and -T linker.ld. They don't mix!

Posted: **Sun Dec 28, 2014 11:38 am**

Obviously it was set by default.

I have added -fno-omit-frame-pointer ad the objdump generated the following:

Code: Select all

    }
    else
    {
        routine _threadRoutine = coreManager->getCurrentKernelPageManager(core_id)->getStartRoutine();
   1821d:       89 de                   mov    %ebx,%esi
   1821f:       48 b8 20 20 02 00 00    movabs $0x22020,%rax
   18226:       00 00 00 
   18229:       48 8b 38                mov    (%rax),%rdi
   1822c:       41 ff d4                callq  *%r12
   1822f:       48 89 c7                mov    %rax,%rdi
   18232:       48 b8 20 3b 01 00 00    movabs $0x13b20,%rax
   18239:       00 00 00 
   1823c:       ff d0                   callq  *%rax
        _threadRoutine();
   1823e:       ff d0                   callq  *%rax
        return 100;
   18240:       b8 64 00 00 00          mov    $0x64,%eax

    }

One very important thing that I have reported incorrectly: THE EXPECTION IS "INVALID OP CODE" not page fault.

Thanks
Karim.

Posted: **Sun Dec 28, 2014 11:45 am**

Okay,

Removed the switched and still behaving the same.

I don't understand why in this specific location. The same method is called in the if condition preceeding it and created no problems?

Thanks
karim.

Posted: **Sun Dec 28, 2014 12:27 pm**

kemosparc wrote:One very important thing that I have reported incorrectly: THE EXPECTION IS "INVALID OP CODE" not page fault.

That's right; that was a really important detail you were missing! An #UD (Invalid Opcode Exception) is even worse than a #PF in this case. It probably means that a misalignment occurred while retrieving RIP, and thus you're executing at a position that doesn't contains the start of an instruction, but rather a middle part of an instruction! For example: If I use your code:

Code: Select all

   1821d:       89 de                   mov    %ebx,%esi
   1821f:       48 b8 20 20 02 00 00    movabs $0x22020,%rax
   18226:       00 00 00
   18229:       48 8b 38                mov    (%rax),%rdi
   1822c:       41 ff d4                callq  *%r12
   1822f:       48 89 c7                mov    %rax,%rdi
   18232:       48 b8 20 3b 01 00 00    movabs $0x13b20,%rax
   18239:       00 00 00
   1823c:       ff d0                   callq  *%rax
        _threadRoutine();
   1823e:       ff d0                   callq  *%rax
        return 100;
   18240:       b8 64 00 00 00          mov    $0x64,%eax

If the loaded RIP is at 1821e (I know it's not but it is an example

), the processor will recieve invalid opcodes, because it's not any instruction, but inside an instruction. Please debug and inform us of the RIP that you get when using &&someLabel, then the code that's at that address. BTW, are you using another bootloader that's not GRUB? You're below the first MiB!

Anyway, the -fomit-frame-pointer stuff was just a hint. It won't solution the problem

.

Posted: **Sun Dec 28, 2014 12:47 pm**

Okay,

Just for the sake if anyone has the same problem or read this post.

The problem is because of the read_eip statement that KemyLand pointed out that it can create bugs.

I have removed and replaced it as Kemyland suggested, and the problem is fixed.

kemyLand you are a star, I have been stuck on this for the past 2 days.

Thanks a lot for your help.

I will now scratch my OS and will start a clean well organized one

Thanks a lot again.
Karim.

Posted: **Sun Dec 28, 2014 12:56 pm**

Sorry I replied before seeing your reply.

Yes, most probably the read_rip causes this problem.

Answering your question about the bootloader, yes I use mine, and I am not loading my kernel in the higher halk.

Thanks,
Karim.

Posted: **Thu Jan 01, 2015 3:40 am**

Okay, I have to chime in to prevent serious problems caused by KemyLand's own misunderstandings of things

-fno-stack-protector

Perfectly fine to get started initially as it requires runtime support (and enabled by default on the latest GCCs). You will probably want to change this to -fstack-protector once you have at least some basics running - which you seem to have already. There are good examples on the wiki on getting this fixed.

-mno-red-zone

Unlike KemyLand advertises, this is MANDATORY in kernel land. The red zone occupies stack space under the stackpointer and gets overwritten whenever an interrupt writes to the same stack. This doesn't happen in userland because it also causes a stack switch, but not by default in ring 0.

-------

That said, there are a few things I don't trust in the first place. I don't see which compiler you have been using and this

Google wrote:In the standard/stock GCC, stack protector is off by default. However, some Linux distributions have patched GCC to turn it on by default

suggests that you violated a very important OS development tradition (see Posting Checklist).

In your previous thread you mentioned 64k limit issues, you also just indicated you use your own bootloader. Considering the range of symptoms, things sound like your bootloader has serious issues that cause problems later down the line. Please do grab a copy of bochs and debug what actually happens on what instruction.

Posted: **Thu Jan 01, 2015 3:08 pm**

Thanks a lot,

I will do get Bochs and debug more into it although my problem is solved. Yet I have a problem using the ram file settings, although I enabled the flag in the configuration before compiling Bochs, yet bochs still does not want to accept memory sizes bigger than 2048 MB. But I will dig into this more.

Regarding your comment about violation, if you mean the cross compiler, I do use a cross compiler and I did follow the instructions on osdev firmly to build it.

Anyways, thanks for the reply ad the help

Happy new year.

Karim.

Posted: **Thu Jan 01, 2015 4:44 pm**

Combuster wrote:
-fno-stack-protector
Perfectly fine to get started initially as it requires runtime support (and enabled by default on the latest GCCs). You will probably want to change this to -fstack-protector once you have at least some basics running - which you seem to have already. There are good examples on the wiki on getting this fixed.

If he (may) have support for stack-smash protection, why would you advise him to use -fno-stack-protector? Stack-smash protection is an excellent feature that enables us to debug more easily.

Combuster wrote:
-mno-red-zone
Unlike KemyLand advertises, this is MANDATORY in kernel land. The red zone occupies stack space under the stackpointer and gets overwritten whenever an interrupt writes to the same stack. This doesn't happen in userland because it also causes a stack switch, but not by default in ring 0.

Seeing your points, I must accept I was wrong on here

.

OSDev.org

Problem with method call that has a parameter

Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter

Re: Problem with method call that has a parameter