OSDev.org

Posted: **Mon Feb 08, 2021 6:15 am**

Hi!

I am experiencing a strange issue in which adding a no-op, or changing a return type definition, makes the issue disappear. I originally suspected a race condition, but changing the function signature's return type leads me to think something else is going on.

I have the following syscall interface that device drivers use:

Code: Select all

// Block until an event is received
// An event will be eit[code]

her an interrupt that must be serviced, or an IPC message
// Returns true if the call returned due to an interrupt needing servicing,
// or false if the call returned due to an IPC message arriving
bool adi_event_await(uint32_t irq) {
// ...
tasking_block_task(driver->task, IRQ_WAIT | AMC_AWAIT_MESSAGE);
// We're now unblocked

task_state_t unblock_reason = task->blocked_info.unblock_reason;
// Make sure this was an event we're expecting
assert(unblock_reason == IRQ_AWAIT || unblock_reason == AMC_AWAIT_MESSAGE, "ADI driver awoke for unknown reason");
return unblock_reason == IRQ_AWAIT;
}[/code]

When the issue triggers, this function returns the inverse of the correct value: "unblock_reason" indicates AMC_AWAIT_MESSAGE instead of IRQ_AWAIT.

However, if I make very slight changes to the code, the issue disappears:

* If I print "unblock_reason" before returning, the issue disappears
* If I check the PID of the running process and do a no-op, the issue disappears
* If I change the return value from "bool" to "uint32_t", the issue disappears
* If I run in a debugger, the issue disappears
* If I change the code to explicitly return true or false based on the value, instead of taking the result of the equality, the issue disappears

To be sure, I checked the assembly generated for the return statement, and it looks perfectly sane:

Code: Select all

cmp        dword [ss:ebp+var_14], 0x100
sete       al

I know this is probably some strange interaction in my system, but wanted to post here in case anyone might have an inkling of what could be going on. Thanks in advance!

Posted: **Mon Feb 08, 2021 7:03 am**

You sure, you are not causing stack corruption somewhere? These kind of unexplained bugs often happen because of undetected stack corruption or overflow.

Posted: **Mon Feb 08, 2021 10:04 am**

pvc wrote:You sure, you are not causing stack corruption somewhere? These kind of unexplained bugs often happen because of undetected stack corruption or overflow.

Thanks! I am not sure at all -- that sounds like an awfully plausible possibility to me, but I'm not sure of an effective strategy for tracking it down.

I'll dig into this further, and wait and see if anyone has a tip on how to track something like that down.

Posted: **Mon Feb 08, 2021 10:22 am**

I enabled gcc's stack canary, and have so far audited the three obvious source locations (driver interface, IPC, scheduler) for stack allocations -- but haven't found any that are causing the issue.

Posted: **Mon Feb 08, 2021 10:33 am**

I don't have a solution either. Just some thoughts:

1. Maybe a parenthesis around the value to return? Probably not helping, but anyway...

2. Have you looked at the produced Assembler code? ("gcc -S" or something similar) And at the differences between the working and not working variant? EDIT You posted the assmebly, but does it look the same for all "versions"/variants of your code? And wha tabout the return instruction(s)?

3. Did you use an optimize flag with gcc?

OSDev.org

Intermittent bit flip in driver syscall

Intermittent bit flip in driver syscall

Re: Intermittent bit flip in driver syscall

Re: Intermittent bit flip in driver syscall

Re: Intermittent bit flip in driver syscall

Re: Intermittent bit flip in driver syscall