Intermittent bit flip in driver syscall

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
codyd51
Member
Member
Posts: 77
Joined: Fri May 20, 2016 2:29 pm
Location: London, UK
GitHub: https://github.com/codyd51
Contact:

Intermittent bit flip in driver syscall

Post by codyd51 »

Hi!

I am experiencing a strange issue in which adding a no-op, or changing a return type definition, makes the issue disappear. I originally suspected a race condition, but changing the function signature's return type leads me to think something else is going on.

I have the following syscall interface that device drivers use:

Code: Select all

// Block until an event is received
// An event will be eit[code]
her an interrupt that must be serviced, or an IPC message
// Returns true if the call returned due to an interrupt needing servicing,
// or false if the call returned due to an IPC message arriving
bool adi_event_await(uint32_t irq) {
// ...
tasking_block_task(driver->task, IRQ_WAIT | AMC_AWAIT_MESSAGE);
// We're now unblocked

task_state_t unblock_reason = task->blocked_info.unblock_reason;
// Make sure this was an event we're expecting
assert(unblock_reason == IRQ_AWAIT || unblock_reason == AMC_AWAIT_MESSAGE, "ADI driver awoke for unknown reason");
return unblock_reason == IRQ_AWAIT;
}[/code]

When the issue triggers, this function returns the inverse of the correct value: "unblock_reason" indicates AMC_AWAIT_MESSAGE instead of IRQ_AWAIT.

However, if I make very slight changes to the code, the issue disappears:

* If I print "unblock_reason" before returning, the issue disappears
* If I check the PID of the running process and do a no-op, the issue disappears
* If I change the return value from "bool" to "uint32_t", the issue disappears
* If I run in a debugger, the issue disappears
* If I change the code to explicitly return true or false based on the value, instead of taking the result of the equality, the issue disappears

To be sure, I checked the assembly generated for the return statement, and it looks perfectly sane:

Code: Select all

cmp        dword [ss:ebp+var_14], 0x100
sete       al
I know this is probably some strange interaction in my system, but wanted to post here in case anyone might have an inkling of what could be going on. Thanks in advance!
User avatar
pvc
Member
Member
Posts: 201
Joined: Mon Jan 15, 2018 2:27 pm

Re: Intermittent bit flip in driver syscall

Post by pvc »

You sure, you are not causing stack corruption somewhere? These kind of unexplained bugs often happen because of undetected stack corruption or overflow.
codyd51
Member
Member
Posts: 77
Joined: Fri May 20, 2016 2:29 pm
Location: London, UK
GitHub: https://github.com/codyd51
Contact:

Re: Intermittent bit flip in driver syscall

Post by codyd51 »

pvc wrote:You sure, you are not causing stack corruption somewhere? These kind of unexplained bugs often happen because of undetected stack corruption or overflow.
Thanks! I am not sure at all -- that sounds like an awfully plausible possibility to me, but I'm not sure of an effective strategy for tracking it down.

I'll dig into this further, and wait and see if anyone has a tip on how to track something like that down.
codyd51
Member
Member
Posts: 77
Joined: Fri May 20, 2016 2:29 pm
Location: London, UK
GitHub: https://github.com/codyd51
Contact:

Re: Intermittent bit flip in driver syscall

Post by codyd51 »

I enabled gcc's stack canary, and have so far audited the three obvious source locations (driver interface, IPC, scheduler) for stack allocations -- but haven't found any that are causing the issue.
PeterX
Member
Member
Posts: 590
Joined: Fri Nov 22, 2019 5:46 am

Re: Intermittent bit flip in driver syscall

Post by PeterX »

I don't have a solution either. Just some thoughts:

1. Maybe a parenthesis around the value to return? Probably not helping, but anyway...

2. Have you looked at the produced Assembler code? ("gcc -S" or something similar) And at the differences between the working and not working variant? EDIT You posted the assmebly, but does it look the same for all "versions"/variants of your code? And wha tabout the return instruction(s)?

3. Did you use an optimize flag with gcc?
Post Reply