Man...this is annoying

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
DarylD

Man...this is annoying

Post by DarylD »

I have now spent two damn weeks trying to iron out a fatal bug...the only thing is, I have no idea how it got there!!

I get all sorts of exceptions, mainly #GPF and #PF...very strange, I think it must be because of my multitasking but I have stripped my code down to only one thread again, still no joy...hmmm.

Just thought I would let you all know why I have been so quite lately!!! (basically making no damn progress)

Daryl.
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:Man...this is annoying

Post by Pype.Clicker »

Sporadic exceptions (in my own experience) were often due to mis-initialized pointers that lead to random code/data overwriting. One of the baddest i ever met was the display buffer overflow (took me about 2 weeks full-time to discover it ...) I suggest you check your memory management is correct. Enforce "=NULL" initialization of all your unitinitialized pointers. Also try to have a "execution history" displayed so that you can trace what's happenning when the exception arise.

The most malicious case is the stack overflow (the stack content is damaged by writing to a mis-initialized pointer %-@ ) sometimes, just adding/removing a 'print("hello");' may be sufficient to make those errors appear or disappear.

The generic advice i will give to you is to try to find the bug back and then track it until you find it (a paper copy of the incriminated code will be your friend).

May the Source Be with You ...
Slasher

Re:Man...this is annoying

Post by Slasher »

It might be caused by pointers being updated but not completed when a task switch occurs!
Or Could ba paging that has a bug, have you tried removing(turning paging off) and then debuging whats left?
DarylD

Re:Man...this is annoying

Post by DarylD »

Yes, I am almost positive my stack is being trashed somehow. I *do* know its not to due with multi-tasking swapping the stacks as I have disabled this feature.

I *do* know it only happens when interrupts are enabled to maybe its a sync problem somewhere.

Very difficult to track down, at least now I can make it happen and make it not happen, but its just not logical though.

Daryl.
Slasher

Re:Man...this is annoying

Post by Slasher »

During which interrupt call does it happen? Timer, Keyboard, system call?
DarylD

Re:Man...this is annoying

Post by DarylD »

Ah..I have made an interesting discovery.

It appears my KPrint code seems to be messing with the stack, probably something to do with va_args and all.

Serves me right for blindly implementing other peoples code.

Anybody got any good full featured vsnprintf code??

Daryl.

UPDATE: I think a lot of my called routines are mashing my passed stack....ummmm :/
DarylD

Re:Man...this is annoying

Post by DarylD »

God knows what is happening now...

Back to drawing board on a few things I think :)
Slasher

Re:Man...this is annoying

Post by Slasher »

men,
for my printf _va_list stuff I use my own system
it will only work on 32bit aligned stacks, which is fine for GCC, so its okay.

what i did was, in the printf, i scan the format pointer(the first agument) printing the characters until i get the '%' symbol. Then i increment a position counter and then multiply it by 4(32bit aligned stack in GCC) and then use this product as an offset into the _va_list _arg (void *) pointer to get the address on the stack for that argument of printf.

printf(char *formar,...)
{
int pos,off;
_va_list _arg;
_arg=format;
...
...
...
pos+=1;
off=pos*4;
.....
.....

then I check for the type that followed the '%'
in a switch statement and then do what is expected

case 'd':
itoa(*(int *)(argv+off));

well, somethin like that! (nothing complicated is done to the stack,easy to debug. But i must admit i'm looking for ways to improve it)
well i could send you the source code, maybe you could improve it for ALL of US!
DarylD

Re:Man...this is annoying

Post by DarylD »

Well...as you can see from the following code snippet and attached resulting output, something is wrong!

The code is as follows:

Code: Select all

KPrint( "MODULES", "[%s]-[%s] Call found = 0x%X\n", pszModule, pszCall, g_sKernel.m_psCallStack->eax );
KPrint( "MODULES", "[%s]-[%s] Call found = 0x%X\n", pszModule, pszCall, g_sKernel.m_psCallStack->eax );
KPrint( "MODULES", "[%s]-[%s] Call found = 0x%X\n", pszModule, pszCall, g_sKernel.m_psCallStack->eax );
Which should have three identical rows, but its not the case, specifically the last number is changing with is the call stack (i.e. the stack used during task switching etc., it should point to all saved registers on the interrupt call)

Also notice, the NULL appears, this suggests the call stack is getting trashed somewhere. Also, of the three lines for each type, the first line has the correct call code, the other two are wrong.

I can only assume some function is buggering things up, maybe the KPrint function.

Daryl.

PS. The call stack obviously exists on the stack if you were wondering!!

[attachment deleted by admin]
DarylD

Re:Man...this is annoying

Post by DarylD »

Incidentally, if I have an error in my printf routine, isn't this a bit like that Schrodingers (sp?) cat thing, you know trying to observe the results affects the results themselves! After all, this is what could be happening!

Or maybe I was asleep for all those years in Physics class :)

Daryl.
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:Man...this is annoying

Post by Pype.Clicker »

in case you have a bug in your printing component, what i suggest you is to debug it in Linux featuring DDD ... Because it's all you have to debug within your own kernel. I've been facing that kind of bug too (trying to print out the display buffer' content ... how foolish i was :)
DarylD

Re:Man...this is annoying

Post by DarylD »

I just noticed that 0x41 and 0x4E are ascii codes for letters (upper case specifically)...this may be a clue!

I cant use DDD Pype, I normally use bochs debugger for things like this..its helpful.

Daryl.
User avatar
Pype.Clicker
Member
Member
Posts: 5964
Joined: Wed Oct 18, 2006 2:31 am
Location: In a galaxy, far, far away
Contact:

Re:Man...this is annoying

Post by Pype.Clicker »

what is g_sKernel.m_psCallStack->eax ?
could you possibly misinitialized it so that it now points to some side-effect area ?
DarylD

Re:Man...this is annoying

Post by DarylD »

No, I am damn sure that g_sKernel.m_psCallStack points to the right place. I tried using a line such as follows:

s_callstack_t *psCallstack = g_sKernel.m_psCallStack;

Then referencing eax as psCallStack->eax, but this actually didnt seem to make much difference, in fact gave different results but I think that is related to the same bug I have now.

I am coming to the conclusion that somehow data is being written to the stack by a function, hence the 0x41/0x43/0x4E and 0x51 codes, which are capital letters.

These sorts of problems are certainly interesting trying to solve!

Daryl.
Slasher

Re:Man...this is annoying

Post by Slasher »

The only way to be sure is to trace it :'( You could disable every single process/task that is running and then enable each one, one at a time and look at what is does to the stack.
It could be that the functions are fine, code is clean, but the sequence in which they are called is what is causing the problem. I had functions, insert_ready_task, delete_ready_task, switch_task and schedule_task which on their own worked fine (hand tested them WELL) but because each depended on the running_task pointer, i had placed delete_ready before schedule_task in a fuction but schedule task needs the running_taks pointer to point to the current running task in order to get at the next task to run. But as i had deleted the running_task from the ready_list when i previously called delete_task, schedule_task was getting a NULL pointer.
Things like this are HARD to catch!
Post Reply