Figured out a major issue (Never give up I guess?)
Posted: Wed Oct 23, 2019 7:19 pm
So I'm not 100% sure this belongs in OS Development but since it was a kernel bug I decided to post it here. This is a thread about never giving up when debugging and also stupid stuff happens when you have complete control of the system and don't check your code very well. Almost two years ago now I started rewriting the initialization section of my kernel because I was CONVINCED I was not setting up my block allocator and paging correctly. As I started to add functionality back into this new kernel, I revisited some of the old code to bring over to the new kernel. And... I FOUND THE BUG.
It was the weirdest thing. When I would run the kernel on QEMU with 128MB of RAM, my kernel would page fault when trying to allocate memory for the stack space when starting a process. I was requesting a single page from the block allocator for the process's stack space. When I would increase the memory in the system to 512MB, no page fault. In the end, I convinced myself that it was down to the way the kernel was linked, that my calculations for the kernel size were failing somewhere, so I set off on my two year adventure to write a new linker map and kernel entry routine that would set up paging and the block allocator correctly. When I started pulling pieces of code today, I decided to step through and see if I could take another crack at the page fault issue when I found this little gem:
I noticed when I wrote the value of startBit to the screen, it looked... odd. Yes I know this function is hideous; I still plan to replace it. Spoiler alert: the error is on the line int startBit = (i*BLOCKS_PER_UNIT)+bit. Yeah... that SHOULD read (i*BLOCKS_PER_UNIT)+j... smh. All that work to rewrite the kernel initialization and this is what I find.
So I think there are two important lessons out of this. The first is already stated very plainly on the wiki "How to ask questions" section. Don't assume that you know which section of your code is causing the problem. I spent MONTHS dumping page tables, examining memory, pounding my head trying to figure this out because I was looking in the wrong section (convinced it was in the setup routines, that the issue was coming from improper configuration when it was a simple logic error). The second is, sometimes when you are stuck, taking a break from what you are working on for a while (two stinking years in my case) can help you come at the problem with fresh eyes. Not sure if this post will help anyone, but I am excited to have found my issue after all this time.
It was the weirdest thing. When I would run the kernel on QEMU with 128MB of RAM, my kernel would page fault when trying to allocate memory for the stack space when starting a process. I was requesting a single page from the block allocator for the process's stack space. When I would increase the memory in the system to 512MB, no page fault. In the end, I convinced myself that it was down to the way the kernel was linked, that my calculations for the kernel size were failing somewhere, so I set off on my two year adventure to write a new linker map and kernel entry routine that would set up paging and the block allocator correctly. When I started pulling pieces of code today, I decided to step through and see if I could take another crack at the page fault issue when I found this little gem:
Code: Select all
// block_alloc_first_free_n (): Locate the first free n blocks in the bitmap
// inputs: nunits - number of blocks to allocate, returns - 0 if no free blocks, nonzero otherwise
static unsigned int block_alloc_first_free_n (size_t nunits) {
if (nunits == 0)
return (unsigned int) NULL;
else if (nunits == 1)
return block_alloc_first_free ();
printf("got here... block_alloc_first_free_n\n");
for (size_t i = 0; i < bmp_sz; i++) {
printf("looping...\n");
if (m_bmp[i] != BLOCK_UNIT_FULL) {
printf("not full: %d\n",i);
for (size_t j = 0; j < BLOCKS_PER_UNIT; j++) {
printf("looping 2...\n");
int bit = 1 << j;
if (!(m_bmp[i] & bit)) {
printf("i: %d, j: %d\n", i, j);
printf("free: 0x%x\n", bit);
printf("free abs: 0x%x\n", (unsigned int)((i*BLOCKS_PER_UNIT)+j));
int startBit = (i*BLOCKS_PER_UNIT)+bit;
size_t free = 0;
for (size_t count = 0; count <= nunits; count++) {
printf("looping 3... testing 0x%x\n", (startBit+count));
if (memory_bitmap_test(startBit+count) == 0) {
printf("free++\n");
free++;
}
if (free == nunits) {
return (unsigned int)((i*BLOCKS_PER_UNIT)+j);
}
}
}
}
}
}
return 0;
}
So I think there are two important lessons out of this. The first is already stated very plainly on the wiki "How to ask questions" section. Don't assume that you know which section of your code is causing the problem. I spent MONTHS dumping page tables, examining memory, pounding my head trying to figure this out because I was looking in the wrong section (convinced it was in the setup routines, that the issue was coming from improper configuration when it was a simple logic error). The second is, sometimes when you are stuck, taking a break from what you are working on for a while (two stinking years in my case) can help you come at the problem with fresh eyes. Not sure if this post will help anyone, but I am excited to have found my issue after all this time.