OSDev.org

Posted: **Fri Dec 03, 2010 4:28 pm**

Come to think of it. What Bochs calls for "code selector load with 0" in fact are inter-device driver calls. These are coded as call far 0001:0000nnnn, call far 0002:0000nnnn and call far 0003:0000nnnn, where n is the gate number. These generate protection faults that patch the code with the correct call. Bochs shouldn't react on those as the selector is 1-3, not 0 (well, it is actually the RPL field only that differs, but still).

Posted: **Fri Dec 03, 2010 4:37 pm**

If you know what the exact issue with Bochs is, why not write a patch to fix it? It seems like a simple change, and would help both you and the Bochs devs.

Posted: **Fri Dec 03, 2010 4:38 pm**

Brendan wrote:Hi,

I just can't help wondering if the number of bugs created while implementing a multi-core monitor/debugger will be more or less than the number of bugs you hope to fix with the multi-core monitor/debugger.

Since it is a standalone application that does nothing until invoked, it shouldn't add any bugs.

Brendan wrote:Then comes the question of debugging the bugs in the multi-core monitor/debugger - won't you need a debugger debugger for that, and will it result in infinite recursion (e.g. a debugger debugger debugger .... debugger)?

The chicken-egg problem? No, it won't. I will develop and debug the monitor as a normal thread in the (already existing) kernel debugger. I just need to make sure that the monitor is not using any of the multitasking APIs, neither directly nor indirectly. This will mean I need to add a new PS/2 keyboard driver that does not use signal to wake up a thread waiting for keyboard input. Instead, the monitor should just loop checking for new input, and the keyboard ISR will just update a keyboard buffer. Much like how the BIOS version works.

And the kernel debugger was originally developped partly with the help of a processor simulator.

EDIT: And then the code in the processor simulator was translated to a RDOS device-driver so that faulting instructions could be emulated in RDOS (a feature needed in the "DOS-days" when the first linear page needed to be emulated), and still used when running V86 code for mode switching the video card.

Posted: **Fri Dec 03, 2010 4:40 pm**

Regarding the selector 0 bug:

Bochs complains because in almost every other case, the very thing you are doing is wrong - it does not mean it can not continue past it. If you don't abort the simulation but rather tell it to continue past it, it will handle the exception like it should.

Posted: **Fri Dec 03, 2010 4:48 pm**

Combuster wrote:Regarding the selector 0 bug:

Bochs complains because in almost every other case, the very thing you are doing is wrong - it does not mean it can not continue past it. If you don't abort the simulation but rather tell it to continue past it, it will handle the exception like it should.

Well, it doesn't abort the simulation with the standard settings at least. It is just that I wondered why it logged all those things, but now I know why.

The reason for this construction has to do with debuggers. These things are also inserted into flat user applications, and in order for a debugger to set the breakpoint for pacing at the correct location, this was a good design. The debbuger knows were a far call ends, but it doesn't know where an invalid instruction ends (which I used before). The debugger also knows that it can either trace or pace a call, which it doesn't know about invalid instructions either.

Posted: **Fri Dec 03, 2010 4:52 pm**

Hi,

Looks like it'd be extremely easy to change Bochs to remove the "call_protected: CS selector null" error message. See line 43 in "bochs/cpu/call_far.cc".

Cheers,

Brendan

Posted: **Fri Dec 03, 2010 4:53 pm**

NickJohnson wrote:If you know what the exact issue with Bochs is, why not write a patch to fix it? It seems like a simple change, and would help both you and the Bochs devs.

I think I know what the issue is (Bochs does not return the correct current count in PIT channel 2), but I don't know Bochs source code so I cannot do a patch for it. If I might guess, Bochs does not do this because channel 2 of the PIT is normally the speaker, but RDOS turns of the speaker and configures channel 2 as a free-running counter, and then uses it to keep real time. If an APIC timer & timestamp counter is available, it will use those to keep real-time and implement timers instead.

To fix this, Bochs could provide emulation support on channel 2 if the channel is running and the speaker is off.

Posted: **Fri Dec 03, 2010 8:51 pm**

Bochs device models are a horrible mess, and very hard to figure -- which is one reason why I an writing rebochs. Rebochs has complete modeling of PIT channel 2, but does not have full SMP support yet -- so it cannot quite help either, in this case.

rdos wrote: So Bochs cannot help me since Bochs knows nothing about the internal workings of RDOS. It doesn't know when the SMP state is incorrect. It doesn't know that something is spinning forever in order to acquire a spinlock because some other core forgot to relase it (or otherwise corrupted it). It doesn't know that some core has entered sleep state in order to conserve energy in waiting for some resource to become available, but the owner never wakes it up.

I think you are selling the emulators a little short here -- there are some nice tricks (especially magic breakpoints) that can interact with your kernel if you code it in. But it is definitely a case of finding new and clever debugging techniques. Just an emulator by itself will not solve anything, of course.

Posted: **Sat Dec 04, 2010 3:01 pm**

The basics for the monitor is now in-place.

First I added the PS/2 keyboard decoding code (including the ISR in the IOAPIC) to the device-driver, and checked it out in a kernel-thread. When it seemed to work, I just let the second core run the thread code instead, and after some fixes this also works. I can tell that it works by being able to toggle the Num lock and Caps lock. I can also reset the PC with CTRL-ALT-DEL, which is quite handy.

OK, so now I need to do some serious things with the keyboard codes I get. The first thing might be to just present them on the screen.

Posted: **Tue Dec 07, 2010 3:34 am**

More progress. Now I can basically output Bochs fatal error register log. In addition to what Bochs shows, I'll also show the base & size of GDT and IDT, debug register settings, TR and LDT. In the next step I'll add stack-dumping, switching between cores, and a general (interactive) function to show any memory address using seg:offset32. I'll also add disassembly of the current (faulting) instruction. Another useful feature would be to show settings in the processor core block (for instance, nesting level, current thread), and global spinlock status.

Posted: **Mon Dec 20, 2010 2:43 am**

It seems fair to me that emulators have problems, software is never perfect. If it seems perfect to you, it may not be perfect to someone else. The only thing I have to add is: if you can't trust emulators, what can you trust? I mean, it's an important and easy testing platform for most of us hobby OSDevers that takes a lot of time away. Real hardware certainly isn't perfect either: they all seem to handle things differently. The difference may be minor, but it may be sufficient to require additional fixes in your kernel or OS. We only have to look at the mess of ATA/IDE and compliance in that region, and there are probably many other examples.

Posted: **Wed Dec 22, 2010 8:15 am**

The main issue is that rdos is a pretty mature OS, and that Bochs currently cannot run it (because of bugs in Bochs). If rdos were developped with Bochs it would work under Bochs, but the development of rdos mostly predates the development of Bochs. In essence, because so many os-projects use Bochs for development, they also tend to compensate for problems in Bochs.

A secondary issue is that some hardware platforms (that are not SMP), have problems with "kernel panics". These are related to real hardware (two RTL8139 ethernet controllers). It is not possible to find these problems with an emulator like Bochs. Also, the problem is not that the kernel hits fatal
exceptions, but rather that the internal software-state in the scheduler becomes corrupt. The plan is to enter the monitor once something in the scheduler goes wrong, regardless if it is a multicore system or not. It would replace the simple panic device that just dumps thread-states with something interactive. This could be used on real-hardware debugging.

Posted: **Thu Jan 27, 2011 4:45 am**

I've continued development on this after a break. The monitor now can handle many cores, and can switch between active cores. On my AMD with 4 cores, I use core 4 for the monitor, and the monitor can list states of the 3 other cores.

I've now added the "abort" function. It is activated by pressing "A" on the monitor keyboard, and it will send an NMI IPI to all active cores. The NMI handler will then check for reentrancy (this can happen when NMIs hit fatal errors), so that the first time the NMI handler for a core is entered, it will save the complete register state, and enter an infinite loop with interrupts disabled. This function works well in freezing the current state of cores, and inspecting what they do. Especially useful when the OS hangs on spinlocks or similar.

I've also added the "opposite" function. It will be called by an active core, and will save the current state of the core, and send NMIs to all other cores, and finally enter the infinite loop. This is planted in the scheduler at known error conditions, and when fatal exceptions occurs in the scheduler.

With these functions I now know a little better why the scheduler crashes on the 4 core AMD system. The AP cores seems to fail before they are scheduled to execute their first task since TR is 0. It also seems like they query current thread before this point, which is probably why they fault.

Posted: **Tue Mar 01, 2011 4:01 pm**

More progress. I can now view any memory region in a core (by changing selector and offset fields interactively). Disassembly of the current instruction has been added. I've also stubbed-out all the exception handlers, and replaced them with default handlers that will panic (show a screen with registers, and halt the system).

On the functionality side, I've now dropped the requirement that one core runs the debugger. Instead, the first core that faults will become the debugger, which means I now can use all cores in the CPU. I enter the monitor from the standard keyboard (with CTRL-ALT-ESC). I've also added code to disable all interrupts in the system (a requirement so the scheduler and other hardware cannot run while the monitor is active). I can even use the standard keyboard to enter the monitor which uses the standard keyboard (after some fixes), so I now only need one keyboard. I'll test this on a single-core processor tomorrow, where the monitor would come in handy.

Viewing the call-stack for a core solved a number of issues with the monitor itself, when cores faulted before they could save their state.

I'll focus on another major issue soon: To switch back the video-card to a known text-mode. This is kind of tricky since I cannot use the V86 monitor for this (it uses the schduler). I actually plan to switch back to real-mode and use the BIOS-call to set text-mode, and then switch back to the saved environment. This will be a challenge, but I think it is possible to solve. There is a need for this function as many crashes will happen from a VESA-video mode, and not from text-mode.

Posted: **Thu Mar 03, 2011 8:10 am**

I solved a major issue with the monitor now. On one of my PPCs, I sometimes get panics in scheduler, especially when I run the remote debugger. With the monitor fully functional, I can now conclude that it is a double-fault in the scheduler. The reason for the double-fault is that the kernel stack pointer is 0. Double-faults are handled with a TSS, otherwise they would tripple-fault when kernel stack space is out. IOW, kernel is out of stack space. I thought this was kind of strange, so I made a stack-trace, and it looks like this:

Code: Select all

ss:0000 0030:C8A6	TryLockSingle
ss:0004 0030:CBD1	start_timer
ss:0024 0030:CFB1	ReloadTimer (RemoveTimer callback)
ss:002C 0A30:06CB	ehci_timer
ss:0030 0030:BFC0	ReloadTimer (LocalRemoveTimer)
ss:003A 0030:CB8C	timer_int
ss:0062 0030:636A	free_small_mem
ss:006E 0030:6854	             FreeLinear
ss:0082 0630:1358	             remove_buf
ss:0092 0630:2592   	unlock_sector
ss:00A0 0420:264E	read_file_block
ss:00C8 05B0:2974	ReadFileListEntry
ss:00DE 05B0:2BE3	ReadFile
ss:010E	07A0:10CE	LoadPage
ss:013C 07A0:1226	load_object
ss:0176 0030:7A21	page_fault_user
ss:017E 0030:7B5A	trap_14
ss:019E 07A0:1484	Preload

ss:01DC 01B3:0050742B	Application

It starts with a page-fault in the application, a demand load of the page, which envokes the file-system, which reads sectors, frees memory. In the middle of this there is a timer-interrupt, which reloads a few timers, and tries to lock the scheduler, and then the stack is out.

I think the solution is to increase the kernel-stack as I cannot see anything wrong in this scenario.

OSDev.org

Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger