Constructing a multi-core monitor / debugger

rdos · Post by **rdos** » Fri Mar 04, 2011 3:09 am

I modified the monitor's PS/2 keyboard handler to not use interrupts. I had some problems on a few platforms where the keyboard would not work once the monitor was invoked. I thought I had compensated for all possible problems with breaking with active ISRs, but obviously not. I first disable all IRQs, send EOI until the is no active ISRs (and record the original ISRs for later display in the monitor), and then I enable the keyboard IRQ, and call the handler in case there is some problem, but this doesn't seem to solve the issue. But, anyway, it is more roboust not to use IRQs, especially since the monitor will poll for keys anyway, and has nothing else to do. However, if somebody has code / a method that can reenable the keyboard from any context (including with active ISRs, and similar), I might look at this again.

rdos · Post by **rdos** » Sun Mar 06, 2011 3:40 am

Because of problems with code patching (see other thread on this issue), I redesigned the startup-process in order for AP cores not to use gates before they are initialized with the monitor. With this setup I hope to be able to fix the code patching issue. Currently, it doesn't always work when 3 cores starts executing the same gate at the same time. Sometimes one or more cores will protection fault on the gate, while sometimes all cores pass the test.

I also have an issue with the Intel Atom processor (hyperthreading), where it seems like the NMI IPI does not work. On the 4-core AMD Athlon it works perfectly well.

I also saw the 4-core AMD running two cores in the scheduler for the first time.

rdos · Post by **rdos** » Sun Mar 06, 2011 3:15 pm

A major break-through. Now both the 4-core AMD and the Intel Atom (with hyperthreading) works with the monitor. Seems like the Intel Atom couldn't handle NMI IPIs, but when I modified the code to first send an int 2 IPI, I can now see the state of the second core. The only drawback is that I cannot force the other core of the Atom processor into break if it has disabled interrupts.

I also have a usable trace when the gate translation fails on the AMD processor that I'll look into tomorrow.

It is a major task to handle all imaginable faults from a real environment, but it is necesary since emulators are no good for finding bugs on real hardware.

rdos · Post by **rdos** » Mon Mar 07, 2011 9:53 am

Now I have added code to reset the video-mode to standard text mode so the monitor works even if RDOS runs in graphics mode when a fault is hit. I do this by switching the processor to real mode, calling the video mode service routine, and then switching back to paged, protected mode.

Czernobyl · Post by **Czernobyl** » Mon Mar 07, 2011 1:27 pm

Hullo RDos... It may not be timely, but I happened to see your comments re. a supposed Bochs "bug", you wrote :

What Bochs calls for "code selector load with 0" in fact are inter-device driver calls. These are coded as call far 0001:0000nnnn, call far 0002:0000nnnn and call far 0003:0000nnnn, where n is the gate number. These generate protection faults that patch the code with the correct call. Bochs shouldn't react on those as the selector is 1-3, not 0 .

You may be misunderstanding, and Bochs is exactly right : the NULL selector is defined to be any of 0, 1, 2 or 3, i.e., that would point to the first GDT entry, which is reserved (unused, in fact) for exactly that purpose : having one invalid selector value for each RPL.

Bochs 2.4.6 is out since a few days BTW. I just briefly tried your boot disk in it, still hangs

But as you yourself have noticed, it is more probably problems with the emulation of the platform devices than the processor(s). Accuracy of Bochs's processor model has become impressive and it's extremely difficult to take in in default.

Cheers

--
Czerno

rdos · Post by **rdos** » Mon Mar 07, 2011 4:07 pm

Czernobyl wrote:Bochs 2.4.6 is out since a few days BTW. I just briefly tried your boot disk in it, still hangs
But as you yourself have noticed, it is more probably problems with the emulation of the platform devices than the processor(s). Accuracy of Bochs's processor model has become impressive and it's extremely difficult to take in in default.

Yes. Provided you tested it with a processor without APIC and timestamp counter, it is probably PIT, channel 2, that is incorrectly emulated. That is why it seemingly hangs because timers never expires.

Combuster · Post by **Combuster** » Tue Mar 08, 2011 4:56 am

PIT 2 is the PC speaker - how do you intend to use that as a timer?

rdos · Post by **rdos** » Tue Mar 08, 2011 4:59 am

Combuster wrote:PIT 2 is the PC speaker - how do you intend to use that as a timer?

Simple. You turn off the speaker (and disallow its use), and setup PIT timer 2 to accumulate passed time. When PIT timer 2 never changes value between read-outs, system time in RDOS will not advance, and since timers use system time, they never expire.

Combuster · Post by **Combuster** » Tue Mar 08, 2011 5:04 am

It's still not a timer, it's a counter, and one that might get skewed each time you try to read it.

rdos · Post by **rdos** » Tue Mar 08, 2011 5:10 am

Combuster wrote:It's still not a timer, it's a counter, and one that might get skewed each time you try to read it.

It is used as a counter of elapsed time. Because I read it by latching the count, and never reload it, it won't skew time. This is precisely why the event timer (PIT timer 0) cannot be used to record elapsed time. Every reload of a new expire count will skew time.

Combuster · Post by **Combuster** » Tue Mar 08, 2011 11:17 am

The wiki wrote:While the latch command should not affect the current count, on some (old/dodgy) motherboards sending the latch command can cause a cycle of the input signal to be occasionally missed, which would cause the current count to be decremented 0.8381ms later than it should be. If you're sending the latch command often this could cause accuracy problems (but if you need to send the latch command often you may wish to consider redesigning your code anyway).

rdos · Post by **rdos** » Tue Mar 08, 2011 12:08 pm

Combuster wrote:
The wiki wrote:While the latch command should not affect the current count, on some (old/dodgy) motherboards sending the latch command can cause a cycle of the input signal to be occasionally missed, which would cause the current count to be decremented 0.8381ms later than it should be. If you're sending the latch command often this could cause accuracy problems (but if you need to send the latch command often you may wish to consider redesigning your code anyway).

OK, but what is the alternative with no timestamp counter? Besides, the CMOS clock is the ultimate source of the clock, so if a few tics are missed on the PIT, this will be compensated for once the CMOS IRQ has occured a few times. And if it is only a few, buggy motherboards, I don't see the problem. It is a bigger problem with timestamp counters that slow-down on some CPUs as they adjust clock frequency.

rdos · Post by **rdos** » Wed Apr 13, 2011 12:46 pm

A new break-through. After providing SMP-safe patching of gates, RDOS can now run with 2 cores on my 4-core AMD Athlon processor. It is not fully stable, but at least it works for a while. I've also made some fixes to the monitor so it can handle 3-cores hitting the same fault at the same time.

However, when I start all 3 application cores with the scheduler, the whole system just locks-up, and I cannot enter the monitor.

EDIT: It actually works for a brief period of time even with 3 cores. However, when it hangs, the monitor also hangs, which is a big problem. Maybe the system is hanging on some critical spinlock that is also affecting the keyboard ISR? If I cannot figure this out, I'm back with one core running the keyboard again.

Update: It was pretty easy to change the code so core 4 runs the keyboard before the monitor is entered, and thus the keyboard can be used to break things even if ISRs are blocked. Having done this modification, I can now enter the monitor and it is evident that two cores are spining on the same spinlock when the system hangs-up. Now I just needs to figure out why this happens.

rdos · Post by **rdos** » Thu Apr 14, 2011 2:53 pm

I know why the system locks up with 2 or more cores (and why this is more frequent with more cores). It is the core unblock function that goes into a deadlock. The stack-traces of the 3 cores shows that all cores have a unblock int-frame, and one of the cores have the frame twice. In the end, all cores are waiting for the scheduler spinlock to become released, but this will never happen as the core with two unblock frames owns the spinlock, and cannot release it as it tries to retake it.

The unblock logic will need a major redesign.

rdos · Post by **rdos** » Fri Apr 15, 2011 2:15 pm

After updating the logic for getting the current core private data & current thread (via the current core) by letting each core have its own copy of the GDT with a special entry for core private data, the system now performs much better and crashes less with 3 cores than before. I suspect this has to do with much short sequences within spinlocks, which makes the unblock error happen less frequently.

A benefit with the monitor is that I can verify that GDTR is indeed different between cores, and the core private data selector aliases to the current core private data (I debugged the new GDT-aliasing partly with the monitor).

OSDev.org

Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger

Re: Constructing a multi-core monitor / debugger