Constructing a multi-core monitor / debugger

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Constructing a multi-core monitor / debugger

Post by rdos »

I still have some considerable problems with my SMP implementation that are really hard to find. I have several "assertions" in the scheduler that track problems, but they usually fire when things have already gone wrong and do not really suggest what the actual problem is. Also, most of the problems seems to be related to IRQs firing at special places in the code.

I now have a 4-core AMD processor to test on. I have a BSP that handles the important hardware (PS/2 keyboard, video, IDE-disc, floppy-disc) so I can run RDOS on it. However, it doesn't work to start another core, as the kernel will panic very fast, and not leave any useful traces of what went wrong. It does work to start another core on my portable eMachine mini-PC with an Atom processor, so something is not working right when the system is used with a real core instead of with hyperthreading.

Anyway, I think it might be a good idea to just write a small monitor that uses one of the cores in the AMD processor. I've read about multi-core processors running different OSes, so why not let one core run some special monitor to be able to track-down problems in a SMP-OS? I could let the SMP-OS use an USB-keyboard (I have a driver that works on another platform), and dedicate the PS/2 keyboard to the monitor. That way I could at any time enable the monitor by a special key-press. The monitor could then define an NMI-vector for it's use, and send a NMI IPI to all other cores in order to freeze their execution. This would even work when a core has disabled interrupts (at least it should). Then I could write a simple program that can show the execution state of all cores, memory state, stacks and so on. It should even be possible to single-step cores by temporarily taking over the trap-vectors. This would provide an ideal way to find errors in the SMP implementation, and analysing system crashes as they happen.

Has anybody done this? Or do you rely on error-free code or some other method?
User avatar
bewing
Member
Member
Posts: 1401
Joined: Wed Feb 07, 2007 1:45 pm
Location: Eugene, OR, US

Re: Constructing a multi-core monitor / debugger

Post by bewing »

Well, this is exactly what Bochs with 8 cores, or ReBochs with 65 thousand cores is supposed to do for you. Make race conditions more noticeable, more common, and easier to debug with a nice gods-eye view of the "inside" of the machine. Trying to debug such a thing in real hardware is going to be a major PITA, no matter how you do it.
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

No, Bochs cannot even run RDOS because how it fails to emulate certain hardware correctly. Besides, it all works on a simple platform (Athlon), so my suspicion is that Bochs will not find these problems with real hardware.
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: Constructing a multi-core monitor / debugger

Post by gerryg400 »

rdos, I feel your pain. I do all my SMP stuff on real hardware too. The only method I've found is to stare at the code until I find the bug. Well that's not entirely true. Sometimes I add extra locking, disabe interrupts on one/some cores or modify the scheduler until the problem disappears. Then stare at the code until I find the bug.
If a trainstation is where trains stop, what is a workstation ?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Constructing a multi-core monitor / debugger

Post by Combuster »

I can't think of any functionality that you expect of each and any platform, but not bochs. If you require such arcane magic, chances are you'll break some real hardware as well. And maybe that's exactly what's happening.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
bewing
Member
Member
Posts: 1401
Joined: Wed Feb 07, 2007 1:45 pm
Location: Eugene, OR, US

Re: Constructing a multi-core monitor / debugger

Post by bewing »

And just in general, I'm very interested in any code that exposes failures in emulators. Can I get a disk image file of your OS?
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

bewing wrote:And just in general, I'm very interested in any code that exposes failures in emulators. Can I get a disk image file of your OS?
Yes, here: http://www.rdos.net/rdos/floppy.img

I haven't tested Bochs for a couple of years, but last I did it failed because of some failure to emulate the TLB. If I remember it correctly, it also failed to handle the PIT timer correctly. Today, RDOS will use the APIC timer if it is available.

If you get a green screen, something has failed. If you get the command prompt, but it won't work, chances are the PIT/APIC timer is not working.
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

Combuster wrote:I can't think of any functionality that you expect of each and any platform, but not bochs. If you require such arcane magic, chances are you'll break some real hardware as well. And maybe that's exactly what's happening.
Not really archane, but RDOS will make extensive use of the PIT or APIC Timer, which includes programming the counters to generate arbitrary timeouts. RDOS will not use it as some kind of 18Hz interval timer like DOS does, but will aggressively reprogram it. And expect to be able readout the count at any time, and get the associated IRQs at the right time. Which in fact is one of the troubles with the SMP implementation.

As for functionality, I expect something similar as the intergrated kernel-debugger I already have, but for CPU cores when fatal conditions occur. Today, I have a shutdown device that just dumps faulted thread state after a fatal condition in the kernel (like, for instance, a fault in the null thread). I also have some sanity-checks in the SMP scheduler, but these are just converted to register dumps for the faulting thread. If I could make these enter some monitor with a known valid state, that does not depend on the scheduler, I could inspect call stacks and memory content and thus much easier find the reason why the SMP context is compromised. Bochs won't help me here, as no faults will occur. It is the internal SMP state that has been detected to be wrong. In some cases, there will be hangs, and then I need some method to switch to the monitor even if everything seems to be dead. There are actually a few known bugs in process creation as well that will make the kernel panic even on a single core which I'd like to track-down.

So Bochs cannot help me since Bochs knows nothing about the internal workings of RDOS. It doesn't know when the SMP state is incorrect. It doesn't know that something is spinning forever in order to acquire a spinlock because some other core forgot to relase it (or otherwise corrupted it). It doesn't know that some core has entered sleep state in order to conserve energy in waiting for some resource to become available, but the owner never wakes it up.

Bochs is quite good at getting a new kernel up, but it is not as good for more mature systems that have left this stage.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Constructing a multi-core monitor / debugger

Post by Owen »

rdos wrote:I haven't tested Bochs for a couple of years, but last I did it failed because of some failure to emulate the TLB
I seriously wonder how you can rely on behaviour of the TLB in a way which is portable, but which Bochs does not emulate correctly.
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

gerryg400 wrote:rdos, I feel your pain. I do all my SMP stuff on real hardware too. The only method I've found is to stare at the code until I find the bug. Well that's not entirely true. Sometimes I add extra locking, disabe interrupts on one/some cores or modify the scheduler until the problem disappears. Then stare at the code until I find the bug.
Yes, I know what you mean. I'm a little envious about people that actually code for SMP to begin with, as this is the correct way to do it. It's a little like multitasking. If you don't code for multitasking to begin with, chances are you will have lots of problems to convert an OS to a multitasking OS.

I coded for multitasking to begin with, but not SMP, as widely available SMP systems did not exist at that time. They do now, so I have to implement it into a design that was not made for SMP. First I had to change from hardware taskswitching to software taskswitching, which was a lot of pain, but eventually this works pretty well now. Then I put the "hooks" into the code to be able to extended the design to multicore, and this is were I am now.

And staring at a 6,000 line task-management module coded in assembler, some parts as old as 20 years, is not that easy. :cry:
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

Just for fun, I downloader Bochs-2.4.5 in order to check if RDOS works. It doesn't. The floppy image i linked to above (http://www.rdos.net/rdos/floppy.img) faults on a single step (!!). I also downloaded a fresh image (http://www.rdos.net/rdos/floppy-new.img) and it doesn't fault, rather hangs somewhere with Bochs reporting CS being loaded with NULL. This image at least works on my PPC-L61 computer here, and it certainly does not load CS with 0.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Constructing a multi-core monitor / debugger

Post by Combuster »

Bochs reporting CS being loaded with NULL. This image at least works on my PPC-L61 computer here, and it certainly does not load CS with 0.
Doesn't that make an excellent bug to file at your address? :wink:
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

Combuster wrote:
Bochs reporting CS being loaded with NULL. This image at least works on my PPC-L61 computer here, and it certainly does not load CS with 0.
Doesn't that make an excellent bug to file at your address? :wink:
No. Both these images boot on real hardware and Microsoft VirtualPC. It is a problem in Bochs. For instance, I've never ever seen "Single Step" error in the boot process on any real hardware.

More info: In the case where CS is loaded with zero, Bochs outputs an address to the loop in the null task that looks like this:

Code: Select all

null_loop:
    hlt
   jmp null_loop
Bochs log:

Code: Select all

00278558793d[PIT  ] entering timer handler
00278558793d[PIT81] clock_all:  cycles=91
00278558793d[IOAP ] set_irq_level(): INTIN0: level=1
00278558793d[IOAP ] IOAPIC: servicing
00278558793d[IOAP ] service_ioapic(): INTIN0 is masked
00278558793d[IOAP ] service_ioapic(): INTIN1 is masked
00278558793d[IOAP ] service_ioapic(): INTIN4 is masked
00278558793d[IOAP ] service_ioapic(): INTIN6 is masked
00278558793d[IOAP ] service_ioapic(): INTIN8 is masked
00278558793d[PIC  ] IRQ line 0 now high
00278558793d[PIC  ] signalling IRQ(0)
00278558793d[PIT  ] RESETting timer
00278558793d[PIT  ] deactivated timer
00278558793d[PIT  ] s.last_usec=69639698
00278558793d[PIT  ] s.timer_id=1
00278558793d[PIT  ] s.timer.get_next_event_time=0
00278558793d[PIT  ] s.last_next_event_time=0
00278558793d[CPU0 ] interrupt(): vector = 28, TYPE = 0, EXT = 1
00278558793d[CPU0 ] interrupt(): INTERRUPT TO SAME PRIVILEGE
00278558797d[PIC  ] IO read from 0021
00278558797d[PIC  ] read master IMR = 80
00278558799d[PIC  ] IO write to 0021 = 81
00278558799d[PIC  ] setting master pic IMR to 81
00278558801d[PIC  ] IO write to 0020 = 20
00278558811d[CPU0 ] return_protected: return to SAME PRIVILEGE LEVEL
00278558960d[CPU0 ] page walk for address 0x00000000e000c670
00278559062d[CPU0 ] page walk for address 0x00000000ff00c64a
00278559115d[PIT81] clock_all:  cycles=95
00278559115d[PIT  ] write to port 0x0040, value = 0x96
00278559115d[PIT81] Write Initial Count: counter=0, count=150
00278559115d[IOAP ] set_irq_level(): INTIN0: level=0
00278559115d[PIC  ] IRQ line 0 now low
00278559115d[PIT  ] RESETting timer
00278559115d[PIT  ] deactivated timer
00278559115d[PIT  ] activated timer
00278559115d[PIT  ] s.last_usec=69639778
00278559115d[PIT  ] s.timer_id=1
00278559115d[PIT  ] s.timer.get_next_event_time=1
00278559115d[PIT  ] s.last_next_event_time=1
00278559118d[PIT  ] entering timer handler
00278559118d[PIT81] clock_all:  cycles=1
00278559118d[PIT81] clock_all:  cycles=1
00278559118d[PIT  ] RESETting timer
00278559118d[PIT  ] deactivated timer
00278559118d[PIT  ] s.last_usec=69639779
00278559118d[PIT  ] s.timer_id=1
00278559118d[PIT  ] s.timer.get_next_event_time=0
00278559118d[PIT  ] s.last_next_event_time=0
00278559118d[PIT  ] write to port 0x0040, value = 0x04
00278559118d[PIT81] Write Initial Count: counter=0, count=4
00278559118d[PIT  ] RESETting timer
00278559118d[PIT  ] deactivated timer
00278559118d[PIT  ] activated timer
00278559118d[PIT  ] s.last_usec=69639779
00278559118d[PIT  ] s.timer_id=1
00278559118d[PIT  ] s.timer.get_next_event_time=1
00278559118d[PIT  ] s.last_next_event_time=1
00278559119d[PIC  ] IO read from 0021
00278559119d[PIC  ] read master IMR = 81
00278559121d[PIC  ] IO write to 0021 = 80
00278559121d[PIC  ] setting master pic IMR to 80
00278559122d[PIT  ] entering timer handler
00278559122d[PIT81] clock_all:  cycles=1
00278559122d[PIT  ] RESETting timer
00278559122d[PIT  ] deactivated timer
00278559122d[PIT  ] activated timer
00278559122d[PIT  ] s.last_usec=69639780
00278559122d[PIT  ] s.timer_id=1
00278559122d[PIT  ] s.timer.get_next_event_time=496
00278559122d[PIT  ] s.last_next_event_time=1174
00278559122d[CPU0 ] return_protected: return to SAME PRIVILEGE LEVEL
00278559204d[CPU0 ] VERR: null selector
00278560000p[WGUI ] >>PANIC<< Window closed, exiting!
00278560000i[CPU0 ] CPU is in protected mode (halted)
00278560000i[CPU0 ] CS.d_b = 16 bit
00278560000i[CPU0 ] SS.d_b = 16 bit
00278560000i[CPU0 ] EFER   = 0x00000000
00278560000i[CPU0 ] | RAX=000000008d15ee90  RBX=00000000000001a1
00278560000i[CPU0 ] | RCX=0000000000000000  RDX=00000000010cfa67
00278560000i[CPU0 ] | RSP=0000000000000200  RBP=0000000000000000
00278560000i[CPU0 ] | RSI=00000000000001ea  RDI=00000000000001e2
00278560000i[CPU0 ] |  R8=0000000000000000   R9=0000000000000000
00278560000i[CPU0 ] | R10=0000000000000000  R11=0000000000000000
00278560000i[CPU0 ] | R12=0000000000000000  R13=0000000000000000
00278560000i[CPU0 ] | R14=0000000000000000  R15=0000000000000000
00278560000i[CPU0 ] | IOPL=0 id vip vif ac vm rf nt of df IF tf SF zf AF pf CF
00278560000i[CPU0 ] | SEG selector     base    limit G D
00278560000i[CPU0 ] | SEG sltr(index|ti|rpl)     base    limit G D
00278560000i[CPU0 ] |  CS:0030( 0006| 0|  0) ff001baa 0000fd0d 0 0
00278560000i[CPU0 ] |  DS:00f0( 001e| 0|  0) fe0388c2 0000029b 0 0
00278560000i[CPU0 ] |  SS:ee60( 1dcc| 0|  0) e00038c4 000001ff 0 0
00278560000i[CPU0 ] |  ES:ee90( 1dd2| 0|  0) e000293c 00000085 0 0
00278560000i[CPU0 ] |  FS:eff8( 1dff| 0|  0) e0000020 0000122d 0 0
00278560000i[CPU0 ] |  GS:0000( 0000| 0|  0) 00000000 00000000 0 0
00278560000i[CPU0 ] |  MSR_FS_BASE:00000000e0000020
00278560000i[CPU0 ] |  MSR_GS_BASE:0000000000000000
00278560000i[CPU0 ] | RIP=000000000000c5fb (000000000000c5fb)
00278560000i[CPU0 ] | CR0=0xe0000019 CR2=0x0000000000004804
00278560000i[CPU0 ] | CR3=0x00033007 CR4=0x00000000
00278560000i[CPU0 ] 0x000000000000c5fb>> jmp .-3 (0xff00e1a4) : EBFD
The display is cleared, and nothing more happens. It seems like Bochs is executing the NULL task all the time and no IRQs ever happen (especially, there are no IRQs from PIT or APIC timer that could schedule another thread so RDOS could start loading). Seems like the old problem with reprograming the PIT as an interval timer is still not solved.

EDIT: What probably happens is that the PIT reports that no time has passed, and thus timers in RDOS never expires.
Last edited by quok on Tue Dec 07, 2010 12:45 pm, edited 1 time in total.
Reason: Changed the bochs log from quote tags to code tags
rdos
Member
Member
Posts: 3380
Joined: Wed Oct 01, 2008 1:55 pm

Re: Constructing a multi-core monitor / debugger

Post by rdos »

OK, so the OHCI driver (+ HID-driver) works on my 4-core AMD board. That means I only need to get an extra USB keyboard, remove the PS/2 keyboard driver, and then I can start developing the monitor. :D
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Constructing a multi-core monitor / debugger

Post by Brendan »

Hi,

I just can't help wondering if the number of bugs created while implementing a multi-core monitor/debugger will be more or less than the number of bugs you hope to fix with the multi-core monitor/debugger.

Then comes the question of debugging the bugs in the multi-core monitor/debugger - won't you need a debugger debugger for that, and will it result in infinite recursion (e.g. a debugger debugger debugger .... debugger)? 8)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply