Here's how my printk() function works in my kernel. I've got no problems with it on SMP
I use two locks; a spinlock called logbuf_lock and a semaphore called console_lock. Everything is buffer based, and uses a couple of housekeeping variables to keep track of where in the buffer you are. These variables are log_start, log_end, and con_start
The first thing printk does is grab the logbuf_lock. Once it has that, it spews into a rather large (32K) buffer using vsnprintk (my vsnprintf variant for the kernel). It then attempts to grab the console_lock semaphore (using semaphore_tryacquire). If it cannot, it releases the logbuf_lock spinlock and exits.
If printk was able to grab the console_lock semaphore, it immediately calls a utility function release_console_lock(). This function is just a huge for loop. As soon as you enter the for loop, release_console_lock grabs the logbuf_lock spinlock again. It makes sure there is something in the log to print by seeing if con_start is equal to log_end. If that is true, the loop exits (still holding the spinlock). Otherwise, I set some local variables, _start and _end to con_start and log_end, respectively. I also set con_start to log_end, as those two being the same is the only way to exit the loop. Then I release the logbuf_lock spinlock. Next up, I call the console printing routine, passing _start and _end.
The console printing routine loops through every registered console driver, printing whatever is in the log buffer between _start and _end. In this way, my kernel messages can go to serial output and be displayed on the screen quite easily.
After the console printing routine exists, we jump back to the top of that for loop in release_console_lock. We grab the logbuf_lock again, and if we're lucky, nobody else printed to the buffer while we were printing to screen, and we can exit. Otherwise it's lather, rinse, repeat. Once out of the for loop, we release the console semaphore, and then release the logbuf_lock spinlock. Don't think that's backwards, remember the console lock was grabbed way up in printk()
Anyway, doing things this way allows most processors to keep spewing to the printk buffer, and then happily go on their way to something else. Only one processor at a time ever handles the task of "draining" the printk buffer by calling the console printing routines.
Of course, there's all kinds of other housekeeping here that I didn't cover. For instance, setting log_start and log_end appropriately, and how to handle the situation where log_end wraps back to the beginning of the buffer, but log_start is back at the end somewhere. And what happens when you are in a kernel panic and want to dump the buffer to screen, but some other processor has the locks, and can't release them because panic() sends an IPI to halt all processors? Well, in that situation, I have a "in_panic" global that gets set, and if that is set, printk zaps all the locks so it can happily grab them and dump the buffer.
The only other problem I really know of here is that it is possible to lose messages, or get "corrupt" messages if you're printing to the buffer faster than you can drain it. If that happens, simply increasing the buffer size will help greatly.
My printk also uses syslog style message numbers, so I can filter what gets displayed (on each console) by setting the syslog_level variable in the console driver. Usually I'll have all debug messages go to a serial port, while only the more "important" stuff gets displayed on screen. I also provide some boot variables that can change this so you don't need to recompile all the time, and of course there will be a system call to handle it as well.
And finally, one of the first things I do in my kernel is grab the console_lock semaphore, which is a statically allocated variable. Doing that means I can easily use printk long before the console drivers are set up, and then flush the buffer without losing any messages after the console drivers are available.