What do you think? Linux uses softirq, tasklet

sounds · Post by **sounds** » Sun May 27, 2012 11:53 pm

Hello osdev!

I'm doing the SMP design for my kernel, and I found this page while researching. I think it's a useful intro / history of what Linux does.

http://people.netfilter.org/rusty/unrel ... tirqs.html

Is it really that good of an idea?

I'll try to sum up what I learned:

The interrupt handler tries to be as small as possible (called the "hardware interrupt handler" by Linux)
Most of the work is done in a softirq
The split between hardware interrupt and softirq is generally a queue of things that need processing - that's what the interrupt signalled anyway
The softirq is also where data can be sent off to other parts of the kernel, especially if the CPU ends up waiting for locks for a really long time in other parts of the kernel
The softirq is a kernel thread so only one CPU is running it at a time
The hardware interrupt handler is available on all CPUs, so several can be running all at once if IRQ balancing is set that way
Small hardware interrupt handlers are better for the scheduler because the scheduler can decide when to do the bulk of the work in the softirq
Small hardware interrupt handlers are also less likely to cause crashes at a level where recovery may be impossible
The softirq can still do a lot of damage but there are more recovery options if it does crash & burn

Now that I understand better what Linux is doing, I can respect this design. But I'm not convinced - it seems inefficient to design a driver in two pieces.

Brendan · Post by **Brendan** » Mon May 28, 2012 1:05 am

Hi,

sounds wrote:But I'm not convinced - it seems inefficient to design a driver in two pieces.

It is inefficient, but the alternative is probably worse...

If a lock is used by a "hardware interrupt handler" and also used by the kernel itself; then you have to disable IRQs when you acquire that lock to avoid deadlocks (e.g. IRQ handler waiting for a lock that can't be released until after the IRQ handler returns).

Imagine if the device driver handled the IRQ during the "hardware interrupt handler", and one device driver wanted to allocate some physical memory. Now every time the kernel wants to allocate physical memory it has to disable IRQs (to avoid causing a deadlock if a physical memory manager lock is in use at the time).

Imagine if the device driver handled the IRQ during the "hardware interrupt handler", and the kernel developers never had a standard device driver API and couldn't know which functions a device driver's interrupt handler might use. Now every function in the kernel that uses any kind of lock has to disable IRQs, just in case a device driver's IRQ handler feels like using that function and causes a deadlock. Most of the kernel probably uses some sort of locks; and having IRQs disabled for "most of the kernel" would completely destroy the kernel's ability to handle IRQs quickly (interrupt latency would be severe).

If the device driver didn't handle the IRQ during the "hardware interrupt handler" (e.g. the "hardware interrupt handler" does nothing more than arrange for the device driver to handle the interrupt later, when the kernel isn't holding any locks) then the kernel would only need to disable IRQs for locks that are used to make these arrangements (and not for every lock in the kernel). Of course every device driver could implement its own "postpone this until later" code; but it makes much more sense for the kernel to provide it so that it's not duplicated (badly) in different ways in different drivers.

Now, you might be wondering if they could fix the problem by defining a standard driver API. My guess is that the number of functions that they'd need in the driver API (or needed by functions that are needed by function that are needed in the driver API) would add up to "most of the kernel" anyway. It might have helped a little, but still might not help enough.

Cheers,

Brendan

sounds · Post by **sounds** » Mon May 28, 2012 10:39 am

Thanks, Brendan!

Most hardware interrupt handlers don't need interrupts disabled - except maybe a short section at the entry & exit of the handler (for example, to update the TSS - kernel stack bookkeeping - of course the NMI handler is an exception to this).

Only a small number of drivers even have IRQs:

Example #1:

PCI disk controller receives IRQ, submits DMA buffer pointer to ATA disk driver queue
ATA disk driver doesn't have any IRQs, it just uses the queues

Example #2:

PCI USB controller receives IRQ, submits DMA buffer pointer to the Generic USB driver
USB drivers don't have any IRQs, they just use the queues

I realize I'm cherry-picking the examples that show my point, but the PCI disk / PCI USB drivers that actually handle IRQs can run with interrupts enabled, and don't allocate memory in the IRQ handler. So most drivers shouldn't have to handle IRQs; the few that do must be very careful not to deadlock. But it seems to me the IRQ handler can avoid deadlock - even if the IRQ hits while the lock is held to read the queue, by designing a deadlock-free queue.

Even with the CPU IF flag set (interrupts enabled), lower priority IRQs will be buffered in the APIC until the hardware interrupt handler returns. Keep handlers as short as possible for the benefit of low priority interrupt latency. It's not a huge concern but not totally unimportant.

Of course, since most IRQ handlers will be for PCI devices they will all start to look very similar. I'm not sure if that means the driver should be forced to use a kernel-provided handler or if each driver re-implements it (possibly badly). I can see why Linux implements them only once. But personally I'd rather have speed than conservativism!

Brendan · Post by **Brendan** » Mon May 28, 2012 1:11 pm

Hi,

sounds wrote:Only a small number of drivers even have IRQs:

Here's a list of devices that use IRQs in some way:

Video cards (IRQ for vertical refresh and/or 2D/3D accelerator commands and/or bus mastering data to/from display memory)
- Monitor/s (relies on a device that has an IRQ)
Sound cards (IRQs for transferring data to speakers or from microphone)
- Speakers (relies on a device that has an IRQ)
- Microphones (relies on a device that has an IRQ)
Hard disk controllers (IRQ to signal data transfer completion for both DMA/bus mastering and PIO)
- Hard disks (relies on a device that has an IRQ)
- CD-ROM (relies on a device that has an IRQ)
Floppy disk controllers (IRQ to signal data transfer completion for both DMA/bus mastering and PIO)
- Floppy disk drives (relies on a device that has an IRQ)
ISA DMA controller chips (rely on some other device to generate IRQs when transfers complete - e.g. floppy controller)
Network cards (IRQs for transferring data to/from the network)
USB controllers (IRQs for data transfers)
- USB keyboard (relies on a device that has an IRQ)
- USB mouse (relies on a device that has an IRQ)
- USB flash (relies on a device that has an IRQ)
- USB printers (relies on a device that has an IRQ)
- Other USB devices (relies on a device that has an IRQ)
PS/2 controller (IRQs for data transfers)
- PS/2 keyboard (relies on a device that has an IRQ)
- PS/2 mouse (relies on a device that has an IRQ)
- Other PS/2 devices (relies on a device that has an IRQ)
Serial port controller (IRQs for data transfers)
- Serial mouse (relies on a device that has an IRQ)
- Serial printers (relies on a device that has an IRQ)
- Other serial devices (relies on a device that has an IRQ)
Parallel port controller (IRQs for data transfers)
- Parallel printers (relies on a device that has an IRQ)
- Other parallel devices (relies on a device that has an IRQ)
UPS (relies on serial port IRQ, USB controller IRQ or ethernet card IRQ to signal "events"/changes)
HPET, PIT, RTC (IRQs for signalling time expiry)
CPUs (Inter-Processor Interrupts, various IRQs in the local APIC - timer, performance monitoring, etc)
ACPI embedded controller (IRQ used to signal "events" that are meant to be handled by AML, like power button pressed, laptop lid opened/closed, etc)
PIC and IO APICs (spurious IRQs)
RAM and memory controller (uses machine check or NMI to signal faults)
IO Hubs/PCI host controller/LPC bridge (forward IRQs from other devices and may cause machine check and/or NMI)

And here's a list of devices that don't use any IRQs:

CMOS (excluding RTC)
Firmware ROM/flash
Power supply (except for laptops where there's "ACPI events" for connecting/disconnecting mains power and "low battery" warnings)

Now let's look at just one device: the serial port. When it receives a byte the IRQ occurs and the IRQ handler has to do something with that byte. Normally it'd send the byte (using IPC) to a process. IPC involves buffers. Sending IPC may also unblock a task (e.g. a task that was blocked waiting for data from the serial port). All by itself, receiving one byte from a serial port may involve any locks that are used by the kernel's physical memory management, (kernel-space) virtual memory management, IPC and the scheduler. To avoid deadlocks, all of those locks would have to disable IRQs (unless you have some sort of "process it later" support, like "soft IRQs").

That's just one device though (well, half of one device). Most device drivers that receive data (keyboard, mouse, ethernet, sound card/microphone, etc) are similar - they have to do something with the data. For some devices (hard disk drivers) buffers are pre-allocated before data is read/received from the device, so the need to allocate buffer space in the IRQ handler is gone (but you really don't want to be calling functions in a file system layer directly from the disk controller's IRQ handler so you still want some sort of "process it later" support like "soft IRQs").

For some devices (anything that supports "hot-plug" in any way - e.g. USB controller) an IRQ can occur because a device was plugged in. In this case you really need some sort of "process it later" support (like "soft IRQs") because you don't want to be starting new device drivers inside an IRQ handler.

Of course this is all stuff that has to happen. There's optional things too, like logging (e.g. errors that are reported by IRQs) and tracking statistics (e.g. network load).

You might be wondering if a micro-kernel avoids the need for some sort of "process it later" support (like "soft IRQs"), because the kernel's IRQ handlers only send IPC to the device driver in user space. It doesn't. The act of sending IPC to a process from within the kernel's IRQ handler is just a different way of doing "process it later" support.

Cheers,

Brendan

sounds · Post by **sounds** » Mon May 28, 2012 4:39 pm

Hi Brendan,

I'm glad you pointed out the potential queue deadlock (priority inversion). I overlooked it at first; some kernels go with "process it later" code to solve this problem. Is a lock-free queue workable here?

Thanks as well for providing a more complete list of which drivers need IRQs. For the purpose of writing interrupt handlers, are we agreed that e.g. the USB Keyboard driver doesn't have an IRQ handler because the USB controller has one?

The USB controller is a great place to focus in and hash out the essential questions:

Can the interrupt handler submit the DMA buffer pointer to the appropriate destination without taking forever? [specifically, without deadlocking or calling sections of the code not intended to happen inside an interrupt]
A corollary to #1: when will the destination wake up? Typically the scheduler is invoked at the end of all device interrupt handlers, so blocked threads do run after the interrupt handler releases the right lock, but not inside the interrupt handler.
Can buffer pointers be allocated/freed outside the interrupt handler - so no memory management is needed during the IRQ?
Data moving the other way may also block waiting for the interrupt to indicate the device's buffers are not full. Is it sufficient to let the scheduler deal with this as another thread, scheduled when the interrupt handler releases the lock that blocked it?
(a repeat of the first question for completeness) Does the queue implementation guarantee the interrupt handler can always submit buffer pointers to the queue, potentially causing the destination to have to back off / retry its access? [the destination would be using a kernel-provided library function to pull data off the queue, so the back off / retry behavior is transparent to the driver writer]

I guess if the answers are all yes, then it might be possible to do away with softirqs in my kernel.

Brendan · Post by **Brendan** » Mon May 28, 2012 9:52 pm

Hi,

You're getting confused (and trying too hard to avoid "process it later"). Let me be perfectly clear, there are only 2 cases:

IRQ handlers use some sort of "process it later" somewhere; or
Everything happens inside an IRQ handler

There are no other choices.

For example, imagine there's a task that is blocked waiting for "read()" to complete, and the disk driver's IRQ occurs:

You can use "process it later" somewhere - either:
- The disk driver's IRQ handler could use some sort of "process it later" (and all the rest can happen later); or
- The disk driver's IRQ handler could complete that transfer and arrange for the next transfer, then use some sort of "process it later" (and all the rest can happen later); or
- The disk driver's IRQ handler could complete that transfer and arrange for the next transfer and then call the file system's code directly, and the file system's code could use some sort of "process it later" (and all the rest can happen later); or
- The disk driver's IRQ handler could complete that transfer and arrange for the next transfer and then call the file system's code directly, which could call the VFS layer directly, which could use some sort of "process it later" (and all the rest can happen later)
Or:
The disk driver's IRQ handler could complete that transfer and arrange for the next transfer and then it could call the file system's code directly, which could call the VFS layer directly, which could complete the "read()" and return to the file system's code, which could return to the disk driver's IRQ handler, which could do IRET

The second option is entirely silly, therefore you have to use some sort of "process it later". This could involve IPC (messages, pipes, whatever), or some sort of queues, or "soft IRQ", or something "signal-like", or polling, or any other way of arranging for code to be executed outside of the IRQ handler. It doesn't matter much which method you use (as long as it's not polling because polling sucks, although signals are ugly so I'd avoid them too). It does matter where it is (obviously you'd want to keep the IRQ handler's short/fast, so "as soon as possible" is better). It also matters if it's a generic/standard thing or if every device driver has to implement it's own hack.

sounds wrote:I'm glad you pointed out the potential queue deadlock (priority inversion). I overlooked it at first; some kernels go with "process it later" code to solve this problem. Is a lock-free queue workable here?

For data being sent from software to a device, a lock-free queue could work (sort of), assuming the device generates some sort of "ready for more" IRQ that the driver's IRQ handler can use to take data from the lock-free queue and send it to the device. However there'd be race conditions involved when the queue is empty (and no "ready for more" IRQ is expected).

A lock-free queue wouldn't work well on its own as a form of "process it later" for data being sent from the device to software, as software wouldn't know when to check the lock-free queue (you'd have to poll the queue).

Also, in general lock-free algorithms do solve deadlock problems, but don't solve starvation problems. If the IRQ handler has to get something from a lock-free queue or put something on a lock-free queue; then it could repeatedly fail and retry for an unlimited amount of time. Note: "lock free" means that someone (not necessarily the IRQ handler) makes forward progress, while "block free" means that everyone (including your IRQ handler) makes forward progress.

sounds wrote:Thanks as well for providing a more complete list of which drivers need IRQs. For the purpose of writing interrupt handlers, are we agreed that e.g. the USB Keyboard driver doesn't have an IRQ handler because the USB controller has one?

Ok.

sounds wrote:The USB controller is a great place to focus in and hash out the essential questions:
Can the interrupt handler submit the DMA buffer pointer to the appropriate destination without taking forever? [specifically, without deadlocking or calling sections of the code not intended to happen inside an interrupt]

For sending data, the USB controller driver's interrupt handler should probably be able to submit the pointer to the DMA buffer to the USB controller (the destination) relatively quickly. For receiving data, the USB controller driver's interrupt handler may or may not be able to submit the pointer to the DMA buffer to the "whatever" (the destination); depending on what the "whatever" is, and depending on what "submit" means, and depending on how "submit" works.

sounds wrote:
A corollary to #1: when will the destination wake up? Typically the scheduler is invoked at the end of all device interrupt handlers, so blocked threads do run after the interrupt handler releases the right lock, but not inside the interrupt handler.

If "destination" is a task (and not another driver in the kernel or something) and if "wake up" means "unblocked" (e.g. data being received by a task that was blocked waiting for IO from a USB device) then the destination/task will wake up/unblock when "something" tells the scheduler to unblock the task; where "something" might be the VFS or GUI or shell or network stack or kernel or some other process or whatever else (depending on what the device was, what the received data is, etc).

sounds wrote:
Can buffer pointers be allocated/freed outside the interrupt handler - so no memory management is needed during the IRQ?

Yes - everything can happen outside the IRQ handler (except for arranging some sort of "process it later").

sounds wrote:
Data moving the other way may also block waiting for the interrupt to indicate the device's buffers are not full. Is it sufficient to let the scheduler deal with this as another thread, scheduled when the interrupt handler releases the lock that blocked it?

Yes, maybe.

sounds wrote:
(a repeat of the first question for completeness) Does the queue implementation guarantee the interrupt handler can always submit buffer pointers to the queue, potentially causing the destination to have to back off / retry its access? [the destination would be using a kernel-provided library function to pull data off the queue, so the back off / retry behavior is transparent to the driver writer]

Polling sucks (even if it's "polling with back off/retry"). You poll, it's not there, you back off a little; you poll again, it's still not there, you back off a little more. Six months later (after preventing the CPU from going into any sleep state with your wasteful polling), you poll again, it's still not there, you back off a little more and immediately after that it happens, but you backed off so far that you don't even notice for 3 days.

Then there's the issue of how you're planning to measure the time delays for the back off - e.g. are you going to poll a time counter constantly, or are you going to poll the queue from within a timer IRQ handler?

Now it's my turn for questions:

Does your OS have some sort of IPC that normal processes can use to send data to other normal processes?
Is the IPC you already have (or the IPC you're planning to have) sane? For example, can tasks block waiting for IPC, and be woken up when the IPC occurs?
Is the IPC you already have (or the IPC you're planning to have) useful? For example, can a process send data to a driver or receive data from a driver using exactly the same IPC code without knowing/caring that it's talking to a driver and not another process?

Cheers,

Brendan

sounds · Post by **sounds** » Tue May 29, 2012 12:01 am

Brendan wrote:Hi,

You're getting confused (and trying too hard to avoid "process it later"). Let me be perfectly clear, there are only 2 cases:
IRQ handlers use some sort of "process it later" somewhere; or

Everything happens inside an IRQ handler
There are no other choices.

Ok, put it that way and yes, my OS will "process it later." I just don't like a mandated one-size-fits-all hardware interrupt handler and all drivers have to "process it later."

Brendan wrote:as long as it's not polling because polling sucks...race conditions involved when the queue is empty (and no "ready for more" IRQ is expected)...Polling sucks (even if it's "polling with back off/retry"). You poll, it's not there, you back off a little; you poll again, it's still not there, you back off a little more. Six months later (after preventing the CPU from going into any sleep state with your wasteful polling), you poll again, it's still not there, you back off a little more and immediately after that it happens, but you backed off so far that you don't even notice for 3 days.

Then there's the issue of how you're planning to measure the time delays for the back off - e.g. are you going to poll a time counter constantly, or are you going to poll the queue from within a timer IRQ handler?

Now it's my turn for questions:
Does your OS have some sort of IPC that normal processes can use to send data to other normal processes?

Is the IPC you already have (or the IPC you're planning to have) sane? For example, can tasks block waiting for IPC, and be woken up when the IPC occurs?

Is the IPC you already have (or the IPC you're planning to have) useful? For example, can a process send data to a driver or receive data from a driver using exactly the same IPC code without knowing/caring that it's talking to a driver and not another process?

Yes to all three, and those are good questions.

My OS will support a posix compatibility layer at some point; right now all I have is semaphores and shared memory a la System V, and character devices including anonymous pipes. Userspace processes can create sysv semaphores, shared memory, and anonymous pipes. In kernel space all that I need are locks (a.k.a. semaphores). Unlike linux, I don't need spin_lock_irqsave(), spin_lock_irq(), spin_lock_bh(), and spin_lock() too. Just spin_lock().

The lock free FIFO I've written only makes a guarantee of progress in one direction. It's not 100% lock-free, since only one side can operate freely, while the other side must carefully lock and retry on failure.

Example: the USB controller receiving URBs and such.

The interrupt handler fires when the incoming DMA is done. It calls the kernel code to append incoming buffers into the FIFO, which will:

Increment the FIFO head counter by 1 ("LOCK XADD")
"LOCK MOV" the DMA buffer pointer into the index returned by XADD (call it the "old head counter")
Read the FIFO tail counter with "LOCK MOV" and compare it to the old head counter; if a buffer overrun occurs, a pointer-to-string saying "USB URB buffer overrun" is submitted to the console log FIFO (same algorithm)
If the FIFO wait queue is not empty and the FIFO lock (see below) is not held, remove the first thread on the wait queue (call it thread1). Acquire the FIFO lock in the name of thread1. Then change thread1's status to awake. Assume that the interrupt handler will run the scheduler just before it exits, so thread1 will eventually run with the FIFO lock in its possession.

Notice no spinlock or anything guarding the append operation. Also, the interrupt handler runs with the CPU IF=1. The FIFO lock only guards the threads that consume URBs.

When the USB subsystem opens the FIFO, it acquires the FIFO lock, then calls select() on the FIFO.

select() will: clear IF=0, put this thread on the FIFO's wait queue, release the FIFO lock, read the head counter with "LOCK MOV," read the tail counter with "LOCK MOV." If empty, sleep by yielding to the scheduler. The thread will awaken with the lock held, ready to pull URBs out of the FIFO. If not empty, set IF=1, remove the thread from the FIFO's wait queue, attempt to acquire the FIFO lock but do not block. If the FIFO lock cannot be acquired, check if the lock was already acquired on behalf of this thread by the interrupt handler. If the lock really is held by another thread, clear IF=0 and jump up to the sleep() call. Once the lock is held by this thread, proceed to pull URBs out of the FIFO.

The kernel code to pull URBs out of the FIFO will:

Read the head counter with "LOCK MOV"
Read the tail counter with "LOCK MOV" (guarded by the FIFO lock being currently held)
Subtract to find how many URBs are available
If 0 are available, return with an error like "try again later"
"LOCK MOV" the DMA buffer pointer from the index at the tail counter.
The pointer may be null, which means the append operation has not written the DMA buffer pointer yet - return with an error like "try again later"
"LOCK MOV" null into the index at the tail counter, which removes the DMA buffer pointer from the FIFO
Incrementally read DMA buffer pointers from the FIFO until some error happens (i.e. 0 available or a null pointer)
Save the updated tail counter back to the FIFO with "LOCK MOV"
Return the collected pointers
Note: the caller, i.e. the USB subsystem, must call select() again to release the FIFO lock or URBs may be in the FIFO with no one awake to fetch them. An argument to select() does a non-blocking select() to release the FIFO lock if there really are no more URBs.

At any point while consuming data out of the FIFO, the head counter may change. The DMA buffer pointer is written into the FIFO after updating the head counter - so multiple writers have non-blocking access. The reader must check both the head counter and that the DMA buffer pointer is not null.

Mea culpa for referring to "back off." There are no time delays here.

A stack implementation with one-sided non-blocking behavior is similar, using atomic ops on a "stack pointer."

rdos · Post by **rdos** » Tue May 29, 2012 6:47 am

Alternative solution:

Instead of soft-IRQs, use above-normal priority kernel threads. To avoid locks in the IRQ, implement a lock-free solution that the IRQ handler can use to wakeup the server thread. This solution can also be used outside of IRQ-handlers. If the IRQ can run with interrupts enabled, it must first call some scheduler function to lock the current thread from scheduling, as the scheduler should not be able to switch thread in the middle of the IRQ.

I've implemented lock-free Signal / WaitForSignal. Signal will set a flag in the thread that is signalled. Wait for signal will sleep if the flag is cleared, and continue (and clear the flag) if it is set. It is garanteed that whenever the Signal occurs, the thread doing WaitForSignal will always run once, even if Signal is issued by an IRQ. Signal is the only synchronization primitive that is allowed to be used by IRQs, appart from spinlocks.

sounds · Post by **sounds** » Tue May 29, 2012 11:41 am

Hi rdos,

Good point about being sure the scheduler will never do a context switch in the middle of an IRQ. I do think I've avoided that bug but it would be a serious one, so I'll write some unit tests to keep my eye on it.

Thanks for the helpful discussion. Like Brendan said,

Brendan wrote:The second option is entirely silly, therefore you have to use some sort of "process it later".

OSDev.org

What do you think? Linux uses softirq, tasklet

What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet

Re: What do you think? Linux uses softirq, tasklet