Sorted/ ordered list using red-black b-tree

dushara · Post by **dushara** » Sat Jul 07, 2007 7:48 pm

Hi,

Is an ordered list the best mechanism to implement a timer list? I've been reading up a bit about red-black b-trees and it seems an interesting proposition...

What I find attractive about it is, according to wikipedia, it seems to have well defined worst case times.

Inserting items into a large sorted list can be a fairly expensive (and non-deterministic) operation.

I'm curous what your opinion is on this matter.

Does anybody have any links to articles etc on efficient timer queue designs?

Dushara

Colonel Kernel · Post by **Colonel Kernel** » Sat Jul 07, 2007 9:14 pm

I don't have a link handy, but I've heard of a technique where you have a sorted list of timers. You sort by a relative time instead of an absolute time, so that the head of the list is always the next timer and whenever its count reaches zero it's time is up.

I think Brendan has implemented this scheme before... maybe he can describe it more coherently (if he's reading this

).

jhawthorn · Post by **jhawthorn** » Sat Jul 07, 2007 9:31 pm

This is described briefly with a little example on the wiki in the article Blocking Process. As Colonel Kernel said, Brendan always has something interesting to share regarding scheduling, so you might want to search through his posts.

dushara · Post by **dushara** » Sat Jul 07, 2007 11:55 pm

Thanks for the replies...

I don't have a link handy, but I've heard of a technique where you have a sorted list of timers

This is what I meant when I said "ordered list" sorry if I didn't express myself clearly.

From my understanding, the problem with this scheme is, you have to traverse the whole list if you need to insert an item that belongs at the end (which is quite expensive).

I think the wiki article is also talking about the same thing.

As Colonel Kernel said, Brendan always has something interesting to share regarding scheduling

Thanks I'll search his posts...

Dushara

jhawthorn · Post by **jhawthorn** » Sun Jul 08, 2007 12:39 am

If you are worried about the time it takes to insert an item, consider using a minimum heap, which does insertion in logarithmic time. Checking the minimum element can be done in constant time, but removal is logarithmic.

Colonel Kernel · Post by **Colonel Kernel** » Sun Jul 08, 2007 1:50 am

dushara wrote:This is what I meant when I said "ordered list" sorry if I didn't express myself clearly.

From my understanding, the problem with this scheme is, you have to traverse the whole list if you need to insert an item that belongs at the end (which is quite expensive).

Ah, I thought you were worried about the time it takes to check when a timer should fire. I guess you're more worried about the scalability of adding new timers (since insertion in a simple list is linear)? How many timers do you expect will be in use typically? If you're optimizing something like this, surely the rest of your OS must be covered in polished chrome by now.

dushara · Post by **dushara** » Sun Jul 08, 2007 5:34 am

If you are worried about the time it takes to insert an item, consider using a minimum heap

Thanks I'll see if that helps me...

Ah, I thought you were worried about the time it takes to check when a timer should fire. I guess you're more worried about the scalability of adding new timers (since insertion in a simple list is linear)?

Yes insertion is the issue.

If you're optimizing something like this, surely the rest of your OS must be covered in polished chrome by now.

What I've implemented is an RTOS kernel and I ignored most of the issues you guys take into consideration (MMU, networking, file systems, etc). I won't call it chrome, but I'm quite happy with it still... (You would have seen my announcement no doubt).

Once again thanks for the advice.

Dushara

JamesM · Post by **JamesM** » Wed Jul 11, 2007 1:56 am

If you're considering implementing red-black trees, I would also point to you http://en.wikipedia.org/wiki/AVL_Tree. AVL trees are very similar in most respects, in that the insertion, deletion and lookup times are the same. The difference is the algorithm used to optimise the tree (stopping it becoming a long linked list in the worst case), and I personally feel the AVL algorithm is easier to understand/implement.

JamesM[/url]

dushara · Post by **dushara** » Wed Jul 11, 2007 3:12 am

Thanks James...

Yes I'm looking into that as well... but from what I can gather, the AVL trees result in more rotations which can potentially be slower?

Don't know yet.

Dushara

Brendan · Post by **Brendan** » Wed Jul 11, 2007 7:49 pm

Hi,

dushara wrote:Is an ordered list the best mechanism to implement a timer list? I've been reading up a bit about red-black b-trees and it seems an interesting proposition...

No....

If "removeEntry" is only used in the timer IRQ handler with interrupts disabled, and "insertEntry" is only used outside of IRQ handlers with interrupts enabled, then it's likely that the time "removeEntry" takes will be much more important than the time "insertEntry" takes.

Next, don't be fooled by "number of operations" (or "big O" notation). For example, an "O(n)" algorithm with poor cache locality might be much worse than an "O(n^2)" algorithm with good cache locality for all possible values of "n".

Then there's re-entrancy - if you're in the middle of inserting a new entry in the list and get a timer IRQ what are you going to do? For single CPU systems you could disable IRQs for all code that touches the list, but that won't work for SMP. To make it work with SMP you can't use a lock alone (the IRQ handler would cause deadlocks) and you can't use any kind of lock that causes task switches inside an IRQ handler. You could disable interrupts and use some kind of spinlock. The problem with that is there's no way to tell how long the IRQ handler would need to wait before it acquires the lock - for e.g. you could have 15 CPUs trying to get the lock to insert something and a timer IRQ at the same time, and the IRQ handler could be the last one to acquire the lock.

Then there's the quantizing effect of the timer. If the timer IRQ occurs every 10 ms, then you could have a list like "6 ms, 1 ms, 8 ms, 15 ms, 18 ms, 12 ms" and it won't matter. Threads don't need to be in strict order - you only need to sort threads according to the time quantum they wake up in.

Lastly, your kernel should guarantee that a thread will not wake up before it's time expires. This means that (for e.g.) if the timer IRQ occurs every 10 ms and a thread wants to sleep for 1 ms, then you don't know if a timer IRQ will occur 2 us after you put the thread on the list, and need to make it wake up after 11 ms to guarantee that it doesn't wake up sooner than 1 ms.

Based on all of the above, it makes sense to me to have a seperate unsorted list for each time quantum. The nice thing with this (assuming that your kernel does guarantee a thread won't wake up too early) is that the timer IRQ can check the "soonest" list without caring about new threads being added to the list. The problem here is you'd need a huge number of lists (a total of "maximumSleepTime / timeQuantum" lists), but we can "cheat" by using modulo and leaving things on the list if they aren't meant to wake up yet.

For an example (using 65536 singly linked lists):

Code: Select all

;Insert a new thread on a list of sleeping threads
;
;Input
; ebx    timer tick to wake up
; edi    address of thread's data area (TDA)

insertThreadOnList:
    push *stuff*

    *disable task switches*

    mov [edi+TDA.wakeTime],ebx              ;Set the actual time this thread should wake
    movzx ebx,bx                            ;ebx = list number (or "wakeTime % 65536")

    mov eax,[timerListTable+ebx*4]          ;eax = current "last thread on list"
.retry:
    mov [edi+TDA.nextThreadTDA],eax         ;Link the previous thread's TDA to the new thread's TDA
    lock cmpxchg [timerListTable+ebx*4],edi ;Try to set a new "last thread on list"
    je .retry                               ;Retry if "last thread on list" changed

    *enable task switches*

    pop *stuff*
    ret

Code: Select all

;The timer IRQ handler
; Note: I assume this uses an interrupt gate (disable tasks switches if it doesn't)

timerIRQ:
    push *stuff*

;Adjust the current time

    inc dword [currentTime],eax          ;Increase the time counter
    mov ebp,[currentTime]                ;ebp = the current time

;Get the next list of sleeping threads

    xor edx,edx
    mov ebx,[nextTimerListNumber]        ;ebx = next timer list number
    lock xchg [timerListTable+ebx*4],edx ;Get the old thread (if any) and clear the list
    xor ecx,ecx                          ;ecx = address for last thread on the "still sleeping" list (none)

;Start processing the list

    jmp .doneThread

;Do the next thread on the list

.nextThread:
    mov edx,[eax+TDA.nextThreadTDA]      ;edx = address for next thread on the list
    mov [eax+TDA.wakeTime],ebp           ;Should this thread wake up yet?
    jbe .wakeThread                      ; yes
    mov [edi+TDA.nextThreadTDA],eax      ; no, put the thread on the "still sleeping" list
    mov ecx,eax                          ;     and set a new "end of sleeping list"
    jmp .doneThread

.wakeThread:
    call wakeThread                      ;Tell the scheduler to wake up this thread

;Check if all threads are done

.doneThread:
    mov eax,edx                          ;eax = address of next thread in list
    test eax,eax                         ;Is everything done?
    jne .nextThread                      ; no, keep going

;Check if there's any threads that are still sleeping on this list

    test ecx,ecx                         ;Are there any threads still sleeping?
    je .done                             ; no, all done

;Merge the remaining sleeping threads with the current list

    mov eax,ecx                          ;eax = last thread on the list
    lock xchg [timerListTable+ebx*4],eax ;Get the old thread (if any) and set a new thread
    mov [ecx+TDA.nextThreadTDA],eax      ;Link the previous thread's TDA to the new thread's TDA

;Update the next list number

.done:
    inc bx                               ;ebx = next timer list number (wraps around at 65536)
    mov [nextTimerListNumber],ebx        ;Set the new list number for next IRQ

;Everything done - do an EOI and return

    *send EOI*
    pop *stuff*
    iretd

If I didn't stuff something up (it's untested) this is entirely multi-CPU safe. Inserting an entry is "lock free" (but not "block free").

The timer IRQ handler is "block free". Worst case is many threads sleeping for a very long time and repeatedly being checked, but to be honest I wouldn't worry much about that. With a 10 ms timer IRQ, any thread that sleeps for less than 10.9 minutes (655.36 seconds) is only checked once when it needs to wake up.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Wed Jul 11, 2007 9:08 pm

Brendan wrote:If "removeEntry" is only used in the timer IRQ handler with interrupts disabled, and "insertEntry" is only used outside of IRQ handlers with interrupts enabled, then it's likely that the time "removeEntry" takes will be much more important than the time "insertEntry" takes.

With a sorted list of timers, wouldn't removeEntry be trivial? You'd always be removing the head of the list...

Next, don't be fooled by "number of operations" (or "big O" notation). For example, an "O(n)" algorithm with poor cache locality might be much worse than an "O(n^2)" algorithm with good cache locality for all possible values of "n".

I was with you until you said "all possible values of 'n'".

It may be true for "today's practical values of 'n'", but those values change over time. Also, the difference between n and n^2 as n increases is a whopper... comparing n with n log n might be more appropriate.

MrZebra · Post by **MrZebra** » Wed Aug 01, 2007 2:06 am

Next, don't be fooled by "number of operations" (or "big O" notation). For example, an "O(n)" algorithm with poor cache locality might be much worse than an "O(n^2)" algorithm with good cache locality for all possible values of "n".

This is not true. Yes, an O(n^2) algorithm might perform better then a O(n) for some values [0, X[ but since cache misses are constant time the above statement does not hold for n > X.

Change possible to reasonable.

[Edit]
Sorry, missed Colonel Kernel's reply...

OSDev.org

Sorted/ ordered list using red-black b-tree

Sorted/ ordered list using red-black b-tree

Re: Sorted/ ordered list using red-black b-tree

Re: Sorted/ ordered list using red-black b-tree