OSDev.org

Posted: **Tue Feb 28, 2017 4:20 pm**

After watching some videos showing visualizations of sorting performance for various algorithms, I got curious about the idea of aggregating or accretion of repeated data points, something I didn't know any examples of. I did some reading, and my impression is that while the idea of clustering repeated values is pretty common as an 'optimization trick', the only algorithms I found that did it or anything quite like it are Timsort (which is a combination merge and insertion sort, mainly), and the various counting sorts (which count the number of occurrences rather than actually sorting the original data). I have not found any formal literature on what I am calling 'aggregating' or 'accretive' sorting algorithms in general. Most of the related discussions I did find were about Timsort specifically, which has caught on recently for this very property - it performs better than most other algorithms for real-world data, even though its worst-case and average-case are the same as the much less complex Merge Sort it is partly based on.

Out of curiosity, I decided to try devising aggregating variants on two of the simpler sorts: the short-circuiting version of BubbleSort, and Gnome Sort, in both the basic and 'teleporting' variants. They aren't really all that interesting in and of themselves, but I wanted to see how it altered best-case performance. I figure that if nothing else, they would serve as a sort of baseline for the effect that this sort of modification would have.

I thought I would share the ones I have done already before going any further. If anyone is curious, I can post both the instrumented test versions of these (and the base algorithms for comparison) and a log of some trials as an attachment.

Code: Select all

def  musterSort(data):
    """muster sort - an accretive (or aggregating) variant of the
    optimized bubblesort.
    I chose the name because repeated elements aggregate into a
    'troop' or 'squadron' which moves together from then on."""
    endpoint = len(data)
    
    for iter in range(endpoint):
        sorted = True
        tagged = 0
        for pos in range(1, endpoint - iter):
            diff = data[tagged] - data[pos]
            if diff != 0:
                    if diff > 0:
                        data[pos], data[tagged] = data[tagged], data[pos]
                        sorted = False
                    tagged = pos
            pos += 1
            
        if sorted:
            break
    return data

Code: Select all

def smurfSort(data):
    """ an aggregating variant of a Gnome Sort
    Basically, gnomes that congregate in a village..."""
    tagged, pos = 0, 1
    endpoint = len(data)
    direction = 1
    while pos < endpoint:
        if pos == 0:
            direction = 1
            tagged, pos = 0, 1
        diff = (data[tagged] - data[pos]) * direction
        if diff < 0:
            tagged = pos
        elif diff > 0:
            data[tagged], data[pos] = data[pos], data[tagged]
            tagged, pos = pos, tagged 
            direction = -direction
        else:
            pass
        pos += direction
    return data

Code: Select all

def bamfSort(data):
    """ an aggregating version of the 'optimized' Gnome Sort
    Probably only X-Men fans of old will get the name."""    
    def innerBamfSort(top):
        pos, tagged  = top-1 , top
        while pos >= 0 and data[pos] >= data[tagged]:
            if data[tagged] != data[pos]:
                data[tagged], data[pos] = data[pos], data[tagged]
                tagged = pos
            pos -= 1

    for pos in range(len(data)):
        innerBamfSort(pos)   
    return data

Posted: **Wed Mar 01, 2017 4:27 am**

Taking advantage of data that's already sorted is called adaptive sorting. However, that is not the same as what counting sorts do, which esentially count key occurrences.

As to what is a good sorting algorithm, that really boils down to the data set and constraints. How big is the data set, how fast do you need the result, how much memory can you afford to spare? E.g., mergesort does better on average but heapsort takes up less memory (unless you merge in-place, which is a complication). What is the key distribution? E.g., if they are close together, you might want counting sort; quicksort is great on average but, depending on how you choose your pivots, there will be pathological cases where it is horrible. Some algorithms try to combine strategies, as introsort and timsort do. However, ultimately, the trade-off should be informed by the requirements.

Posted: **Wed Mar 01, 2017 11:07 am**

OK, I am not familiar with the term adaptive sorting, but that looks like a slightly different issue from what I have in mind.

I am not talking about things that are already sorted, per se, but about treating repeated elements as a unit, and adding new instances of that repeated element to the unit once they are found.

For example, in a standard bubble sort (as before, I am using Bubble simply because it is easy to understand, not because it should be used in any real-world context to speak of), for a data set:

Code: Select all

6, 6, 4, 1, 6,  5, 6, 1, 2, 4, 1, 4, 8, 3, 12

The first iteration would bubble up as

Code: Select all

6, 6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
^
6, 6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
   ^
6, 4, 6, 1, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
      ^
6, 4, 1, 6, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
         ^
6, 4, 1, 6, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
            ^
6, 4, 1, 6, 5, 6, 6, 1, 2, 4, 1, 4, 8, 3, 12
               ^
6, 4, 1, 6, 5, 6, 6, 1, 2, 4, 1, 4, 8, 3, 12
                  ^
6, 4, 1, 6, 5, 6, 1, 6, 2, 4, 1, 4, 8, 3, 12
                     ^
6, 4, 1, 6, 5, 6, 1, 2, 6, 4, 1, 4, 8, 3, 12
                        ^
6, 4, 1, 6, 5, 6, 1, 2, 4, 6, 1, 4, 8, 3, 12
                           ^
6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 6, 4, 8, 3, 12
                              ^
6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 6, 8, 3, 12
                                 ^
6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 6, 8, 3, 12
                                    ^
6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 6, 3, 8, 12
                                       ^
6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 6, 3, 8, 12
                                          ^

Whereas for the 'muster sort' I posted, it would go:

Code: Select all

6, 6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
^
6, 6, 4, 1, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
   ^
4, 6, 6, 1, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
      ^
4, 1, 6, 6, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
         ^
4, 1, 6, 6, 6, 5, 6, 1, 2, 4, 1, 4, 8, 3, 12
            ^
4, 1, 5, 6, 6, 6, 6, 1, 2, 4, 1, 4, 8, 3, 12
                  ^
4, 1, 5, 1, 6, 6, 6, 6, 2, 4, 1, 4, 8, 3, 12
                     ^
4, 1, 5, 1, 2, 6, 6, 6, 6, 4, 1, 4, 8, 3, 12
                        ^
4, 1, 5, 1, 2, 4, 6, 6, 6, 6, 1, 4, 8, 3, 12
                           ^
4, 1, 5, 1, 2, 4, 1, 6, 6, 6, 6, 4, 8, 3, 12
                              ^
4, 1, 5, 1, 2, 4, 1, 4, 6, 6, 6, 6, 8, 3, 12
                                 ^
4, 1, 5, 1, 2, 4, 1, 4, 6, 6, 6, 6, 8, 3, 12
                                    ^
4, 1, 5, 1, 2, 4, 1, 4, 6, 6, 6, 6, 3, 8, 12
                                       ^
4, 1, 5, 1, 2, 4, 1, 4, 6, 6, 6, 6, 3, 8, 12
                                          ^

It is a specific sub-topic in adaptive sorting, I suppose, in that groups of successive items (where a 'column' of scalar items which have no lacunae) could be treatable as a unit. I may need to consider a further, properly adaptive variant of muster sort etc.

Posted: **Sat Mar 11, 2017 9:45 pm**

The feature you are describing is more of an unwanted side-effect that you have to take extra steps to fix rather than an optimization. The sorting problem is about ordering records based on their keys. So you are only saving memory in the special case where the records consists only of keys. However, that's almost never what you want. (Yay, a list of ordered numbers.)

The problem is that if you only store how many times a keys occurs, you can't use that information to differentiate between the records. Say I am sorting phone book entries. I don't want to just know how many people have the same name, I want to sort the contact information based on the names of people. If phonebook["Alex"] == 6, all I know is that there are six people called Alex in the phone book. On the other hand, { phonebook[FOO].name == "Alex", phonebook[FOO].number == BAR } is exactly what you want.

Look at the implementation of counting sort provided here. You'll notice that counting isn't the end of the story. It's just what is used to produce the actual output, which has the same size as the initial input. You may also be interested in reading about stable and unstable sorting.

Posted: **Sat Mar 11, 2017 10:42 pm**

to be honest I just use Judy Arrays - you get the key compression and a pointer to associated data

Posted: **Sat Mar 11, 2017 11:02 pm**

Judy structures are blazingly fast but mostly uninteresting from a computer scientist's point of view. They are basically just a HUGE, complicated library of micro-optimized code meant to take advantage of CPU architectural decisions. I never even bring them up.

Posted: **Sun Mar 12, 2017 5:04 am**

true dat they are excrutiatingly optimized....

but while the idea is mundane - the way the inventor implemented them is fairly interesting...

the tries are stored in memory as state transitions

separate state machines are implemented for insert, delete, and lookup operations - and yes each is highly optimized.

bit oriented key sorting using a dynamic compressible state transition table is actually pretty cool - but lord it will make your brain hurt

The make scripts suck though....

Posted: **Sat Mar 18, 2017 6:11 pm**

Love4Boobies wrote:The feature you are describing is more of an unwanted side-effect that you have to take extra steps to fix rather than an optimization. The sorting problem is about ordering records based on their keys. So you are only saving memory in the special case where the records consists only of keys.

I think you misunderstood what the modified algorithms do, and the intent of the modifications. The goal is not to save memory, but to eliminate comparisons, by treating a equivalent values as a block - either by grouping them under a single key with a counter (for out-of-place sorting of keys), or by keeping track of the endpoints of the blocks being examined (for an in-place sort). In the case of these examples, the fact that there can only be at most one block being tested means that it simply tracks a single block of 'bubbles' at a time; a more effective algorithm would probably need to track multiple blocks with some sort of separate data structure (which I believe is what Timsort does, but I haven't gone over it closely).

The real goal here is to avoid retesting elements in highly redundant data streams, by saying 'all of these sort as a unit' once the redundancy has been identified.

While some memory savings would be a possible result of the counting versions (which would not really be counting sorts per se), it would be a side result, and would probably not be significant. Note also, that I would expect for the counting versions the data would need to be equal (meaning that for data structures which aren't directly referenced simple scalars, the keys being aggregated would need to point or hash to the same block of memory) not just equivalent.

These algorithms, as I said, were ones I came up with to try and explore the idea, in the hope of generalizing it and analyzing the advantages and disadvantages of aggregation. The specific ones were, not to put too fine a point on it, garbage, but that was because they were intended as a starting point of a larger project, more along the lines of a pilot/prototype than a real working example. I am actually interested more in seeing if there is a notable benefit to them, and under what circumstances they would be worth using versus the existing 'standard' algorithm they are based on, as well as the usual comparisons between distinct algorithms.

I haven't gone back to it since then, but... well, that's not surprising, given the way I (fail to) work.

Posted: **Sat Mar 18, 2017 7:16 pm**

On an unrelated note - and I really do mean unrelated in this case - experience tells me that if you are repeatedly sorting a linear data structure, it is usually a sign that it would be better to use a non-linear ordered data structure instead (often a tree or graph of some kind), preferably one that is self-balancing/self-organizing. Seriously, I have seen that anti-pattern (which is sort of a Golden Hammer approach, in that a lot of devs seem to be wary of using anything more complicated than a singly linked list).

OSDev.org

some aggregating variants of common sorting algorithms

some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms

Re: some aggregating variants of common sorting algorithms