Nearest neighbour

twrl · Post by **twrl** » Thu Sep 04, 2014 11:43 am

Slightly random question, but can anyone recommend a fast algorithm for nearest-neighbor search in four(ish) dimensions? Heuristic/probabilistic solutions also welcome.

AndrewAPrice · Post by **AndrewAPrice** » Thu Sep 04, 2014 1:14 pm

What units are you using?

If your dataset is a grid (w,x,y,z), and you have decimal number, just round it to the nearest whole number.

If it's freeform data, you can loop through each item and save the nearest one (O(n) complexity). The algorithm for calculating the distance between between 2 4d points is:

Code: Select all

float dist(float w1, float x1, float y1, float z1, float w2, float x2, float y2, float z2) {
float dw = w1 - w2;
float dx = x1 - x2;
float dy = y1 - y2;
float dz = z1 - z2;

float mag2 = dw * dw + dx * dx + dy * dy + dz * dz;

return sqrt(mag2);
}

However, since you're just comparing them, you can use the distance squared (as d1 < d2 implies d1 * d1 < d2 * d2) - that saves you from doing a computationally expensive square root. Return mag2 instead of sqrt(mag2). You can square root the number later once you've found the closest one if needed.

If you have a large dataset (millions!) and want reasonable or realtime performance, you'd probably want to spacially divide the dataset.

If the data is fairly evenly distributed across a closed area, you could probably make 4d grid of 'buckets' and just scan your bucket, if nothing found, scan buckets in +1 and -1 of every direction around you, if nothing scan in +2 and -2 of every direction, etc.

If the data is unevenly distributed or in an open world, you could probably use a 4d variation of an octree (a tree with 2 children per axis permutation - 16 children in 4D). Walk down the octree from the top, scanning the children closest to your point first (and only scan other children if they're closer than your nearest found point.) A well balanced 4D octree may get you O(log n) complexity.

A k-d tree is similar to an octree except instead of having 16 children (in 4D), each node is split into 2 children along at point 'k' along an axis 'd'. Scanning a k-d tree starts with walking down it, find what axis it's split along, test if you're before or after the axis split and scan that child first - only scan the other side only if the axis split point is closer than the nearest found point (along that axis.)

A k-d tree would have less logic but is generally considered a static data-set once constructed because constructing a balanced k-d tree requires some computation and sorting when you construct it. A 4d octree would have a little more code, but stays efficient when dynamically updated.

Feel free to ask any questions if none of that made sense.

twrl · Post by **twrl** » Thu Sep 04, 2014 2:20 pm

MessiahAndrw wrote:What units are you using?

If your dataset is a grid (w,x,y,z), and you have decimal number, just round it to the nearest whole number.

If it's freeform data, you can loop through each item and save the nearest one (O(n) complexity). The algorithm for calculating the distance between between 2 4d points is:
Code: Select all
float dist(float w1, float x1, float y1, float z1, float w2, float x2, float y2, float z2) {
float dw = w1 - w2;
float dx = x1 - x2;
float dy = y1 - y2;
float dz = z1 - z2;

float mag2 = dw * dw + dx * dx + dy * dy + dz * dz;

return sqrt(mag2);
}
However, since you're just comparing them, you can use the distance squared (as d1 < d2 implies d1 * d1 < d2 * d2) - that saves you from doing a computationally expensive square root. Return mag2 instead of sqrt(mag2). You can square root the number later once you've found the closest one if needed.

If you have a large dataset (millions!) and want reasonable or realtime performance, you'd probably want to spacially divide the dataset.

If the data is fairly evenly distributed across a closed area, you could probably make 4d grid of 'buckets' and just scan your bucket, if nothing found, scan buckets in +1 and -1 of every direction around you, if nothing scan in +2 and -2 of every direction, etc.

If the data is unevenly distributed or in an open world, you could probably use a 4d variation of an octree (a tree with 2 children per axis permutation - 16 children in 4D). Walk down the octree from the top, scanning the children closest to your point first (and only scan other children if they're closer than your nearest found point.) A well balanced 4D octree may get you O(log n) complexity.

A k-d tree is similar to an octree except instead of having 16 children (in 4D), each node is split into 2 children along at point 'k' along an axis 'd'. Scanning a k-d tree starts with walking down it, find what axis it's split along, test if you're before or after the axis split and scan that child first - only scan the other side only if the axis split point is closer than the nearest found point (along that axis.)

Feel free to ask any questions if none of that made sense.

It's all integers. What I'm trying to construct is really a sort of spatial queue. When you enqueue an object it gets inserted into the space, you dequeue from a particular point in the space and it gives you back the nearest.

I'm actually toying with this as a design for a process scheduler. I was looking at several different properties for deciding what thread to go with next: priority, (inverse of) time since last scheduled, NUMA distance, proportion of scheduled time used, and so on. It occurred to me that I could use those properties of a thread as coordinates in a metric space, and then pick the closest one for each CPU. I vaguely remembered having read something about metric trees and I was wondering if I could adapt one to be a sort of queue in the same way that a heap can be a priority queue, but some googling later I just got mired in different kinds of trees.

One way seems to be to divide it into min-heaps based on distance-from-cpu at insertion time, with some heuristics for if a heap is empty search the neighboring ones. That's similar to what you said about a grid of buckets. Doesn't seem right though, or maybe it's just disappointingly simple...

Combuster · Post by **Combuster** » Fri Sep 05, 2014 1:54 am

Having indexes per CPU is pretty simple: you always insert and remove objects into all heaps, but one heap dictates which object actually gets removed. The time does increase linearly with each additional processor though, and other algorithms may prove more efficient in such cases.

bluemoon · Post by **bluemoon** » Fri Sep 05, 2014 2:42 am

The other approach is to use spheres, a common technique to handle object inside infinite map for game.

You then define:

How many object within each sphere
The condition for create, destroy, merge, split and resize of sphere

To add new object you choose a sphere with cover the position of the object, split/resize/create sphere if needed.
To pick a nearby object you just get the sphere of the reference object, and loop thru' such small number of objects within that sphere.

Since everything is calculated directly with distance metric, it is independent to map resolution and dimension as contrast to grid approach.

Icee · Post by **Icee** » Fri Sep 05, 2014 6:16 am

Have a look at octrees, k-d trees and the like, those are the usual structures for spatial data manipulation.

Brendan · Post by **Brendan** » Fri Sep 05, 2014 7:21 am

Hi,

Don't forget that to find the nearest neighbour you don't need to find the distance between any points at all; which means that instead of doing something like "if( sqrt(a) < sqrt(b) )" you only need to do "if( a < b )" and can avoid the (relatively expensive) "sqrt()".

Now, if one of the dimensions only has a small number of possible values (e.g. "NUMA distance" probably only has 256 possible values) then you can have one list of points per possible value (e.g. one list for each of the 256 possible NUMA distances). This means you don't need to store that dimension as part of the point, and it allows you to pre-compute part of the formula (e.g. do "dz_squared = (z1 - z2) * (z1 - z2)") in an outer loop and avoid repeatedly re-computing it in the inner-most loop. Basically:

Code: Select all

float dist_squared(float w1, float x1, float y1, float w2, float x2, float y2, float dz_squared) {
    float dw = w1 - w2;
    float dx = x1 - x2;
    float dy = y1 - y2;

    return dw * dw + dx * dx + dy * dy + dz_squared;
}

Also, if the pre-computed "dz_squared" for a list is greater than your "current least distance squared found so far", then you can skip that entire list of points. By checking lists in order of "most likely to contain a point that's close" you might avoid checking many entire lists.

Finally; it seems to me that you mostly need "nearest point to (0,0,0,0)". In that case, instead of storing a point's "w, x and y" co-ords you could store "w*w, x*x and y*y", which means something more like this:

Code: Select all

float dist_squared(float w_squared, float x_squared, float y_squared, float z_squared) {
    return w_squared + x_squared + y_squared + z_squared;
}

Cheers,

Brendan

AndrewAPrice · Post by **AndrewAPrice** » Sat Sep 06, 2014 11:02 am

We've showed you how you can calculate the distance between two 4D points. Worst case complexity for finding the nearest neighbour is O(1) - loop through every other point and store the nearest one.

Sorting a queue - you can use sorting algorithm you want. Most sorting algorithms run in O(n log n) time.

twrl · Post by **twrl** » Tue Sep 09, 2014 11:52 am

The day I can't find distances in arbitrary metric spaces is the day I hand back my drywipe pens (I am a mathematician, I'm just not very computation oriented).

Anyway, it's NUMA that complicates everything because it basically rules out things like Octrees and k-d trees. A metric tree like Bluemoon suggested is looking like the way to go. I found a paper on an improved version of the same technique (http://www.cs.cmu.edu/~christos/PUBLICA ... imTree.pdf) which might give good enough performance for the scheduler.

I had another idea though, which I'm trying to do the math for at the moment. I can treat the pool of waiting tasks as a graph with the edge weights corresponding to the distances. The path through the graph between two nodes is a good upper bound on the actual distance, so with dummy nodes to represent the CPUs I can use an algorithm like A* to do approximate nearest neighbour with the destination node being the task that has just been pre-emptied. If I omit edges and just give every node a fixed number (the shortest five I can cheaply find, or something) then I get O(n) space complexity, worst case time should be similar to the tree if I queue up a few tasks each time, and I get better than the tree on average if I have a cunning heuristic in the insert algorithm. At least that's what back-of-envelope calculations suggest, but I haven't proved it. Also means that there will be stochastic behaviour which guarantees against starvation. Anyway, if I figure it out I'll share it.

bluemoon · Post by **bluemoon** » Tue Sep 09, 2014 8:57 pm

Instead of using Euclidean metric (1,1,1,1) to calculate distance, you may also adjust your metric tensor to weight for different dimensions for scheduling.

OSDev.org

Nearest neighbour

Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour

Re: Nearest neighbour