# Partition an array of numbers into sets by proximity

Let's say we have an array like

[37, 20, 16, 8, 5, 5, 3, 0]

What algorithm can I use so that I can specify the number of partitions and have the array broken into them.

For 2 partitions, it should be

[37] and [20, 16, 8, 5, 5, 3, 0]

For 3, it should be

[37],[20, 16] and [8, 5, 5, 3, 0]

I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions. Any ideas?

My code is in ruby but any language/algo/pseudo-code will suffice.

Here's the ruby code by Vikram's algorithm

def partition(arr,clusters) # Return same array if clusters are less than zero or more than array size return arr if (clusters >= arr.size) || (clusters < 0) edges = {} # Get weights of edges arr.each_with_index do |a,i| break if i == (arr.length-1) edges[i] = a - arr[i+1] end # Sort edge weights in ascending order sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first} # Maintain counter for joins happening. prev_edge = arr.size+1 joins = 0 sorted_edges.each do |edge| # If join is on right of previous, subtract the number of previous joins that happened on left if (edge > prev_edge) edge -= joins end joins += 1 # Join the elements on the sides of edge. arr[edge] = arr[edge,2].flatten arr.delete_at(edge+1) prev_edge = edge # Get out when right clusters are done break if arr.size == clusters end end

## Answers

I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.

Algorithm : -

Construct path graph from your data set like following : -

[37, 20, 16, 8, 5, 5, 3, 0]

path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7

then weight for each edge will be difference between their values

edge(0,1) = abs(37-20) = 17 edge(1,2) = abs(20-16) = 4 edge(2,3) = abs(16-8) = 8 edge(3,4) = abs(8-5) = 3 edge(4,5) = abs(5-5) = 0 edge(5,6) = abs(5-3) = 2 edge(6,7) = abs(3-0) = 3

Use kruskal on this graph till there are only k clusters remaining : -

Sort the edges first according to weights in ascending order:-

(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)

Use krushkal on it find exactly k = 3 clusters : -

iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0] iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0] iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)] iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)] iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)] stop as clusters = 3

reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what u desired

(assuming the array is sorted in descending order)

37, 20, 16, 8, 5, 5, 3, 0

Calculate the differences between adjacent numbers:

17, 4, 8, 3, 0, 2, 3

Then sort them in descending order:

17, 8, 4, 3, 3, 2, 0

Then take the first few numbers. For example, for 4 partitions, take 3 numbers:

17, 8, 4

Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).

17 - difference between 37 and 20 8 - difference between 16 and 8 4 - difference between 20 and 16

Now print the stuff:

37 | 20 | 16 | 8, 5, 5, 3, 0

While @anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.

You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.

While getting an optimal result is NP hard, there is a simple greedy solution.

It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.

Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."

Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.

The Wikipedia page on this is pretty good.