Algorithm for finding the busiest period?

I have some data like this:

1: 2 - 10
2: 3 - 15
3: 4 - 9
4: 8 - 14
5: 7 - 13
6: 5 - 10
7: 11 - 15

I will attempt a representation to make it clearer:

        1     2     3     4     5     6     7     8     9     10     11     12     13     14     15
1             |--------------------------------------X---------|
2                   |--------------------------------X--------------------------------------------|
3                         |--------------------------X---|
4                                                  |-X-------------------------------------|
5                                           |--------X------------------------------|
6                               |--------------------X----------|
7                                                                     |---------------------------|

So in the example case, 8-9 is the critical period if the second scheme is used because all the points are active. What is a fast and good way to solving this problem in python? I am thinking of using dynamic programming but are there other approaches that are suggested?

My approach until now:

I was thinking more from a real-time perspective. So, whenever I get a new point, I do this: Assume I already got 2-10 and I get 3-15 then I pick the max of start and min of end so this case it is 3-10 and increment this interval's count to 2. Then the third point comes in 4-9, pick the max which is 4 and the min is 9 and update the value 3-10 to 4-9 and update count to 3. Now when 8-14 comes in, I pick the start of this interval is greater than 4-9 and the end of this interval is less than 4-9. In this case, it is not true so I will create a new bucket 8-14 and I put the count to 1. This is not the entire algorithm but should give a high-level idea of what I am doing here. I will see if I can sketch the pseudo-code.

Answers


        1     2     3     4     5     6     7     8     9     10     11     12     13     14     15
1             |--------------------------------------X---------|
2                   |--------------------------------X--------------------------------------------|
3                         |--------------------------X---|
4                                                  |-X-------------------------------------|
5                                           |--------X------------------------------|
6                               |--------------------X----------|
7                                                                     |---------------------------|

             +1    +1     +1   +1           +1     +1    -1    -2     +1           -1     -1     -2
              1     2     3     4           5       6    5      3     4             3      2      0
                                                     ^^^^

Get it?

So you need to transform this:

1: 2 - 10
2: 3 - 15
3: 4 - 9
4: 8 - 14
5: 7 - 13
6: 5 - 10
7: 11 - 15

into:

[(2,+), (3,+), (4,+), (5,+), (7,+), (8,+), (9,-), (10,-), (10,-), (11,+), (13,-), (14,-), (15,-), (15,-)]

and then you simply iterate through, counting up when you see a + and counting down on -. The busiest interval will be when the count is maximum.

So in code:

intervals = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10), (11, 15)]
intqueue = sorted([(x[0], +1) for x in intervals] + [(x[1], -1) for x in intervals])
rsum = [(0,0)]
for x in intqueue: 
    rsum.append((x[0], rsum[-1][1] + x[1]))
busiest_start = max(rsum, key=lambda x: x[1])
# busiest_end = the next element in rsum after busiest_start 

# instead of using lambda, alternatively you can do:
#     def second_element(x):
#         return x[1]
#     busiest_start = max(rsum, key=second_element)
# or:
#     import operator
#     busiest_start = max(rsum, key=operator.itemgetter(1))

runtime complexity is (n+n)*log(n+n)+n+n or O(n*log(n))

It is also possible to convert this idea into an online algorithm if you don't have the complete list of intervals at the start of the program but is guaranteed that incoming intervals will never be scheduled for a past point. Instead of sorting you will use a priority queue, each time an interval comes, you push in two items, the start point and the end point, each with a +1 and -1 respectively. And then you pop off and count and keep track of the peak hour.


I would start by thinking of the busy-ness of a point x as the number of activations to the left of x, minus the number of deactivations to the left of x. I would sort the activations and deactivations by the time at which they occur (in O(nlog(n)) time). Then you can traverse the list, tracking the number active (y), incrementing and decrementing that number with activations and deactivations passed. The busiest period will be the points at which y is at its maximum. I can't think of a solution off the top of my head that is better than O(nlog(n)). The brute force would be O(n^2).


I thought you could perhaps use a set() for this, and it would work if your assured that all periods intersect at at least one point.

However, this does not work as soon as a period does not intersect. You may be able to add additional logic to cover this, so I'll post what I was thinking:

>>> periods = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10),]
>>> intersected = None
>>> for first, second in periods:
...     if not intersected:
...         intersected = set(range(first, second + 1))
...     else:
...         intersected = intersected.intersection(set(range(first, second + 1)))
...
>>> intersected
set([8, 9])

Note: this does not include the 11-15 period. Your probably best off just creating bin pairs as mentioned by R.K.


Here's what I was thinking for the bin based approach, and adapted to handle adds dynamically, basically what R.K. was saying I believe.

from collections import defaultdict
from operator import itemgetter

class BusyHour(object):
    def __init__(self):
        self.pairs = defaultdict(int)
    def add_period(self, period):
        start, end = period
        for current_period in range(start, end):
            pair_key = (current_period, current_period + 1) 
            self.pairs[pair_key] += 1
    def get_max(self):
        # sort, defaults to smallest to largest
        # --> items() returns (key, value) pairs
        # --> itemgetter gets the given index of the first argument given to sorted
        return max(self.pairs.items(), key=itemgetter(1))


if __name__ == '__main__':
    periods = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10), (11, 15)]
    bh = BusyHour()
    for period in periods:
        bh.add_period(period)
    print bh.get_max()

Updated: Only sort on call to get_max, and use defaultdict(int).


Not sure if i understand your question. If you are trying to find the most common "interval", you can sum them up per interval. This way you have 12 buckets for the above example. For each use, you would add 1 to each of the buckets used in that particular use, and at the end, find the maximum value in all the buckets. Here, that would be 6 for the 8-9 interval.


Need Your Help

Getting the floor value of a number in SQLite?

sql sqlite floor

I have a SQLite database with a table containing the scores of some league players in a bowling center. I'm trying to get the average of the Score column for each player ID. The problem with this i...

What's the easiest way to print output from parallel operations in Ruby without jumbling up the output?

ruby multithreading

Say I forked of a bunch of Threads and wants to print progress output from each one to STDERR. How can I do so in a way that ensures that the output retains line-atomicity, i.e. doesn't jumble up o...