# Top of the most frequent substrings in file of sequences

I have a file with many sequences (all of them have different length) like these:

5-2-6
5-6-6-6-1-8
4
1-7-6-6-2-6-6-1
18-24-2-6-6-1-8

I need to calculate the most frequent substrings in all sequences and get result like this:

6-6       5 times
2-6       3 times
6-6-1     3 times
6-1       3 times
6-1-8     2 times
1-8       2 times
2-6-6-1   2 times
2-6-6     2 times

Substrings can be any length starting from 2 numbers. I should provide the opportunity in code to change length of the substring, for example, to find substrings longer than 3 numbers.

I know I should use suffix tree, but I can't find working example for my case((

Language: R or Python.

You don't need a suffix tree; just split sequences and use a sliding window iterator to produce substrings; a loop starting at a minimum size looping up to the number of numbers on the line will let you produce different sizes of substrings.

A collections.Counter() object can then keep track of counts:

from collections import Counter
from itertools import islice

def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
"   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result

minimum_size = 2
counts = Counter()

with open(filename) as infh:
for line in infh:
parts = line.strip().split('-')
# count substrings of various sizes
counts.update('-'.join(sp)
for size in range(minimum_size, len(parts) + 1)
for sp in window(parts, size))

The Counter.most_common() method then gives you a sorted list substrings sorted by frequency:

for substring, count in counts.most_common():
print '{:<20} {:>3d} times'.format(substring, count)

For your sample data, the top 10 entries are:

6-6                    5 times
6-6-1                  3 times
2-6                    3 times
6-1                    3 times
2-6-6                  2 times
1-8                    2 times
6-1-8                  2 times
6-6-1-8                2 times
2-6-6-1                2 times
18-24-2-6-6            1 times

If you want to focus on different minimum substring lengths, you don't have to re-count the whole file. You could instead filter the Counter object; count the - characters in the keys:

for substr, freq in counts.most_common():
if substr.count('-') < 2:
# skip substrings shorter than 3 elements
continue

R isn't the best language for this, because trees are not (to my knowledge) implemented well. If you're looking for a quick-and-dirty solution, this will work. It is not efficient in any way:

# Create some data
set.seed(1)
data<-replicate(1000,sample(1:10,round(runif(1,1,8)),replace=TRUE))
# A function for all substrings
all.substrings<-function(s) unlist(sapply(1:length(s), function(y) sapply(y:length(s), function(x) s[y:x])),recursive=FALSE)
# Grab all strings
substrings<-unlist(sapply(data,all.substrings),recursive=FALSE)
# Strip one length substrings
substrings<-substrings[lapply(substrings,length)>1]
# Paste them together so you can easily aggregate
substrings.pasted<-sapply(substrings,paste,collapse=' ')
# Aggregate and sort.
sorted.results<-sort(table(substrings.pasted),decreasing=TRUE)
# Display
# 10 3 4 10  8 1  9 4  9 2 8 10
#   52   50   48   48   47   45

So the sequence 10 3 was present 52 times, the maximum.