Counting multiple letter groups in a string

I've been trying to adapt my python function to count groups of letters instead of single letters and I'm having a bit of trouble. Here's the code I have to count individual letters:

my_seq = "CTAAAGTCAACCTTCGGTTGACCTTGAAAGGGCCTTGGGAACCTTCGGTTGACCTTGAGGGTTCCCTAAGGGTT"

def count_letters(str):
    counts = {}
    for c in str:
        if c in counts:
            counts[c]+=1
        else:
            counts[c]=1
    return counts

counts = count_letters(my_seq)
print(counts)

The function currently spits out counts for each individual letter. Right now it prints this:

{'C': 23, 'T': 30, 'G': 30, 'A': 20}

Ideally, I'd like it to print something like this:

{'CTA': 2, 'TAG': 3, 'CGC': 1, 'GAG': 2 ... }

I'm very new to python and this is proving to be difficult.

Thank you!

Answers


This can be done pretty quickly using collections.Counter.

from collections import Counter

s = "CTAACAAC"

def chunk_string(s, n):
    return [s[i:i+n] for i in range(len(s)-n+1)]

counter = Counter(chunk_string(s, 3))
# Counter({'AAC': 2, 'ACA': 1, 'CAA': 1, 'CTA': 1, 'TAA': 1})

Edit: To elaborate on chunk_string:

It takes a string s and a chunk size n as arguments. Each s[i:i+n] is a slice of the string that is n characters long. The loop iterates over the valid indices where the string can be sliced (0 to len(s)-n). All of these slices are then grouped in a list comprehension. An equivalent method is:

def chunk_string(s, n):
    chunks = []
    last_index = len(s) - n
    for i in range(0, last_index + 1):
        chunks.append(s[i:i+n])
    return chunks

This is basically as the first posted answer by Jared Goguen, but in reply to OP's comment, for a possible way without importing a module:

>>> m
'CTAAAGTCAACCTTCGGTTGACCTTGAGGGTTCCCTAAGGGTTGGGGATGACCCTTGGGTCTAAAGTCAACCTTCGGTTGACCTTGAGGGTTCCCTAAGGGTT'
>>> l = [m[i:i+3] for i in range(len(m)-2)]
>>> 
>>> d = {}
>>> 
>>> for k in set(l):
        d[k] = l.count(k)


>>> d
{'AAG': 4, 'GGA': 1, 'AAA': 2, 'TAA': 4, 'AGG': 4, 'AGT': 2, 'GGG': 7, 'ACC': 5, 'CGG': 2, 'GGT': 7, 'TCC': 2, 'TGA': 5, 'CAA': 2, 'TGG': 2, 'GTC': 3, 'AAC': 2, 'ATG': 1, 'CTT': 5, 'TCA': 2, 'CCT': 7, 'CCC': 3, 'GTT': 6, 'TTG': 6, 'GAT': 1, 'GAC': 3, 'TCG': 2, 'GAG': 2, 'CTA': 4, 'TTC': 4, 'TCT': 1}

Or if you are a fan of one liners:

>>> d = {k:l.count(k) for k in set(l)}

Need Your Help

Creating a background process for shell in c

c linux bash shell exec

Im trying to make my own shell in C, but Im having trouble with handling background and foreground processes. Here is where I create processes:

Plotting 2D heat map

c# charts heatmap

I have a chart on which I want to plot a heat map; the only data I have is humidity and temperature, which represent a point in the chart.