Counting different letter K-mers with scikit learn

I'm working on extracting the frequencies of different amino acid letters from protein sequences. I'm also working on different "reduced" representations of the alphabet (I.E, instead of 20 letters, I want to have some letters be equivalent [K],[R] -> [KR], etc').

What is an efficient way to: 1) Extract the frequency counts of different k-mers (I.E, overlapping counts of letters of length 1,2,3 from the sequence) from the protein sequence, preferably using the built in scikit learn tools (Such as countvectorizer and the like)? (I can generate the possible combinations myself and count it from the string, but this is quite inefficient, and I wan't to use scikit learn's tools in my pipeline, but those tools are for words, not multiple letters in a single long word..)

2) Is there an efficient way to get k-mer letter counts/frequencies, using scikit's countVectorizer or the like, for different alphabets? (I.E, to feed the translation table to the method, and get the 2-mer frequencies of the reduced library directly, rather than inefficiently recalculating possible combinations and their frequencies myself for each sequence).

Maintenance of order and the like is also important, since I need to get the feature "names" as well at the end (For appending to the output as a feature column's name). Thank you very much!


You want a list/dict with every "word" in your proteins. Lets suppose you have the following proteins:

prot_1 = "mklfgsmhee"
prot_2 = "heelyiggis"

You want a function that return all the words of length n, like so:

>>> words_prot_1 = wording(prot_1, 3)
>>> print words_prot_1
["mkl", "klf", "lfg", "fgs", "gsm", "smh", "mhe", "hee"]
>>> words_prot_2 = wording(prot_1, 3)
>>> print words_prot_2
["hee", "eel", "ely", "lyi", "yig", "igg", "ggi", "gis"]

Cycle through your proteins to create either a dict or a list like this:

kmers_3 = ["mkl", "klf", "lfg", "fgs", "gsm", "smh", "mhe", "hee",
           "eel", "ely", "lyi", "yig", "igg", "ggi", "gis"]

(Note the repeated term "hee" is only once, this is easily done with a set(list) or with key inserting in a dict). Now the same word_prot_1 can be feeded to CountVectorizer, faking the text with a str.join(). Something like:

cv = CountVectorizer(vocabulary=kmers_3)
cv.fit_transform(" ".join(words_prot_1)).toarray()

Need Your Help

Unloaded attachment isn't marked for download error in android

android android-intent

I am writing phone contacts to a file and exporting it through Email intent Action. Export works fine when i write the file to SD card.But when i write the file to phone memory of the emulator i ge...

How can I resolve getActiveObject() and getActiveGroup() in new version of fabricJS?

javascript canvas svg fabricjs

I have been using fabricjs since version 1.1.9 and created quite a big application. Now i'm trying to use the newer version 1.4.0, and have found out there are many changes. It was possible to use