Counting different letter K-mers with scikit learn
I'm working on extracting the frequencies of different amino acid letters from protein sequences. I'm also working on different "reduced" representations of the alphabet (I.E, instead of 20 letters, I want to have some letters be equivalent [K],[R] -> [KR], etc').
What is an efficient way to: 1) Extract the frequency counts of different k-mers (I.E, overlapping counts of letters of length 1,2,3 from the sequence) from the protein sequence, preferably using the built in scikit learn tools (Such as countvectorizer and the like)? (I can generate the possible combinations myself and count it from the string, but this is quite inefficient, and I wan't to use scikit learn's tools in my pipeline, but those tools are for words, not multiple letters in a single long word..)
2) Is there an efficient way to get k-mer letter counts/frequencies, using scikit's countVectorizer or the like, for different alphabets? (I.E, to feed the translation table to the method, and get the 2-mer frequencies of the reduced library directly, rather than inefficiently recalculating possible combinations and their frequencies myself for each sequence).
Maintenance of order and the like is also important, since I need to get the feature "names" as well at the end (For appending to the output as a feature column's name). Thank you very much!
You want a list/dict with every "word" in your proteins. Lets suppose you have the following proteins:
prot_1 = "mklfgsmhee" prot_2 = "heelyiggis"
You want a function that return all the words of length n, like so:
>>> words_prot_1 = wording(prot_1, 3) >>> print words_prot_1 ["mkl", "klf", "lfg", "fgs", "gsm", "smh", "mhe", "hee"] >>> words_prot_2 = wording(prot_1, 3) >>> print words_prot_2 ["hee", "eel", "ely", "lyi", "yig", "igg", "ggi", "gis"]
Cycle through your proteins to create either a dict or a list like this:
kmers_3 = ["mkl", "klf", "lfg", "fgs", "gsm", "smh", "mhe", "hee", "eel", "ely", "lyi", "yig", "igg", "ggi", "gis"]
(Note the repeated term "hee" is only once, this is easily done with a set(list) or with key inserting in a dict). Now the same word_prot_1 can be feeded to CountVectorizer, faking the text with a str.join(). Something like:
cv = CountVectorizer(vocabulary=kmers_3) cv.fit_transform(" ".join(words_prot_1)).toarray()