text content analyzer in python
I created a text content analyzer in python that analyzes input from a file and output
- The total word count
- The count of unique words
- The number of sentences
Here is the code :
import re import string import os import sys def function(s): return re.sub("[%s]" % re.escape(string.punctuation), '', s.lower()) def main(): words_list =  with open(sys.argv, "r") as f: for line in f: words_list.extend(line.split()) print "Total word count:", len(words_list) new_words = map(function, words_list) print "Unique words:", len(set(new_words)) nb_sentence = 0 for word in words_list: if re.search(r'[.!?][' "'" '"]*', word): nb_sentence += 1 print "Sentences:", nb_sentence if __name__ == "__main__": main()
I am now trying to calculate the average sentence length in words, find often used phrases (a phrase of 3 or more words used over 3 times), and make a list of words used, in order of descending frequency. Could anyone help?
Here are some approaches that could help:
For the average sentence length in words, you could split on periods to get an array of sentences, then split each sentence in that array on spaces to get an array of words in each sentence. You could then calculate the length of each words array in the sentences array and average those lengths.
To make a list of words used in order of descending frequency, you could split the text on spaces iterate over each word and store the count in a dictionary where the key is a word and the value is the count of occurrences. You could then iterate over the keys in that dictionary, create tuples of the words and counts, and sort those tuples to figure out the most common words. Here is a solution to a related problem, solving for common characters in a string: https://gist.github.com/geoff616/6df5320a1f720411a180
For often used phrases (a phrase of 3 words used over 3 times), you could do the same calculation as above but split on every third space (with a regex) instead of analyzing each word individually, and filtering out anything with a count less than 3. Calculating common phrases of 3 or more words is trickier, but if you solve all of the previous problems, the answer might become more apparent.