text content analyzer in python

I created a text content analyzer in python that analyzes input from a file and output

  1. The total word count
  2. The count of unique words
  3. The number of sentences

Here is the code :

import re
import string
import os
import sys

def function(s):
    return re.sub("[%s]" % re.escape(string.punctuation), '', s.lower())

def main():
    words_list = []

    with open(sys.argv[1], "r") as f:
        for line in f:
            words_list.extend(line.split())

    print "Total word count:", len(words_list)

    new_words = map(function, words_list)

    print "Unique words:", len(set(new_words))

    nb_sentence = 0
    for word in words_list:
        if re.search(r'[.!?][' "'" '"]*', word):
            nb_sentence += 1

    print "Sentences:", nb_sentence

if __name__ == "__main__":
    main()

I am now trying to calculate the average sentence length in words, find often used phrases (a phrase of 3 or more words used over 3 times), and make a list of words used, in order of descending frequency. Could anyone help?

Answers


Here are some approaches that could help:

  • For the average sentence length in words, you could split on periods to get an array of sentences, then split each sentence in that array on spaces to get an array of words in each sentence. You could then calculate the length of each words array in the sentences array and average those lengths.

  • To make a list of words used in order of descending frequency, you could split the text on spaces iterate over each word and store the count in a dictionary where the key is a word and the value is the count of occurrences. You could then iterate over the keys in that dictionary, create tuples of the words and counts, and sort those tuples to figure out the most common words. Here is a solution to a related problem, solving for common characters in a string: https://gist.github.com/geoff616/6df5320a1f720411a180

  • For often used phrases (a phrase of 3 words used over 3 times), you could do the same calculation as above but split on every third space (with a regex) instead of analyzing each word individually, and filtering out anything with a count less than 3. Calculating common phrases of 3 or more words is trickier, but if you solve all of the previous problems, the answer might become more apparent.


Need Your Help

Auto Split Columns using iTextSharp

c# itextsharp

I have a grid that need to be exported to PDF, grid has 28 columns. I am using iText to write the pdf. Issue -Itext is writting only 13 columns rest columns are not coming in PDF.