Determining Word Frequency of Specific Terms

I'm a non-computer science student doing a history thesis that involves determining the frequency of specific terms in a number of texts and then plotting these frequencies over time to determine changes and trends. While I have figured out how to determine word frequencies for a given text file, I am dealing with a (relatively, for me) large number of files (>100) and for consistencies sake would like to limit the words included in the frequency count to a specific set of terms (sort of like the opposite of a "stop list")

This should be kept very simple. At the end all I need to have is the frequencies for the specific words for each text file I process, preferably in spreadsheet format (tab delineated file) so that I can then create graphs and visualizations using that data.

I use Linux day-to-day, am comfortable using the command line, and would love an open-source solution (or something I could run with WINE). That is not a requirement however:

I see two ways to solve this problem:

  1. Find a way strip-out all the words in a text file EXCEPT for the pre-defined list and then do the frequency count from there, or:
  2. Find a way to do a frequency count using just the terms from the pre-defined list.

Any ideas?


I would go with the second idea. Here is a simple Perl program that will read a list of words from the first file provided and print a count of each word in the list from the second file provided in tab-separated format. The list of words in the first file should be provided one per line.


use strict;
use warnings;

my $word_list_file = shift;
my $process_file = shift;

my %word_counts;

# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
  # Store words in lowercase for case-insensitive match
  $word_counts{lc($_)} = 0;

# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";

while (<FILE>) {
  while ( /-$/ ) {
    # If the line ends in a hyphen, remove the hyphen and
    # continue reading lines until we find one that doesn't
    my $next_line = <FILE>;
    defined($next_line) ? $_ .= $next_line : last;

  my @words = split /\b/, lc; # Split the lower-cased version of the string
  foreach my $word (@words) {
    $word_counts{$word}++ if exists $word_counts{$word};

# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
  print "$word\t$word_counts{$word}\n"

If the file words.txt contains:


And the file text.txt contains the text of your post, the following command:

perl words.txt text.txt

will print:

frequencies     3
linux   1
science 1
words   3

Note that breaking on word boundaries using \b may not work the way you want in all cases, for example, if your text files contain words that are hyphenated across lines you will need to do something a little more intelligent to match these. In this case you could check to see if the last character in a line is a hyphen and, if it is, just remove the hyphen and read another line before splitting the line into words.

Edit: Updated version that handles words case-insensitively and handles hyphenated words across lines.

Note that if there are hyphenated words, some of which are broken across lines and some that are not, this won't find them all because it only removed hyphens at the end of a line. In this case you may want to just remove all hyphens and match words after the hyphens are removed. You can do this by simply adding the following line right before the split function:


Need Your Help

Spring Security 3.0.5 Concurrency is not working

spring security concurrency

Hi I am using Spring Security 3.0.5 with Spring Framework 3.0.6. I have configured concurrency as per the documentation. It is not working. I login to the application from a browser session and ...

screnshot of uiwebview showing blank white screen

ios objective-c memory-management crash uiimageview

My app is working fine in iOS 6.1 and Xcode 4.6. Now after i updated to iOS 7 and xcode 5.0 the app is getting crashed.