Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8

I need to have list of most repeated phrase on my whole text data.

For example my desire output something like this:

  'a': 423412341,
  'this': 423412341,
  'is': 322472341,
  'this is': 222472341,
  'this is a': 122472341,
  'this is a my': 5235634

Process and store each phrase take huge size of database. For example store in MySQL or MongoDB. Question is is there any more efficient database or alghorithm for find this result ? Solr, Elasticsearch or etc ...

I think i have max 10 words in each phrase can be good for me.


I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.

  1. Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).

  2. Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.

A Short Description of The Sampling Algorithm

This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.

In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.

In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.

Output the top k of the items resulting from the second stage.

Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.

If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:

Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:

Have you considered using MapReduce?

Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.

Original paper on MapReduce by Google.

tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.

EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.

After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).

Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).

The top answer by Amy Tavori states:

Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.

While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):

Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.

Looking at the most frequent single words, we get:

1-Gram  Frequency
------  ---------
a       12
will    6
corpus  5
tricksy 4
or      3
from    2
it      2
the     2
very    2
you     2

The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:

2-Gram          Frequency
------          ---------
corpus will     5
tricksy corpus  4
or a            3
a very          2

And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:

3-Gram                Frequency
------                ---------
tricksy corpus will   4

To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.

This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

Need Your Help

jQuery Selectors: Select element with specific class and 'title' attribute

jquery class jquery-selectors attributes title

This should be an easy one, but I can't find a good example anywhere. I'm trying to select a particular element on my page that has a class of 'statuslight' and a title attribute of one of three op...

Reason for globals() in Python?

python global-variables global

What is the reason of having globals() function in Python? It only returns dictionary of global variables, which are already global, so they can be used anywhere... I'm asking only out of curiosity,