C++ Counting Map

Recently I was dealing with what I am sure is a very common problem, which essentially boils down into the following:

Given a long text, calculate the frequency of each word occurring in the text.

I was able to solve this problem using std::unordered_map. This, however, turned quite ugly, as for every word in the text, if that's already been encountered I had to do a find, erase, and then a re-insert into the map with the value incremented.

I realise there are other ways of doing this, such as using a hashing function on top of a vanilla array/vector and increment value there, but I was wondering if there was a more elegant way of solving this problem, like an STL component, or function. that would have a similar interface to Pythons Counter collections.

I know C++ being C++ I can't really expect such high level concepts to always be implemented for me, but was just wondering if you guys new about anything (or at least your Googling skills are superior to mine) which could make my code a little nicer.

Answers


I'm not quite sure why an std::unordered_map (or just std::map) would involve much complexity. I'd write the code something like this:

std::unordered_map<std::string, int> words;

std::string word;
while (word = getword(input))
   ++words[word];

There's no need for any kind of find/erase/reinsert.

In case it's not clear how/why this works: operator[] will create an entry for a value if none exists yet in the map. The associated value will be a value-initialized object of the specified type, which will be zero in the case of an int (or similar). We then increment that every time we encounter the word.


An alternative solution:

std::multiset<std::string> m;
for (auto w: words) m.insert(w);

m.count("some word");

The advantage is that you don't have to rely on the 'trick' with operator[], making the code more readable.

EDIT: As Kerrek pointed out in the comments, this solution is slower. multiset stores all the elements you insert, even if they are deemed equal (they might still differ in some aspect that operator== does not check). This causes a significant overhead compared to unordered_map<std::string, int>, which only has to store each word once.

(As a side note, processing the complete works of William Shakespeare using the map solution takes about 0.33s on my machine, as opposed to 0.78s for the multiset solution.)


Need Your Help

Calculating work hours based on rolling days and intervals

vba excel-2010 worksheet-function

I've trying to calculate the intervals for a workday, where the hours have different categories/earnings, and the standard hours are different depending on the days. I am using Excel 2010 for this,...

How to configuring apache CAMEL in KURA framework

java eclipse apache-camel osgi rhiot

I am working in Apache camel which has to be included in KURA framework for throttling and some other purposes,so i have followed this link.