C++ Counting Map
Recently I was dealing with what I am sure is a very common problem, which essentially boils down into the following:
Given a long text, calculate the frequency of each word occurring in the text.
I was able to solve this problem using std::unordered_map. This, however, turned quite ugly, as for every word in the text, if that's already been encountered I had to do a find, erase, and then a re-insert into the map with the value incremented.
I realise there are other ways of doing this, such as using a hashing function on top of a vanilla array/vector and increment value there, but I was wondering if there was a more elegant way of solving this problem, like an STL component, or function. that would have a similar interface to Pythons Counter collections.
I know C++ being C++ I can't really expect such high level concepts to always be implemented for me, but was just wondering if you guys new about anything (or at least your Googling skills are superior to mine) which could make my code a little nicer.
I'm not quite sure why an std::unordered_map (or just std::map) would involve much complexity. I'd write the code something like this:
std::unordered_map<std::string, int> words; std::string word; while (word = getword(input)) ++words[word];
There's no need for any kind of find/erase/reinsert.
In case it's not clear how/why this works: operator will create an entry for a value if none exists yet in the map. The associated value will be a value-initialized object of the specified type, which will be zero in the case of an int (or similar). We then increment that every time we encounter the word.
An alternative solution:
std::multiset<std::string> m; for (auto w: words) m.insert(w); m.count("some word");
The advantage is that you don't have to rely on the 'trick' with operator, making the code more readable.
EDIT: As Kerrek pointed out in the comments, this solution is slower. multiset stores all the elements you insert, even if they are deemed equal (they might still differ in some aspect that operator== does not check). This causes a significant overhead compared to unordered_map<std::string, int>, which only has to store each word once.
(As a side note, processing the complete works of William Shakespeare using the map solution takes about 0.33s on my machine, as opposed to 0.78s for the multiset solution.)