Searching for a fast/efficient histogram algorithm (with pre-specified bins)

I don't do much coding outside of Matlab, but I have a need to export my Matlab code to another language, most likely C. My Matlab code includes a histogram function, histc(), that places my input data (which is double-precision, not integer) into a specified array of bins, to form a histogram.

I'm sure I can piece together a couple nested loops to generate a histogram function, but I need this function to be fast and memory-light, as it will be accessed repeatedly and often.

To avoid re-inventing the wheel, anyone know if C language has any existing histogram function(s) available for use, or whether people needing such a thing generally create it themselves?

Anyone know an efficient algorithm for creating a histogram? Pseudo-code is fine.

Thanks in advance.

Answers


GSL (GNU Scientific Library) contains a histogram implementation.

Here is the documentation: http://www.gnu.org/software/gsl/manual/html_node/Histograms.html.

And here is an example use: http://www.gnu.org/software/gsl/manual/html_node/Example-programs-for-histograms.html.


The "ideal" histogram algorithm will depend upon the range you expect to capture. Generally any histogram algorithm will look like this:

const int NSAMPLES = whatever;
double samples[NSAMPLES] = { 1.0, 3.93, 1e30, ... }; // your data set
const int NBUCKETS = 10; // or whatever
int counts[NBUCKETS] = { 0 };
for (int i = 0; i != NSAMPLES; ++i) {
    counts[TRANSFER(samples[i])]++;
}

where TRANSFER() is some function that maps your inputs to a bin (0th or Nth bin mapping to "out of range" of applicable).

The exact implementation of TRANSFER() depends a lot on the expected distribution of your sample and where you are interested in detail. Some common approaches I have seen:

  • uniform distribution in range [a,b] (requires linear transform)
  • logarithmic distribution of unsigned integer values (best when combined with some bit twiddling hacks to quickly determine the nearest power-of-two or similar).

If you don't know the distribution up-front, then you really can't have an efficient mechanism to bin them effectively: you'll either have to guess (biased or uninformative results) or store everything and sort it at the end, binning into equal-sized buckets (poor performance).


I've written my own histogram code in C, as it's simple enough that I didn't even think to look for a library. Normally you just need to create an array to contain the number of bins that you want [num_bins = (int)(val_max - val_min + 1);], and as you encounter each sample you can divide by the number of bins [bin_idx = (int)((value - val_min) / bin_width);] (where bin_width = (max-min)/num_bins) to find where it belongs and then increment the bin counter. This is an easy, fast, single pass through the data. Do check my arithmetic above for edge cases.

The problem you might encounter is that the domain of your input might not be known. Having 100 bins over the whole range of double isn't going to be much good if all your data is within only a small fraction of that. The solution is to make a first pass over the data to find the min/max of your range. There's really no quick fix to this and most libraries will ask for min/max up front.


Need Your Help

What is the etymology of <*> from Applicative in Haskell?

haskell notation applicative nomenclature

Where did the name &lt;*&gt; first begin to appear in literature or code, and did it come with any explanation for the choice of symbol?

Really simple way to deal with XML in Python?

python xml

Musing over a recently asked question, I started to wonder if there is a really simple way to deal with XML documents in Python. A pythonic way, if you will.