Group by In HBase
I almost do not know anything about HBase. Sorry for basic questions.
Imagine I have a table of 100 billion rows with 10 int, one datetime, and one string column.
- Does HBase allow querying this table and Group the result based on key (even a composite key)?
- If so, does it have to run a map/reduce job to it?
- How do you feed it the query?
- Can HBase in general perform real-time like queries on a table?
Data aggregation in HBase intersects with the "real time analytics" need. While HBase is not built for this type of functionality there is a lot of need for it. So the number of ways to do so is / will be developed. 1) : Register HBase table as external table in Hive and do aggregations. Data will be accessed via HBase API what is not that efficient. Configuring Hive with Hbase this is discussion about how it can be done. It is most powerful way to group by HBase data. It do imply running MR jobs but by HHive, not by HBase. 2) You can write you own MR job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API - instead it will be accesses right from HDFS in sequential manner. 3) Next version of HBase will contain coprocessors which would be able to aggregations inside specific regions. You can assume them to be a kind of stored procedures in the RDBMS word. 4) In memory, Inter-region MR job which will be parralelized in one node is also planned in the future HBase releases. It will enable somewhat more advanced analytical processing then coprocessors.
FAST RANDOM READS = PREPREPARED data sitting in HBase! Use Hbase for what it is...
1. A place to store a lot of data. 2. A place from which you can do super fast reads. 3. A place where SQL is not gonna do you any good (use java).
Although you can read data from HBase and do all sorts of aggregates right in Java data structures before you return your aggregated result, its best to leave the computation to mapreduce. From your questions, it seems as if you want the source data for computation to sit in HBase. If this is the case, the route you want to take is have HBase as the source data for a mapreduce job. Do computations on that and return the aggregated data. But then again, why would you read from Hbase to run a mapreduce job? Just leave the data sitting HDFS/ Hive tables and run mapreduce jobs on them THEN load the data to Hbase tables "pre-prepared" so that you can do super fast random reads from it.
Once you have the preaggregated data in HBase, you can use Crux http://github.com/sonalgoyal/crux to further drill, slice and dice your HBase data. Crux supports composite and simple keys, with advanced filters and group by.