How to handle massive data query and control the time within 1 sec?
I am thinking through a problem, if I get a table, and the data in it keep growing, thousand, million, billion .... One day, I think even a simple query it will need several seconds to run. So is there any means which we can use to control the time within 1 second or any reasonable time ?
Partitioning. The fastest I/O's you can do are the one you don't need to do.
Indexing. As appropriate, not for every column. You can't make every query run at memory speed, so you have to pick and choose.
Realism. You're not processing a billion I/Os through a relational engine in a single second.
Sure spread it out.
You can use something like Hive ( http://wiki.apache.org/hadoop/Hive ) for SQL queries.
It will take a few minutes per query, weather you have 100 thousand rows or 100 billion rows. You will have data living on many different computers, and though the magic of hadoop, your query will go out to where the data lives, do the query of that part, and come back with the results.
Or for faster queries with more limitations, look at Hbase ( http://hbase.apache.org/#Overview ). It also sits on top of hadoop, and is a little faster at the tradeoff of less SQL like.
Think you should community wiki this as there won't a single correct answer (or you get a lot more specific in your question).
Firstly, expanding Tim's indexing. Btree indexes are like an upside down pyramid. Your Root/'level 0' block may point to a hundred 'level 1' blocks. They each point to a hundred 'level 2' blocks and they each point to a hundred 'level 3' blocks. That's a million 'level 3' blocks, which can point to a hundred million data rows. That's five reads to get to any row in that dataset (and probably all but the last two are cached in memory). One more level lifts your dataset by two orders of magnitude. Indexes scale REALLY well, so if your application use case is playing with small data volumes in a very large dataset, you're fine.
Partitioning can be seen as an alternative form of indexing, where you want to quickly exclude a significant part of the work.
Datawarehouse appliances are a second solution, when you expect to be dealing with large datasets within even larger datasets. Generally the solution is to throw disks at a problem, with or without CPUs/memory dedicated to those disks to split the problem.
Distributed database are mostly solving a different form of scalability, that of large numbers of concurrent users. There's only so much memory a CPU can address, and therefore only so many users that a CPU can cope with without them fighting over memory. Replication worked to a degree, especially with older style read-heavy applications. The problem that the newer NoSQL database are addressing is to do that and get consistent results, including managing backups and recoveries to restore consistency. They've generally done that by going for 'eventual consistency', accepting transient inconsistencies as the tradeoff for scalability.
I'd venture to say that there are few NoSQL database where the data volume has precluded a RDBMS solution. Rather it's been the user/transaction/write volume that has pushed distributed databases.
Solid State storage will also play a part. The problem with brown spinning disks recently has been less to do with capacity as rotation. They can't go fast enough to quickly access all the data you can store on them. Flash drives/cards/memory/cache basically take out that 'seek' time that is holding everything up.
Indexing will solve 90% of your problems. Finding one unique element out of a million in a binary tree will require traversing only 30 nodes (0.003% of the total number of records).
Depending on the data, you could make aggregation tables. So if you were recording a statistic and sampled every 5 minutes, you could simply aggregate the data into a table with each row representing an average reading over a period of an hour, a day, and so on.