Storing arrays in databases

What is the most efficient way to store large arrays (10000x100) in a database, say, hsqldb? I need to do this for a certain math program that I'm writing in java. Please help. The whole array will be retrieved and stored often (not so much individual elements). Also, some meta-data about the array needs to be stored about the array.

Answers


Great question.

Unless you want to translate your arrays into a set of normalized tables, which it sounds like you don't, you might want to contemplate serialization.

Serialization is a fancy word for turning objects into some format that you can save to disk or a database. The two major formats for serialization are binary and XML, and I'm betting that Java has some support for it.

Depending on what data types you're using, you should be able to turn your array into XML or binary and then save that to a single field in the database. You could get started with this technique in Java by checking out http://java.sun.com/developer/technicalArticles/Programming/serialization/. I know that it's built into .NET.

Hope that this helps. Let me know if I can give you any more direction.


How about storing the data as a BLOB and using Java to decode the BLOB into an actual Java array? It would be much more efficient for storing and retrieving the whole array in one gulp, but would be terrible for twiddling individual elements.


Come up with an internal representation -- be it XML, JSON, some binary file you come up with yourself, or any other form of serialization.

Store it in a table using the "blob" datatype. Store any metadata associated with the matrix in additional columns.

I strongly disagree that the way to do it is to create a table with the same number of rows and columns as your matrix -- that is a very high price to pay for functionality you don't use.

Prepare your insert/select statements beforehand, and use bind variables to change what matrix you are working with -- don't make the db reparse every request.


If its is only 1 array, why not use a binary file?


As allready suggested: Don't use a RDBMS if you don't need the features. Instead of Serialization though you might want to concider a low level API like JDBM that provides some database like features like managing an on-disk index.


If your data is densely packed (the values histogram is close to flat line), your best choice is blob and serialization using Object[Output/Input]Stream.

Otherwise, you might find it more efficient to use sparse arrays and variation of Entity-Attribute-Value schema. Here is an example:

 Name | IndexKey  | Value
------+-----------+-------
 foo  | 'default' | 39        
 foo  | 0:0:0     | 23
 foo  | 0:0:1     | 34
 foo  | 1:5:0     | 12
 ...
 bar  | 1:3:8     | 20
 bar  | 1:3:8     | 23
 bar  | 1:1:1     | 24
 bar  | 3:0:6     | 54
 ...

This also allows you fast updates to parts of the table and selecting slices using SQL 'like' operator.

If the number of your dimensions is fixed to break down the key column to separate int columns for each dimension in order to improve the index efficiency and have more flexible selection criteria (you can use first index 'null' for metadata like the default value).

In any case, it is a good idea to create a clustered index on Name,IndexKey columns.


Define a table with the data your array holds and insert the array values into a table.

This is very simple data access/storage. Will your array dimensions always be the same?


  • Do it in one big explicit transaction. Don't force the database system to create a new implicit transaction for every insert.
  • Use a prepared statement.

PostgreSQL has built-in support for arrays.

http://www.postgresql.org/docs/8.0/interactive/arrays.html


Java Serialization to a Byte Array stored as a BLOB will be your best bet. Java will serialized a large array quite efficiently. Use the rest of the rows columns for anything you're interested in querying upon or displaying readily. It can also be a good idea to keep the BLOBs in their own table and have the "regular" rows point to the "BLOB" rows, if you query and report on the non-BLOB data much (though this can vary by database implementation).


HSQLDB 2.0 supports one dimensional arrays stored as a column of the table. So each row of the table will correspond to one row of the 2D array.

But if you want to retreive a 2D array as a whole, BLOB is the best solution.


Need Your Help

Parts of a Shell Script in Parallel

shell parallel-processing

I have a shell script that contains 4 diferents parts. Each part do not depends on one another. How can I run these 4 parts in parallel?

OpenCV NDK-build not found

android opencv android-ndk environment-variables adt

So ive been wresting with this for like 3 hours so decided to ask a question. I must have read 10 different articles and still cant get it to work.