SQL Server Hash Indexes
When using the CHECKSUM column type to artificially create a hash index, is the lookup actually O(1) or is it still O(lg n) like it is for a clustered index? I have a table from which I will select based on its ID column and I need the lookup to be as fast as possible, so is the clustered index the fastest possible option? I am looking for something that will provide O(1) performance.
Okay, 2 points. The SQL CHECKSUM function does not produce a hash value. It actually calculates a CRC value. It is not a very good candidate to base a hash check on becuase there will be a relativly large number of collisions. You should check the hash_bytes function if you want a hash function. Secondly, you are not actually creating a hash index. You are creating a normal b-tree on a hash value so the lookup time will be exactly the same as for any other b-tree index on a similar sized data type. There is a chance that you could gain a little performance by using a CRC or hash of a long varchar value to allow comparisons of a smaller number of bytes, but string comparison only checks as many bytes as it needs to, which is as far as the first character that doesn't match, and if you do match on the hashed value, you then need to double check the actual value anyway. So unless you have a lot of very similar strings you will probably end up comparing MORE bytes by using the hash (or CRC). In short, I don't think this is a sensible plan, but as with all optimisations you should test it in your specific case and then decide. I would be interested to see your results if you would care to post them. And I don't believe that there is any faster way to locate a row in SQL server than by using a clustered index. In case you care, Ingres (by CA) can create hash indexes which would then achive O(1). there may be other RDBM's out there that also support true hash indexes.
I don't think that SQL server natively has a hash table based index. The BOL documentation is talking about building a standard (tree) index on a calculated value. This is not the same thing as a Linear Hash Table, which is an index structure available on some DBMS platforms, but not SQL Server (AFAIK).
You may get some benefit from using the technique described in this blog post to hash large string values such as URL's for faster lookup. However, the underlying index is still a tree structure and is O(Log N).
You can try to set things up to use a hash join, you can look at the execution plan to verify a hash join is actually used. When hash joins are used, SQL Server will still build the hash table first as part of executing the individual query. I believe indexes are never stored as a hash, only as trees.
In general I would not create an artificial hash column unless you're doing exact matches against potentially large strings or binary blobs (as pipTheGeek mentions). I just wanted to add that sometimes this is necessary as strings might be too large to fit in an index key. There is a limit to the size of index keys of I think 2k for SQL Server.
Of course, in your join you need to include the hash column and the source column to resolve any ambiguities that result from the hash.
There's no advantage to searching an indexed CHECKSUM over a clustered index on the ID field if the ID field is an int since both will do a clustered index seek. Also, a CHECKSUM of an int column always returns the same value as the column (i.e. CHECKSUM(535) = 535). However, a CHECKSUM lookup will generally perform better if the ID is a long character column.