Produce and resolve short file path

I want a collision free hash value that is as short as possible. I want to use it as a pretty directory path to a file name.

  • I want to build a directory tree that has an almost equal length path to any file in it.
  • The files have binary content.
  • Two files of identical content should produce identical file paths (I think that is what the hash should provide).
  • Hash length should be minimal.
  • Hash calculation time is NOT the top priority since the hash calculation is done once for each file.

My current solution:

import org.apache.commons.codec.binary.Base64;
import org.apache.commons.codec.digest.DigestUtils;

String shortHash(){
  byte[] content = "sample".getBytes();
  byte[] hex = DigestUtils.md5(content);
  String filename = Base64.encodeBase64URLSafeString(hex);
  return filename;
}

It produces the hash value 5e8ff9bf55ba3508199d22e984129be6 and a file name as Xo_5v1W6NQgZnSLphBKb5g

To store many files in a directory tree, I simply split the file name to produce a file path like this:

<basedir>/Xo/_5/v1W6NQgZnSLphBKb5g

How can I produce a shorter file path?

Answers


I want a collision free hash value

A hash is never collision free, but you can choose a hash algorith which is extremely unlikely to have collisions, as Jon Skeet explained.

How can I produce a shorter file path?

You need to distiguish two responsibilities.

  1. The hash value for the file gives you fast collision detection.
  2. Produce a short file path. Take a look at alphabet conversion theory and Java UrlShortener implementation.

To handle #2 you follow these steps:

a)Convert

  1. Save real file path in database
  2. You get a unique row ID for that
  3. Convert row ID to short string with encode()
  4. Use the short string as your short file path

b)Resolve

  1. Decode the short file path to an integer ID with decode()
  2. Look up real file path in database for given ID

You can use a Cyclic redundancy check, which generates a value based on the bytes of your content. This is the one I use in Java, which returns a long:

public static long crc64(byte[] data) {
    long crc = 0xffffffffffffffffL;
    for (int b : data) {
        int b2 = (int) (((crc >> 56) & 0xFF) ^ (b & 0xFF));
        crc = (crc << 8) & 0xffffffffffffffffL ^ CRC64_Table[b2];
    }
    return crc;
}

The CRC64_Table is too big to post it here, so I've uploaded it to pastebin.

EDIT: You can also use a 32 bit version, like this one: http://introcs.cs.princeton.edu/java/51data/CRC32.java.html


Need Your Help

Gmail not showing correct font

html css fonts gmail

I am trying to change the font of my emails to open sans, however, I am having issues with Gmail rendering the correct font. I managed to find a way to solve the issue for outlook. This is what I h...

ListViews inside Fragments inside FragmentPagerAdapter

java android listview android-listview android-fragments

I'm getting very confused over how to implement ListViews inside a fragment controlled through a FragmentPagerAdapter.