How do I obfuscate a column in a Hive view?
I have created a view for a table as:
CREATE VIEW anonymous_table AS SELECT id, value FROM sensitive_table
and would like the id field of sensitive table to be obfuscated somehow, like an MD5 hash or something similar so that people querying the view can't see the actual id. What is a good way to do this in Hive?
Don't include ID in your view at all:
CREATE VIEW something AS SELECT "HIDDEN ID", value from sensitive_table;
If you still need there to be a distinct key available for each record, you could write a UDF to do whatever transformation you like:
ADD JAR mycode.jar; CREATE TEMPORARY FUNCTION hash as 'com.example.MyUDF'; CREATE VIEW something as SELECT hash(id), value from sensitive_table;
BONUS: Seeing as your users can just look at the sensitive table anyway, you could hash the IDs before they arrives in hive? This is probably the best option honestly.
Either way, if you're processing the ID's, having a stable hashing function would be what you need if people still need to rely on the ID's for joining / aggregation, etc.