Should I convert overlong UTF-8 strings to their shortest normal form?
I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.
My question is quite simply "is this a bad idea"?
A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.
Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.
Am I missing something. Is there a hidden danger I haven't considered?
Yes, this is a bad idea.
Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.
The canonical example that caused many problems: '\xC0\xBCscript>'. ‘Fix’ the overlong sequence to plain ASCII < and you have accidentally created a security hole.
No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.
I don't think this is a bad idea from a security or usability perspective.
From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <> before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)
Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.
I don't know if it is a bad idea in your scenario, however, as this kind of change is not bijective, it may lead to data loss.
If you incorrectly detected the encoding of your data, you may interpret data as being legitimate UTF-8 overlongs and change them in the shortest normal form. There will be no way to later retrieve the original data.
As a personal experience, I know that when such things may happen, they WILL and you will potentially not notice the error before it is too late...