Checklist for going the Unicode way with Perl

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.

Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.

This is what I have done so far (forgive me for only including "summary" code examples):

  1. Made sure files are always read and written in UTF-8:

    use open ':utf8';
    
  2. Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):

    s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg;    # Kept from existing code
    s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg;     # Added
    utf8::decode $_;
    
  3. Made sure text is printed as UTF-8:

    binmode STDOUT, ':utf8';
    
  4. Made sure browsers interpret my content as UTF-8:

    Content-Type: text/html; charset=UTF-8
    <meta http-equiv="content-type" content="text/html;charset=UTF-8">
    
  5. Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):

    accept-charset="UTF-8"
    
  6. Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:

    use utf8;
    

Does this looks reasonable or am I missing something?

EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.

Answers


  • The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).

  • For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).

  • Make decoding fail hard with an exception, compare:

    perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
    perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
    
  • You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:

    use Encode qw(decode);
    use URI::Escape::XS qw(decodeURIComponent);
    $_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
    
  • Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.

  • Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.

  • Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).

  • For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.


You're always missing something. The problem is usually the unknown unknowns, though. :)

Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.

Some other things you'll need to do:

  • Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.

  • Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �

  • Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).

  • Ensure that anything looking at @ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).

If you find anything else, please let us know by answering your own question with whatever we left out. ;)


Need Your Help

How to dynamically update a ListView on Android

android listview filter android-widget

On Android, how can I a ListView that filters based on user input, where the items shown are updated dynamically based on the TextView value?