A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
- despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
- convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
- use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
- use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
- when reading UTF-8 files, specify the layer -
open($fh,"<:utf8",$filename)
- on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the
:raw
layer
- in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg.
$b=substr(lc($utf8string),0,2048)
fails randomly but $a=lc($utf8string);$b=substr($a,0,2048)
works!
- remember to convert your input - eg. in a web app, incoming form data may need decoding
- ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
- handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg.
$uri=utf_decode($r->uri())
)
- one more for web apps, remember the charset in the HTTP header overrides the charset specified with
<meta>
- I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
- make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
- remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
- where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.