perl - 将旧版 Perl 代码迁移到 UTF-8 时会出现什么问题？

Question

到目前为止，我工作的项目只在源代码中使用 ASCII。由于 I18N 领域即将发生一些变化，而且我们在测试中需要一些 Unicode 字符串，我们正在考虑硬着头皮将源代码移动到 UTF-8，同时使用utf8pragma ( use utf8;)

由于代码现在是 ASCII，我不希望代码本身有任何问题。但是，我不太清楚我们可能会得到任何副作用，而我认为考虑到我们的环境（perl5.8.8、Apache2、mod_perl、带有 FreeTDS 驱动程序的 MSSQL Server），我很可能会得到一些副作用。

如果您过去曾进行过此类迁移：我会遇到什么问题？我该如何管理它们？

score 11 · Accepted Answer

utf8pragma 只是告诉 Perl 你的源代码是 UTF-8 编码的。如果您在源代码中只使用了 ASCII，那么 Perl 理解源代码不会有任何问题。为了安全起见，您可能希望在源代码管理中创建一个分支。:)

如果您需要处理文件中的 UTF-8 数据，或将 UTF-8 写入文件，则需要在文件句柄上设置编码，并将数据编码为外部位所期望的。例如，请参阅使用 utf8 编码的 Perl 脚本，它可以打开编码为 GB2312 的文件名吗？.

查看介绍 Unicode 的 Perl 文档：

另请参阅Juerd 的 Perl Unicode 建议。

score 4 · Accepted Answer

A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:

despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools

One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!

I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.

perl - 将旧版 Perl 代码迁移到 UTF-8 时会出现什么问题？

2 回答 2

Related

Reference