7

到目前为止,我工作的项目只在源代码中使用 ASCII。由于 I18N 领域即将发生一些变化,而且我们在测试中需要一些 Unicode 字符串,我们正在考虑硬着头皮将源代码移动到 UTF-8,同时使用utf8pragma ( use utf8;)

由于代码现在是 ASCII,我不希望代码本身有任何问题。但是,我不太清楚我们可能会得到任何副作用,而我认为考虑到我们的环境(perl5.8.8、Apache2、mod_perl、带有 FreeTDS 驱动程序的 MSSQL Server),我很可能会得到一些副作用。

如果您过去曾进行过此类迁移:我会遇到什么问题?我该如何管理它们?

4

2 回答 2

11

utf8pragma 只是告诉 Perl 你的源代码是 UTF-8 编码的。如果您在源代码中只使用了 ASCII,那么 Perl 理解源代码不会有任何问题。为了安全起见,您可能希望在源代码管理中创建一个分支。:)

如果您需要处理文件中的 UTF-8 数据,或将 UTF-8 写入文件,则需要在文件句柄上设置编码,并将数据编码为外部位所期望的。例如,请参阅使用 utf8 编码的 Perl 脚本,它可以打开编码为 GB2312 的文件名吗?.

查看介绍 Unicode 的 Perl 文档:

另请参阅Juerd 的 Perl Unicode 建议

于 2009-11-25T13:42:50.513 回答
4

A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:

  • despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
  • convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
  • use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
  • use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
  • when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
  • on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
  • in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
  • remember to convert your input - eg. in a web app, incoming form data may need decoding
  • ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
  • handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
  • one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
  • I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
  • make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
  • remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
  • where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools

One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!

I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.

于 2009-11-27T23:49:19.897 回答