perl - 在 Perl 中测试查询字符串 unicode 处理

Question

当我对 Unicode 问题感到困惑时，我正在尝试编写一个测试查询字符串解析的示例。简而言之，字母“Omega”（Ω）似乎没有被正确解码。

统一码：U+2126
3字节序列：\xe2\x84\xa6
URI 编码：%E2%84%A6

所以我写了这个测试程序来验证我可以用 URI::Encode “解码” unicode 查询字符串。

use strict;                                                                                                                                                                    
use warnings;
use utf8::all;    # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;

sub parse_query_string {
    my $query_string = shift;
    my @pairs = split /[&;]/ => $query_string;

    my %values_for;
    foreach my $pair (@pairs) {
        my ( $key, $value ) = split( /=/, $pair );
        $_ = uri_decode($_) for $key, $value;
        $values_for{$key} ||= [];
        push @{ $values_for{$key} } => $value;
    }
    return \%values_for;
}

my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';

diag $omega;
diag $query->{alpha}[0];

done_testing;

以及测试的输出：

query.t .. 
not ok 1 - Unicode should decode correctly
#   Failed test 'Unicode should decode correctly'
#   at query.t line 23.
#     Structures begin differing at:
#          $got->{alpha}[0] = 'â¦'
#     $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=1,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.05 cusr  0.00 csys =  0.09 CPU)
Result: FAIL

在我看来 URI::Encode 可能在这里被破坏，但切换到 URI::Escape 并使用 uri_unescape 函数会报告相同的错误。我错过了什么？

score 7 · Accepted Answer

URI 编码字符仅表示 utf-8 序列，而 URI::Encode 和 URI::Escape 仅将它们解码为 utf-8 字节字符串，并且它们都不将字节字符串解码为 UTF-8（这是正确的行为一个通用的 URI 解码库）。

换句话说，您的代码基本上是这样： is "\N{U+2126}", "\xe2\x84\xa6"这将失败，因为经过比较，perl 将后者升级为 3 个字符长度的 latin-1 字符串。

您必须使用Encode::decode_utf8after手动解码输入值uri_decode，或者比较编码的 utf8 字节序列。

score 5 · Accepted Answer

URI 转义表示八位字节并且对字符编码一无所知，因此您必须自己从 UTF-8 八位字节解码为字符，例如：

$_ = decode_utf8(uri_decode($_)) for $key, $value;

score 4 · Accepted Answer

在您自己对问题的解释中，可以在不正确的细节中看到该问题。您正在处理的实际上是：

Unicode 代码点：U+2126
代码点的 UTF-8 编码：\xe2\x84\xa6
codepoint的UTF-8编码的URI编码：%E2%84%A6

问题是您只撤消了其中一种编码。

已经提出了解决方案。我只是想提出另一种解释。

score 0 · Accepted Answer

我建议您看看为什么现代 Perl 默认避免使用 UTF-8？对这个话题进行彻底的讨论。

我会在那里添加讨论：

你会注意到页面上有很多奇怪的字形。这是作者故意的。
我已经尝试过线程中推荐的 Symbola 字体，它在 Win 7 上看起来很糟糕。YMMV。
阅读为什么现代 Perl 默认避免使用 UTF-8？过于频繁可能会导致抑郁和对你的生活选择挥之不去的怀疑。

perl - 在 Perl 中测试查询字符串 unicode 处理

4 回答 4

Related

Reference