regex - 将带有 unicode 字符的字符串转换为小写

Question

问题陈述 -我正在处理一些数据文件。在那个数据转储中，我有一些字符串，其中包含字符的 unicode 值。字符可能是大写和小写。现在我需要对这个字符串进行以下处理。

1-如果有 - , _ ) ( } { ] [ ' " 然后删除它们。所有这些字符都以 Unicode 形式存在于字符串中，为 ($4-hexa-digits)

2-所有大写字符都需要转换为小写（包括所有不同的unicode字符'Φ' -> 'φ'，'Ω' -> 'ω'，'Ž' -> 'ž'）

3-稍后我将使用这个最终字符串来匹配不同的用户输入。

问题详细描述——我有一些类似的字符串Buna$002C_Texas , Zamboanga_$0028province$0029等等。

这里$002C, $0028和$0029是 unicode 值，我正在使用下面将它们转换为它们的字符表示。

$str =~s/\$(....)/chr(hex($1))/eg;

或者

$str =~s/\$(....)/pack 'U4', $1/eg;

现在我根据我的要求替换所有字符。然后我将字符串解码为 utf-8 以获取包括 unicode 在内的所有字符的小写，如下所示，因为 lc 直接不支持 unicode 字符。

$str =~ s/(^\-|\-$|^\_|\_$)//g;                        
$str =~ s/[\-\_,]/ /g;                                                                         
$str =~ s/[\(\)\"\'\.]|ʻ|’|‘//g;                                                                                       
$str =~ s/^\s+|\s+$//g;
$str =~ s/\s+/ /g;
$str = decode('utf-8',$str);
$str = lc($str);
$str = encode('utf-8',$str);

但是当 Perl 尝试解码字符串时，我遇到了错误。

Cannot decode string with wide characters at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173

如here所述，此错误也很明显。@ http://www.perlmonks.org/?node_id=569402

现在我按照上面的 url 改变了我的逻辑。我在下面使用将 unicode 转换为字符表示。

$str =~s/\$(..)(..)/chr(hex($1)).chr(hex($2))/eg;

但是现在我没有得到字符表示。我得到了一些不可打印的字符。那么当我不知道会有多少不同的 unicode 表示时如何处理这个问题。

score 5 · Accepted Answer

5

于 2013-08-29T11:10:50.673 回答

regex - 将带有 unicode 字符的字符串转换为小写

1 回答 1

Related

Reference