7

什么是创建设置了 UTF8 标志但包含无效 UTF8 字节序列的 perl 字符串的好方法?

有没有一种方法可以在 perl 字符串上设置 UTF8 标志而不执行本地编码到 UTF-X 转换(例如,当您调用时会发生这种情况utf8::upgrade)?

我需要这样做以追踪 DBI 驱动程序中可能存在的错误。

4

2 回答 2

8

您可以通过破解字符串的内容来设置任意字节序列,但仍设置 UTF8 标志。

use Inline C;
use Devel::Peek;
utf8::upgrade( $str = "" );
Dump($str); 
twiddle($str, "\x{BD}\x{BE}\x{BF}\x{C0}\x{C1}\x{C2}");
Dump($str);
__DATA__
__C__
/** append arbitrary bytes to a Perl scalar **/
void twiddle(SV *s, const char *t)
{
  sv_catpv(s, t);
}

典型输出:

SV = PV(0x80029bb0) at 0x80072008
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x80155098 ""\0 [UTF8 ""]
  CUR = 0
  LEN = 12
SV = PV(0x80029bb0) at 0x80072008
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x80155098 "\275\276\277\300\301\302"\0Malformed UTF-8 character (unexpected continuation byte 0xbd, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected continuation byte 0xbe, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected continuation byte 0xbf, with no preceding start byte) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected non-continuation byte 0xc1, immediately after start byte 0xc0) in subroutine entry at ./invalidUTF.pl line 6.
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xc2) in subroutine entry at ./invalidUTF.pl line 6.
 [UTF8 "\x{0}\x{0}\x{0}\x{0}\x{0}"]
  CUR = 6
  LEN = 12
于 2013-05-09T18:10:10.390 回答
7

这正是 Encode_utf8_on所做的。

use Encode qw( _utf8_on );

my $s = "abc\xC0def";  # String to use as raw buffer content.
utf8::downgrade($s);   # Make sure each char is stored as a byte.
_utf8_on($s);          # Set UTF8 flag.

_utf8_on(除非您想生成错误的标量,否则切勿使用。)

您可以使用查看损坏情况

use Devel::Peek qw( Dump );
Dump($s);

输出:

SV = PV(0x24899c) at 0x4a9294
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x24ab04 "abc\300def"\0Malformed UTF-8 character (unexpected non-continuation byte 0x64, immediately after start byte 0xc0) in subroutine entry at script.pl line 9.
 [UTF8 "abc\x{0}ef"]
  CUR = 7
  LEN = 12
于 2013-05-09T22:37:42.423 回答