c - 如何在 XS 代码中逐字符处理字符串

Question

假设有这样一段代码：

  my $str = 'some text';
  my $result = my_subroutine($str);

并且my_subroutine()应该作为 Perl XS 代码来实现。例如，它可以返回 (unicode) 字符串的字节总和。

在 XS 代码中，如何处理字符串(a) char by char，作为一般方法，以及(b)逐字节，如果字符串由 ASCII 码子集组成（从原生转换的内置函数字符串的数据结构到 char[]) ？

score 3 · Accepted Answer

在 XS 层，您将获得字节或 UTF-8 字符串。在一般情况下，您的代码可能会包含char *指向字符串中下一项的指向，并随其递增。要在 XS 中使用一组有用的 UTF-8 支持函数，请阅读perlapi

我的一个例子来自http://cpansearch.perl.org/src/PEVANS/Tickit-0.15/lib/Tickit/Utils.xs

int textwidth(str)
    SV *str
  INIT:
    STRLEN len;
    const char *s, *e;

  CODE:
    RETVAL = 0;

    if(!SvUTF8(str)) {
      str = sv_mortalcopy(str);
      sv_utf8_upgrade(str);
    }

    s = SvPV_const(str, len);
    e = s + len;

    while(s < e) {
      UV ord = utf8n_to_uvchr(s, e-s, &len, (UTF8_DISALLOW_SURROGATE
                                               |UTF8_WARN_SURROGATE
                                               |UTF8_DISALLOW_FE_FF
                                               |UTF8_WARN_FE_FF
                                               |UTF8_WARN_NONCHAR));
      int width = wcwidth(ord);
      if(width == -1)
        XSRETURN_UNDEF;

      s += len;
      RETVAL += width;
    }

  OUTPUT:
    RETVAL

简而言之，此函数一次迭代给定字符串一个 Unicode 字符，累积由给出的宽度wcwidth()。

score 3 · Accepted Answer

如果您期望字节：

STRLEN len;
char* buf = SvPVbyte(sv, len);

while (len--) {
   char byte = *(buf++);

   ... do something with byte ...
}

如果您期待文本或任何非字节字符：

STRLEN len;
U8* buf = SvPVutf8(sv, len);

while (len) {
   STRLEN ch_len;
   UV ch = utf8n_to_uvchr(buf, len, &ch_len, 0);
   buf += ch_len;
   len -= ch_len;

   ... do something with ch ...
}

c - 如何在 XS 代码中逐字符处理字符串

2 回答 2

Related

Reference