c++ - 如何仅测试字母的 u32string（使用语言环境）

Question

我正在编写一个编译器（用于我自己的编程语言），我希望允许用户使用 Unicode 字母类别中的任何字符来定义标识符（现代语言，如Go已经允许这样的语法）。我已经阅读了很多关于 C++11 中字符编码的内容，并且根据我发现的所有信息，使用 utf32 编码会很好（在词法分析器中迭代速度很快，并且它比 utf8 有更好的支持在 C++ 中）。

在 C++ 中有isalpha函数。我如何测试wchar32_t它是否是字母（在任何语言中归类为“字母”的 Unicode 代码点）？

甚至可能吗？

score 1 · Accepted Answer

使用ICU遍历字符串并检查是否满足适当的 Unicode 属性。下面是一个用 C 语言检查 UTF-8 命令行参数是否是有效标识符的示例：

#include <stdint.h>
#include <stdlib.h>
#include <string.h>

#include <unicode/uchar.h>
#include <unicode/utf8.h>

int main(int argc, char **argv) {
  if (argc != 2) return EXIT_FAILURE;
  const char *const str = argv[1];
  int32_t off = 0;
  // U8_NEXT has a bug causing length < 0 to not work for characters in [U+0080, U+07FF]
  const size_t actual_len = strlen(str);
  if (actual_len > INT32_MAX) return EXIT_FAILURE;
  const int32_t len = actual_len;
  if (!len) return EXIT_FAILURE;
  UChar32 ch = -1;
  U8_NEXT(str, off, len, ch);
  if (ch < 0 || !u_isIDStart(ch)) return EXIT_FAILURE;
  while (off < len) {
    U8_NEXT(str, off, len, ch);
    if (ch < 0 || !u_isIDPart(ch)) return EXIT_FAILURE;
  }
}

请注意，这里的 ICU 使用 Java 定义，这与UAX #31中的定义略有不同。在实际应用程序中，您可能还希望之前标准化为 NFC。

score 0 · Accepted Answer

0

ICU项目isaplha中有一个。我认为你可以使用它。

于 2013-04-07T14:12:54.613 回答

c++ - 如何仅测试字母的 u32string（使用语言环境）

2 回答 2

Related

Reference