c++ - 使用 boost::locale/ICU 边界分析与中文

Question

使用boost::locale 文档中的示例代码，我无法正确标记中文文本：

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="中華人民共和國";
ssegment_index map(word,text.begin(),text.end(),gen("zh_CN.UTF-8")); 
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
    std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;

这将中华人民共和国</a>拆分为七个不同的字符中/华/人/民/共/和/国，而不是预期的中华//人民共和国。编译 Boost的 ICU 文档声称中文应该开箱即用，并使用基于字典的分词器来正确拆分短语。使用示例日语测试短语“生きるか死ぬか、それが问题だ。”在上面的代码中使用“ja_JP.UTF-8”语言环境确实有效，但这种标记化不依赖于字典，只依赖于汉字/假名边界。

我已经按照这里的建议直接在 ICU 中尝试了相同的代码，但结果是一样的。

UnicodeString text = "中華人民共和國";
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getChinese(), status);
bi->setText(text);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
    printf("Boundary at position %d\n", p);
    p = bi->next();
}
delete bi;

知道我做错了什么吗？

score 1 · Accepted Answer

您很可能使用 5.0 之前的 ICU 版本，这是第一个支持基于字典的中文分词的版本。

另外，请注意，默认情况下 boost 使用 ICU 作为本地后端，因此是镜像结果。

c++ - 使用 boost::locale/ICU 边界分析与中文

1 回答 1

Related

Reference