c++ - 如何使用 C++ 计算文本中 Unicode 字符的数量

Question

我写了一个简单的代码来计算文本中不同字符的数量。这是下面的代码：

#include <iostream>
#include <fstream>
#include <map>
using namespace std;
const char* filename="text.txt";
int main()
{
    map<char,int> dict;
    fstream f(filename);
    char ch;
    while (f.get(ch))
    {
        if(!f.eof())
            cout<<ch;
        if (!dict[ch])
            dict[ch]=0;
        dict[ch]++;
    }
    f.close();
    cout<<endl;
    for (auto it=dict.begin();it!=dict.end();it++)
    {
        cout<<(*it).first<<":\t"<<(*it).second<<endl;
    }
    system("pause");
}

该程序在计算ascii字符方面做得很好，但它不能在像汉字这样的Unicode字符中工作。如果我希望它能够在Unicode字符中工作，如何解决这个问题？

score 2 · Accepted Answer

首先，你要数什么？Unicode 代码点或字素簇，即编码意义上的字符，还是读者感知的字符？还要记住，“宽字符”（16 位字符）不是 Unicode 字符（UTF-16 是可变长度，就像 UTF-8 一样！）。

在任何情况下，都可以使用 ICU 之类的库来进行实际的代码点/集群迭代。对于计数，您需要用适当的类型替换char您的类型（代码点的 32 位或字素簇的规范化字符串，规范化应该 - 再次 - 由库处理）mapunsigned int

重症监护室：http: //icu-project.org

字形集群：http ://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

标准化：http ://unicode.org/reports/tr15/

score 1 · Accepted Answer

您需要一个 Unicode 库来处理 Unicode 字符。编码 - 比如说 - UTF8 自己将是一项艰巨的任务，并且重新发明轮子。

在这个来自 SO 的 Q/A中提到了一个很好的问题，你会从其他答案中找到建议。

score 0 · Accepted Answer

一切都有广泛的 char 版本，但如果你想做一些与你现在所拥有的非常相似的事情并且使用 16 位版本的 unicode：

map<short,int> dict;
fstream f(filename);
char ch;
short val;
while (1)
{
    // Beware endian issues here - should work either way for char counting though.
    f.get(ch);
    val = ch;
    f.get(ch);
    val |= ch << 8;

    if(val == 0) break;

    if(!f.eof())
        cout<<val;
    if (!dict[val])
        dict[val]=0;
    dict[val]++;
}
f.close();
cout<<endl;
for (auto it=dict.begin();it!=dict.end();it++)
{
    cout<<(*it).first<<":\t"<<(*it).second<<endl;
}

上面的代码做了很多假设（所有字符都是 16 位，文件中的偶数字节等），但它应该做你想做的事情，或者至少让你快速了解它如何处理宽字符。

score 0 · Accepted Answer

如果您可以妥协并仅计算代码点，那么直接在 UTF-8 中进行操作相当简单。但是，您的字典必须是std::map<std::string, int>. 获得 UTF-8 的第一个字符后：

while ( f.get( ch ) ) {
    static size_t const charLen[] = 
    {
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
          4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  6,  6,  0,  0,
    } ;
    int chLen = charLen[ static_cast<unsigned char>( ch ) ];
    if ( chLen <= 0 ) {
        //  error: impossible first character for UTF-8
    }
    std::string codepoint( 1, ch );
    -- chLen;
    while ( chLen != 0 ) {
        if ( !f.get( ch ) ) {
            //  error: file ends in middle of a UTF-8 code point.
        } else if ( (ch & 0xC0) != 0x80 ) {
            //  error: illegal following character in UTF-8
        } else {
            codepoint += ch;
        }
    }
    ++ dict[codepoint];
}

您会注意到大部分代码都涉及错误处理。

c++ - 如何使用 C++ 计算文本中 Unicode 字符的数量

4 回答 4

Related

Reference