c++ - 计算文件中字母的出现次数

Question

我正在尝试计算每个字母出现在文件中的次数。当我运行下面的代码时，它会计算“Z”两次。谁能解释为什么？

测试数据为：

abcdefghijklmnopqrstuvwxyz

ABCDEFGHIJKLMNOPQRSTUVWXYZ

#include <iostream>                 //Required if your program does any I/O
#include <iomanip>                  //Required for output formatting
#include <fstream>                  //Required for file I/O
#include <string>                   //Required if your program uses C++ strings
#include <cmath>                    //Required for complex math functions
#include <cctype>                   //Required for letter case conversion

using namespace std;                //Required for ANSI C++ 1998 standard.

int main ()
{
string reply;
string inputFileName;
ifstream inputFile;
char character;
int letterCount[127] = {};

cout << "Input file name: ";
getline(cin, inputFileName);

// Open the input file.
inputFile.open(inputFileName.c_str());      // Need .c_str() to convert a C++ string to a C-style string
// Check the file opened successfully.
if ( ! inputFile.is_open())
{
    cout << "Unable to open input file." << endl;
    cout << "Press enter to continue...";
    getline(cin, reply);
    exit(1);
}

while ( inputFile.peek() != EOF )
{
      inputFile >> character;
      //toupper(character);

      letterCount[static_cast<int>(character)]++;
}

for (int iteration = 0; iteration <= 127; iteration++)
{
    if ( letterCount[iteration] > 0 )
    {
         cout << static_cast<char>(iteration) << " " << letterCount[iteration] << endl;
    }
}

system("pause");
exit(0);
}

score 4 · Accepted Answer

正如其他人指出的那样，您在输入中有两个 Q。你有两个 Z 的原因是最后一个

inputFile >> character;

（可能当流中只剩下一个换行符，因此不是 EOF 时）无法转换任何内容，在前一次迭代的全局“字符”中留下一个“Z”。之后尝试检查 inputFile.fail() 以查看以下内容：

while (inputFile.peek() != EOF)
{
    inputFile >> character;

    if (!inputFile.fail())
    {
        letterCount[static_cast<int>(character)]++;
    }
}

编写循环的惯用方式，也解决了你的“Z”问题，是：

while (inputFile >> character)
{
      letterCount[static_cast<int>(character)]++;
}

score 2 · Accepted Answer

大写字符串中有两个Q's。我相信你得到两个计数的原因Z是你应该EOF在阅读角色之后检查，而不是之前，但我不确定。

score 2 · Accepted Answer

好吧，其他人已经指出了您代码中的错误。

但这是一种优雅的方式，您可以读取文件并计算其中的字母：

 struct letter_only: std::ctype<char> 
 {
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
       static std::vector<std::ctype_base::mask> 
             rc(std::ctype<char>::table_size,std::ctype_base::space);

       std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
       return &rc[0];
    }
 };

struct Counter
{
    std::map<char, int> letterCount;
    void operator()(char  item) 
    { 
       if ( item != std::ctype_base::space)
         ++letterCount[tolower(item)]; //remove tolower if you want case-sensitive solution!
    }
    operator std::map<char, int>() { return letterCount ; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only leters only!
     input.open("filename.txt");
     istream_iterator<char> start(input);
     istream_iterator<char> end;
     std::map<char, int> letterCount = std::for_each(start, end, Counter());
     for (std::map<char, int>::iterator it = letterCount.begin(); it != letterCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }

这是此解决方案的修改（未经测试）版本：

计算文件中单词频率的优雅方法

score 1 · Accepted Answer

一方面，您的输入中确实有两个 Q。

关于 Z，@Jeremiah 可能是正确的，因为它是最后一个字符，并且您的代码没有正确检测 EOF，所以它被加倍计算。这可以通过例如更改输入字符的顺序来轻松验证。

作为旁注，这里

for (int iteration = 0; iteration <= 127; iteration++)

您的索引超出范围；循环条件应该是iteration < 127，或者您的数组声明为int letterCount[128]。

score 1 · Accepted Answer

鉴于您显然只想计算英文字母，看来您应该能够大大简化您的代码：

int main(int argc, char **argv) { 
   std::ifstream infile(argv[1]);

    char ch;
    static int counts[26];

    while (infile >> ch)
       if (isalpha(ch))
           ++counts[tolower(ch)-'a'];

    for (int i=0; i<26; i++)
        std::cout << 'A'  + i << ": " << counts[i] <<"\n";
    return 0;
}

当然，还有更多的可能性。与@Nawaz 的代码（例如）相比，这显然更短更简单——但它也受到更多限制（例如，就目前而言，它只适用于非重音英文字符）。它几乎仅限于基本的 ASCII 字母——EBCDIC 编码、ISO 8859-x 或 Unicode 将完全破坏它。

他还可以轻松地将“仅字母”过滤应用于任何文件。在它们之间进行选择取决于您是否想要/需要/可以使用这种灵活性。如果您只关心问题中提到的字母，并且只在使用某些 ASCII 超集的典型机器上，则此代码将更轻松地处理这项工作 - 但如果您需要更多，它根本不适合。

c++ - 计算文件中字母的出现次数

5 回答 5

Related

Reference