c++ - 计算唯一单词的数量和每个单词的出现次数

Question

CSCI-15 作业 #2，字符串处理。（60 分）截止 2013 年 9 月 23 日

你不能在这个程序中使用 C++ 字符串对象。

编写一个 C++ 程序，使用ifstream getline()方法从文件中读取文本行，使用将这些行标记为单词（“标记”）strtok()，并保留文件中数据的统计信息。您的输入和输出文件名将在命令行上提供给您的程序，您将使用argc和访问该程序argv[]。

您需要计算总单词数、唯一单词数、每个单词的计数以及行数。另外，记住并打印文件中最长和最短的单词。如果最长或最短单词存在平局，您可以以任何一致的方式解决平局（例如，使用找到的第一个或最后一个，但对最长和最短的单词使用相同的方法）。您可以假设这些行包含由空格分隔并以句点结尾的单词（连续的小写字母 [az]）。您可能会忽略其他标点符号的可能性，包括所有格或收缩，例如“Jim's house”。文件中最后一行之前的行将在句点之后有一个换行符 ('\n')。在您的数据文件中，省略最后一行的“\n”。

从输入文件中读取行，并将它们回显到输出文件。在到达输入文件的文件末尾（或读取长度为零的行，您应该将其视为输入数据的结尾）后，打印单词及其出现次数，每行一个单词/计数对，并且收集到的统计信息到输出文件。您还需要创建自己的其他测试文件。此外，您的程序必须与 EMPTY 输入文件一起正常工作——它没有统计信息。

测试文件如下所示（正好 4 行，最后一行没有 NEWLINE）：

敏捷的棕色狐狸跳过了懒狗。现在是所有好人伸出援手的时候了。圣诞节我想要的只是我的两颗门牙。敏捷的棕色狐狸跳过一只懒惰的狗。

将其复制并粘贴到一个小文件中，用于您的一项测试。

提示：

使用 char 的二维数组，100 行 x 16 列（为什么不是 15？）来保存唯一的单词，并使用具有 100 个元素的一维 int 数组来保存相关的计数。对于每个单词，扫描数组中占用的行以查找匹配项（使用strcmp()），如果找到匹配项，则增加相关计数，否则（超过最后一个单词），将单词添加到表中并设置其数到 1。

单独的最长单词和最短单词需要保存在它们自己的 C 字符串中。（为什么不能在标记化数据中保留指向它们的指针？）

请记住 - 在最后一行的末尾加上 NO NEWLINE，否则您的文件结尾测试可能无法正常工作。（这可能会导致程序在看到文件结尾之前读取长度为零的行。）

这不是一个很长的程序——不超过 2 页代码

这是我到目前为止所拥有的：

#include<iostream>
#include<iomanip>
#include<fstream>
#include<string>
#include<cstring>
using namespace std;

void totalwordCount(ifstream &inputFile)
{
    char words[100][16]; // Holds the unique words.
    char *token;
    int totalCount = 0; // Counts the total number of words.
    // Read every word in the file.
    while(inputFile >> words[99])
    {
        totalCount++; // Increment the total number of words.
        // Tokenize each word and remove spaces, periods, and newlines.
        token = strtok(words[99], " .\n"); 
        while(token != NULL)
        {
            token = strtok(NULL, " .\n");
        }
    } 
    cout << "Total number of words in file: " << totalCount << endl;
}

void uniquewordCount(ifstream &inputFile)
{
    char words[100][16]; // Holds the unique words
    int counter[100];
    char *tok = "0";
    int uniqueCount = 0; // Counts the total number of unique words
    while(!inputFile.eof())
    {
        uniqueCount++;
        tok = strtok(words[99], " .\n");
        while(tok != NULL)
        {
            tok = strtok(NULL, " .\n");
            inputFile >> words[99];
            if(strcmp(tok, words[99]) == 0)
            {
                counter[99]++;
            }
            else
            {
                words[99][15] += 1;
            }
            uniqueCount++;
        }
    }
    cout << counter[99] << endl;
}

int main(int argc, char *argv[])
{
    ifstream inputFile;
    char inFile[12] = "string1.txt";
    char outFile[16] = "word result.txt";

    // Get the name of the file from the user.
    cout << "Enter the name of the file: ";
    cin >> inFile;

    // Open the input file.
    inputFile.open(inFile);

    // If successfully opened, process the data.
    if(inputFile)
    {
        while(!inputFile.eof())
        {        
            totalwordCount(inputFile);
            uniquewordCount(inputFile);
        }
    }
    return 0;
}

我已经处理了如何在函数中计算文件中的单词总数totalwordCount()，但是在uniquewordCount()函数中，我无法计算唯一单词的总数和计算每个单词的出现次数。我需要在uniquewordCount()功能中更改什么吗？

score 4 · Accepted Answer

该程序包含几个被视为有害的问题！为了防止基于上述完全无意义的分配创建不良软件，这里有一些提示：

在读取流后，始终测试流是否成功。用于in.eof()确定流是否处于良好状态不起作用！其中一个问题是，如果流因与文件结尾不同的原因而变坏，例如，未能正确解析值（这将设置std::ios_base::failbit但不会设置为std::ios_base::eofbit.
在不设置要读取的字符数限制的情况下读取固定大小的数组是 C++的拼写char方式！如果您真的认为 using是正确的方法（请参阅下一项），则绝对需要设置数组的宽度，例如 using 。当然，您仍然需要检查此提取是否成功。ain >> agets()in >> ain >> std::setw(sizeof(a)) >> a
从表面上看，您的老师希望您实际使用std::istream::getline()来读取数组，例如使用in.getline(a, sizeof(a))（当然，需要检查是否成功）。
请注意，格式化的输入，即，in >> a已经标记了由空格接收的流！在那之后就没有必要为之烦恼了strtok()。
一旦你消费了一个流，它就会被消费。假设字符不是来自文件而是来自标准输入之类的东西，您也不能倒回流以再次读取它。我认为您想将值标记一次并将它们用于两个目的。
这更像是一个旁注：创建流之后，它的性质对于流内容的处理应该完全无关紧要（尽管，例如，对于字符串流，您可能希望最终使用str()成员收集结果）：实现您的流处理函数根据std::istream而不是std::ifstream！

由于您有一个具体的问题（“我需要在 uniquewordCount() 函数中更改什么吗？”）：是的，一切！完全抛弃这个功能，重新考虑你需要做什么。基本上，功能的结构应该遵循

char buffer[100];
while (in.getline(buffer, sizeof(buffer))) {
    // tokenize buffer into words
    // for each word check if it already exists
    // if the word does not exist, append it to the array of known words and set count to 1
    // if the word exists, increment the count
    // determine if the word is shorter or longer than the shortest or longest word so far
    // if it is the case, remember the word's index or a pointer to it
}

c++ - 计算唯一单词的数量和每个单词的出现次数

1 回答 1

Related

Reference