c - 从文本文件中读取并将行解析为 C 中的单词

Question

我是 C 和系统编程的初学者。对于家庭作业，我需要编写一个程序，从标准输入读取输入并将行解析为单词，并使用 System V 消息队列（例如，计数单词）将单词发送到排序子进程。我卡在输入部分。我正在尝试处理输入，删除非字母字符，将所有字母单词小写，最后将一行单词拆分为多个单词。到目前为止，我可以以小写形式打印所有 alpha 单词，但是单词之间有线条，我认为这是不正确的。有人可以看看并给我一些建议吗？

来自文本文件的示例：荷马的《伊利亚特》的古腾堡计划电子书，作者：荷马

我认为正确的输出应该是：

the
project
gutenberg
ebook
of
the
iliad
of
homer
by
homer

但我的输出如下：

project
gutenberg
ebook
of
the
iliad
of
homer
                         <------There is a line there
by
homer

我认为空行是由“，”和“by”之间的空格引起的。我尝试了诸如“如果 isspace(c) 则什么也不做”之类的方法，但它不起作用。我的代码如下。任何帮助或建议表示赞赏。

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>


//Main Function
int main (int argc, char **argv)
{
    int c;
    char *input = argv[1];
    FILE *input_file;

    input_file = fopen(input, "r");

    if (input_file == 0)
    {
        //fopen returns 0, the NULL pointer, on failure
        perror("Canot open input file\n");
        exit(-1);
    }
    else
    {        
        while ((c =fgetc(input_file)) != EOF )
        {
            //if it's an alpha, convert it to lower case
            if (isalpha(c))
            {
                c = tolower(c);
                putchar(c);
            }
            else if (isspace(c))
            {
                ;   //do nothing
            }
            else
            {
                c = '\n';
                putchar(c);
            }
        }
    }

    fclose(input_file);

    printf("\n");

    return 0;
}

编辑**

我编辑了我的代码，终于得到了正确的输出：

int main (int argc, char **argv)
{
    int c;
    char *input = argv[1];
    FILE *input_file;

    input_file = fopen(input, "r");

    if (input_file == 0)
    {
        //fopen returns 0, the NULL pointer, on failure
        perror("Canot open input file\n");
        exit(-1);
    }
    else
    {
        int found_word = 0;

        while ((c =fgetc(input_file)) != EOF )
        {
            //if it's an alpha, convert it to lower case
            if (isalpha(c))
            {
                found_word = 1;
                c = tolower(c);
                putchar(c);
            }
            else {
                if (found_word) {
                    putchar('\n');
                    found_word=0;
                }
            }

        }
    }

    fclose(input_file);

    printf("\n");

    return 0;
}

score 6 · Accepted Answer

我认为你只需要忽略任何非字母字符!isalpha(c)，否则转换为小写。当您在这种情况下找到一个单词时，您需要跟踪。

int found_word = 0;

while ((c =fgetc(input_file)) != EOF )
{
    if (!isalpha(c))
    {
        if (found_word) {
            putchar('\n');
            found_word = 0;
        }
    }
    else {
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

如果您需要处理诸如“不是”之类的单词中的撇号，那么应该这样做 -

int found_word = 0;
int found_apostrophe = 0;
    while ((c =fgetc(input_file)) != EOF )
    {
    if (!isalpha(c))
    {
        if (found_word) {
            if (!found_apostrophe && c=='\'') {
                found_apostrophe = 1;
            }
            else {
                found_apostrophe = 0;
                putchar('\n');
                found_word = 0;
            }
                }
    }
    else {
        if (found_apostrophe) {
            putchar('\'');
            found_apostrophe = 0;
        }
        found_word = 1;
        c = tolower(c);
        putchar(c);
    }
}

score 1 · Accepted Answer

我怀疑您真的想将所有非字母字符作为分隔符处理，而不仅仅是将空格作为分隔符处理并忽略非字母字符。否则，foo--bar会显示为一个单词foobar，对吗？好消息是，这让事情变得更容易。您可以删除该isspace子句，而只使用该else子句。

同时，无论你是否特别对待标点符号，你都会遇到一个问题：你为任何空格打印一个换行符。因此，以\r\nor结尾的行，\n甚至以结尾的句子.，都将打印一个空行。解决这个问题的明显方法是跟踪最后一个字符或标志，因此如果您以前打印过一个字母，则只打印一个换行符。

例如：

int last_c = 0

while ((c = fgetc(input_file)) != EOF )
{
    //if it's an alpha, convert it to lower case
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isalpha(last_c))
    {
        putchar(c);
    }
    last_c = c;
}

但是你真的想对所有的标点符号一视同仁吗？问题陈述暗示你这样做，但在现实生活中，这有点奇怪。例如，foo--bar应该可能显示为单独的单词fooand bar，但实际上应该it's显示为单独的单词itand s？就此而言，使用isalpha“单词字符”作为您的规则也意味着，例如，2nd将显示为nd.

因此，如果isascii您的用例不适合区分单词字符和分隔符，则您必须编写自己的函数来进行正确区分。isalnum(c) || c == '\''您可以轻松地用逻辑（例如，）或表（只是一个 128 个整数的数组，所以函数是）来表达这样的规则c >= 0 && c < 128 && word_char_table[c]。这样做还有一个额外的好处，您可以在以后扩展您的代码以处理 Latin-1 或 Unicode，或处理程序文本（与英语文本具有不同的单词字符），或者......</p>

score 0 · Accepted Answer

看来您是用空格分隔单词，所以我认为

while ((c =fgetc(input_file)) != EOF )
{
    if (isalpha(c))
    {
        c = tolower(c);
        putchar(c);
    }
    else if (isspace(c))
    {
       putchar('\n');
    }
}

也会起作用。前提是您的输入文本单词之间的空格不超过一个。

c - 从文本文件中读取并将行解析为 C 中的单词

3 回答 3

Related

Reference