c++ - 识别句子的结尾

Question

我正在尝试读取文本文件，并将其逐个字符串输入到向量字符串中。我需要它停在每个句子的末尾，然后挑选句子中的关键词。我知道如何找到关键词，但不知道如何让它在最后停止输入字符串。我正在使用 while 循环来检查每一行，并且我正在考虑使用一系列 if 语句，例如

if(std::vector<string>::iterator i == ".") i == "\0"

到目前为止，我执行矢量填充的代码是：

std::string c;
ifstream infile;
infile.open("example.txt");
while(infile >> c){
    a.push_back(c);
}

好的，所以我想出了一种将文本文件的每个单词加载到标记中的方法，考虑到“”作为分隔符，并有一个特殊情况单词列表：

    const int MAX_PER_LINE = 512;
    const int MAX_TOK = 20;
    const char* const DELIMETER = " -";
    const char* const SPECIAL ="!?.";
    const char* const ignore[]  = {"Mr.", "Ms.","Mrs.","sr.", "Ave.", "Rd."};

接着

             if(!file.good()){
         return 1;
     }
     //parsing algorithm paraphrased from cs.dvc.edu/HowTo_Parse.html
     while(!file.eof()){
     char line[MAX_PER_LINE];

     file.getline(line, MAX_PER_LINE);
     int n = 0;
     const char* token[MAX_TOK] = {};
     token[0] = strtok(line, DELIMETER);
     if(token[0]){
         for(n = 1; n < MAX_TOK; ++n){
             token[n] = strtok(0, DELIMETER);
             if(!token[n]) break;
         }
     }
     //for(int i = 0; i < n; ++i){
     for(int i = 0; i < n; ++i){
         cout << "Token[" << i << "] =" << token[i] << endl;
         cout << endl; 
     }
     }

现在我正在寻找在 if 语句中放入的内容，以便它检查每个标记的特殊情况，或者如果它们遵循具有特殊情况的标记，则将它们加载到新的集合标记中。我大部分都知道伪代码，但我不知道应该使用什么样的语法 if(token[i] contains special case or token[i] didn't have any before it(对于第一个令牌）或大写并跟随带有特殊情况的令牌以将其加载到新令牌中。

任何帮助将不胜感激。

score 2 · Accepted Answer

对于小型项目或没有国际化的项目，编写自己的句子分隔符是可以的。对于基于文本边界的高级文本解决方案，我会推荐 ICU 的BreakIterator。它们基于 unicode.org 标准化，提供字符、单词、换行符和句子边界。他们有 C++ 的开源库（以及我认为的 Java）。参考这个页面，它有图书馆下载页面的链接。

这将避免重新发明轮子并避免以后出现潜在问题。大多数领先的出版软件产品，如 QuarkXPress 等都使用这个库。

编辑：我试图为 ICU 在句子边界上的 BreakIterator 用法找到一个快速教程，但我找到了单词边界示例 - （句子边界计算非常相似，可能只需要在下面createWordInstance替换createSentenceInstance）

void listWordBoundaries(const UnicodeString& s) {
    UErrorCode status = U_ZERO_ERROR;
    BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);


    bi->setText(s);
    int32_t p = bi->first();
    while (p != BreakIterator::DONE) {
        printf("Boundary at position %d\n", p);
        p = bi->next();
    }
    delete bi;
}

score 0 · Accepted Answer

查找以句点结尾的单词非常简单，只需检查 if word.back() == '.'。您还需word.empty()要先检查，因为back()如果字符串为空，则行为未定义。如果您的编译器不支持 C++11，您也可以使用word[word.size() - 1] == '.'.

下面是一个基本示例，它使用任何以“.”结尾的单词来天真地分割句子：

#include <iostream>
#include <string>
#include <vector>

int main(int argc, char** argv) {
    if (argc == 1) {
        std::cerr << "Usage: " << argv[0] << " [text to split]\n"
            << "Splits the input text into one sentence per line." << std::endl;
        return 1;
    }

    std::vector<std::string> sentences;
    std::string current_sentence;
    for (int i = 1; i < argc; ++i) {
        std::string word(argv[i]);
        current_sentence.append(word);
        current_sentence.push_back(' ');
        /* use word.back() == '.' for C++11 */
        if (!word.empty() && word[word.size() - 1] == '.') {
            sentences.push_back(current_sentence);
            current_sentence.clear();
        }
    }
    if (!current_sentence.empty()) {
        sentences.push_back(current_sentence);
    }

    for (size_t i = 0; i < sentences.size(); ++i) {
        std::cout << sentences[i] << std::endl;
    }
    return 0;
}

像这样运行：

$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test. 
And a second sentence. 
So we meet again Mr. 
Bond.

注意它是如何看待“先生”的。是句子的结尾。

我不确定处理这个问题的聪明方法，但一个（脆弱的）选项是制作一个不是句子结尾的单词列表，然后检查该单词是否在列表中，如下所示：

#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>

const std::string tmp[] = {
    "dr.",
    "mr.",
    "mrs.",
    "ms.",
    "rd.",
    "st."
};
const std::set<std::string> ABBREVIATIONS(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));

bool has_period(const std::string& word) {
    return !word.empty() && word[word.size() - 1] == '.';
}

bool is_abbreviation(std::string word) {
    /* Convert to lowercase, so we don't need to check every possible
     * variation of each word. Remove this (and update the set initialization)
     * if you don't care about handling poor grammar. */
    std::transform(word.begin(), word.end(), word.begin(), ::tolower);

    /* Check if the word is an abbreviation. */
    return ABBREVIATIONS.find(word) != ABBREVIATIONS.end();
}

int main(int argc, char** argv) {
    if (argc == 1) {
        std::cerr << "Usage: " << argv[0] << " [text to split]\n"
            << "Splits the input text into one sentence per line." << std::endl;
        return 1;
    }

    std::vector<std::string> sentences;
    std::string current_sentence;
    for (int i = 1; i < argc; ++i) {
        std::string word(argv[i]);
        current_sentence.append(word);
        current_sentence.push_back(' ');
        if (has_period(word) && !is_abbreviation(word)) {
            sentences.push_back(current_sentence);
            current_sentence.clear();
        }
    }
    if (!current_sentence.empty()) {
        sentences.push_back(current_sentence);
    }

    for (size_t i = 0; i < sentences.size(); ++i) {
        std::cout << sentences[i] << std::endl;
    }
    return 0;
}

在 C++11 中，您可以通过 using 使其更高效，通过unordered_setusingstd::string::back和更容易的初始化 ( std::set<std::string> PERIOD_WORDS = { "dr.", "mr.", "mrs." /*etc.*/ }) 使其更简单。

运行这个版本：

$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test. 
And a second sentence. 
So we meet again Mr. Bond.

但是，当然，它仍然没有捕捉到我们没有明确编程的任何情况：

$ ./a.out Example Ave. is just north of here.
Example Ave. 
is just north of here.

即使我们在其中添加了这一点，也很难检测到像“我住在 Example Ave.”这样的句子以缩写结尾的情况。我希望这对作为一个开始有所帮助。

编辑：我刚刚阅读了评论中链接到的打破维基百科文章的句子，合并规则相对容易：

(c) 如果下一个记号大写，则结束一个句子。

就像是：

#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>

const std::string tmp[] = {
    "ave.",
    "dr.",
    "mr.",
    "mrs.",
    "ms.",
    "rd.",
    "st."
};
const std::set<std::string> PERIOD_WORDS(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));

bool has_period(const std::string& word) {
    return !word.empty() && word[word.size() - 1] == '.';
}

bool is_abbreviation(std::string word) {
    /* Convert to lowercase, so we don't need to check every possible
     * variation of each word. Remove this (and update the set initialization)
     * if you don't care about handling poor grammar. */
    std::transform(word.begin(), word.end(), word.begin(), ::tolower);

    /* Check if the word is a word that ends with a period. */
    return PERIOD_WORDS.find(word) != PERIOD_WORDS.end();
}

bool is_capitalized(const std::string& word) {
    return !word.empty() && std::isupper(word[0]);
}

int main(int argc, char** argv) {
    if (argc == 1) {
        std::cerr << "Usage: " << argv[0] << " [text to split]\n"
            << "Splits the input text into one sentence per line." << std::endl;
        return 1;
    }

    std::vector<std::string> sentences;
    std::string current_sentence;
    for (int i = 1; i < argc; ++i) {
        std::string word(argv[i]);
        std::string next_word(i + 1 < argc ? argv[i + 1] : "");
        current_sentence.append(word);
        current_sentence.push_back(' ');
        if (next_word.empty()
            || has_period(word)
            && (!is_abbreviation(word) || is_capitalized(next_word))) {
            sentences.push_back(current_sentence);
            current_sentence.clear();
        }
    }

    for (size_t i = 0; i < sentences.size(); ++i) {
        std::cout << sentences[i] << std::endl;
    }
    return 0;
}

然后甚至像这样的情况也有效：

$ ./a.out Example Ave. is just north of here. I live on Example Ave. Test test test.
Example Ave. is just north of here. 
I live on Example Ave. 
Test test test.

但它仍然无法处理某些情况：

$ ./a.out Mr. Adams lives on Example Ave. Example Ave. is just north of here. I live on Example Ave. Test test test.
Mr. 
Adams lives on Example Ave. 
Example Ave. is just north of here. 
I live on Example Ave. 
Test test test.

c++ - 识别句子的结尾

2 回答 2

Related

Reference