查找以句点结尾的单词非常简单,只需检查 if word.back() == '.'
。您还需word.empty()
要先检查,因为back()
如果字符串为空,则行为未定义。如果您的编译器不支持 C++11,您也可以使用word[word.size() - 1] == '.'
.
下面是一个基本示例,它使用任何以“.”结尾的单词来天真地分割句子:
#include <iostream>
#include <string>
#include <vector>
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]\n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
current_sentence.append(word);
current_sentence.push_back(' ');
/* use word.back() == '.' for C++11 */
if (!word.empty() && word[word.size() - 1] == '.') {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
if (!current_sentence.empty()) {
sentences.push_back(current_sentence);
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
像这样运行:
$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test.
And a second sentence.
So we meet again Mr.
Bond.
注意它是如何看待“先生”的。是句子的结尾。
我不确定处理这个问题的聪明方法,但一个(脆弱的)选项是制作一个不是句子结尾的单词列表,然后检查该单词是否在列表中,如下所示:
#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>
const std::string tmp[] = {
"dr.",
"mr.",
"mrs.",
"ms.",
"rd.",
"st."
};
const std::set<std::string> ABBREVIATIONS(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));
bool has_period(const std::string& word) {
return !word.empty() && word[word.size() - 1] == '.';
}
bool is_abbreviation(std::string word) {
/* Convert to lowercase, so we don't need to check every possible
* variation of each word. Remove this (and update the set initialization)
* if you don't care about handling poor grammar. */
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
/* Check if the word is an abbreviation. */
return ABBREVIATIONS.find(word) != ABBREVIATIONS.end();
}
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]\n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
current_sentence.append(word);
current_sentence.push_back(' ');
if (has_period(word) && !is_abbreviation(word)) {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
if (!current_sentence.empty()) {
sentences.push_back(current_sentence);
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
在 C++11 中,您可以通过 using 使其更高效,通过unordered_set
usingstd::string::back
和更容易的初始化 ( std::set<std::string> PERIOD_WORDS = { "dr.", "mr.", "mrs." /*etc.*/ }
) 使其更简单。
运行这个版本:
$ g++ test.cpp
$ ./a.out This is a test. And a second sentence. So we meet again Mr. Bond.
This is a test.
And a second sentence.
So we meet again Mr. Bond.
但是,当然,它仍然没有捕捉到我们没有明确编程的任何情况:
$ ./a.out Example Ave. is just north of here.
Example Ave.
is just north of here.
即使我们在其中添加了这一点,也很难检测到像“我住在 Example Ave.”这样的句子以缩写结尾的情况。我希望这对作为一个开始有所帮助。
编辑:我刚刚阅读了评论中链接到的打破维基百科文章的句子,合并规则相对容易:
(c) 如果下一个记号大写,则结束一个句子。
就像是:
#include <algorithm>
#include <iostream>
#include <set>
#include <string>
#include <vector>
const std::string tmp[] = {
"ave.",
"dr.",
"mr.",
"mrs.",
"ms.",
"rd.",
"st."
};
const std::set<std::string> PERIOD_WORDS(tmp, tmp + sizeof(tmp) / sizeof(tmp[0]));
bool has_period(const std::string& word) {
return !word.empty() && word[word.size() - 1] == '.';
}
bool is_abbreviation(std::string word) {
/* Convert to lowercase, so we don't need to check every possible
* variation of each word. Remove this (and update the set initialization)
* if you don't care about handling poor grammar. */
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
/* Check if the word is a word that ends with a period. */
return PERIOD_WORDS.find(word) != PERIOD_WORDS.end();
}
bool is_capitalized(const std::string& word) {
return !word.empty() && std::isupper(word[0]);
}
int main(int argc, char** argv) {
if (argc == 1) {
std::cerr << "Usage: " << argv[0] << " [text to split]\n"
<< "Splits the input text into one sentence per line." << std::endl;
return 1;
}
std::vector<std::string> sentences;
std::string current_sentence;
for (int i = 1; i < argc; ++i) {
std::string word(argv[i]);
std::string next_word(i + 1 < argc ? argv[i + 1] : "");
current_sentence.append(word);
current_sentence.push_back(' ');
if (next_word.empty()
|| has_period(word)
&& (!is_abbreviation(word) || is_capitalized(next_word))) {
sentences.push_back(current_sentence);
current_sentence.clear();
}
}
for (size_t i = 0; i < sentences.size(); ++i) {
std::cout << sentences[i] << std::endl;
}
return 0;
}
然后甚至像这样的情况也有效:
$ ./a.out Example Ave. is just north of here. I live on Example Ave. Test test test.
Example Ave. is just north of here.
I live on Example Ave.
Test test test.
但它仍然无法处理某些情况:
$ ./a.out Mr. Adams lives on Example Ave. Example Ave. is just north of here. I live on Example Ave. Test test test.
Mr.
Adams lives on Example Ave.
Example Ave. is just north of here.
I live on Example Ave.
Test test test.