c++ - 忽略几个不同的词.. c++？

Question

我正在阅读几个文档，并对我读到的单词进行索引。但是，我想忽略常见的大小写单词（a、an、the、and、is、or、are 等）。

这样做有捷径吗？不仅仅是做...

if(word=="and" || word=="is" || etc etc....) 忽略单词；

例如，我可以以某种方式将它们放入 const 字符串中，然后让它检查字符串吗？不确定...谢谢！

score 5 · Accepted Answer

set<string>使用您要排除的单词创建一个，并用于mySet.count(word)确定该单词是否在集合中。如果是，则计数将为1; 不然的话0。

#include <iostream>
#include <set>
#include <string>
using namespace std;

int main() {
    const char *words[] = {"a", "an", "the"};
    set<string> wordSet(words, words+3);
    cerr << wordSet.count("the") << endl;
    cerr << wordSet.count("quick") << endl;
    return 0;
}

score 1 · Accepted Answer

您可以使用字符串数组，循环遍历并匹配每个字符串，或者使用更优化的数据结构，例如 aset或 trie。

这是一个如何使用普通数组执行此操作的示例：

const char *commonWords[] = {"and", "is" ...};
int commonWordsLength = 2; // number of words in the array

for (int i = 0; i < commonWordsLength; ++i)
{
    if (!strcmp(word, commonWords[i]))
    {
        //ignore word;
        break;
    }
}

请注意，此示例不使用 C++ STL，但您应该使用。

score 0 · Accepted Answer

如果你想最大化性能，你应该创建一个 trie....

http://en.wikipedia.org/wiki/Trie

...停用词...

http://en.wikipedia.org/wiki/Stop_words

没有标准的 C++ trie 数据结构，但是请参阅此问题以了解第三方实现...

尝试实现

如果您对此不感兴趣并想使用标准容器，那么最好使用的是unordered_set<string>将停用词放入哈希表中。

bool filter(const string& word)
{
    static unordered_set<string> stopwords({"a", "an", "the"});
    return !stopwords.count(word);
}

c++ - 忽略几个不同的词.. c++？

3 回答 3

Related

Reference