c++ - 在前缀树中获取所有具有共同前缀的单词时出现性能问题

Question

我有一个前缀树来存储大量的单词。现在，如果我想找到所有带有公共前缀“a”的单词，我首先检索包含 a 的第一个节点，然后在第一个节点的子节点中以深度优先方式彻底搜索。虽然这个想法看起来很天真和简单，但如果具有公共前缀的单词的可能数量非常高（> 20K），它实际上是非常缓慢的。有没有其他方法可以有效地检索所有以公共前缀开头的单词？还是我应该采用其他数据结构？提前谢谢你。

EDIT1 基本上我通过访问每个节点并逐步添加字符来创建一个完整的单词。所有单词稍后都存储在向量容器中。是的，我有递归实现。

编辑2

vector<int> getNonEmptyEdgeIndices(Node* parent) {
    vector<int> indices;
    for(int i=0; i<EDGE; i++) {
        if (parent->edges[i] != NULL) {
            indices.push_back(i);
        }
    }
    return indices; 
}

vector<string> getSubsequentStrings(vector<string> wordsStartingWith, Node* node, string prefix) {
    vector<int> indices = getNonEmptyEdgeIndices(node);

    // push the word to the container if node is a leaf 
    if (indices.empty()) {
        wordsStartingWith.push_back(prefix);
        return wordsStartingWith;
    }

    // if frequency is set in node, push the word but still continue recursion
    if (node->frequency != 0) {
        wordsStartingWith.push_back(prefix);
    }

    // look all the children of the node
    for(unsigned int i=0; i<indices.size(); i++) {
        string newPrefix = prefix + getNodeChar(indices[i]);
        Node* child = node->edges[indices[i]];

        // recursively get the prefix for all children
        wordsStartingWith = getSubsequentStrings(wordsStartingWith, child, newPrefix);  
    }

    return wordsStartingWith;
}

vector<string> Trie::getWordsStartingWith(string prefix) {
    vector<string> wordsStartingWith;
    Node* lastNode = getLastNode(prefix);

    if (lastNode != NULL) {
        wordsStartingWith = getSubsequentStrings(wordsStartingWith, lastNode, prefix);
    }
    return wordsStartingWith;
}

编辑 3 已解决！！！我的实现实际上存在问题。我在递归调用中传递了这个巨大的向量字符串容器，这实际上是问题所在。谢谢大家的好意建议。

score 0 · Accepted Answer

实际上，TRIE+DFT 对于您的情况已经是一个足够好的解决方案。它的时间复杂度是O(M+B^M)单词M的最大长度，B是可能的字母的恒定数量（通常是B=26）。虽然它是指数级的，但实际上它可能比您想象的要快得多，因为 TRIE 树非常稀疏且M数量很小。

一个更简单（不能保证更好）的解决方案是将所有单词排序到一个数组中。然后，您可以通过对数组进行二进制搜索来查找具有目标 prefix 的第一个和最后一个单词，就像您使用英语词典一样。排序需要O(NlogN)和搜索需要O(MlogN)whereN是单词的数量。它是多项式的。

如果你真的是极品飞车，那么你几乎可以随时支付内存空间来换取。在这种情况下，您可以在构建 TRIE 树期间通过指针将每个单词链接到其所有前缀节点。然后将时间复杂度降低到O(M+N)非常快的程度。但另一方面，它的空间复杂度为O(NM). 假设你有一百万个平均有 5 个字母的单词，你会在指针上花费大约 150KB 的内存。

c++ - 在前缀树中获取所有具有共同前缀的单词时出现性能问题

1 回答 1

Related

Reference