algorithm - How can we optimise the creation of a trie if we know the input is in alphabetical order?

Question

I am implementing a prefix tree, with a standard insertion mechanism. If we know we will be given a list of words in alphabetical order, is there any way we can change the insertion to skip a few steps? I am coding in Java, although I'm not looking for code in any particular language. I have considered adding the Nodes for each word to a queue, then hopping backwards through it until we're at a prefix of the next word, but this may be circumventing the whole point of the prefix tree!

Any thoughts on something like this? I'm finding it hard to come up with an implementation that's of any use unless the input is many many very similar words ("aaaaaaaaaab", "aaaaaaaaaac", "aaaaaaaaaad", ...) or something. But even then doing a string comparison on the prefixes is probably a similar cost to just using the prefix tree normally.

score 1 · Accepted Answer

您无法避免查看用于构建树的输入字符串中的所有字符。如果有办法做到这一点，那么我可以让你的算法不正确。特别是，假设有一个单词 w 并且您不看它的一个字符（例如，第 k 个字符）。然后，当您的算法运行并尝试将单词放在 trie 中的某个位置时，它必须能够在不知道所有字符的情况下放置它。因此，如果我将单词的第 k 个字符更改为其他字符，您的算法会将它放在与以前完全相同的位置，这是不正确的，因为单词中的一个字符将不正确。

由于构建 trie 的普通算法所花费的时间与输入中的字符数成正比，因此如果不做一些疯狂的技巧，例如并行化构造代码或将字符打包成机器词并命中它们，您将无法渐近地超越它用你的比特黑客之锤。

但是，您可能会获得恒定的因子加速。由于缓存性能，在链接结构中跟踪大量指针可能会很慢，因此您可以通过最小化必须遵循的指针数量来加速算法。您可以做的一件事是保持您插入的最后一个字符串的末尾的位置，以及一个跟踪路径回到根的节点列表（最好作为动态数组）。要插入新字符，您可以执行以下操作：

查找与您插入的最后一个字符串匹配的字符串的最长前缀。
跳转到数组中的指针，标记将带你去哪里。
正常跟踪路径的其余部分，将跟踪的所有节点添加到数组并覆盖以前的指针。

这样，如果您插入大量具有合理长度的公共前缀的单词，您可以避免通过结构的共享部分进行一堆指针追踪。如果您有很多具有相同前缀的单词，这可能会给您带来性能提升。它并没有比以前更好（事实上，它使用了更多的内存），但是不遵循指针所节省的成本可以加起来。我没有对此进行测试，但似乎它可能会起作用。

希望这可以帮助！

algorithm - How can we optimise the creation of a trie if we know the input is in alphabetical order?

1 回答 1

Related

Reference