java - 实现一个简单的 Trie 以实现高效的 Levenshtein 距离计算 - Java

Question

更新 3

完毕。下面是最终通过了我所有测试的代码。同样，这是模仿穆里洛·瓦斯康塞洛对史蒂夫·汉诺夫算法的修改版本。感谢所有帮助！

/**
 * Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
 * words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
 * distance using a Trie" and Murilo Vasconcelo's revised version in C++.
 * 
 * http://stevehanov.ca/blog/index.php?id=114
 * http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
 * 
 * @param ArrayList<Character> word - the characters of an input word as an array representation
 * @return int - the minimum Levenshtein Distance
 */
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int iWordLength = word.size();
    int[] currentRow = new int[iWordLength + 1];

    for (int i = 0; i <= iWordLength; i++) {
        currentRow[i] = i;
    }

    for (int i = 0; i < iWordLength; i++) {
        traverseTrie(theTrie.root, word.get(i), word, currentRow);
    }
    return theTrie.minLevDist;
}

/**
 * Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
 * 
 * @param TrieNode node - the current TrieNode
 * @param char letter - the current character of the current word we're working with
 * @param ArrayList<Character> word - an array representation of the current word
 * @param int[] previousRow - a row in the Levenshtein Distance matrix
 */
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int minimumElement = currentRow[0];
    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);

        if (currentRow[i] < minimumElement) {
            minimumElement = currentRow[i];
        }
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minimumElement < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            traverseTrie(node.children.get(c), c, word, currentRow);
        }
    }
}

更新 2

最后，我设法使它适用于我的大多数测试用例。我的实现实际上是从Murilo 的 C++ 版本的Steve Hanov's algorithm的直接翻译。那么我应该如何重构这个算法和/或进行优化呢？下面是代码...

public int search(String word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
    return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.charAt(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }
        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minElement(currentRow) < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            searchRec(node.children.get(c), c, word, currentRow);

        }
    }
}

感谢所有为这个问题做出贡献的人。我试图让 Levenshtein Automata 工作，但我无法实现。

因此，我正在寻找有关上述代码的重构和/或优化建议。如果有任何混淆，请告诉我。与往常一样，我可以根据需要提供其余的源代码。

更新 1

所以我实现了一个简单的 Trie 数据结构，并且我一直在尝试按照 Steve Hanov 的 python 教程来计算 Levenshtein 距离。实际上，我对计算给定单词和 Trie 中的单词之间的最小Levenshtein 距离感兴趣，因此我一直在关注Murilo Vasconcelos 的 Steve Hanov 算法版本。它工作得不是很好，但这是我的 Trie 课程：

public class Trie {

    public TrieNode root;
    public int minLevDist;

    public Trie() {
        this.root = new TrieNode(' ');
    }

    public void insert(String word) {

        int length = word.length();
        TrieNode current = this.root;

        if (length == 0) {
            current.isWord = true;
        }
        for (int index = 0; index < length; index++) {

            char letter = word.charAt(index);
            TrieNode child = current.getChild(letter);

            if (child != null) {
                current = child;
            } else {
                current.children.put(letter, new TrieNode(letter));
                current = current.getChild(letter);
            }
            if (index == length - 1) {
                current.isWord = true;
            }
        }
    }
}

...和 TrieNode 类：

public class TrieNode {

    public final int ALPHABET = 26;

    public char letter;
    public boolean isWord;
    public Map<Character, TrieNode> children;

    public TrieNode(char letter) {
        this.isWord = false;
        this.letter = letter;
        children = new HashMap<Character, TrieNode>(ALPHABET);
    }

    public TrieNode getChild(char letter) {

        if (children != null) {
            if (children.containsKey(letter)) {
                return children.get(letter); 
            }
        }
        return null;
    }
}

现在，我尝试按照Murilo Vasconcelos的方式实现搜索，但是有些东西出了问题，我需要一些帮助来调试它。请就如何重构提出建议和/或指出错误在哪里。我想重构的第一件事是“minCost”全局变量，但这是最小的事情。无论如何，这是代码......

public void search(String word) {

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
}

private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int replace, insertCost, deleteCost;

    for (int i = 1; i < size; i++) {

        char c = word.charAt(i - 1);

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;
        replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);

        currentRow[i] = minimum(insertCost, deleteCost, replace);
    }

    if (currentRow[size - 1] < minCost && !node.isWord) {
        minCost = currentRow[size - 1];
    }
    Integer minElement = minElement(currentRow);
    if (minElement < minCost) {

        for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
            searchRec(node, entry.getKey(), word, currentRow);
        }
    }
}

我为缺乏评论表示歉意。那么我做错了什么？

初始帖子

我一直在阅读一篇文章Fast and Easy Levenshtein distance using a Trie，希望找到一种有效的方法来计算两个字符串之间的Levenshtein 距离。我的主要目标是，给定大量单词，能够找到输入单词和这组单词之间的最小 Levenshtein 距离。

在我的简单实现中，我为每个输入词计算输入词和词集之间的 Levenshtein 距离，并返回最小值。它有效，但效率不高......

我一直在寻找 Java 中 Trie 的实现，并且遇到了两个看似不错的资源：

Koders.com 版本
code.google.com 版本（编辑：这似乎已移至github.com/rkapsi）

但是，对于我正在尝试做的事情，这些实现似乎太复杂了。当我一直在阅读它们以了解它们如何工作以及 Trie 数据结构通常如何工作时，我只会变得更加困惑。

那么如何在 Java 中实现一个简单的 Trie 数据结构呢？我的直觉告诉我，每个 TrieNode 都应该存储它所代表的字符串以及对字母表字母的引用，不一定是所有字母。我的直觉正确吗？

一旦实现，下一个任务就是计算 Levenshtein 距离。我通读了上面文章中的 Python 代码示例，但我不会说 Python，一旦我点击递归搜索，我的 Java 实现就会耗尽堆内存。那么如何使用 Trie 数据结构计算 Levenshtein 距离？我有一个简单的实现，仿照这个源代码，但它不使用 Trie ......它效率低下。

除了您的评论和建议之外，看到一些代码真的很高兴。毕竟，这对我来说是一个学习过程……我从未实施过 Trie……所以我可以从这次经历中学到很多东西。

谢谢。

ps 如果需要，我可以提供任何源代码。另外，我已经按照Nick Johnson 的博客中的建议通读并尝试使用 BK-Tree ，但它的效率不如我想象的那么高……或者我的实现可能是错误的。

score 11 · Accepted Answer

据我所知，您不需要提高 Levenshtein Distance 的效率，您需要将字符串存储在一个结构中，这样您就不需要多次运行距离计算，即通过修剪搜索空间。

由于 Levenshtein 距离是一个度量，因此您可以使用任何利用三角不等式的度量空间索引 - 您提到了 BK-Trees，但还有其他的，例如。有利点树、固定查询树、平分线树、空间逼近树。以下是他们的描述：

Burkhard-Keller 树

节点按如下方式插入树中：对于根节点，从空间中选择一个任意元素；添加唯一的边缘标记的子元素，使得每个边缘的值是从枢轴到该元素的距离；递归应用，当边缘已经存在时选择子节点作为枢轴。

固定查询树

与 BKT 一样，除了：元素存储在叶子中；每个叶子都有多个元素；对于树的每个级别，都使用相同的枢轴。

平分线树

每个节点包含两个具有覆盖半径的枢轴元素（中心元素与其任何子树元素之间的最大距离）；将最接近第一个枢轴的元素和最接近第二个枢轴的元素过滤成两个集合，并从这些集合中递归地构建两个子树。

空间逼近树

最初所有元素都在一个包中；选择任意元素作为枢轴；在枢轴范围内建立最近邻的集合；将每个剩余元素放入刚刚构建的集合中离它最近的元素的包中；从这个集合的每个元素递归地形成一个子树。

有利点树

从集合中任意选择一个枢轴；计算此枢轴与剩余集合的每个元素之间的中值距离；将集合中的元素过滤成左右递归子树，使得距离小于或等于中位数的元素形成左侧，而距离较大的形成右侧。

score 9 · Accepted Answer

我已经在 C++ 中实现了“使用 Trie 的快速和简单的 Levenshtein 距离”文章中描述的算法，它真的很快。如果您愿意（比 Python 更了解 C++），我可以将代码粘贴到某个地方。

编辑： 我把它贴在我的博客上。

score 3 · Accepted Answer

这是Java 中的 Levenshtein Automata的示例（编辑：移至github）。这些可能也会有所帮助：

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/ lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/

编辑：以上链接似乎已移至 github：

https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree /master/lucene/core/src/test/org/apache/lucene/util/automaton

看起来实验性的 Lucene 代码是基于dk.brics.automaton包的。

用法似乎类似于以下内容：

LevenshteinAutomata builder = new LevenshteinAutomata(s);
Automaton automata = builder.toAutomaton(n);
boolean result1 = BasicOperations.run(automata, "foo");
boolean result2 = BasicOperations.run(automata, "bar");

score 2 · Accepted Answer

在许多方面，Steve Hanov 的算法（在问题中链接的第一篇文章中介绍，使用 Trie 的快速和简单的 Levenshtein 距离），Murilo 和你（OP）制作的算法的端口，以及很可能涉及到的每个相关算法Trie 或类似结构，功能很像 Levenshtein Automaton（这里已经多次提到）：

Given:
       dict is a dictionary represented as a DFA (ex. trie or dawg)
       dictState is a state in dict
       dictStartState is the start state in dict
       dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
       editDistance is an edit distance
       laWord is a word
       la is a Levenshtein Automaton defined for laWord and editDistance
       laState is a state in la
       laStartState is the start state in la
       laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
       charSequence is a sequence of chars
       traversalDataStack is a stack of (dictState, laState, charSequence) tuples

Define dictState as dictStartState
Define laState as laStartState
Push (dictState, laState, "") on to traversalDataStack
While traversalDataStack is not empty
    Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
    Define currentDictState as the dictState in currentTraversalDataTuple
    Define currentLAState as the laState in currentTraversalDataTuple
    Define currentCharSequence as the charSequence in currentTraversalDataTuple
    For each char in alphabet
        Check if currentDictState has outgoing transition labeled by char
        Check if currentLAState has outgoing transition labeled by char
        If both currentDictState and currentLAState have outgoing transitions labeled by char
            Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
            Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
            Define newCharSequence as concatenation of currentCharSequence and char
            Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
            If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
                Add newCharSequence to resultSet
            endIf
        endIf
    endFor
endWhile

Steve Hanov 的算法及其前面提到的导数显然使用 Levenshtein 距离计算矩阵代替了正式的 Levenshtein Automaton。相当快，但是正式的 Levenshtein Automaton 可以生成其参数状态（描述自动机具体状态的抽象状态）并用于遍历，从而绕过任何与编辑距离相关的运行时计算。因此，它应该比上述算法运行得更快。

如果您（或其他任何人）对正式的 Levenshtein Automaton 解决方案感兴趣，请查看LevenshteinAutomaton。它实现了上述基于参数状态的算法，以及纯基于具体状态遍历的算法（如上所述）和基于动态规划的算法（用于编辑距离和邻居确定）。它是由你真正维护的:)。

score 1 · Accepted Answer

函数 walk 接受一个 testitem（例如一个可索引的字符串，或一个字符数组）和一个 trie。trie 可以是具有两个插槽的对象。一个指定 trie 的节点，另一个指定该节点的子节点。孩子们也在尝试。在python中它会是这样的：

class Trie(object):
    def __init__(self, node=None, children=[]):
        self.node = node
        self.children = children

或者在 Lisp 中...

(defstruct trie (node nil) (children nil))

现在 trie 看起来像这样：

(trie #node None
      #children ((trie #node f
                       #children ((trie #node o
                                        #children ((trie #node o
                                                         #children None)))
                                  (trie #node u
                                        #children ((trie #node n
                                                         #children None)))))))

现在内部函数（您也可以单独编写）获取 testitem，树的根节点的子节点（其节点值为 None 或其他），以及设置为 0 的初始距离。

然后我们只是递归地遍历树的两个分支，从左到右。

score 1 · Accepted Answer

我将把它留在这里，以防有人正在寻找另一种解决这个问题的方法：

http://code.google.com/p/oracleofwoodyallen/wiki/ApproximateStringMatching

score 1 · Accepted Answer

我的直觉告诉我，每个 TrieNode 都应该存储它所代表的字符串以及对字母表字母的引用，不一定是所有字母。我的直觉正确吗？

不，trie 不代表字符串，它代表一组字符串（及其所有前缀）。一个 trie 节点将一个输入字符映射到另一个 trie 节点。所以它应该包含一个字符数组和一个对应的 TrieNode 引用数组。（也许不是那么精确的表示，取决于您特定使用它的效率。）

score 1 · Accepted Answer

如我所见，您想遍历 trie 的所有分支。使用递归函数并不难。我在我的 k 最近邻算法中也使用了 trie，使用了相同的函数。我不知道Java，但是这里有一些伪代码：

function walk (testitem trie)
   make an empty array results
   function compare (testitem children distance)
     if testitem = None
        place the distance and children into results
     else compare(testitem from second position, 
                  the sub-children of the first child in children,
                  if the first item of testitem is equal to that 
                  of the node of the first child of children 
                  add one to the distance (! non-destructive)
                  else just the distance)
        when there are any children left
             compare (testitem, the children without the first item,
                      distance)
    compare(testitem, children of root-node in trie, distance set to 0)
    return the results

希望能帮助到你。

score 1 · Accepted Answer

我正在查看您的最新更新 3，该算法似乎对我不起作用。

让我们看看您有以下测试用例：

    Trie dict = new Trie();
    dict.insert("arb");
    dict.insert("area");

    ArrayList<Character> word = new ArrayList<Character>();
    word.add('a');
    word.add('r');
    word.add('c');

在这种情况下，与 dict 之间的最小编辑距离"arc"应为 1，即与之间的编辑距离"arc"，"arb"但您的算法将返回 2。

我浏览了以下代码：

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

至少对于第一个循环，字母是单词中的一个字符，但是相反，您应该比较 trie 中的节点，所以会有一行与单词中的第一个字符重复，对吗？每个 DP 矩阵都有第一行作为副本。我执行了与您在解决方案中添加的完全相同的代码。

score 0 · Accepted Answer

好吧，这就是我很久以前的做法。我将字典存储为 trie，它只是一个受限于树形式的有限状态机。您可以通过不进行限制来增强它。例如，公共后缀可以简单地是共享子树。您甚至可以有循环，以捕获诸如“国家”、“国家”、“国有化”、“国有化”等内容。

使尝试尽可能简单。不要往里面塞绳子。

请记住，您这样做不是为了找到两个给定字符串之间的距离。您可以使用它来查找字典中最接近给定字符串的字符串。所花费的时间取决于您可以忍受多少 levenshtein 距离。对于零距离，它只是 O(n)，其中 n 是字长。对于任意距离，它是 O(N)，其中 N 是字典中的单词数。

score 0 · Accepted Answer

如果我错了，请纠正我，但我相信您的 update3 有一个额外的循环，这是不必要的，并且会使程序变慢：

for (int i = 0; i < iWordLength; i++) {
    traverseTrie(theTrie.root, word.get(i), word, currentRow);
}

你应该只调用一次 traverseTrie 因为在 traverseTrie 中你已经在循环整个单词。代码应仅如下所示：

traverseTrie(theTrie.root, ' ', word, currentRow);

java - 实现一个简单的 Trie 以实现高效的 Levenshtein 距离计算 - Java

更新 3

更新 2

更新 1

初始帖子

11 回答 11

Related

Reference