更新 3
完毕。下面是最终通过了我所有测试的代码。同样,这是模仿穆里洛·瓦斯康塞洛对史蒂夫·汉诺夫算法的修改版本。感谢所有帮助!
/**
* Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
* words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
* distance using a Trie" and Murilo Vasconcelo's revised version in C++.
*
* http://stevehanov.ca/blog/index.php?id=114
* http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
*
* @param ArrayList<Character> word - the characters of an input word as an array representation
* @return int - the minimum Levenshtein Distance
*/
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int iWordLength = word.size();
int[] currentRow = new int[iWordLength + 1];
for (int i = 0; i <= iWordLength; i++) {
currentRow[i] = i;
}
for (int i = 0; i < iWordLength; i++) {
traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
return theTrie.minLevDist;
}
/**
* Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
*
* @param TrieNode node - the current TrieNode
* @param char letter - the current character of the current word we're working with
* @param ArrayList<Character> word - an array representation of the current word
* @param int[] previousRow - a row in the Levenshtein Distance matrix
*/
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int minimumElement = currentRow[0];
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.get(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
if (currentRow[i] < minimumElement) {
minimumElement = currentRow[i];
}
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minimumElement < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
traverseTrie(node.children.get(c), c, word, currentRow);
}
}
}
更新 2
最后,我设法使它适用于我的大多数测试用例。我的实现实际上是从Murilo 的 C++ 版本的Steve Hanov's algorithm的直接翻译。那么我应该如何重构这个算法和/或进行优化呢?下面是代码...
public int search(String word) {
theTrie.minLevDist = Integer.MAX_VALUE;
int size = word.length();
int[] currentRow = new int[size + 1];
for (int i = 0; i <= size; i++) {
currentRow[i] = i;
}
for (int i = 0; i < size; i++) {
char c = word.charAt(i);
if (theTrie.root.children.containsKey(c)) {
searchRec(theTrie.root.children.get(c), c, word, currentRow);
}
}
return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int insertCost, deleteCost, replaceCost;
for (int i = 1; i < size; i++) {
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
if (word.charAt(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
}
if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
theTrie.minLevDist = currentRow[size - 1];
}
if (minElement(currentRow) < theTrie.minLevDist) {
for (Character c : node.children.keySet()) {
searchRec(node.children.get(c), c, word, currentRow);
}
}
}
感谢所有为这个问题做出贡献的人。我试图让 Levenshtein Automata 工作,但我无法实现。
因此,我正在寻找有关上述代码的重构和/或优化建议。如果有任何混淆,请告诉我。与往常一样,我可以根据需要提供其余的源代码。
更新 1
所以我实现了一个简单的 Trie 数据结构,并且我一直在尝试按照 Steve Hanov 的 python 教程来计算 Levenshtein 距离。实际上,我对计算给定单词和 Trie 中的单词之间的最小Levenshtein 距离感兴趣,因此我一直在关注Murilo Vasconcelos 的 Steve Hanov 算法版本。它工作得不是很好,但这是我的 Trie 课程:
public class Trie {
public TrieNode root;
public int minLevDist;
public Trie() {
this.root = new TrieNode(' ');
}
public void insert(String word) {
int length = word.length();
TrieNode current = this.root;
if (length == 0) {
current.isWord = true;
}
for (int index = 0; index < length; index++) {
char letter = word.charAt(index);
TrieNode child = current.getChild(letter);
if (child != null) {
current = child;
} else {
current.children.put(letter, new TrieNode(letter));
current = current.getChild(letter);
}
if (index == length - 1) {
current.isWord = true;
}
}
}
}
...和 TrieNode 类:
public class TrieNode {
public final int ALPHABET = 26;
public char letter;
public boolean isWord;
public Map<Character, TrieNode> children;
public TrieNode(char letter) {
this.isWord = false;
this.letter = letter;
children = new HashMap<Character, TrieNode>(ALPHABET);
}
public TrieNode getChild(char letter) {
if (children != null) {
if (children.containsKey(letter)) {
return children.get(letter);
}
}
return null;
}
}
现在,我尝试按照Murilo Vasconcelos的方式实现搜索,但是有些东西出了问题,我需要一些帮助来调试它。请就如何重构提出建议和/或指出错误在哪里。我想重构的第一件事是“minCost”全局变量,但这是最小的事情。无论如何,这是代码......
public void search(String word) {
int size = word.length();
int[] currentRow = new int[size + 1];
for (int i = 0; i <= size; i++) {
currentRow[i] = i;
}
for (int i = 0; i < size; i++) {
char c = word.charAt(i);
if (theTrie.root.children.containsKey(c)) {
searchRec(theTrie.root.children.get(c), c, word, currentRow);
}
}
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {
int size = previousRow.length;
int[] currentRow = new int[size];
currentRow[0] = previousRow[0] + 1;
int replace, insertCost, deleteCost;
for (int i = 1; i < size; i++) {
char c = word.charAt(i - 1);
insertCost = currentRow[i - 1] + 1;
deleteCost = previousRow[i] + 1;
replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);
currentRow[i] = minimum(insertCost, deleteCost, replace);
}
if (currentRow[size - 1] < minCost && !node.isWord) {
minCost = currentRow[size - 1];
}
Integer minElement = minElement(currentRow);
if (minElement < minCost) {
for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
searchRec(node, entry.getKey(), word, currentRow);
}
}
}
我为缺乏评论表示歉意。那么我做错了什么?
初始帖子
我一直在阅读一篇文章Fast and Easy Levenshtein distance using a Trie,希望找到一种有效的方法来计算两个字符串之间的Levenshtein 距离。我的主要目标是,给定大量单词,能够找到输入单词和这组单词之间的最小 Levenshtein 距离。
在我的简单实现中,我为每个输入词计算输入词和词集之间的 Levenshtein 距离,并返回最小值。它有效,但效率不高......
我一直在寻找 Java 中 Trie 的实现,并且遇到了两个看似不错的资源:
- Koders.com 版本
- code.google.com 版本 (编辑:这似乎已移至github.com/rkapsi)
但是,对于我正在尝试做的事情,这些实现似乎太复杂了。当我一直在阅读它们以了解它们如何工作以及 Trie 数据结构通常如何工作时,我只会变得更加困惑。
那么如何在 Java 中实现一个简单的 Trie 数据结构呢?我的直觉告诉我,每个 TrieNode 都应该存储它所代表的字符串以及对字母表字母的引用,不一定是所有字母。我的直觉正确吗?
一旦实现,下一个任务就是计算 Levenshtein 距离。我通读了上面文章中的 Python 代码示例,但我不会说 Python,一旦我点击递归搜索,我的 Java 实现就会耗尽堆内存。那么如何使用 Trie 数据结构计算 Levenshtein 距离?我有一个简单的实现,仿照这个源代码,但它不使用 Trie ......它效率低下。
除了您的评论和建议之外,看到一些代码真的很高兴。毕竟,这对我来说是一个学习过程……我从未实施过 Trie……所以我可以从这次经历中学到很多东西。
谢谢。
ps 如果需要,我可以提供任何源代码。另外,我已经按照Nick Johnson 的博客中的建议通读并尝试使用 BK-Tree ,但它的效率不如我想象的那么高……或者我的实现可能是错误的。