java - Java：如何在字符串列表中找到最可能的字符串？

Question

我有一个 Java 字符串列表，其中包含一个拼写不同的人的名字（并非完全不同）。例如，John 可能拼写为 Jon、Jawn、Jaun 等。我应该如何检索此列表中最合适的字符串。如果有人能建议一种在这种情况下如何使用 Soundex 的方法，那将有很大帮助。

score 4 · Accepted Answer

有一个用于匹配近似字符串的 jar 文件。

通过链接并下载 frej.jar

http://sourceforge.net/projects/frej/files/

这个jar文件中有一种方法

Fuzzy.equals("jon","john");

它将在这种类型的近似字符串中返回 true。

score 4 · Accepted Answer

您已经使用了近似字符串匹配算法，有几种策略可以实现这一点。Blur 是基于 Levenshtein 词距离的近似字符串匹配的基于 Trie 的 Java 实现。

还有另一种策略来实现其称为boyer-moore 近似字符串匹配算法。

使用该算法和 Levenshtein 词距离解决这些问题的常用方法是将输入与可能的输出进行比较，然后选择与所需输出的距离最小的那个。

score 1 · Accepted Answer

本文提供了关于近似字符串匹配的基于 Trie 的 Java 实现的详细解释和完整代码： Fast and Easy Levenshtein distance using a Trie。

搜索函数返回小于给定的所有单词的列表

与目标词的最大距离

def 搜索（字，最大成本）：

# build first row
currentRow = range( len(word) + 1 )

results = []

# recursively search each branch of the trie
for letter in trie.children:
    searchRecursive( trie.children[letter], letter, word, currentRow, 
        results, maxCost )

return results

这个递归助手由上面的搜索功能使用。它假设

previousRow 已经填好了。

def searchRecursive（节点，字母，单词，previousRow，结果，maxCost）：

columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]

# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):

    insertCost = currentRow[column - 1] + 1
    deleteCost = previousRow[column] + 1

    if word[column - 1] != letter:
        replaceCost = previousRow[ column - 1 ] + 1
    else:                
        replaceCost = previousRow[ column - 1 ]

    currentRow.append( min( insertCost, deleteCost, replaceCost ) )

# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
    results.append( (node.word, currentRow[-1] ) )

# if any entries in the row are less than the maximum cost, then 
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
    for letter in node.children:
        searchRecursive( node.children[letter], letter, word, currentRow, 
            results, maxCost )

score 1 · Accepted Answer

如果您在索引文本时使用语音过滤器工厂， Solr可以做到这一点。

搜索是solr的专长。并搜索发音相似的单词。但是，如果您只想要这个，而不想要 solr 提供的其他功能，那么您可以使用此处提供的源代码。

score 1 · Accepted Answer

有很多理论和方法来估计2 个字符串的匹配

给出一个直截了当的真/假结果似乎很奇怪，因为“jon”真的不等于“john”，它很接近但不匹配

实现了相当多的估计方法的一项伟大的学术工作被称为“SecondString.jar” -站点链接

大多数实现的方法都会给匹配打一些分数，这个分数取决于使用的方法

示例：让我们将“编辑距离”定义为 str1 中到达 str2 所需的字符更改数，在这种情况下，“jon”->“john”自然需要 1 个字符，对于这种方法，分数越低越好

java - Java：如何在字符串列表中找到最可能的字符串？

5 回答 5

搜索函数返回小于给定的所有单词的列表

与目标词的最大距离

这个递归助手由上面的搜索功能使用。它假设

previousRow 已经填好了。

Related

Reference