text - 在文本中查找与给定关键字相似度最高的子字符串

Question

假设我有这个 text =I love apples, kiwis, oranges and bananas和 searchString =kiwis and bananas和一个相似性算法说Jaccard index。如何有效地找到text与searchString.

基本上，我试图找到与我拥有的关键字列表匹配的部分文本（文本有很高的错误、拼写错误、额外的符号和空格）。

score 5 · Accepted Answer

Jaccard 索引是“幸运的”相似性算法，因为您可以更新它的新符号值，而无需重新计算所有以前的东西。因此，您可以将text结果索引值视为一系列差异。之后，问题可以简化为https://en.wikipedia.org/wiki/Maximum_subarray_problem。

你的第二段怎么样，如果你正在做一些类似 NLP 的研究，我建议在进一步处理之前清理你的数据（尽可能删除那些额外的符号和空格）。这就是所谓的“拼写校正”，并且有大量不同的算法和库。要选择合适的一个，需要有关您的域的额外信息。

score 2 · Accepted Answer

看一下叠瓦技术，并尝试找出相似之处。你可以点击这个链接： http: //nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

例如，使用 9 shingle 并将每个子集与您的特定关键字进行比较

score 1 · Accepted Answer

This demo searches all wiki titles, try the "show search terms" option to see the Levenshtein distance and error correction algorithm in action.

score 0 · Accepted Answer

每个查询词都会根据字典进行检查。如果在字典中未找到某个术语，则字典中的这些单词将显示为拼写建议，与所讨论的查询术语最相似。

相似度/编辑距离 由于两个词之间的相似度度量通常用于 Damerau-Levenshtein 距离https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

其他几个参考

4 回答 4