python - 通过过滤生成不同（遥远，按编辑距离）单词的列表

Question

我有一个很长（> 1000 项）的单词列表，我想从中删除与其他单词“太相似”的单词，直到其余单词都“显着不同”。例如，没有两个词在编辑距离 D 内。

我不需要一个独特的解决方案，它不必是完全最优的，但它应该相当快（在 Python 中）并且不会丢弃太多条目。

我怎样才能做到这一点？谢谢。

编辑：要清楚，我可以用谷歌搜索一个测量编辑距离的python例程。问题是如何有效地做到这一点，并且也许以某种方式找到 D 的“自然”值。也许通过从所有单词中构造某种 trie 然后修剪？

score 3 · Accepted Answer

您可以使用bk-tree, 并在添加每个项目之前检查它是否不在任何其他项目的距离 D 内（感谢@DietrichEpp 在此想法的评论中。

您可以将此配方用于 bk-tree（尽管任何类似的配方都可以轻松修改）。只需进行两项更改：更改行：

def __init__(self, items, distance, usegc=False):

到

def __init__(self, items, distance, threshold=0, usegc=False):

并换行

        if el not in self.nodes: # do not add duplicates

到

        if (el not in self.nodes and
            (threshold == None or len(self.find(el, threshold)) == 0)):

这样可以确保添加项目时没有重复项。然后，从列表中删除重复项的代码很简单：

from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

请注意，它的距离函数依赖于python-Levenshtein包，这比 bk-tree 提供的要快得多。python-Levenshtein 有 C 编译的组件，但值得安装。

最后，我使用越来越多的单词（从中随机抓取）设置了一个性能测试，/usr/share/dict/words并绘制了每个单词的运行时间：

import random
import time
from Levenshtein import distance
from bktree import BKtree

with open("/usr/share/dict/words") as inf:
    word_list = [l[:-1] for l in inf]

def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

def time_remove_duplicates(n, threshold):
    """Test using n words"""
    nwords = random.sample(word_list, n)
    t = time.time()
    newlst = remove_duplicates(nwords, threshold)
    return len(newlst), time.time() - t

ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)

from matplotlib import pyplot as plt

plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")

在此处输入图像描述

如果没有在数学上确认它，我认为它不是二次的，而且我认为它实际上可能是n log n，如果插入 bk-tree 是对数时间操作，这将是有意义的。最值得注意的是，它在 5000 个字符串以下运行得非常快，这有望成为 OP 的目标（ 15000 个字符串是合理的，而传统的 for 循环解决方案则不会）。

score 2 · Accepted Answer

尝试不会有帮助，哈希映射也不会。它们对于像这样的空间、高维问题根本没有用。

但这里真正的问题是对“高效”的不明确要求。“高效”有多快？

import Levenshtein

def simple(corpus, distance):
    words = []
    while corpus:
        center = corpus[0]
        words.append(center)
        corpus = [word for word in corpus
                  if Levenshtein.distance(center, word) >= distance]
    return words

我从硬盘驱动器上的“美国英语”词典中统一选择了 10,000 个单词，寻找距离为 5 的集合，产生了大约 2,000 个条目。

实际0m2.558s
用户 0m2.404s
系统 0m0.012s

所以，问题是，“效率如何才足够有效”？由于您没有指定您的要求，因此我很难知道该算法是否适合您。

兔子洞

如果你想要更快的东西，我会这样做。

创建 VP 树、BK 树或其他合适的空间索引。对于语料库中的每个单词，如果它与索引中的每个单词有合适的最小距离，则将该单词插入树中。空间索引是专门为支持这种查询而设计的。

最后，您将拥有一棵树，其中包含具有所需最小距离的节点。

score 0 · Accepted Answer

你的想法绝对是有趣的。这个页面有一个很好的设置，可以在 trie 中快速计算编辑距离，如果你需要将你的单词列表扩展到数百万而不是一千，这在语料库语言学业务中是相当小的，这肯定会很有效。

祝你好运，这听起来很有趣！

python - 通过过滤生成不同（遥远，按编辑距离）单词的列表

3 回答 3

兔子洞

Related

Reference