我有 40,000 个字符串的集合,并想使用 成对比较它们的相似性fuzz.token_set_ratio()
,但我的大脑并没有正确连接以有效地做到这一点,即使在研究了矢量化之后也是如此。
这是一个例子:
from fuzzywuzzy import fuzz
s = ["fuzzy was a strong bear",
"fuzzy was a large bear",
"fuzzy was the strongest bear you could ever imagine"]
similarities = []
l = len(s)
for i in range(l):
similarities.append([])
for j in range(l):
similarities[i].append(fuzz.token_set_ratio(s[i], s[j]))
similarities
现在显然,这段代码至少有两个缺点。首先,它使用低效的 for 循环。其次,虽然得到的similarities
矩阵是对称的(这并不总是正确的,但现在忽略它)并且我只需要计算上三角形或下三角形,它会计算所有元素。后者可能是我可以编写代码的东西,但我正在寻找使用similarities
Python 的最快方法。
编辑:这是另一条可能有用的信息。我尝试使用 加快进程pdist
,这对于一些类似的任务似乎表现良好。但是,在这种情况下,由于某种原因,它似乎比我低效的 for 循环慢。
这是代码:
from fuzzywuzzy import fuzz
from scipy.spatial.distance import pdist, squareform
import numpy as np
def pwd(string1, string2):
return fuzz.token_set_ratio(string1, string2)
s = []
for i in range(100):
s.append("fuzzy was a strong bear")
s.append("fuzzy was a large bear")
s.append("fuzzy was the strongest bear you could ever imagine")
def pwd_loops():
similarities = []
l = len(s)
for i in range(l):
similarities.append([])
for j in range(l):
similarities[i].append(fuzz.token_set_ratio(s[i], s[j]))
a = np.array(s).reshape(-1,1)
def pwd_pdist():
dm = squareform(pdist(a, pwd))
%time pwd_loops()
#Wall time: 2.39 s
%time pwd_pdist()
#Wall time: 3.73 s