1

我想构建一个成对距离矩阵,其中“距离”是两个字符串之间的相似度得分,如此处实现。我正在考虑使用 sci-kit learn 的成对距离方法来执行此操作,因为我之前已将其用于其他计算,并且易于并行化非常好。

这是相关的代码:

def hashdistance(str1, str2):
    hash1 = simhash(str1)
    hash2 = simhash(str2)

    distance = 1 - hash1.similarity(hash2)

    return distance   


strings = [d['string'] for d in data]
distance_matrix = pairwise_distances(strings, metric = lambda u,v: hashdistance(u, v))

strings看起来像['foo', 'bar', 'baz']

当我尝试这个时,它会抛出错误ValueError: could not convert string to float。这可能是一件非常愚蠢的事情,但我不确定为什么需要在此处进行转换,以及为什么会抛出该错误:匿名函数 inmetric可以接受字符串并返回浮点数;为什么输入需要是浮点数,如何根据 simhash“距离”创建这个成对距离矩阵?

4

1 回答 1

4

According to the documentation, only metrics from scipy.spatial.distance are allowed, or a callable from:

In [26]: sklearn.metrics.pairwise.pairwise_distance_functions
Out[26]:
{'cityblock': <function sklearn.metrics.pairwise.manhattan_distances>,
 'euclidean': <function sklearn.metrics.pairwise.euclidean_distances>,
 'l1': <function sklearn.metrics.pairwise.manhattan_distances>,
 'l2': <function sklearn.metrics.pairwise.euclidean_distances>,
 'manhattan': <function sklearn.metrics.pairwise.manhattan_distances>}

One issue is that if metric is callable then sklearn.metrics.pairwise.check_pairwise_arrays tries to convert the input to float, (scipy.spatial.distance.pdist does something similar, so you're out of luck there) thus your error.

Even if you could pass a callable it wouldn't scale very well, since the loop in pairwise_distances is pure Python. It looks like you'll have to just write the loop yourself. I would suggest reading the source code of pdist and/or pairwise_distances for hints as to how to do this.

于 2013-08-30T01:05:03.563 回答