python - 计算 250k 列表的成对相似度的最有效方法

Question

我有 250,000 个列表，每个列表平均包含 100 个字符串，存储在 10 个字典中。我需要计算所有列表的成对相似度（相似度度量在这里不相关；但简而言之，它涉及取两个列表的交集并通过某个常数对结果进行归一化）。

我为成对比较提出的代码非常简单。我只是使用 itertools.product 将每个列表与其他列表进行比较。问题在于以一种省时的方式对 250,000 个列表执行这些计算。对于处理过类似问题的任何人：根据以下标准，哪种常用选项（scipy、PyTables）最适合此问题：

支持python数据类型
巧妙地存储一个非常稀疏的矩阵（大约 80% 的值将是 0）
高效（可以在 10 小时内完成计算）

score 10 · Accepted Answer

您只是想要最有效的方法来确定数据中任意两点之间的距离吗？

或者您是否真的需要这个mxm距离矩阵来存储数据中所有行的所有成对相似度值？

通常，使用为快速检索而优化的数据结构将数据保存在某个度量空间中比预先计算成对相似度值并仅查找它们要高效得多。不用说，距离矩阵选项的规模非常大——n 个数据点需要一个 nxn 距离矩阵来存储成对的相似度分数。

kd-tree是小维度数据的首选技术（这里的“小”意味着特征数量少于大约 20）；对于更高维度的数据，通常首选Voronoi 细分。

最近，球树已被用作两者的优越替代品——它具有kd-tree的性能，但在高维上没有退化。

scikit-learn具有出色的实现，其中包括单元测试。它有据可查，目前正在积极开发中。

scikit-learn建立在NumPy和SciPy 之上，因此两者都是依赖项。站点上提供了scikit-learn的各种安装选项。

Ball Trees 最常见的用例是在k-Nearest Neighbors中；但它本身会很好地工作，例如，在像 OP 中描述的情况下。

您可以像这样使用scikit-learn Ball Tree实现：

>>> # create some fake data--a 2D NumPy array having 10,000 rows and 10 columns
>>> D = NP.random.randn(10000 * 10).reshape(10000, 10)

>>> # import the BallTree class (here bound to a local variable of same name)
>>> from sklearn.neighbors import BallTree as BallTree

>>> # call the constructor, passing in the data array and a 'leaf size'
>>> # the ball tree is instantiated and populated in the single step below:

>>> BT = BallTree(D, leaf_size=5, p=2)

>>> # 'leaf size' specifies the data (number of points) at which 
>>> # point brute force search is triggered
>>> # 'p' specifies the distance metric, p=2 (the default) for Euclidean;
>>> # setting p equal to 1, sets Manhattan (aka 'taxi cab' or 'checkerboard' dist)

>>> type(BT)
    <type 'sklearn.neighbors.ball_tree.BallTree'>

实例化和填充球树非常快 （使用 Corey Goldberg 的计时器类计时）：

>>> with Timer() as t:
        BT = BallTree(D, leaf_size=5)

>>> "ball tree instantiated & populated in {0:2f} milliseconds".format(t.elapsed)
        'ball tree instantiated & populated in 13.90 milliseconds'

查询球树也很快：

示例查询：提供最接近数据点行索引 500 的三个数据点；并且对于他们每个人，返回他们的索引和他们在 D[500,:] 处与这个参考点的距离

>>> # ball tree has an instance method, 'query' which returns pair-wise distance
>>> # and an index; one distance and index is returned per 'pair' of data points

>>> dx, idx = BT.query(D[500,:], k=3)

>>> dx    # distance
    array([[ 0.   ,  1.206,  1.58 ]])

>>> idx    # index
    array([[500, 556, 373]], dtype=int32)

>>> with Timer() as t:
    dx, idx = BT.query(D[500,:], k=3)


>>> "query results returned in {0:2f} milliseconds".format(t.elapsed)
        'query results returned in 15.85 milliseconds'

scikit-learn Ball Tree 实现中的默认距离度量是Minkowski，它只是欧几里得和曼哈顿的泛化（即，在 Minkowski 表达式中，有一个参数 p，当设置为 2 时，它会塌陷为欧几里得，而曼哈顿，对于 p = 1。

score 0 · Accepted Answer

如果您定义适当的距离（相似性）函数，那么scipy.spatial.distance 中的一些函数可能会有所帮助

python - 计算 250k 列表的成对相似度的最有效方法

2 回答 2

Related

Reference