1

我是个numpy婴儿,正在考虑使用numpy.vectorise()来计算距离矩阵。我认为其中一个关键部分是signature参数,但是当我运行下面的代码时,我得到一个错误:

import numpy as np
from scipy.spatial.distance import jaccard

#find jaccard dissimilarities for a constant 1 row * m columns array vs each array in an n rows * m columns nested array, outputting a 1 row * n columns array of dissimilarities   
vectorised_compute_jac = np.vectorize(jaccard, signature = '(m),(n,m)->(n)')

array_list = [[1, 2, 3], #arrA
              [2, 3, 4], #arrB
              [4, 5, 6]] #arrC

distance_matrix = np.array([])
for target_array in array_list:
    print (target_array)
    print (array_list)
    #row should be an array of jac distances between target_array and each array in array_list
    row = vectorised_compute_jac(target_array , array_list)
    print (row, '\n\n') 
    #np.vectorise() functions return an array of objects of type specified by otype param, based on docs
    np.append(distance_matrix, row)

输出 + 错误:

[1, 2, 3]
[[1, 2, 3], [2, 3, 4], [4, 5, 6]]
Traceback (most recent call last):

  File "C:\Users\u03132tk\.spyder-py3\ModuleMapper\untitled1.py", line 21, in <module>
    row = vectorised_compute_jac(array, array_list)

  File "C:\ANACONDA3\lib\site-packages\numpy\lib\function_base.py", line 2163, in __call__
    return self._vectorize_call(func=func, args=vargs)

  File "C:\ANACONDA3\lib\site-packages\numpy\lib\function_base.py", line 2237, in _vectorize_call
    res = self._vectorize_call_with_signature(func, args)

  File "C:\ANACONDA3\lib\site-packages\numpy\lib\function_base.py", line 2277, in _vectorize_call_with_signature
    results = func(*(arg[index] for arg in args))

  File "C:\ANACONDA3\lib\site-packages\scipy\spatial\distance.py", line 893, in jaccard
    v = _validate_vector(v)

  File "C:\ANACONDA3\lib\site-packages\scipy\spatial\distance.py", line 340, in _validate_vector
    raise ValueError("Input vector should be 1-D.")

ValueError: Input vector should be 1-D.

我想要什么,方括号表示 numpy 数组而不是列表,基于上面评论中讨论的数组输出类型:

  #arrA    #arrB   #arrC
[[JD(AA), JD(AB), JD(AC)],   #arrA
 [JD(BA), JD(BB), JD(BC)],   #arrB
 [JD(CA), JD(CB), JD(CC)]]   #arrC

有人可以建议签名参数是如何工作的,以及这是否会导致我的麻烦?我怀疑这是由于我签名中的 (n, m) 因为它是唯一的多维事物,因此问题:(

干杯! 蒂姆

4

1 回答 1

0

我打算按原样运行您的代码,但后来发现您在滥用np.append. 因此,我将跳过您的迭代,并尝试使用直接的列表推导重新创建计算。

它看起来jaccard需要 2 个 1d 数组,并返回一个标量,您显然想为数组列表的所有对计算它。

In [5]: arr = np.array(array_list)
In [6]: [jaccard(arr[0],b) for b in arr]
Out[6]: [0.0, 1.0, 1.0]
In [7]: [[jaccard(a,b) for b in arr] for a in arr]
Out[7]: [[0.0, 1.0, 1.0], [1.0, 0.0, 1.0], [1.0, 1.0, 0.0]]
In [9]: np.array(_)
Out[9]: 
array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

jaccard有了对称性和 0,应该可以通过更具选择性的迭代来减少调用。但我会把它留给其他人。

使用您的signature,您告诉vectorize将 1d 和 2d 数组传递给jaccard,并期望返回 1d。那是不对的。

这是,我认为正确使用vectorize

In [12]: vectorised_compute_jac = np.vectorize(jaccard, signature = '(m),(m)->()
    ...: ')
In [13]: vectorised_compute_jac(arr[None,:,:],arr[:,None,:])
Out[13]: 
array([[0., 1., 1.],
       [1., 0., 1.],
       [1., 1., 0.]])

将其时间与嵌套理解进行比较:

In [14]: timeit vectorised_compute_jac(arr[None,:,:],arr[:,None,:])
384 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: timeit np.array([[jaccard(a,b) for b in arr] for a in arr])
203 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在 [15] 中,jaccard支配时间的是调用,而不是迭代机制。因此,利用对称性将是值得的。

In [17]: timeit jaccard(arr[0],arr[1])
21.2 µs ± 79.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
于 2021-12-28T17:51:46.587 回答