python - 如何解释 numpy.correlate 和 numpy.corrcoef 返回的值？

Question

我有两个一维数组，我想看看它们的相互关系。我应该在 numpy 中使用什么程序？我正在使用numpy.corrcoef(arrayA, arrayB)并且numpy.correlate(arrayA, arrayB)两者都给出了一些我无法理解或理解的结果。

有人可以阐明如何理解和解释这些数字结果（最好使用示例）？

score 19 · Accepted Answer

numpy.correlate简单地返回两个向量的互相关。

如果您需要了解互相关，请从http://en.wikipedia.org/wiki/Cross-correlation开始。

通过查看自相关函数（与自身互相关的向量）可以看到一个很好的例子：

import numpy as np

# create a vector
vector = np.random.normal(0,1,size=1000) 

# insert a signal into vector
vector[::50]+=10

# perform cross-correlation for all data points
output = np.correlate(vector,vector,mode='full')

代码图

当两个数据集重叠时，这将返回一个具有最大值的 comb/shah 函数。由于这是一个自相关，因此两个输入信号之间不会有“滞后”。因此相关性的最大值是vector.size-1。

如果您只想要重叠数据的相关值，您可以使用mode='valid'.

score 12 · Accepted Answer

我暂时只能评论numpy.correlate。这是一个强大的工具。我将它用于两个目的。首先是在另一个模式中找到一个模式：

import numpy as np
import matplotlib.pyplot as plt

some_data = np.random.uniform(0,1,size=100)
subset = some_data[42:50]

mean = np.mean(some_data)
some_data_normalised = some_data - mean
subset_normalised = subset - mean

correlated = np.correlate(some_data_normalised, subset_normalised)
max_index = np.argmax(correlated)  # 42 !

我使用它的第二个用途（以及如何解释结果）是用于频率检测：

hz_a = np.cos(np.linspace(0,np.pi*6,100))
hz_b = np.cos(np.linspace(0,np.pi*4,100))

f, axarr = plt.subplots(2, sharex=True)

axarr[0].plot(hz_a)
axarr[0].plot(hz_b)
axarr[0].grid(True)

hz_a_autocorrelation = np.correlate(hz_a,hz_a,'same')[round(len(hz_a)/2):]
hz_b_autocorrelation = np.correlate(hz_b,hz_b,'same')[round(len(hz_b)/2):]

axarr[1].plot(hz_a_autocorrelation)
axarr[1].plot(hz_b_autocorrelation)
axarr[1].grid(True)

plt.show()

找到第二个峰的索引。从这里你可以回过头来找到频率。

first_min_index = np.argmin(hz_a_autocorrelation)
second_max_index = np.argmax(hz_a_autocorrelation[first_min_index:])
frequency = 1/second_max_index

score 9 · Accepted Answer

在阅读了所有教科书的定义和公式之后，对于初学者来说，看看如何从另一个推导出来可能会很有用。首先关注两个向量之间仅成对相关的简单情况。

import numpy as np

arrayA = [ .1, .2, .4 ]
arrayB = [ .3, .1, .3 ]

np.corrcoef( arrayA, arrayB )[0,1] #see Homework bellow why we are using just one cell
>>> 0.18898223650461365

def my_corrcoef( x, y ):    
    mean_x = np.mean( x )
    mean_y = np.mean( y )
    std_x  = np.std ( x )
    std_y  = np.std ( y )
    n      = len    ( x )
    return np.correlate( x - mean_x, y - mean_y, mode = 'valid' )[0] / n / ( std_x * std_y )

my_corrcoef( arrayA, arrayB )
>>> 0.1889822365046136

作业：

将示例扩展到两个以上的向量，这就是 corrcoef 返回矩阵的原因。
查看 np.correlate 对不同于“有效”的模式的作用
看看scipy.stats.pearsonrover (arrayA, arrayB)

另一个提示：请注意，在此输入上处于“有效”模式的 np.correlate 只是一个点积（与上面 my_corrcoef 的最后一行比较）：

def my_corrcoef1( x, y ):    
    mean_x = np.mean( x )
    mean_y = np.mean( y )
    std_x  = np.std ( x )
    std_y  = np.std ( y )
    n      = len    ( x )
    return (( x - mean_x ) * ( y - mean_y )).sum() / n / ( std_x * std_y )

my_corrcoef1( arrayA, arrayB )
>>> 0.1889822365046136

score 2 · Accepted Answer

如果您对int向量的结果感到困惑，则可能是由于溢出：

>>> a = np.array([4,3,2,0,0,10000,0,0], dtype='int16')
>>> np.correlate(a,a[:3], mode='valid')
array([    29,     18,      8,  20000,  30000, -25536], dtype=int16)

怎么会？

29 = 4*4 + 3*3 + 2*2
18 = 4*3 + 3*2 + 2*0
 8 = 4*2 + 3*0 + 2*0
...
40000 = 4*10000 + 3*0 + 2*0 shows up as 40000 - 2**16 = -25536

python - 如何解释 numpy.correlate 和 numpy.corrcoef 返回的值？

4 回答 4

Related

Reference