我需要对两个一维样本应用数十万次Anderson-Darling 检验。中的实现scipy
是anderson_ksamp,它运行良好,但它占用了相当多的时间。我想提高它的性能,因为我知道:
- 我总是会比较 2 个样本
- 我只需要 Anderson-Darling 检验统计量,即不需要临界值或 p 值
scipy
从测试的原始实现中删除了非必要的检查,我设法将性能提高了近 30%。
这可以进一步改善吗?
import numpy as np
from scipy.stats import anderson_ksamp
import time as t
def main():
"""
Test scipy's osiginal vs the simplified Anderson-Dalring tests.
"""
t1, t2 = 0., 0.
AD_all = []
for _ in range(1000):
N = np.random.randint(10, 200)
aa = np.random.uniform(0., 1., N)
bb = np.random.uniform(0., 1., N)
s = t.time()
AD = anderson_ksamp([aa, bb])[0]
t1 += t.time() - s
s = t.time()
AD2 = anderson_ksamp_new([aa, bb])
t2 += t.time() - s
# Check that both values are equal
AD_all.append([AD, AD2])
AD_all = np.array(AD_all).T
print((AD_all[0] == AD_all[1]).all())
print("Improvement: {:.1f}%".format(100. - 100. * t2 / t1))
def anderson_ksamp_new(samples):
"""
A2akN: equation 7 of Scholz and Stephens.
samples : sequence of 1-D array_like
Array of sample arrays.
Z : array_like
Sorted array of all observations.
Zstar : array_like
Sorted array of unique observations.
k : int
Number of samples.
n : array_like
Number of observations in each sample.
N : int
Total number of observations.
Returns
-------
A2aKN : float
The A2aKN statistics of Scholz and Stephens 1987.
"""
k = 2 # This will always be 2
Z = np.sort(np.hstack(samples))
N = Z.size
Zstar = np.unique(Z)
n = np.array([sample.size for sample in samples])
A2kN = 0.
Z_ssorted_left = Z.searchsorted(Zstar, 'left')
if N == Zstar.size:
lj = 1.
else:
lj = Z.searchsorted(Zstar, 'right') - Z_ssorted_left
Bj = Z_ssorted_left + lj / 2.
for i in np.arange(0, k):
s = np.sort(samples[i])
s_ssorted_right = s.searchsorted(Zstar, side='right')
Mij = s_ssorted_right.astype(float)
fij = s_ssorted_right - s.searchsorted(Zstar, 'left')
Mij -= fij / 2.
inner = lj / float(N) * (N*Mij - Bj*n[i])**2 / (Bj*(N - Bj) - N*lj/4.)
A2kN += inner.sum() / n[i]
A2kN *= (N - 1.) / N
H = (1. / n).sum()
hs_cs = (1. / np.arange(N - 1, 1, -1)).cumsum()
h = hs_cs[-1] + 1
g = (hs_cs / np.arange(2, N)).sum()
a = (4*g - 6) * (k - 1) + (10 - 6*g)*H
b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6
c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h
d = (2*h + 6)*k**2 - 4*h*k
sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
m = k - 1
A2 = (A2kN - m) / np.sqrt(sigmasq)
return A2