to的cdf
参数kstest
可以是一个可调用的,它实现了您想要测试数据的分布的累积分布函数。要使用它,您必须实现双峰分布的 CDF。您希望分布是两个正态分布的混合。您可以通过计算构成混合的两个正态分布的 CDF 的加权和来实现此分布的 CDF。
这是一个脚本,显示了如何执行此操作。为了演示如何kstest
使用,脚本运行kstest
了两次。首先,它使用不是来自分布的样本。正如预期的那样,kstest
为第一个样本计算一个非常小的 p 值。然后它会生成一个从混合物中提取的样本。对于这个样本,p 值不小。
import numpy as np
from scipy import stats
def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
"""
CDF of a mixture of two normal distributions.
"""
return (weight1*stats.norm.cdf(x, mean1, stdv1) +
(1 - weight1)*stats.norm.cdf(x, mean2, stdv2))
# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4
n = 200
# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)
# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)
# Create a sample from the bimodal distribution. This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution. The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
(stats.norm.rvs(mean2, stdv2, size=n - n1))))
# Most of time, the p-value returned by kstest with sample2 will not
# be small. We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)
典型输出(每次运行脚本时数字都会不同):
sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403
您可能会发现,对于您的问题,此测试效果不佳。您有 4800 个样本,但在您的代码中,您的参数的数值只有一位或两位有效数字。除非您有充分的理由相信您的样本是从具有这些参数的分布中抽取的,否则很可能kstest
会返回一个非常小的 p 值。