2

I am interested in checking if let's say a sample A (n=25) is uniformly distributed. Here is the way I'd check for that in Python:

import scipy.stats as ss
A=[9,9,9,4,9,6,7,8,9,4,5,2,4,9,6,7,3,4,2,4,5,6,8,9,9]
ss.kstest(A,'uniform', args=(min(A),max(A)), N=25)

Which returns: (0.22222222222222221, 0.14499771178796239), that is, with a p-value of ~0.15 the test can't reject that the sample A comes from an uniform distribution.

Now that's how I calculate the same in R:

A=c(9,9,9,4,9,6,7,8,9,4,5,2,4,9,6,7,3,4,2,4,5,6,8,9,9)
ks.test(A,punif,min(A),max(A))

The result: D = 0.32, p-value = 0.01195. With R one should reject the null hypothesis at the usual significance level of 0.05 (!!!)

If I read the documentation correctly, both functions perform a two-sided test as a default. Also, I get that the KS test is mainly intended for continuous variables, but can this explain the contrasting approximations produced by Python and R? Alternatively, am I making some flagrant mistake on the syntax?

4

1 回答 1

1

scipy.stats 中任何 cdf 的参数是位置和比例。对于均匀分布,这是 loc = 最小 x 值,其中均匀密度为 1,比例是均匀密度为 1 的区间宽度。使用 args = (min(A), max(A)-min(A )) 在 python 中将给出 R 给出的 D 值。

p 值仍然会有所不同。这是由于 KS 检验对重复值不稳健。它旨在用于连续分布,并且期望不会出现重复的 y 值。在存在重复数据的情况下,使用不同的算法来尝试估计 p。如果您在另一个数据样本上重新运行代码而不重复,并且将 args 设置为 loc 和 scale,您应该在 R 和 Python 中获得相同的 p 值。

于 2017-09-19T17:19:11.463 回答