python - Spark（Python）中的Kolmogorov Smirnov测试不起作用？

Question

我在 Python spark-ml 中进行了正态性测试，发现了我认为的错误。

这是设置，我有一个标准化的数据集（范围-1，到1）。

当我做直方图时，我可以清楚地看到数据不正常：

>>> prices_norm.histogram(10)

([-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
 [226, 269, 119, 95, 52, 26, 8, 2, 2, 5])

当我运行 Kolmgorov-Smirnov 测试时，我得到以下结果：

>>> testResults = Statistics.kolmogorovSmirnovTest(prices_norm, "norm")
>>> print testResults

Kolmogorov-Smirnov test summary:
degrees of freedom = 0 
statistic = 0.46231145770077375 
pValue = 1.742039845709087E-11 
Very strong presumption against null hypothesis: Sample follows theoretical distribution.

Kolmgorov-Smirnov 检验将零假设 (H0)定义为：数据遵循指定的分布( http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm )。

在这种情况下，p 值非常低，因此我们应该拒绝原假设。这是有道理的，因为这显然是不正常的。

那么，为什么它会说：

Sample follows theoretical distribution

这不是错的吗？它不应该说样本不遵循理论分布吗？我错过了什么吗？

score 3 · Accepted Answer

这把我逼疯了，所以我直接去看了源代码：

git://git.apache.org/spark.git
spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala

代码正确，null Hypothesis设置为：

object NullHypothesis extends Enumeration {
  type NullHypothesis = Value
  val OneSampleTwoSided = Value("Sample follows theoretical distribution")
}

字符串消息的措辞只是重申零假设：

Very strong presumption against null hypothesis: Sample follows theoretical distribution.
                                                 ________________________________________
                                                                    H0

可以说，措辞令人困惑，因为它可以以两种方式解释。但这确实是正确的。

python - Spark（Python）中的Kolmogorov Smirnov测试不起作用？

1 回答 1

Related

Reference