python - 比较两个多列数据框的统计显着性

Question

我有 2 个数据框。每个数据框包含 64 列，每列包含 256 个值。我需要比较这两个数据框的统计意义。

我只知道统计学的基础知识。我所做的是为每个数据帧的所有列计算 p 值。然后我将第一个数据帧的每一列的 p 值与第二个数据帧的每一列的 p 值进行比较。EX：第一个数据帧的第一列的 p 值到第二个数据帧的第一列的 p 值。

然后我告诉哪些列在 2 个数据帧中显着不同。

有没有更好的方法来做到这一点。我用蟒蛇。

score 1 · Accepted Answer

老实说，你做这件事的方式并不是它的本意。让我强调一些在进行此类分析时应始终牢记的要点：

1.) 先假设

我强烈建议避免针对所有内容进行测试。这种探索性数据分析可能会产生一些重要的结果，但也可能会导致多重比较问题。简而言之：您进行了如此多的测试，以至于看到实际上并不重要的重要事物的机会大大增加（另请参阅I 型和 II 型错误）。

2.) p 值并不是所有的魔法

说您计算了所有列的 p 值并不能说明您使用了哪个测试。p 值只是数学统计中的一个“工具”，被许多测试（例如相关性、t 检验、ANOVA、回归等）使用。具有显着的 p 值表明您观察到的差异/关系具有统计相关性（即系统效应而非随机效应）。

3.) 考虑样本和效应大小

Depending on which test you are using, the p-value is sensitive to the sample size you have. The greater your sample size, the more likely it is to find a significant effect. For instance, if you compare two groups with 1 million observations each, the smallest differences (which might also be random artifacts) can be significant. It is therefore important to also take a look at the effect size that tells you how large the observed really is (e.g. r for correlations, Cohen's d for t-tests, partial eta for ANOVAs etc.).

SUMMARY

So, if you want to get some real help here, I suggest to post some code and specify more concretely what (1) your research question is, (2) which tests you used, and (3) how your code and your output looks like.

python - 比较两个多列数据框的统计显着性

1 回答 1

Related

Reference