我有一个我无法弄清楚的神秘问题。
我Statistics::TTest
用来计算数千对数字分布的 p 值。
我正在使用这些 p 值来创建火山图,当我绘制 p 值时,我观察到一个奇怪的伪影,其中许多点都获得了相同的 p 值。
经过一番调查,我可以用下面代码中的四对数字重新创建这种现象。
当我在 Excel 中计算这些对的 p 值时,这些值都非常不同(相差几个数量级),但使用Statistics::TTes
我得到的每对值完全相同。
p 值非常小(大约 1.6e-12)所以我想知道这是否不是某种精度问题,但我无法弄清楚。
如果您运行下面的代码,它将显示四个相同的 p 值(T 检验概率,t_prob
),尽管真实的 p 值范围从1.7e-19
到2.8e-29
。
我尝试过使用类似的方法Statistics::Distributions
,但我遇到了同样的问题,但我认为Statistics::TTest
依赖于Statistics::Distributions
这些计算。
我找不到执行此计算的任何其他模块。
我应该注意到,绝大多数(99%)的成对分布都得到了正确的 p 值。只有少数会在错误的值上产生这种奇怪的碰撞。
有没有人知道出了什么问题?
非常感谢帮助,谢谢!
这是代码:
#!/usr/bin/perl -w
use strict;
use Statistics::TTest;
my %datasets = ();
@{$datasets{a}[0]} = (0.722466024,0.925999419,1,1.049630768,1.056583528,1.10433666,1.13093087,1.150559677,1.220329955,1.316145742,1.333423734,1.63691458,1.691534165,0.713695815,0.815575429,0.918386234,0.925999419,0.941106311,0.948600847,0.970853654,0.98550043,0.98550043,1,1.028569152,1.042644337,1.10433666,1.117695043,1.269033146,1.286881148,1.298658316,1.575312331);
@{$datasets{a}[1]} = (-0.49410907,-0.358453971,-0.321928095,-0.286304185,-0.200912694,-0.200912694,-0.168122759,-0.120294234,-0.120294234,-0.104697379,-0.104697379,-0.074000581,-0.058893689,-0.577766999,-0.514573173,-0.514573173,-0.49410907,-0.358453971,-0.358453971,-0.304006187,-0.251538767,-0.184424571);
@{$datasets{b}[0]} = (-0.434402824,-0.286304185,-0.251538767,-0.058893689,-0.043943348,-0.043943348,0.084064265,0.163498732,0.23878686,0.23878686,0.310340121,0.839959587,0.879705766,-0.556393349,-0.268816758,-0.251538767,-0.152003093,-0.104697379,-0.089267338,-0.029146346,-0.029146346,0,0.070389328,0.084064265,0.097610797,0.124328135,0.137503524,0.189033824,0.189033824,0.214124805,0.214124805,0.214124805,0.321928095,0.333423734,0.367371066,0.40053793,0.411426246,0.443606651,0.516015147,0.669026766,0.713695815);
@{$datasets{b}[1]} = (0.782408565,0.799087306,0.82374936,0.887525271,0.925999419,0.933572638,0.956056652,0.97819563,0.98550043,1.021479727,1.084064265,1.097610797,1.13093087,1.150559677,1.15704371,1.176322773,1.182692298,1.22650853,1.286881148,1.292781749,1.310340121,1.459431619,1.485426827,1.521050737,1.59454855,1.695993813,1.713695815,1.726831217,0.40053793,0.411426246,0.59454855,0.925999419,0.941106311,0.948600847,0.98550043,1.028569152,1.070389328,1.117695043,1.124328135,1.220329955,1.316145742,1.744161096);
@{$datasets{c}[0]} = (-0.043943348,-0.029146346,-0.01449957,0.028569152,0.097610797,0.124328135,0.176322773,0.201633861,0.263034406,-0.862496476,-0.104697379,0.084064265,0.084064265,0.084064265,0.124328135,0.124328135,0.163498732,0.263034406,0.275007047,0.286881148,0.321928095,0.333423734);
@{$datasets{c}[1]} = (-2.64385619,-2.556393349,-2.473931188,-2.395928676,-2.395928676,-2.395928676,-2.321928095,-2.321928095,-2.321928095,-2.251538767,-2.251538767,-2.184424571,-2.120294234,-2,-0.535331733,-1.64385619,-1.556393349,-1.514573173,-1.514573173,-1.473931188,-1.434402824,-1.434402824,-1.395928676,-1.395928676,-1.395928676,-1.395928676,-1.358453971,-1.358453971,-1.358453971,-1.358453971,-1.358453971,-1.321928095,-1.286304185,-1.286304185,-1.286304185,-1.251538767,-1.217591435,-1.120294234,-1);
@{$datasets{d}[0]} = (0.933572638261024,0.948600847493356,0.948600847493356,0.970853654340483,0.978195629681652,1.111031312388740,1.150559676575380,1.416839741912830,0.731183241572200,0.790772037862000,0.815575428862573,0.855989697308481,0.871843648509318,0.895302621333307,0.933572638261024,0.941106310946431,0.948600847493356,0.956056652412403,0.970853654340483,0.992768430768924,1.000000000000000,1.063502942306160,1.226508529808680,1.269033146455240,1.298658315564520,1.704871964456350);
@{$datasets{d}[1]} = (-0.473931188332412,0.028569152196771,0.042644337408494,0.056583528366368,0.070389327891398,0.084064264788475,0.097610796626422,0.111031312388744,0.454175893185802,0.454175893185802,-0.514573172829758,-0.268816758427800,-0.168122758808327,-0.136061549576028,-0.043943347587597,0.014355292977070,0.111031312388744,0.124328135002202,0.137503523749935,0.176322772640463,0.238786859587116,0.250961573533219,0.344828496997441);
foreach my $dataset (sort keys %datasets) {
my $ttest = new Statistics::TTest;
$ttest->load_data(\@{$datasets{$dataset}[0]},\@{$datasets{$dataset}[1]});
print "$dataset - t_prob:\t$ttest->{t_prob}\n\n";
}
编辑:我还没有弄清楚 Statistics::TTest 或 Statistics::Distributions 发生了什么,但我找到了另一个可以正常工作的模块。万一其他人遇到这个问题,我会在这里发布。据我了解,一对分布上的单向方差分析相当于学生 T 检验。因此,我尝试使用 Statistics::ANOVA 并取得了成功。使用上述代码中 %datasets 的定义,以下循环将计算正确的 p 值(匹配 Excel 给出的值):
foreach my $dataset (sort keys %datasets) {
my $aov = Statistics::ANOVA->new();
$aov->load( "$dataset\1", \@{$datasets{$dataset}[0]} );
$aov->add( "$dataset\2", \@{$datasets{$dataset}[1]} );
$str = $aov->anova(independent => 1, parametric => 1, ordinal => 0);
print $str->{_stat}->{p_value} . "\n";
}