我是 R 新手,我正在使用 Shapiro-Wilk 测试来测试一组数据的正态性。我的问题不在于使用测试,而在于生成结果表来识别 p 值大于 0.05 的结果行。为了说明我的问题,我使用了 golub 数据集,它给出了来自“ALL”和“AML”患者的一系列基因表达值。
我所做的如下:
library (multtest)
data (golub)
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
# the golub dataset has the expression values for 3051 genes so I've decided to use only the first 10 genes from the dataset to make it easier to work with
ALL10 <- golub[1:10, gol.fac=="ALL"]
# calculate Shapiro-Wilk test for normality
sh10 <- apply (ALL10, 1, function(x) shapiro.test(x)$p.value)
# get the names of the first 10 genes from the golub.gnames matrix
ALL10names <- golub.gnames[1:10,2]
# combine gene names with normality p-value scores
list10 <- cbind(ALL10names,sh10)
# find those that have normal distribution
normdist<- list10[,2]>0.05
# print a list of those with normal distribution
list10[which(normdist),]
我得到的结果是:
ALL10names sh10
[1,] "AFFX-HUMISGF3A/M97935_MA_at (endogenous control)" "2.97359627770755e-07"
[2,] "AFFX-HUMISGF3A/M97935_3_at (endogenous control)" "0.299103621399385"
[3,] "AFFX-HUMGAPDH/M33197_5_at (endogenous control)" "6.60564216346286e-07"
[4,] "AFFX-HUMGAPDH/M33197_M_at (endogenous control)" "6.81945800629973e-07"
[5,] "AFFX-HSAC07/X00351_5_at (endogenous control)" "3.3088559810058e-06"
[6,] "AFFX-HSAC07/X00351_M_at (endogenous control)" "1.30227973255158e-08"
如您所见,这是错误的!有几个值实际上 < 0.05,只有一个实际上 > 0.05(这是我想要的)
如果我做:
which(normdist)
[1] 1 3 7 8 9 10
但
which (sh10 > 0.05)
[1] 3
所以很明显错误发生在
normdist<- list10[,2]>0.05
我的问题是为什么会这样?我想要 list10 的第 2 列中值大于 0.05 的所有内容......它看起来正确,但我得到了错误的结果。正如我所说,我正在学习 R,所以我想了解哪里出了问题,所以我不会重复我的错误。提前致谢!