4

我想在 R 中为下面的示例数据帧“数据”执行两个样本 Kolmogorov-Smirnov (KS) 测试:

    Protein1    Protein2    Protein3    Protein4    Protein5    Protein6    Protein7    Protein8    Group
Sample1 0.56    1.46    0.64    2.53    1.96    305.29  428.41  113.22  Control
Sample2 0.75    2.29    0.38    4.31    1.72    307.95  492.58  82.75   Control
Sample3 2.05    1.73    2.42    14.75   2.92    523.92  426.94  131.51  Control
Sample4 1.71    5.37    0.68    6.39    2.02    343.05  435.16  123.12  Control
Sample5 13.31   0.94    1.21    3.83    2.83    313.71  327.84  66.8    Control
Sample6 0.36    1.81    0.42    2.25    1.48    335.23  352.55  93.81   Control
Sample7 0.28    3.26    0.49    2.62    1.96    251.49  468.19  80.27   Control
Sample8 1.62    17.01   0.49    2.87    1.7 254.79  402.9   86.8    Control
Sample9 0.59    2.64    0.18    2.93    1.23    388.87  384.53  109.52  Control
Sample10    1.67    3.75    0.43    3.89    1.83    306.28  440.86  97.55   Control
Sample11    15.53   12.02   0.57    1.81    2.31    328.56  352.98  118.18  Control
Sample12    0.94    7.06    35.77   4.98    2.44    389.14  376.18  119.75  Control
Sample13    2.07    1.73    0.38    3.89    2.56    233.81  377.21  72.1    Control
Sample14    3.2 1.38    0.5 4.05    2.51    406.57  538.57  209.16  Patient
Sample15    1.33    2.17    0.46    3.31    1.72    278.04  276.37  79.4    Patient
Sample16    1.48    2.9 0.84    9.27    1.94    332.76  413.66  99.09   Patient
Sample17    2.02    1.96    0.25    4.16    1.96    358.73  383.63  107.46  Patient
Sample18    2.94    2   0.27    3.99    2.55    354.78  493.02  145.36  Patient
Sample19    1.01    8.1 0.35    3.65    1.62    335.18  264.74  145.15  Patient
Sample20    6.95    15.48   2.94    3.64    2.3 307.23  484.38  119.61  Patient
Sample21    0.52    1.38    0.56    3.08    1.86    244.3   304.74  76.87   Patient
Sample22    0.35    2.17    0.38    4.51    2.09    304.68  369.98  151.76  Patient
Sample23    2.26    2.9 0.3 4.44    2.43    302.51  367.51  150.69  Patient
Sample24    3.19    1.96    0.81    2.94    2.15    309.59  362.18  133.49  Patient
Sample25    1.12    2   0.71    3.77    2.42    334.36  358.9   131.35  Patient
Sample26    5.28    8.1 0.81    22.84   2.35    422.18  369.71  76.35   Patient

要在 2 个单独的列之间执行 KS 测试,代码如下:

> ks.test(data$Protein1, data$Protein2, data=data)

    Two-sample Kolmogorov-Smirnov test

data:  data$Protein1 and data$Protein2
D = 0.42308, p-value = 0.01905
alternative hypothesis: two-sided

Warning message:
In ks.test(data$Protein1, data$Protein2, data = data) :
  cannot compute exact p-value with ties

但是,我想按列和按组执行此操作。例如,为 t.test 或 wilcox.test 执行此操作很容易,因为您可以编码为 t.test(y1,y2) 或 t.test(y~x) # 其中 y 是数字,x 是二进制因素 但是,当涉及到二进制因素时,没有 ks.test 的代码。有人能帮忙吗?

最终,我想为所有蛋白质的整个数据框执行此操作,因为我可以成功地进行 t 检验,但想为 ks.test 执行如下操作:

t_tests <- lapply(
  data[, -1], # apply the function to every variable *other than* the first one (group)
  function(x) { t.test(x ~ HealthGroups, data = data) }
)

任何帮助表示赞赏。提前谢谢你。

4

1 回答 1

4

这是一个非常简单的方法。这使用了一个循环,这通常在 R 圈子中不受欢迎。但是,它非常简单且不言自明,这对于新用户来说可能是一个加分项,并且在这种情况下循环太慢也没有问题。(请注意,lapply()如果您愿意,可以使用它,但这仍然是一个循环,它只是在外面看起来不同。)

只需创建两个具有相同变量的新子集数据框。然后循环调用ks.test. 输出对用户不是很友好——它只会说j——所以我添加了一个对?writeLines的调用来打印正在测试的变量的名称。

# I am assuming the original data frame is called d
dc <- d[d$Group=="Control",]
dp <- d[d$Group=="Patient",]
for(j in 1:8){  
  writeLines(names(dc)[j])
  print(ks.test(dc[,j], dp[,j]))  
}
# Protein1
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.30769, p-value = 0.5882
# alternative hypothesis: two-sided
# 
# Protein2
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein3
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein4
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.46154, p-value = 0.1254
# alternative hypothesis: two-sided
# 
# Protein5
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.23077, p-value = 0.8793
# alternative hypothesis: two-sided
# 
# Protein6
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.15385, p-value = 0.9992
# alternative hypothesis: two-sided
# 
# Protein7
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.38462, p-value = 0.2999
# alternative hypothesis: two-sided
# 
# Protein8
# 
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  dc[, j] and dp[, j]
# D = 0.46154, p-value = 0.1265
# alternative hypothesis: two-sided
# 
# Warning messages:
# 1: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 2: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 3: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties
# 4: In ks.test(dc[, j], dp[, j]) : cannot compute exact p-value with ties

这将创建一个数据框来容纳结果:

dk <- data.frame(Protein=character(8), D=numeric(8), p=numeric(8), stringsAsFactors=F)
for(j in 1:8){  
  k <- ks.test(dc[,j], dp[,j])
  dk$Protein[j] <- names(dc)[j]
  dk$D[j]       <- k$statistic
  dk$p[j]       <- k$p.value
}
dk
#    Protein         D         p
# 1 Protein1 0.3076923 0.5881961
# 2 Protein2 0.2307692 0.8793244
# 3 Protein3 0.2307692 0.8793244
# 4 Protein4 0.4615385 0.1253895
# 5 Protein5 0.2307692 0.8793244
# 6 Protein6 0.1538462 0.9992124
# 7 Protein7 0.3846154 0.2999202
# 8 Protein8 0.4615385 0.1264877
于 2019-06-14T15:06:21.447 回答