r - Conducting a t-test across two different data frames with switched rows/columns?

Question

Sorry for the confusing title, this one's a bit hard to describe. Basically, I've got two data tables which look similar to this:

df1 <- data.frame(SNP=c("W", "X", "Y", "Z"),
                  Gene.ID=c("A", "B", "C", "B"), pval=NA)
df2 <- data.frame(W=c(1, 0, 1), X=c(1, 1, 0), Y=c(0, 0, 1), Z=c(1, 0, 1),
                  A=c(3.5, 2.5, 3.5), C=c(4.5, 2.5, 1.5), B=c(1.5, 2.5, 1.5))

So all the entries in df1 correspond to column names in df2. My goal is to fill df1$pval with the p-values from a t-test. For every row in df1, I want to do a t-test comparing the df2 column that matches the value of df1$SNP, and compares that against the df2 column that matches the value of df1$Gene.ID.

For example, for the first row in df1, I would want to compare df2$W vs. df2$A, and then return the resulting p-value inside of df1[1, 3]. For the second row, I would compare df2$X vs. df2$B and return the p-value in df1[2, 3]. In other words, something like this:

for (i in 1:nrow(df1)){
  test <- t.test(df2[,which(colnames(df2)==df1[i, 1]] ~ df2[,which(colnames(df2)==df1[i, 2]])
  df1[i, 3] <- test$p.value
}

But this does not work because you can only select multiple column names using the colnames function, not just a single column name. Suggestions for how to get around this would be greatly appreciated--or if you have a simpler method in mind, that would be great too.

score 1 · Accepted Answer

我不明白你为什么认为这行不通——我认为你的代码只是有语法错误。以下代码似乎可以正常工作（注意更改为 use sapply，这在 R 中稍微更传统）：

df1[, 3] <- sapply(seq_len(nrow(df1)), 
  function(i) {
    test <- t.test(
      df2[, which(colnames(df2) == df1[i, 1])],
      df2[, which(colnames(df2) == df1[i, 2])])
    test$p.value
  })

score 1 · Accepted Answer

在这里使用which(colnames(df2)...)可能不是最佳选择，因为您要做的就是选择df2具有df1[i,1]或df1[i,2]作为名称的列。

在 R 中，按名称选择列的一种方法是使用双括号：例如，df2[["A"]]将检索的列A，df2这似乎是您想要的，并且比df2[, which(colnames(df2) == "A")].

考虑到这一点，您可以像这样重写代码：

for (i in 1:nrow(df1)){
  test <- t.test(df2[[df1[i, 2]]] ~ df2[[df1[i, 1]]])
  df1[i, 3] <- test$p.value
}

请注意，我切换了df1[i, 1]，并且df1[i, 2]由于文档t.test说明二进制变量必须位于右侧。

lhs ~ rhs 形式的公式，其中 lhs 是一个数值变量，给出数据值，rhs 是一个具有两个水平的因子，给出相应的组

r - Conducting a t-test across two different data frames with switched rows/columns?

2 回答 2

Related

Reference