r - R：如何从一个表中删除出现在另一个表中的值？

Question

我的数据如下所示：

> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
      gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value   q_value significant
1 XLOC_000219       M4       M3     OK 3.85465 0.00000             -Inf        NA   5e-05 0.0075951         yes
2 XLOC_004272       M4       M3     OK 2.06687 0.00000             -Inf        NA   5e-05 0.0075951         yes
3 XLOC_004991       M4       M3     OK 3.29904 0.00000             -Inf        NA   5e-05 0.0075951         yes
4 XLOC_007234       M4       M3     OK 1.28027 0.00000             -Inf        NA   5e-05 0.0075951         yes
5 XLOC_000664       M4       F4     OK 1.46853 0.00000             -Inf        NA   5e-05 0.0075951         yes
6 XLOC_001809       M4       F4     OK 0.00000 1.91743              Inf        NA   5e-05 0.0075951         yes

我用 RSQLite 生成了两个子集：

M4M3 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "M3"')

M4F4 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "F4"')

我想从 M4M3 中删除所有在 M4F4 中具有匹配基因 ID 的值。我使用 RSQLite 过滤数据集并不重要，它可能是一个纯 R 解决方案，但我不确定如何比较表并从一个基于另一个的行中删除行。

感谢您的任何建议！

score 3 · Accepted Answer

有很多方法可以做到这一点。

Base R 子集解决方案（如上面 Balter 所述）：

M4M3.new <- M4M3[!(M4M3$gene_id %in% M4F4$gene_id),]

Base R 集并集解决方案：

M4M3.new <- setdiff(M4M3, M4F4)

Dplyr 解决方案

M4M3.new <- dplyr::anti_join(M4M3, 
                             M4F4, 
                             by = c("gene_id" = "gene_id"))

编辑：所有似乎都在以下数据集上进行了测试：

tst1 <- data.frame(gene_id = seq(1:10), 
                   sample_1 = rep("M4", 10), 
                   sample_2 = c(rep("M3", 6), rep("F4", 4)), 
                   other_values = sample(1:10, 10, replace = T),
                   other_values2 = rep("OK", 10))

M4M3 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "M3",]
M4F4 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "F4",]

score 1 · Accepted Answer

如果您希望连接在数据库上运行，您还可以通过 dbplyr 进行连接：

library(dplyr)
src <- dbplyr::src_dbi(db)
geneExpDiffData <- tbl(src, "geneExpDiffData")

M4M3 <- geneExpDiffData %>%
  filter(significant == "yes" & sample_1 == "M4" & sample_2 == "M3")

M4F3 <- geneExpDiffData %>%
  filter(significant == "yes" & sample_1 == "M4" & sample_2 == "F4")

anti_join(M4M3, M4F3)

这样做的好处是您可以对大多数应用程序使用相同的语法，无论您的数据是在数据库中还是在本地数据框中。事实上M4M3，andM4F3只是查询对象，查询只会在请求时运行（例如，如果您显示数据或运行连接）。通过以下方式转换为数据框collect()：

result_df <- anti_join(M4M3, M4F3) %>% collect()

在简介中了解更多信息。

score 1 · Accepted Answer

您可以直接在一个 SQL 语句中执行此操作，如下所示：

M4M3 <- dbGetQuery(mydb, '
SELECT * 
FROM geneExpDiffData 
WHERE significant = "yes" 
AND sample_1 = "M4" 
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id 
                    FROM geneExpDiffData 
                    WHERE significant = "yes" 
                    AND sample_1 = "M4" 
                    AND sample_2 = "F4")
')

内括号中的代码返回gene_idM4F4 中所有内容的表格。所以我们想要gene_id第一个表中的所有内容，而不是第二个表中的所有内容。

r - R：如何从一个表中删除出现在另一个表中的值？

3 回答 3

Related

Reference