r - 如何按组拆分 data.table 并按列中的出现次数使用子集？

Question

我有一个大数据集，287046 x 18，看起来像这样（只是部分表示）：

tdf
         geneSymbol     peaks
16         AK056486 Pol2_only
13         AK310751   no_peak
7          BC036251   no_peak
10         DQ575786   no_peak
4          DQ597235   no_peak
5          DQ599768   no_peak
11         DQ599872   no_peak
12         DQ599872   no_peak
2           FAM138F   no_peak
15           FAM41C   no_peak
34116         GAPDH      both
283034        GAPDH Pol2_only
6      LOC100132062   no_peak
9      LOC100133331   no_peak
14     LOC100288069      both
8            M37726   no_peak
3             OR4F5   no_peak
17           SAMD11      both
18           SAMD11      both
19           SAMD11      both
20           SAMD11      both
21           SAMD11      both
22           SAMD11      both
23           SAMD11      both
24           SAMD11      both
25           SAMD11      both
1            WASH7P Pol2_only

我想要做的是提取（1）“Pol2_only”或“both”的基因符号，然后；(2) 只是“Pol2_only”但不是“both”的基因符号。例如，GAPDH 将满足条件 1 但不满足条件 2。

我已经尝试过 plyr 类似的东西（那里有一个额外的条件，请忽略）：

## grab genes with both peaks 
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)

## grab genes pol2 only peaks 
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)

但它需要很长时间并且仍然返回错误的答案。例如，2 的答案是：

pol2.only.peaks
  geneSymbol     peaks
1   AK056486 Pol2_only
2      GAPDH Pol2_only
3     WASH7P Pol2_only

如您所见，GAPDH 不应该存在。我在 data.table 中的实现（更受欢迎，因此更受欢迎）也产生了相同的结果：

filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
   geneSymbol     peaks
1:   AK056486 Pol2_only
2:      GAPDH Pol2_only
3:     WASH7P Pol2_only

问题似乎是子集是逐行工作的，而我需要将其应用于整个geneSymbol 的子组。

请帮我加入小组吗？一个 data.table 解决方案会受到欢迎，因为它更快，但 plyr（甚至是基础 R）很好。添加一个额外的列来注明峰的性质的解决方案将是完美的。这就是我的意思：

tdf
         geneSymbol     peaks      newCol
16         AK056486 Pol2_only   Pol2_only
13         AK310751   no_peak     no_peak
7          BC036251   no_peak     no_peak
10         DQ575786   no_peak     no_peak
4          DQ597235   no_peak     no_peak
5          DQ599768   no_peak     no_peak
11         DQ599872   no_peak     no_peak
12         DQ599872   no_peak     no_peak
2           FAM138F   no_peak     no_peak
15           FAM41C   no_peak     no_peak
34116         GAPDH      both        both
283034        GAPDH Pol2_only        both
6      LOC100132062   no_peak     no_peak
9      LOC100133331   no_peak     no_peak
14     LOC100288069      both        both
8            M37726   no_peak     no_peak
3             OR4F5   no_peak     no_peak
17           SAMD11      both        both
18           SAMD11      both        both
19           SAMD11      both        both
20           SAMD11      both        both
21           SAMD11      both        both
22           SAMD11      both        both
23           SAMD11      both        both
24           SAMD11      both        both
25           SAMD11      both        both
1            WASH7P Pol2_only   Pol2_only

再次注意现在在 2 行中“都”的 GAPDH。这是数据：

dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", 
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", 
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", 
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", 
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", 
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", 
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", 
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both", 
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", 
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")

谢谢！

编辑** 我找到了解决该问题的方法。选择是逐行进行的。它所需要的只是一个hack，也就是说，在返回的逻辑向量中所有的值都是真的。所以这是我对 plyr 函数所做的：

ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
  geneSymbol     peaks
1   AK056486 Pol2_only
2     WASH7P Pol2_only

请注意在条件旁边使用 all in。现在结果是预期的，即仅“Pol2_only”（冗余警报）基因:) 还有待完成的是我尝试但未能执行的 data.table 中的实现。有什么帮助吗？

我没有写下我的问题的答案，期望有人在 data.table 中提供更好的解决方案。

score 3 · Accepted Answer

正如您要求的 data.table 解决方案。

# set the key to be "peaks
TDF <- data.table(tdf, key = c('geneSymbol','peaks'))

# use unique to get unique combinations, then for each geneSymbol get the first
# match (we have keyed by peak soboth < Pol2_only < no_peak within each 
# geneSymbol )
# then exclude those with "peak == "no_peak")

unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak']

#      geneSymbol     peaks
# 1:     AK056486 Pol2_only
# 2:        GAPDH      both
# 3: LOC100288069      both
# 4:       SAMD11      both
# 5:       WASH7P Pol2_only

score 1 · Accepted Answer

你不需要 plyr。

a <- tdf$geneSymbol[tdf$peaks %in% c("both", "Pol2_only")]
b <- tdf$geneSymbol[tdf$peaks != "Pol2_only"]
result <- setdiff(a, b)

并在您的数据框中创建一个新列：

tdf$newcol <- with(tdf, ifelse(geneSymbol %in% result, "Pol2 only",
                        ifelse(geneSymbol %in% a, "both", "no_peak")))

r - 如何按组拆分 data.table 并按列中的出现次数使用子集？

2 回答 2

Related

Reference