我有一个大数据集,287046 x 18,看起来像这样(只是部分表示):
tdf
geneSymbol peaks
16 AK056486 Pol2_only
13 AK310751 no_peak
7 BC036251 no_peak
10 DQ575786 no_peak
4 DQ597235 no_peak
5 DQ599768 no_peak
11 DQ599872 no_peak
12 DQ599872 no_peak
2 FAM138F no_peak
15 FAM41C no_peak
34116 GAPDH both
283034 GAPDH Pol2_only
6 LOC100132062 no_peak
9 LOC100133331 no_peak
14 LOC100288069 both
8 M37726 no_peak
3 OR4F5 no_peak
17 SAMD11 both
18 SAMD11 both
19 SAMD11 both
20 SAMD11 both
21 SAMD11 both
22 SAMD11 both
23 SAMD11 both
24 SAMD11 both
25 SAMD11 both
1 WASH7P Pol2_only
我想要做的是提取(1)“Pol2_only”或“both”的基因符号,然后;(2) 只是“Pol2_only”但不是“both”的基因符号。例如,GAPDH 将满足条件 1 但不满足条件 2。
我已经尝试过 plyr 类似的东西(那里有一个额外的条件,请忽略):
## grab genes with both peaks
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)
## grab genes pol2 only peaks
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)
但它需要很长时间并且仍然返回错误的答案。例如,2 的答案是:
pol2.only.peaks
geneSymbol peaks
1 AK056486 Pol2_only
2 GAPDH Pol2_only
3 WASH7P Pol2_only
如您所见,GAPDH 不应该存在。我在 data.table 中的实现(更受欢迎,因此更受欢迎)也产生了相同的结果:
filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
geneSymbol peaks
1: AK056486 Pol2_only
2: GAPDH Pol2_only
3: WASH7P Pol2_only
问题似乎是子集是逐行工作的,而我需要将其应用于整个geneSymbol 的子组。
请帮我加入小组吗?一个 data.table 解决方案会受到欢迎,因为它更快,但 plyr(甚至是基础 R)很好。添加一个额外的列来注明峰的性质的解决方案将是完美的。这就是我的意思:
tdf
geneSymbol peaks newCol
16 AK056486 Pol2_only Pol2_only
13 AK310751 no_peak no_peak
7 BC036251 no_peak no_peak
10 DQ575786 no_peak no_peak
4 DQ597235 no_peak no_peak
5 DQ599768 no_peak no_peak
11 DQ599872 no_peak no_peak
12 DQ599872 no_peak no_peak
2 FAM138F no_peak no_peak
15 FAM41C no_peak no_peak
34116 GAPDH both both
283034 GAPDH Pol2_only both
6 LOC100132062 no_peak no_peak
9 LOC100133331 no_peak no_peak
14 LOC100288069 both both
8 M37726 no_peak no_peak
3 OR4F5 no_peak no_peak
17 SAMD11 both both
18 SAMD11 both both
19 SAMD11 both both
20 SAMD11 both both
21 SAMD11 both both
22 SAMD11 both both
23 SAMD11 both both
24 SAMD11 both both
25 SAMD11 both both
1 WASH7P Pol2_only Pol2_only
再次注意现在在 2 行中“都”的 GAPDH。这是数据:
dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251",
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F",
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069",
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11",
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only",
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak",
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak",
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both",
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol",
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L,
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L,
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")
谢谢!
编辑** 我找到了解决该问题的方法。选择是逐行进行的。它所需要的只是一个hack,也就是说,在返回的逻辑向量中所有的值都是真的。所以这是我对 plyr 函数所做的:
ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
geneSymbol peaks
1 AK056486 Pol2_only
2 WASH7P Pol2_only
请注意在条件旁边使用 all in。现在结果是预期的,即仅“Pol2_only”(冗余警报)基因:) 还有待完成的是我尝试但未能执行的 data.table 中的实现。有什么帮助吗?
我没有写下我的问题的答案,期望有人在 data.table 中提供更好的解决方案。