我想找到按基因聚合的最小值:
a <- data.frame(probe=c("probe1","probe2","probe3","probe4"), gene=c("gene1","gene1","gene2","gene1"), value=c(.001,.1,.05,.001))
# probe gene value
# 1 probe1 gene1 0.001
# 2 probe2 gene1 0.100
# 3 probe3 gene2 0.050
# 4 probe4 gene1 0.001
所以我这样做了:
aggregated <- aggregate(value~gene, data=a, FUN=min)
# gene value
# 1 gene1 0.001
# 2 gene2 0.050
b <- merge(aggregated, a)
# gene value probe
# 1 gene1 0.001 probe1
# 2 gene1 0.001 probe4
# 3 gene2 0.050 probe3
但是因为probe1和probe4的值相同,所以gene1是重复的,然后我需要在两列中选择一个(不管哪一列)。所以我可以这样做:
# THIS IS THE OUTPUT THAT I WANT
c <- aggregate(b, by=list(b$gene), function(x) x[1])[,-1]
# gene value probe
# 1 gene1 0.001 probe1
# 2 gene2 0.050 probe3
问题是我在循环中使用它,所以如果我将它应用到没有重复的数据帧上会出错:
aggregate(c, by=list(b$gene), function(x) x[1])[,-1]
# Error in aggregate.data.frame(c, by = list(b$gene), function(x) x[1]) : arguments must have same length
我可以在应用第二个聚合之前检查是否存在重复的探针基因对,但我确信有更好的方法。
编辑:我的代码中有一个错误。这实际上完美无缺
b <- merge(aggregate(value~gene, data=a, FUN=min), a);
aggregate(b, by=list(b$gene), function(x) x[1])[,-1]
但问题仍然存在,是否有一种不那么迂回的方式来做到这一点?