r - 在R中过滤没有循环的数据

Question

我有相当大的数据框（几百万条记录）。
由于以下规则，我需要对其进行过滤：
- 对于每个产品，删除第一条记录之后的第五条记录之前的所有记录，其中 x>0。

所以，我们只对两列感兴趣——ID 和 x。数据框按 ID 排序。
使用循环很容易做到这一点，但循环在如此大的数据帧上表现不佳。

如何在“矢量风格”中做到这一点？

示例：
过滤前

ID  x  
1 0  
1 0  
1 5  # First record with x>0  
1 0  
1 3  
1 4  
1 0   
1 9   
1 0  # Delete all earlier records of that product  
1 0  
1 6  
2 0  
2 1  # First record with x>0   
2 0  
2 4  
2 5  
2 8  
2 0  # Delete all earlier records of that product  
2 1  
2 3

过滤后：

score 4 · Accepted Answer

对于这些拆分、应用、组合问题 - 我喜欢使用plyr。如果速度成为问题，还有其他选择，但对于大多数事情 - plyr 易于理解和使用。我编写了一个函数来实现您上面描述的逻辑，然后将其提供ddply()给基于 ID 对每个数据块进行操作。

fun <- function(x, column, threshold, numplus){
  whichcol <- which(x[column] > threshold)[1]
  rows <- seq(from = (whichcol + numplus), to = nrow(x))
  return(x[rows,])
}

然后把这个喂给ddply()

require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
  ID x
1  1 9
2  1 0
3  1 0
4  1 6
5  2 0
6  2 1
7  2 3

r - 在R中过滤没有循环的数据

1 回答 1

Related

Reference