r - 在某些条件适用时允许最大数量的条目

Question

我有一个包含很多条目的数据集。这些条目中的每一个都属于某个ID（belongID），条目是唯一的（具有uniqID），但多个条目可以来自同一个源（sourceID）。来自同一来源的多个条目也可能具有相同的belongID。出于研究的目的，我需要对数据集进行研究，我必须删除单个 sourceID 的条目，这些条目对于 1 个belongID 出现超过 5 次。最多需要保留 5 个条目是具有最高“时间”值的条目。

为了说明这一点，我有以下示例数据集：

   belongID   sourceID uniqID   Time     
   1           1001     101       5            
   1           1002     102       5        
   1           1001     103       4        
   1           1001     104       3       
   1           1001     105       3     
   1           1005     106       2        
   1           1001     107       2       
   1           1001     108       2       
   2           1005     109       5                
   2           1006     110       5        
   2           1005     111       5        
   2           1006     112       5        
   2           1005     113       5      
   2           1006     114       4        
   2           1005     115       4        
   2           1006     116       3       
   2           1005     117       3                
   2           1006     118       3       
   2           1005     119       2        
   2           1006     120       2        
   2           1005     121       1      
   2           1007     122       1        
   3           1010     123       5        
   3           1480     124       2

最后的示例应如下所示：

   belongID   sourceID uniqID   Time     
   1           1001     101       5            
   1           1002     102       5        
   1           1001     103       4        
   1           1001     104       3       
   1           1001     105       3     
   1           1005     106       2        
   1           1001     107       2           
   2           1005     109       5                
   2           1006     110       5        
   2           1005     111       5        
   2           1006     112       5        
   2           1005     113       5      
   2           1006     114       4        
   2           1005     115       4        
   2           1006     116       3       
   2           1005     117       3                
   2           1006     118       3           
   2           1007     122       1        
   3           1010     123       5        
   3           1480     124       2

文件中有更多包含数据条目的列，但选择必须完全基于时间。如示例中所示，具有相同belowID 的sourceID 的第5 个和第6 个条目也可能具有相同的时间。在这种情况下，只需要选择 1，因为 max=5。

出于说明目的，此处的数据集按belongID 和时间很好地排序，但在实际数据集中并非如此。知道如何解决这个问题吗？我还没有遇到类似的东西..

score 1 · Accepted Answer

假设您的数据在df. 在此之后获得有序（按 uniqID）输出：

tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
    if(all(is.na(tab[bIDs == i, ])))next
    ids <- na.omit(sIDs[tab[i, sIDs] > 5])
    for(j in ids)
    {
        cond <- df$belongID == i & df$sourceID == j
        old <- df[cond,]
        id5 <- order(old$Time, decreasing = TRUE)[1:5]
        new <- old[id5,]
        df <- df[!cond,]
        df <- rbind(df, new)
    }
}
df[order(df$uniqID), ]

score 1 · Accepted Answer

使用plyr包的两行解决方案：

library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]

结果：

   belongID sourceID uniqID Time
5         1     1001    101    5
6         1     1002    102    5
4         1     1001    103    4
2         1     1001    104    3
3         1     1001    105    3
7         1     1005    106    2
1         1     1001    108    2
10        2     1005    109    5
16        2     1006    110    5
11        2     1005    111    5
17        2     1006    112    5
12        2     1005    113    5
15        2     1006    114    4
9         2     1005    115    4
13        2     1006    116    3
8         2     1005    117    3
14        2     1006    118    3
18        2     1007    122    1
19        3     1010    123    5
20        3     1480    124    2

score 1 · Accepted Answer

如果dat是您的数据框：

do.call(rbind, 
        by(dat, INDICES=list(dat$belongID, dat$sourceID), 
           FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))

score 0 · Accepted Answer

将要使用此方法的数据集有 170.000 多个条目和近 30 列

使用我的数据集对 danas.zuokas、mplourde 和 Andrie 提供的三个解决方案中的每一个进行基准测试，结果如下：

danas.zuokas 的解决方案：

   User     System  Elapsed 
   2829.569   0     2827.86

mplourde 的解决方案：

   User     System  Elapsed 
   765.628  0.000   763.908

Aurdie 的解决方案：

   User     System  Elapsed 
   984.989  0.000   984.010

因此，我将使用 mplourde 的解决方案。谢谢你们！

score 0 · Accepted Answer

这应该更快，使用data.table：

DT = as.data.table(dat)

DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]

旁白：建议计算在此问题的各种答案中重复相同变量名称的次数。您是否有很多长或相似的对象名称？

r - 在某些条件适用时允许最大数量的条目

5 回答 5

Related

Reference