r - R 根据次要字段条件获取数据帧上的唯一记录

Question

更新和简化

我有一个非常大的表（约 700 万条记录），其结构如下。

temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
                   text = "Website Datetime    Rating
A 2007-12-06T14:53:07Z        1
A 2006-07-28T03:52:26Z        4
B 2006-11-02T11:06:25Z        2
C 2007-06-19T06:56:08Z        5
C 2009-11-28T22:27:58Z        2
C 2009-11-28T22:28:13Z        2")

我要检索的是每个网站具有最高评分的独特网站：

Website    Rating
A    4
B    2
C    5

我尝试使用 for 循环，但它太慢了。有没有其他方法可以实现这一点。

score 3 · Accepted Answer

 do.call( rbind, lapply( split(temp, temp$Website) , 
                               function(d) d[ which.max(d$Rating), ] ) )
  Website             Datetime Rating
A       A 2006-07-28T03:52:26Z      4
B       B 2006-11-02T11:06:25Z      2
C       C 2007-06-19T06:56:08Z      5

由于您的“日期时间”变量实际上还不是日期或日期时间对象，因此您可能应该首先转换为日期对象。

which.max将选择最大的第一个项目。

>  which.max(c(1,1,2,2))
[1] 3

所以阿难在这方面的警告可能不正确。数据表方法肯定会更快，如果机器内存适中，也可能会成功。上面的方法可能会沿途制作多个副本，而 data.table 函数不需要复制那么多。

score 2 · Accepted Answer

我可能会探索该data.table软件包，但没有更多详细信息，以下示例解决方案很可能不是 您需要的。我提到这一点是因为，特别是，每组匹配的“评分”记录可能不止一个max；你想如何处理这些案件？

library(data.table)
temp <- read.table(header = TRUE, stringsAsFactors=FALSE,
                text = "Website Datetime    Rating
                        A       2012-10-9   10
                        A       2012-11-10  12
                        B       2011-10-9   5")
DT <- data.table(temp, key="Website")
DT
#    Website   Datetime Rating
# 1:       A  2012-10-9     10
# 2:       A 2012-11-10     12
# 3:       B  2011-10-9      5
DT[, list(Datetime = Datetime[which.max(Rating)], 
          Rating = max(Rating)), by = key(DT)]
#    Website   Datetime Rating
# 1:       A 2012-11-10     12
# 2:       B  2011-10-9      5

我建议为了获得更好的答案，您可能希望包含诸如日期时间变量如何影响您的聚合之类的信息，或者每个组是否可能有多个“最大值”值。

如果您想要与最大值匹配的所有行，则修复很容易：

DT[, list(Time = Times[Rating == max(Rating)], 
          Rating = max(Rating)), by = key(DT)]

如果您确实只想要该Rating专栏，那么有很多方法可以解决这个问题。按照与上述相同的步骤转换为 a data.table，尝试：

DT[, list(Datetime = max(Rating)), by = key(DT)]
     Website Datetime
# 1:       A        4
# 2:       B        2
# 3:       C        5

或者，保持原来的 "temp" data.frame，尝试aggregate()：

aggregate(Rating ~ Website, temp, max)
    Website Rating
# 1       A      4
# 2       B      2
# 3       C      5

另一种方法，使用ave：

temp[with(temp, Rating == ave(Rating, Website, FUN=max)), ]

r - R 根据次要字段条件获取数据帧上的唯一记录

2 回答 2

Related

Reference