r - 在一个数据帧中添加计数列与另一个数据帧中的匹配计数

Question

我想在一个数据框中添加一个列，其中包含另一个数据框中的匹配数，这似乎很简单，但我似乎无法让它工作。例子：

smaller_df$CountOfMatches <- nrow(subset(larger_df, Date == smaller_df$Date))

这给了我错误：

In `==.default`(Date, smaller_df$Date) :
  longer object length is not a multiple of shorter object length

我知道数据框的长度不同，我不是要求合并，我只需要为 small_df 中的每一行/日期（有效日期对象）进行合并；计算 large_df 中有多少匹配项。

我对 R 非常陌生，所以我在这里必须缺少一些基本且非常琐碎的东西。

提前致谢

score 4 · Accepted Answer

最简单的方法是创建一个汇总表，然后将其与您的原始（较小）数据合并。最好有一个可重现的例子。所以这里有一些可重现的数据：

smaller_df <- data.frame(Date=seq(as.Date("2000-01-01"), 
                                  as.Date("2000-01-10"), by="1 day"))
set.seed(5)
larger_df <- data.frame(Date=sample(seq(as.Date("2000-01-01"), 
                                        as.Date("2000-01-20"), by="1 day"),
                                    80, replace=TRUE))

创建日期表（计数）larger_df

tbl <- table(larger_df$Date)

将其转换为适合合并的 data.frame

counts <- data.frame(Date=as.Date(names(tbl)), CountOfMatches=as.vector(tbl))

然后在日期合并。请注意，如果日期没有出现在中larger_df但出现在中smaller_df，那么CountOfMatches将是NA而不是0。

merge(smaller_df, counts, all.x=TRUE)

对于这个样本数据，你得到

> merge(smaller_df, counts, all.x=TRUE)
         Date CountOfMatches
1  2000-01-01              4
2  2000-01-02              2
3  2000-01-03              5
4  2000-01-04              4
5  2000-01-05              5
6  2000-01-06              6
7  2000-01-07              2
8  2000-01-08              5
9  2000-01-09              3
10 2000-01-10              3

编辑：

一个更简洁的版本，它使用一个包（它提供了摆脱一些转换细节的便利功能）是

library("plyr")
merge(smaller_df, 
      ddply(larger_df, .(Date), summarise, CountOfMatches=length(Date)),
      all.x = TRUE)

相同的结果，有效地，相同的逻辑。对于未出现在larger_df.

score 4 · Accepted Answer

有一种方法可以使用 data.table 包来做到这一点。这是一个用于在内存中高效处理大型数据集的包，允许类似 SQL 或 SAS 数据的步进式操作，但方括号 [] 的行为与 data.frame 对象不同。您可以将 data.table 连接、表达式和聚合放在 [] 中。阅读 data.table 手册以了解更多信息。

首先，将您的两个框架转换为 data.table 对象，并将键列设置为 Date。data.table 对象将按日期排序，然后可以连接。

使用与上述相同的样本数据：

library(data.table)
smaller_df <- data.table(data.frame(Date=seq(as.Date("2000-01-01"), 
    as.Date("2000-01-10"), by="1 day")))
set.seed(5)
larger_df <- data.table(data.frame(Date=sample(seq(as.Date("2000-01-01"), 
    as.Date("2000-01-20"), by="1 day"), 80, replace=TRUE)))

将键列设置为日期：

setkey(smaller_df, Date)
setkey(larger_df, Date)

您可以使用 by-without-by 语法并使用您按日期键入的事实。.N将返回子集中的行数（即日期匹配的行数）。

larger_df[smaller_df, .N]
##         Date   N
##  1: 2000-01-01 4
##  2: 2000-01-02 2
##  3: 2000-01-03 5
##  4: 2000-01-04 4
##  5: 2000-01-05 5
##  6: 2000-01-06 6
##  7: 2000-01-07 2
##  8: 2000-01-08 5
##  9: 2000-01-09 3
## 10: 2000-01-10 3

score 4 · Accepted Answer

这看起来相当简单：

smaller_df$bigDfCount <-sapply( smaller_df$Date,  
                        FUN=function(x) length(larger_df[larger_df$Date==x, "Date"] ) )
smaller_df

         Date bigDfCount
1  2000-01-01          4
2  2000-01-02          2
3  2000-01-03          5
4  2000-01-04          4
5  2000-01-05          5
6  2000-01-06          6
7  2000-01-07          2
8  2000-01-08          5
9  2000-01-09          3
10 2000-01-10          3

r - 在一个数据帧中添加计数列与另一个数据帧中的匹配计数

3 回答 3

Related

Reference