r - How to match two data.frames with an inexact matching identifier (one identifier has to be in the range of the other)

Question

I have the following matching problem: I have two data.frames, one with an observation every month (per company ID), and one with an observation every quarter (per company ID; note that quarter means fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily correct and also, a fiscal quarter is not necessarily 3 month long).

For every month and company, I want to get the correct value of that quarter. Consequently, several months have the same value for one quarter. As an example see the code below:

monthlyData <- data.frame(ID = rep(c("A", "B"), each = 5),
                  Month = rep(1:5, times = 2),
                  MonValue = 1:10)
monthlyData
   ID Month MonValue
1   A     1        1
2   A     2        2
3   A     3        3
4   A     4        4
5   A     5        5
6   B     1        6
7   B     2        7
8   B     3        8
9   B     4        9
10  B     5       10

#Quarterly data, i.e. the value of every quarter has to be matched to several months in d1
#However, I want to match fiscal quarters, which means that one quarter is not necessarily 3 month long
qtrData <- data.frame(ID = rep(c("A", "B"), each = 2),
                  startMonth = c(1, 4, 1, 3),
                  endMonth   = c(3, 5, 2, 5),
                  QTRValue   = 1:4)
qtrData
  ID startMonth endMonth QTRValue
1  A          1        3        1
2  A          4        5        2
3  B          1        2        3
4  B          3        5        4

#Desired output
   ID Month MonValue QTRValue
1   A     1        1        1
2   A     2        2        1
3   A     3        3        1
4   A     4        4        2
5   A     5        5        2
6   B     1        6        3
7   B     2        7        3
8   B     3        8        4
9   B     4        9        4
10  B     5       10        4

Note: This question was posted on R-help months ago, but I didn't get any answer then and found a solution myself (see R-help). Now, however, I posted a question on stackoverflow where I have a question regarding the data.table where this problem was mentioned as well and there, Andrie asked me to post this question again because he apparently has a good solution for it (see Question on SO)

UPDATE: See Matthew Dowle's comment: how does the real data look?

This data is a more realistic one. I added a few rows, but the only main part that changed is column endMonth in qtrData. More precisely, the startMonth is not necessarily the endMonth of the previous quarter plus one month anymore. Therefore, using the roll option, I think that you need another line of code (if not, you get 20 rows back, but with Andrie's solution, which is the desired one, you get 17 rows back). Then there is no performance difference anymore, if I don't miss anything here.

monthlyData_new <- data.table(ID = rep(c("A", "B"), each = 10),
                  Month = rep(1:10, times = 2),
                  MonValue = 1:20)

qtrData_new <- data.table(ID = rep(c("A", "B"), each = 3),
                  startMonth = c(1, 4, 7, 1, 3, 8),
                  endMonth   = c(3, 5, 10, 2, 5, 10),
                  QTRValue   = 1:6)

setkey(qtrData_new, ID)
setkey(monthlyData_new, ID)

qtrData1 <- qtrData_new
setkey(qtrData1, ID, startMonth)
monthlyData1 <- monthlyData_new
setkey(monthlyData1, ID, Month)

withTable1 <- function(){
  xx <- qtrData1[monthlyData1, roll=TRUE]
  xx <- xx[startMonth <= endMonth]

}

withTable2 <- function(){
  yy <- monthlyData_new[qtrData_new][Month >= startMonth & Month <= endMonth]

}

benchmark(withTable1, withTable2, replications=1e6)
        test replications elapsed relative user.self sys.self user.child sys.child
1 withTable1      1000000   4.244 1.028599     4.232    0.008          0         0
2 withTable2      1000000   4.126 1.000000     4.096    0.028          0         0

score 5 · Accepted Answer

尝试这个：

mD = data.table(monthlyData, key="ID,Month")
qD = data.table(qtrData,key="ID,startMonth")
qD[mD,roll=TRUE]
      ID startMonth endMonth QTRValue MonValue
 [1,]  A          1        3        1        1
 [2,]  A          2        3        1        2
 [3,]  A          3        3        1        3
 [4,]  A          4        5        2        4
 [5,]  A          5        5        2        5
 [6,]  B          1        2        3        6
 [7,]  B          2        2        3        7
 [8,]  B          3        5        4        8
 [9,]  B          4        5        4        9
[10,]  B          5        5        4       10

那应该快得多。

编辑：回答有问题的后续编辑。一种方法是使用 NA 来存储缺失月份的位置。我发现查看一个时间序列列（不规则的间隙和 NA）比查看一系列范围的两个更容易。

> mD <- data.table(ID = rep(c("A", "B"), each = 10),
+                  Month = rep(1:10, times = 2),
+                  MonValue = 1:20,  key="ID,Month")
>                  
> qD <- data.table(ID = rep(c("A", "B"), each = 4),
+                   Month = c(1,4,6,7, 1,3,6,8),
+                   QtrValue = c(1,2,NA,3, 4,5,NA,6),
+                   key="ID,Month")
>                   
> mD
      ID Month MonValue
 [1,]  A     1        1
 [2,]  A     2        2
 [3,]  A     3        3
 [4,]  A     4        4
 [5,]  A     5        5
 [6,]  A     6        6
 [7,]  A     7        7
 [8,]  A     8        8
 [9,]  A     9        9
[10,]  A    10       10
[11,]  B     1       11
[12,]  B     2       12
[13,]  B     3       13
[14,]  B     4       14
[15,]  B     5       15
[16,]  B     6       16
[17,]  B     7       17
[18,]  B     8       18
[19,]  B     9       19
[20,]  B    10       20
> qD
     ID Month QtrValue
[1,]  A     1        1
[2,]  A     4        2
[3,]  A     6       NA     # missing for 1 month  (6)
[4,]  A     7        3
[5,]  B     1        4
[6,]  B     3        5
[7,]  B     6       NA     # missing for 2 months (6 and 7)
[8,]  B     8        6
> qD[mD,roll=TRUE]
      ID Month QtrValue MonValue
 [1,]  A     1        1        1
 [2,]  A     2        1        2
 [3,]  A     3        1        3
 [4,]  A     4        2        4
 [5,]  A     5        2        5
 [6,]  A     6       NA        6
 [7,]  A     7        3        7
 [8,]  A     8        3        8
 [9,]  A     9        3        9
[10,]  A    10        3       10
[11,]  B     1        4       11
[12,]  B     2        4       12
[13,]  B     3        5       13
[14,]  B     4        5       14
[15,]  B     5        5       15
[16,]  B     6       NA       16
[17,]  B     7       NA       17
[18,]  B     8        6       18
[19,]  B     9        6       19
[20,]  B    10        6       20
> qD[mD,roll=TRUE][!is.na(QtrValue)]
      ID Month QtrValue MonValue
 [1,]  A     1        1        1
 [2,]  A     2        1        2
 [3,]  A     3        1        3
 [4,]  A     4        2        4
 [5,]  A     5        2        5
 [6,]  A     7        3        7
 [7,]  A     8        3        8
 [8,]  A     9        3        9
 [9,]  A    10        3       10
[10,]  B     1        4       11
[11,]  B     2        4       12
[12,]  B     3        5       13
[13,]  B     4        5       14
[14,]  B     5        5       15
[15,]  B     8        6       18
[16,]  B     9        6       19
[17,]  B    10        6       20

score 3 · Accepted Answer

这里有两个解决方案，使用 Base R 和data.table. 由于该data.table解决方案比基础 R 快约 30%，而且更易于阅读，因此我建议使用data.table它。

碱基R

既然您表示希望有这种效率，我使用vapply：

matchData <- function(id, month, data=d2){
  vapply(seq_along(id), 
      function(i)which(
            id[i]==data$ID & 
                month[i] >= data$startMonth & 
                month[i] <= data$endMonth),
      FUN.VALUE=1,
      USE.NAMES=FALSE
      )
}


within(monthlyData, 
    Value <- qtrData$QTRValue[matchData(
               monthlyData$ID, monthlyData$Month, qtrData)]
)

   ID Month MonValue Value
1   A     1        1     1
2   A     2        2     1
3   A     3        3     1
4   A     4        4     2
5   A     5        5     2
6   B     1        6     3
7   B     2        7     3
8   B     3        8     4
9   B     4        9     4
10  B     5       10     4

数据表

并且还演示了如何使用data.table：

mD <- data.table(monthlyData, key="ID")
qD <- data.table(qtrData, key="ID")
mD[qD][Month>=startMonth & Month<=endMonth]


      ID Month MonValue startMonth endMonth QTRValue
 [1,]  A     1        1          1        3        1
 [2,]  A     2        2          1        3        1
 [3,]  A     3        3          1        3        1
 [4,]  A     4        4          4        5        2
 [5,]  A     5        5          4        5        2
 [6,]  B     1        6          1        2        3
 [7,]  B     2        7          1        2        3
 [8,]  B     3        8          3        5        4
 [9,]  B     4        9          3        5        4
[10,]  B     5       10          3        5        4

基准

我很好奇这两种方法的比较：

library(rbenchmark)

withBase <- function(){
  xx <- within(monthlyData, 
      Value <- qtrData$QTRValue[matchData(monthlyData$ID, monthlyData$Month, qtrData)])
  
}

withTable <- function(){
  yy <- mD[qD][Month>=startMonth & Month<=endMonth]
  
}

benchmark(withBase, withTable, replications=1e6)

       test replications elapsed relative user.self sys.self user.child
1  withBase      1000000   10.09 1.296915      7.65     0.21         NA
2 withTable      1000000    7.78 1.000000      6.38     0.16         NA

r - How to match two data.frames with an inexact matching identifier (one identifier has to be in the range of the other)

2 回答 2

碱基R

数据表

基准

Related

Reference