我需要过滤一些交易数据,但我很困惑如何管理它。这是我的数据的一个简单示例:
set.seed(1)
start.date <- as.POSIXct("2011-01-01 09:30:01", tz = "GMT")
dates <- seq(start.date, length = 10, by = "days")
tr_dt <- as.integer(gsub("-", "", as.Date(dates)))
DT <- data.table(TM_STMP = dates, PR = format(rlnorm(10, 2), digits = 2), VOL = rpois(10, 200), TRD_EXCTN_DT = tr_dt, TRD_RPT_DT = tr_dt, ASOF_CD = "")
DT[5] <- DT[2]
DT[6] <- DT[2]
DT[7] <- DT[2]
DT[8] <- DT[2]
DT$TRD_RPT_DT[5] <- 20131109
DT$TRD_RPT_DT[6] <- 20131109
DT$TRD_RPT_DT[7] <- 20131109
DT$TRD_RPT_DT[8] <- 20131109
DT$ASOF_CD[5] <- "R"
DT$ASOF_CD[6] <- "A"
DT$ASOF_CD[7] <- "R"
DT$ASOF_CD[8] <- "A"
DT
                TM_STMP   PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD
 1: 2011-01-01 09:30:01  3.9 221     20131105   20131105
 2: 2011-01-02 09:30:01  8.9 205     20131106   20131106
 3: 2011-01-03 09:30:01  3.2 191     20131107   20131107
 4: 2011-01-04 09:30:01 36.4 195     20131108   20131108
 5: 2011-01-02 09:30:01  8.9 205     20131106   20131109       R
 6: 2011-01-02 09:30:01  8.9 205     20131106   20131109       A
 7: 2011-01-02 09:30:01  8.9 205     20131106   20131109       R
 8: 2011-01-02 09:30:01  8.9 205     20131106   20131109       A
 9: 2011-01-09 09:30:01 13.1 208     20131113   20131113
10: 2011-01-10 09:30:01  5.4 212     20131114   20131114
我要做的是:
1) 获取所有行ASOF_CD != "R"并将它们与ASOF_CD == ""基于TM_STMP,PR和TRD_EXCTN_DT(for ASOF_CD == "") < TRD_RPT_DT(for ASOF_CD == "R") 的行匹配。只有一个""可以匹配一个"R"。这应该导致:
               TM_STMP   PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD
2: 2011-01-02 09:30:01  8.9 205     20110102   20110102
5: 2011-01-02 09:30:01  8.9 205     20110102   20131109       R
2)从 data.table中删除这些匹配项,包括"R"和。""然后 data.table 看起来像:
               TM_STMP   PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD
 1: 2011-01-01 09:30:01  3.9 221     20110101   20110101
 2: 2011-01-03 09:30:01  3.2 191     20110103   20110103   
 3: 2011-01-04 09:30:01 36.4 195     20110104   20110104   
 4: 2011-01-02 09:30:01  8.9 205     20110102   20131109       A
 5: 2011-01-02 09:30:01  8.9 205     20110102   20131109       R
 6: 2011-01-02 09:30:01  8.9 205     20110102   20131109       A
 7: 2011-01-09 09:30:01 13.1 208     20110109   20110109
 8: 2011-01-10 09:30:01  5.4 212     20110110   20110110
3) 获取所有剩余ASOF_CD == "R"的行并将它们与ASOF_CD == "A"基于TM_STMP,PR和TRD_EXCTN_DT(for ASOF_CD == "A") <= TRD_RPTD_DT(for ASOF_CD == "R") 的行匹配。只有一个"A"可以匹配一个"R"。比赛是:
                TM_STMP   PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD
 4: 2011-01-02 09:30:01  8.9 205     20110102   20131109       A
 5: 2011-01-02 09:30:01  8.9 205     20110102   20131109       R
4)从 data.table中删除这些匹配项,包括"R"和。"A"最终结果如下data.table:
                TM_STMP   PR VOL TRD_EXCTN_DT TRD_RPT_DT ASOF_CD
 1: 2011-01-01 09:30:01  3.9 221     20110101   20110101
 2: 2011-01-03 09:30:01  3.2 191     20110103   20110103
 3: 2011-01-04 09:30:01 36.4 195     20110104   20110104
 4: 2011-01-02 09:30:01  8.9 205     20110102   20131109       A
 5: 2011-01-09 09:30:01 13.1 208     20110109   20110109
 6: 2011-01-10 09:30:01  5.4 212     20110110   20110110
我想到了第一个任务,并尝试使用:
setkey(DT, "TM_STMP", "PR", "TRD_EXCTN_DT")
DT[ASOF_CD == ""][DT[ASOF_CD == "R", list(TM_STMP, PR, TRD_RPT_DT)], roll = Inf, nomatch = 0, mult = "first"]
我使用roll=Inf参数来匹配TRD_EXCTN_DT<TRD_RPT_DT和mult="first"只得到一个匹配 in DT[ASOF_CD == ""],但这给了我两个匹配:
               TM_STMP   PR TRD_EXCTN_DT VOL TRD_RPT_DT ASOF_CD
1: 2011-01-02 09:30:01  8.9     20131109 205   20131106
2: 2011-01-02 09:30:01  8.9     20131109 205   20131106
此外,对于步骤 1) 和 2),我不知道如何进行匹配以获取"R"与"". 是否有一个内部连接的解决方案可以立即给我第一对"R"和""那个匹配,所以我可以删除它们?