1

是否有一种优雅的方式来填补缺失的时间段 astimetk::pad_by_timetsibble::fill_gapsin data.table

数据可能如下所示

library(data.table)
data<-data.table(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-03-01","2020-03-01","2020-03-01"),
             Card = c(1,2,3,1,3,1,2,3),
             A = rnorm(8)
)

2020-02-01 对卡片 2 的隐含观察。

tsibble包中,您可以执行以下操作

library(tsibble)
data <- data[, .(Date = yearmonth(ymd(Date)), 
               Card = as.character(Card),
              A= as.numeric(A))]
data<-as_tsibble(data, key = Card, index = Date)
data<-fill_gaps(data)

timetk包中,您可以执行以下操作

library(timetk)
data <- data[, .(Date = ymd(Date), 
             Card = as.character(Card),
             A= as.numeric(A))]
data<-data %>%
  group_by(Card) %>%
  pad_by_time(Date, .by = "month") %>%
  ungroup()
4

1 回答 1

3

只是data.table

如果没有设置键,那么

data2 <- data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
data2
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1  1.37095845
# 2: 2020-01-01     2 -0.56469817
# 3: 2020-01-01     3  0.36312841
# 4: 2020-02-01     1  0.63286260
# 5: 2020-02-01     2          NA
# 6: 2020-02-01     3  0.40426832
# 7: 2020-03-01     1 -0.10612452
# 8: 2020-03-01     2  1.51152200
# 9: 2020-03-01     3 -0.09465904

(更新/简化,感谢@sindri_baldur!)

如果设置了键,则可以使用@Frank 的方法:

data2 <- data[ do.call(CJ, c(mget(key(data)), unique = TRUE)), ]

从这里,您可以nafill根据需要使用,也许

data2[, A := nafill(A, type = "locf"), by = .(Card)]
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1  1.37095845
# 2: 2020-01-01     2 -0.56469817
# 3: 2020-01-01     3  0.36312841
# 4: 2020-02-01     1  0.63286260
# 5: 2020-02-01     2 -0.56469817
# 6: 2020-02-01     3  0.40426832
# 7: 2020-03-01     1 -0.10612452
# 8: 2020-03-01     2  1.51152200
# 9: 2020-03-01     3 -0.09465904

(如何填写取决于您对数据上下文的了解;它可能很容易by=.(Date),或者某种形式的插补。)


更新:上面对可能的组合进行了扩展,可能会填充到特定的跨度之外,在这种情况下,人们可能会看到:Card

data <- data[-1,]
data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1          NA
# 2: 2020-01-01     2 -0.42225588
# 3: 2020-01-01     3 -0.12235017
# 4: 2020-02-01     1  0.18819303
# 5: 2020-02-01     2          NA
# 6: 2020-02-01     3  0.11916096
# 7: 2020-03-01     1 -0.02509255
# 8: 2020-03-01     2  0.10807273
# 9: 2020-03-01     3 -0.48543524

我认为有两种方法可以解决这个问题:

  1. 执行上述代码,然后删除NA每组的前导(和尾随)s:

    data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)
      ][, .SD[ !is.na(A) | !seq_len(.N) %in% c(1, .N),], by = Card]
    #     Card       Date           A
    #    <num>     <char>       <num>
    # 1:     1 2020-02-01  0.18819303
    # 2:     1 2020-03-01 -0.02509255
    # 3:     2 2020-01-01 -0.42225588
    # 4:     2 2020-02-01          NA
    # 5:     2 2020-03-01  0.10807273
    # 6:     3 2020-01-01 -0.12235017
    # 7:     3 2020-02-01  0.11916096
    # 8:     3 2020-03-01 -0.48543524
    
  2. 完全不同的方法(假设Date-class,上面没有严格要求):

    data[,Date := as.Date(Date)]
    data[data[, .(Date = do.call(seq, c(as.list(range(Date)), by = "month"))), 
              by = .(Card)],
         on = .(Date, Card)]
    #          Date  Card           A
    #        <Date> <num>       <num>
    # 1: 2020-01-01     2 -0.42225588
    # 2: 2020-02-01     2          NA
    # 3: 2020-03-01     2  0.10807273
    # 4: 2020-01-01     3 -0.12235017
    # 5: 2020-02-01     3  0.11916096
    # 6: 2020-03-01     3 -0.48543524
    # 7: 2020-02-01     1  0.18819303
    # 8: 2020-03-01     1 -0.02509255
    
于 2021-10-28T10:30:39.857 回答