-1

假设我有一个文件,其中包含一些这样的条目:

02/10/11 10:26:35 AM UTC, 0
02/10/11 10:26:38 AM UTC, 1
02/10/11 10:26:42 AM UTC, 0

是否有任何直接的方法R可以将此信息转换为全长二进制时间序列(假设采样间隔为一秒),用零和一估算?

在此示例中,系列将是:0 0 0 1 1 1 1 0

编辑:因为 Dirk 和 Josh 提供了独特的解决方案,我想看看他们在处理时间方面的比较:

library(xts)
library(data.table)
library(rbenchmark)

doseq <- function(N,Nby){
  base.t <<- Sys.time()
  t.seq <<- base.t + seq.int(from=0, to=N, by=Nby)
  n.t <<- length(t.seq)
  val.seq <<- (1:n.t - 1) %% 2
}

josh <- function(N,Nby=10){
  doseq(N,Nby)
  dt1 <- data.table(time = t.seq, val=val.seq, key="time")
  dt2 <- data.table(time = with(dt1, seq(min(time), max(time), by=1)), key = "time")
  dtf <- dt1[dt2, rolltolast = TRUE]
  return(dtf)
}

dirk <- function(N,Nby=10){
  doseq(N,Nby)
  xt1 <- xts(val.seq, t.seq)
  secs <- seq(start(xt1), end(xt1), by="1 sec")
  xtf <- zoo::na.locf(merge(xt1, xts(, secs)))
  return(xtf)
}

bm <- benchmark(josh(1e2,10), josh(1e3,10), josh(1e4,10), josh(1e5,10), josh(1e6,10),
  dirk(1e2,10), dirk(1e3,10), dirk(1e4,10), dirk(1e5,10), dirk(1e6,10),
  columns=c("test", "replications","elapsed", "relative"),
  replications=10)

print(bm)

给予:

              test replications elapsed relative
6    dirk(100, 10)           10   0.024    1.000
7   dirk(1000, 10)           10   0.026    1.083
8  dirk(10000, 10)           10   0.044    1.833
9  dirk(1e+05, 10)           10   0.321   13.375
10 dirk(1e+06, 10)           10   3.342  139.250
1    josh(100, 10)           10   0.034    1.417
2   josh(1000, 10)           10   0.036    1.500
3  josh(10000, 10)           10   0.070    2.917
4  josh(1e+05, 10)           10   0.453   18.875
5  josh(1e+06, 10)           10   5.381  224.208

所以看起来它们并没有太大的不同,但是xts方法比data.table方法快一些。

4

2 回答 2

3

以下是使用data.table包的方法:

library(data.table)

## Some example data
X <- data.table(time = Sys.time() + c(0,3,7), val=c(0,1,0), key = "time")

## A data.table with one row for each second spanned by X
Y <- data.table(time = with(X, seq(min(time), max(time), by=1)), key = "time")

## Merge them
X[Y, rolltolast = TRUE]
#                   time val
# 1: 2012-09-13 15:58:53   0
# 2: 2012-09-13 15:58:54   0
# 3: 2012-09-13 15:58:55   0
# 4: 2012-09-13 15:58:56   1
# 5: 2012-09-13 15:58:57   1
# 6: 2012-09-13 15:58:58   1
# 7: 2012-09-13 15:58:59   1
# 8: 2012-09-13 15:59:00   0
于 2012-09-13T23:00:29.680 回答
3

是的,xts包可以提供帮助。

首先,创建一个xts对象:

R> pt <- strptime(c("02/10/11 10:26:35 AM", "02/10/11 10:26:38 AM", 
+                    "02/10/11 10:26:42 AM"), "%d/%m/%y %H:%M:%S %p", tz="UTC")
R> vals <- c(0,1,0)
R> x <- xts(vals, pt)
R> x
                    [,1]
2011-10-02 10:26:35    0
2011-10-02 10:26:38    1
2011-10-02 10:26:42    0
Warning message:
timezone of object (UTC) is different than current timezone (). 
R> 

我们可以忽略警告——我有一个美国时区。

现在,我们可以创建从该变量开始到结束的秒数序列:

R> secs <- seq(start(x), end(x), by="1 sec")

现在来看看魔术:通过将我们的原始对象与该网格的“空”对象合并,我们扩展为 gridL

R> x2 <- merge(x, xts(, secs))
R> x2
                     x
2011-10-02 10:26:35  0
2011-10-02 10:26:36 NA
2011-10-02 10:26:37 NA
2011-10-02 10:26:38  1
2011-10-02 10:26:39 NA
2011-10-02 10:26:40 NA
2011-10-02 10:26:41 NA
2011-10-02 10:26:42  0
Warning message:
timezone of object (UTC) is different than current timezone (). 

剩下的就是调用na.locf()

R> x2 <- na.locf(merge(x, xts(, secs)))
R> x2
                    x
2011-10-02 10:26:35 0
2011-10-02 10:26:36 0
2011-10-02 10:26:37 0
2011-10-02 10:26:38 1
2011-10-02 10:26:39 1
2011-10-02 10:26:40 1
2011-10-02 10:26:41 1
2011-10-02 10:26:42 0
Warning message:
timezone of object (UTC) is different than current timezone (). 
R> 
于 2012-09-13T23:01:10.403 回答