r - 将序列与缺失值对齐

Question

我使用的语言是 R，但您不一定需要了解 R 来回答问题。

问题： 我有一个序列可以被认为是基本事实，另一个序列是第一个序列的移位版本，有一些缺失值。我想知道如何对齐两者。

设置

我有一个ground.truth基本上是一组时间的序列：

ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
                rep( seq(0,length.out=5,by=4*10+30), each=10 )

想想ground.truth我正在执行以下操作的时间：

{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5

我有第二个序列observations，它ground.truth 移动了 20% 的值缺失：

nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs     <- length(observations)

如果我绘制这些向量，这就是它的样子（请记住，将这些视为时间）：

在此处输入图像描述

我试过的。我想：

计算班次（theLag在我上面的例子中）
计算一个向量idx，使得ground.truth[idx] == observations - theLag

首先，假设我们知道theLag。注意ground.truth[1]不一定observations[1]-theLag。事实上，我们有ground.truth[1] == observations[1+lagI]-theLag一些lagI.

为了计算这个，我想我会使用互相关（ccf函数）。

但是，每当我这样做时，我都会遇到最大的滞后。互相关为 0，意思是ground.truth[1] == observations[1] - theLag。但是我已经在示例中尝试过这个，我明确地确保它observations[1] - theLag不是（即修改以确保它没有 1）。 ground.truth[1]idx_to_keep

这种转变theLag不应该影响互相关（不是ccf(x,y) == ccf(x,y-constant)吗？）所以我打算稍后再解决。

不过，也许我误解了，因为observations它的价值不如ground.truth? 即使在我设置的更简单的情况下theLag==0，互相关函数仍然无法识别正确的滞后，这让我相信我正在考虑这个错误。

有没有人有一个通用的方法来解决这个问题，或者知道一些可以提供帮助的 R 函数/包？

非常感谢。

score 6 · Accepted Answer

对于滞后，您可以计算两组点之间的所有差异（距离）：

diffs <- outer(observations, ground.truth, '-')

您的滞后应该是出现length(observations)次数的值：

which(table(diffs) == length(observations))
# 55.715382960625 
#              86

再检查一遍：

theLag
# [1] 55.71538

一旦你发现你的问题的第二部分很容易theLag：

idx <- which(ground.truth %in% (observations - theLag))

score 3 · Accepted Answer

如果您的时间序列不太长，则以下内容应该有效。

您有两个时间戳向量，第二个是第一个的移位且不完整的副本，您想知道它移位了多少。

# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]

我们可以尝试所有可能的滞后，并通过将每个观察到的时间戳与最接近的“真实”时间戳相匹配，计算出对齐的糟糕程度。

# Loss function
library(sqldf)
f <- function(u) {
  # Put all the values in a data.frame
  d1 <- data.frame(g="truth",    value=x)
  d2 <- data.frame(g="observed", value=y+u)
  d <- rbind(d1,d2)
  # For each observed value, find the next truth value
  # (we could take the nearest, on either side, 
  # but it would be more complicated)
  d <- sqldf("
    SELECT A.g, A.value, 
           ( SELECT MIN(B.value) 
             FROM   d AS B 
             WHERE  B.g='truth' 
             AND    B.value >= A.value
           ) AS next
    FROM   d AS A
    WHERE  A.g = 'observed'
  ")
  # If u is greater than the lag, there are missing values.
  # If u is smaller, the differences decrease 
  # as we approach the lag.
  if(any(is.na(d))) {
    return(Inf)
  } else {
    return( sum(d$`next` - d$value, na.rm=TRUE) )
  }
}

我们现在可以搜索最佳延迟。

# Look at the loss function
sapply( seq(-2,2,by=.1), f )

# Minimize the loss function.
# Change the interval if it does not converge, 
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

r - 将序列与缺失值对齐

2 回答 2

Related

Reference