r - 通过 dtw 计算距离矩阵

Question

在时间序列第 1 天到第 26 天，我有两个用于控制和治疗的标准化读取计数矩阵。我想通过动态时间包装计算距离矩阵，然后将其用于聚类，但似乎太复杂了。我这样做了；谁可以帮忙澄清一下？非常感谢

> head(control[,1:4])
               MAST2     WWC2  PHYHIPL   R3HDM2
Control_D1  6.591024 5.695156 3.388652 5.756384
Control_D1 8.043454 5.365221 6.859768 6.936970
Control_D3 7.731590 4.868267 6.919972 6.931073
Control_D4 8.129948 5.105528 6.627016 7.090268
Control_D5 7.690863 4.729501 6.824746 6.904610
Control_D6 8.101723 5.334501 6.868990 7.115883
> 

> head(lead[,1:4])
              MAST2     WWC2  PHYHIPL   R3HDM2
Lead30_D1  6.418423 5.610699 3.734425 5.778046
Lead30_D2 7.918360 4.295191 6.559294 6.780952
Lead30_D3 7.807142 4.294722 6.599187 6.716040
Lead30_D4 7.856720 4.432136 6.572337 6.848483
Lead30_D5 7.827311 4.204738 6.607107 6.784094
Lead30_D6 7.848760 4.458451 6.581216 6.943003
>
> dim(control)
[1]   26 2603
> dim(lead)
[1]   26 2603
library(dtw)

for (i in control) { 
  for (j in lead) { 
    result[i,j] <- dtw( dist(control[,,i],lead[,,j]), distance.only=T )$normalizedDistance 
  }
}

说

Error in lead[, , j] : incorrect number of dimensions

score 5 · Accepted Answer

已经有类似的问题，但答案还不是很详细。在 R 的特定情况下，这是您需要了解的内容的细分。

计算交叉距离矩阵

该proxy软件包专门用于计算交叉距离矩阵。您应该检查它的小插图，以了解它已经实施了哪些措施。其使用示例：

set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)

suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean", 
                               upper = TRUE, diag = TRUE)
print(distance_matrix)
#>          1        2        3        4        5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000

注意：在时间序列的上下文中， proxy将矩阵中的每一行sample_data视为一个序列，这可以通过上面是一个5x10矩阵并且得到的交叉距离矩阵是的事实来证实5x5。

使用 DTW 距离

该dtw包实现了 DTW 的许多变体，它还利用proxy. 您可以使用以下方法计算 DTW 距离矩阵：

suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw", 
                           upper = TRUE, diag = TRUE)
print(distance_matrix)
#>          1        2        3        4        5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000

使用自定义距离

一件好事proxy是它为您提供了注册自定义函数的选项。您似乎对 DTW 的规范化版本感兴趣，因此您可以执行以下操作：

ndtw <- function(x, y = NULL, ...) {
    dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}

pr_DB$set_entry(
  FUN = ndtw,
  names = "ndtw",
  loop = TRUE,
  distance = TRUE
)

ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
                            upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#>           1         2         3         4         5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000

pr_DB有关更多信息，请参阅文档。

其他 DTW 实现

这个dtwclust包（我制作的）实现了一个基本但更快的 DTW 版本，它可以使用多线程并利用proxy：

suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#>      [,1]      [,2]      [,3]      [,4]      [,5]     
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000

该dtw_basic实现仅支持两种步进模式和一种窗口类型，但速度要快得多：

suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
  proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
  proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)

Unit: microseconds
                                                                                        expr      min       lq     mean
 proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba",      window.size = 5L) 5279.124 5621.742 6070.069
                            proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)  657.966  710.418  776.474
   median       uq       max neval cld
 5802.354 6348.199 10411.000   100   b
  752.282  814.037  1161.626   100  a

另一个多线程实现包含在parallelDist包中，虽然我没有亲自测试过。

多变量或多维时间序列

单个多元序列通常是一个矩阵，其中时间跨越行，多个变量跨越列。DTW 也适用于他们：

mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421

好处proxy是它也可以计算列表中包含的对象之间的距离，因此您可以将多个多元序列放入矩阵列表中：

mv_series <- lapply(1L:5L, function(dummy) {
  matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})

mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#>      [,1]     [,2]     [,3]     [,4]     [,5]    
#> [1,]  0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599  0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470  0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376  0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111  0.00000

你的情况

无论您选择什么，您都可以使用它proxy来获得结果，但是由于您没有提供全部数据，因此我无法给您提供更具体的示例。我认为这dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE)会给您一对系列之间的距离，假设您将每个系列视为具有 4 个变量的多元系列。

score 1 · Accepted Answer

如果您的问题是“为什么我会收到此错误？” 答案是您正在尝试根据第三维对矩阵进行子集化，该矩阵是一个二维数组。

看：

dim(lead)
# [1] 26 2603
lead[,,6.418423] # yes, that's the value j has the first time through the loop
# This will reproduce your error
lead[,,1]
# This will also reproduce your error

希望您现在可以看到您有一些问题：

您正在尝试根据第三维对矩阵进行子集化
您的i和j值分别是和中的control值lead。您可以将它们用作它们的值，或者您可以生成索引，例如，for(i in seq_along(control)如果您打算将其用于其他目的而不是获取相同的值。
将其带到下一步，尚不清楚您要传递给该dist函数的内容。dist采用单个矩阵并计算其行之间的距离。您似乎试图从两个不同的矩阵中传递两个值，或者可能是两个不同矩阵的两个子集。看起来您可能需要返回并查看文档中的示例xtr