我正在尝试对包括参与者年度报告收入的纵向调查的时间序列数据进行聚类。这些收入轨迹的长度各不相同,因此动态时间扭曲似乎是计算距离矩阵的合适工具。
一些实验表明,这些轨迹如何相互映射取决于分配的步进模式。因此,我想为我的数据集选择最合适的一个。我对动态时间扭曲不是很有经验,所以我决定尝试使用由一系列步进模式创建的距离矩阵对一个小样本进行聚类,看看哪个具有最佳性能指标。
为此,我使用了dtw
包的rabinerJuangStepPattern
功能,它可以实现 Rabiner 和 Juang(1993 年;我无法获得该文档的副本)中概述的“全面的步骤模式集”。因此,我创建了一个嵌套的 for 循环来遍历 Rabiner-Juang 集的所有配置,发现其中许多都抛出了以下错误:
Error in dtw(… : No warping path exists that is allowed by costraints
我已经使用我的数据的玩具版本复制了这个问题,它只尝试计算相对于数据集中第一个参与者的距离:
# Import required libraries
library(tidyverse)
library(dtw)
# Set seed for reproducible results
set.seed(123)
# Generate lengths of sample income trajectories
lengths = sample(8:27,500,replace = T) %>% as.list()
# Use rnorm to generate income trajectories of varying lengths, as defined above
inc_traj = list() %>% .[1:500] %>% map2(lengths, ~ rnorm(.y, 1588.647, 1484.186))
# Create list which allows comparison of all trajectories with that of the first participant
pairs = list() %>% .[1:500] %>% map2(as.list(1:500), ~ inc_traj %>% .[c(1,.y)])
# Empty list of distances to populate using for loop below
distances = list()
# Empty vector of step pattern names to populate using for loop below
name = c()
# Define loop counter
l = 0
# For each Rabiner-Juang family
for (a in 1:7) {
# For each slope-weighting sub-type
for (b in 1:4) {
# For both smoothed and unsmoothed applications
for (c in 1:2) {
# Increase loop count by 1
l = l + 1
# Use try catch to treat errors
tryCatch({
# Calculate the distance between the first income trajectory and all other trajectories in the sample
temp = pairs %>% map(~ dtw(.[[1]], .[[2]], keep = T, step = rabinerJuangStepPattern(a,letters[b],smoothed = {c == 1}))) %>%
map(~ .$distance) %>% unlist()
# Assign distances to distances list
distances[[l]] = temp
},
# No additional commands for warnings
warning=function(war) {},
# No additional commands for errors
error=function(err) {},
# If code fails to run, assign NULL to distances list
finally=function(f) {
distances[[l]] = NULL
})
# Add name to name list
name[l] = paste0("Rabiner-Juang:",a,",",letters[b],",smoothed=",{c == 1})
# Print for loop progress
cat("\r", paste0("Rabiner-Juang:",a,",",letters[b],",smoothed=",{c == 1},". ",l," of ",7*4*2," calculated."))
}
}
}
# Assign names to all list objects
distances = distances %>% setNames(name)
# Get names of Rabiner-Juang step patterns that worked correctly
distances %>% map(~ !is.null(.)) %>% unlist() %>% .[. == T] %>% names()
此代码的输出表明系列 I、V 和 VII 的步进模式正常工作,而系列 II、III、IV 和 VI 的步进模式会产生错误。
因此,我的问题如下:
1)为什么这些家庭中的一些工作,而另一些则产生错误?这是因为某些家庭不适合这种数据,还是我的实施错误?
2) 有谁知道为什么在这个用例中某些步骤模式可能比其他模式更受欢迎的任何理论原因?
非常感谢您的宝贵时间!!!
引文:
Rabiner, LR, & Juang, B.-H. (1993 年)。语音识别基础。新泽西州恩格尔伍德悬崖:普伦蒂斯霍尔。