r - 使用 R 减少序列分析中的时间范围

Question

我有一个在很长一段时间内发生的序列。我尝试了 8 种不同的算法来对我的序列进行分类（OM、CHi2、...）。时间从 1 到 123。我有 110 个个人和 8 个事件。

我的结果并不像预期的那样。首先，它非常难以阅读。其次，一个类别包含太多的代表序列（group3）。第三，每组的序列数量确实不平衡。

这可能是因为我的时间变量的范围为 123。我搜索了时间范围过长存在问题的文章。我在 Sabherwal 和 Robey（1993 年）以及 Shi 和 Prescott（2011 年）中读到，您可以通过将所需的转换数量除以较长序列的长度来标准化“每个序列”。我怎么能在 R 中做到这一点？

请在下面找到我的数据描述：

library(TraMineRextras)
head(seq.tse.data)
seq.tse.data <- structure(list(
ID = c(1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 
     4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 
     6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L),
Year = c(2008L, 2010L, 2012L, 2007L, 2009L, 2010L, 2012L,
       2013L, 1996L, 1997L, 1999L, 2003L, 2006L, 2008L, 
       2012L, 2007L, 2007L, 2008L, 2003L, 2007L, 2007L,
       2009L, 2009L, 2011L, 2014L, 2016L, 2006L, 2009L, 
       2011L, 2013L, 2013L, 2015L, 2015L, 2016L), 
Event = c(5L, 4L, 5L, 3L, 1L, 5L, 5L, 5L, 3L,3L,3L,3L,3L,5L, 1L, 5L,
5L,5L,4L,5L, 5L, 5L, 5L, 5L, 5L,5L,5L,5L, 4L, 4L, 1L, 4L, 1L,5L)), 
      class = "data.frame", row.names = c(NA, -34L)
      )
    seq.sts <- TSE_to_STS(seq.tse.data,
                     id = 1, timestamp = 2, event = 3, 
                     stm =NULL, tmin = 1935, tmax = 2018,
                     firstState = "None")    
seq.SPS <- seqformat(seq.sts, 1:84, from = "STS", to = "SPS")
seq.obj <- seqdef(seq.SPS)
> head(seq.tse.data)
  ID Year Event
1  1 2008     5
2  2 2010     4
3  2 2012     5
4  3 2007     3
5  3 2009     1
6  3 2010     5
> head(seq.obj)
    Sequence                            
[1] (None,74)-(5,10)-1                  
[2] (None,76)-(4,2)-(5.4,6)-2           
[3] (None,73)-(3,2)-(3.1,1)-(5.3.1,8)-3 
[4] (None,62)-(3,12)-(5.3,4)-(5.3.1,6)-3
[5] (None,73)-(5,11)-1                  
[6] (None,69)-(4,4)-(5.4,11)-2  

> head(alphabet(seq.obj),10)
 [1] "(1,1)"  "(1,10)" "(1,11)" "(1,12)" "(1,14)" "(1,19)" "(1,2)"  "(1,21)" "(1,25)" "(1,3)" 
...
[145] "(5.4.3.1,5)"   "(5.4.3.1,6)"   "(5.4.3.1,7)"   "(5.4.3.1,8)"   "(5.4.3.1.2,9)" "(None,1)"      "(None,11)"     "(None,20)"    
[153] "(None,26)"     "(None,30)"     "(None,38)"     "(None,41)"     "(None,42)"     "(None,44)"     "(None,45)"     "(None,49)"    
[161] "(None,51)"     "(None,53)"     "(None,55)"     "(None,57)"     "(None,58)"     "(None,59)"     "(None,60)"     "(None,61)"    
[169] "(None,62)"     "(None,64)"     "(None,65)"     "(None,66)"     "(None,67)"     "(None,68)"     "(None,69)"     "(None,7)"     
[177] "(None,70)"     "(None,71)"     "(None,72)"     "(None,73)"     "(None,74)"     "(None,75)"     "(None,76)"     "(None,77)"    
[185] "(None,78)"     "(None,79)"

提前致谢，

安东宁

score 1 · Accepted Answer

我想你的问题是关于规范序列之间的差异。例如，Sabherwal 和 Robey (1993, p 557) 参考了 Abbott & Hyrcac (1990) 提出的距离标准化，根本不考虑序列的标准化。无论如何，我无法弄清楚序列的标准化可能是什么。

的seqdist函数TraMineR有一个norm参数，可用于规范化一些建议的距离度量。seqdist以下是帮助页面的摘录：

距离可以选择通过 norm 参数进行归一化。如果设置为“auto”，则将 Elzinga 归一化（相似度除以两个序列长度的几何平均值）应用于“LCS”、“LCP”和“RLCP”距离，而 Abbott 归一化（距离除以较长序列的长度）用于“OM”、“HAM”和“DHD”。Elzinga 的方法可以使用“gmean”强制执行，而 Abbott 的规则可以使用“maxlength”强制执行。使用“maxdist”，距离通过其最大可能值进行归一化。有关详细信息，请参阅 Gabadinho 等人。（2009 年，2011 年）。最后，“YujinBo”是 Yujian 和 Bo (2007) 提出的保留三角不等式的归一化。

让我警告您，虽然归一化使两个短序列（例如长度为 10）之间的距离与两个长序列（例如长度为 100）之间的距离更具可比性，但它并不能解决比较不同长度序列的问题。

您可以在Elzinga & Studer (2016)中找到关于序列分析中距离和相似性归一化的详细讨论。

r - 使用 R 减少序列分析中的时间范围

1 回答 1

Related

Reference