r - Problem with big data (?) during computation of sequence distances using TraMineR

Question

I am trying to run an optimal matching analysis using TraMineR but it seems that I am encountering an issue with the size of the dataset. I have a big dataset of European countries which contains employment spells. I have more than 57,000 sequences which are 48 units long and consist of 9 distinct states. In order to get an idea of the analysis, here is the head of sequence object employdat.sts:

[1] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[2] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[3] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  
[4] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  
[5] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[6] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...

In a shorter SPS format, this reads as follows:

Sequence               
[1] "(EF,48)"              
[2] "(EF,48)"              
[3] "(ST,48)"              
[4] "(ST,36)-(MS,3)-(EF,9)"
[5] "(EF,48)"              
[6] "(ST,24)-(EF,24)"

After passing this sequence object to the seqdist() function, I get the following error message:

employdat.om <- seqdist(employdat.sts, method="OM", sm="CONSTANT", indel=4)    
[>] creating 9x9 substitution-cost matrix using 2 as constant value  
[>] 57160 sequences with 9 distinct events/states  
[>] 12626 distinct sequences  
[>] min/max sequence length: 48/48  
[>] computing distances using OM metric  
Error in .Call(TMR_cstringdistance, as.integer(dseq), as.integer(dim(dseq)),  : negative length vectors are not allowed

Is this error related to the huge number of distinct, long sequences? I am using a x64-machine with 4GB RAM and I have also tried it on a machine with 8-GB RAM which reproduced the error message. Does someone know a way to tackle this error? Besides, analyses for each single country using the same syntax with an index for the country worked well and produced meaningful results.

score 8 · Accepted Answer

我以前从未见过此错误代码，但这很可能是由于您的序列数量过多。您至少可以尝试做两件事：

使用"full.matrix=FALSE"seqdist 中的参数（参见帮助页面）。它将仅计算下三角矩阵并返回可直接在hclust函数中使用的“dist”对象。
您可以聚合相同的序列（您只有 12626 个不同的序列而不是 57160 个序列），计算距离，使用权重（根据每个不同序列在数据集中出现的次数计算）对序列进行聚类，然后添加聚类回到你的原始数据集。WeightedCluster使用库可以很容易地做到这一点。WeightedCluster 手册的第一个附录提供了执行此操作的分步指南（该过程也在网页http://mephisto.unige.ch/weightedcluster上进行了描述）。

希望这可以帮助。

score 2 · Accepted Answer

一个通常效果很好的简单解决方案是仅分析数据的样本。例如

employdat.sts <- employdat.sts[sample(nrow(employdat.sts),5000),]

将提取 5000 个序列的随机样本。探索如此重要的样本应该足以找出序列的特征，包括它们的多样性。

为了提高代表性，您甚至可以采用一些分层抽样（例如，按第一个或最后一个状态，或按数据集中可用的一些协变量）。由于您手头有原始数据集，因此您可以完全控制随机抽样设计。

更新

如果集群是目标，并且您需要每个单独序列的集群成员资格，请参阅https://stackoverflow.com/a/63037549/1586731

r - Problem with big data (?) during computation of sequence distances using TraMineR

2 回答 2

Related

Reference