由于我有一个大数据集并且只有有限的计算资源,我想利用聚合序列对象使用R 包TraMineR
和WeightedCluster
. 但我很难找到这样做 的正确语法。
在下面的示例代码中,您会发现两个差异分析,差异分析的第一个树形图使用原始数据集,第二个使用聚合数据(即仅按频率加权的唯一序列)。
不幸的是,结果不匹配。你知道为什么吗?
示例代码
library(TraMineR)
library(WeightedCluster)
## Load example data and assign labels
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, 17:86], weights=mvad$weight)
mvad.agg
## Define sequence object
mvad.seq <- seqdef(mvad[, 17:86], alphabet=mvad.alphabet, states=mvad.scodes,
labels=mvad.labels, weights=mvad$weight, xtstep=6)
mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
states=mvad.scodes, labels=mvad.labels,
weights=mvad.agg$aggWeights, xtstep=6)
## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="OM", indel=1.5, sm="CONSTANT")
mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")
## Discrepancy analysis
tree <- seqtree(mvad.seq ~ gcse5eq + Grammar + funemp,
data=mvad, diss=mvad.dist, weight.permutation="diss")
seqtreedisplay(tree, type="d", border=NA)
tree.agg <- seqtree(mvad.agg.seq ~ gcse5eq + Grammar + funemp,
data=mvad[mvad.agg$aggIndex, ], diss=mvad.agg.dist,
weight.permutation="diss")
seqtreedisplay(tree.agg, type="d", border=NA)
这个问题与大数据和序列距离的计算有关。