cluster-analysis - 用于降维的 SSVD +Clustering

翻译自：https://stackoverflow.com/questions/29413296 2015-04-02T12:42:59.057

78 次

我已经通过 mahout 运行 ssvd 来应用 LSA（潜在语义分析）。我有文本文档，每个文档都包含许多功能（从 100 到 2000 个术语）。我想在文档上使用 LSA 来获取一起出现“概念”的热门术语或短语。任何人都知道我该怎么做？实际上我应用了预处理过滤（标记化，停用词删除，词干提取，......），通过 mahout 创建 tfidf，然后运行 ssvd 命令：bin/mahout ssvd -i termVectors/tfidf-vectors/part-r-00000 -no Output文件夹 -c 200 -us true -U false -V false -t 1 -ow -pca true 我在 mahout 中使用 clusterdump 来解析结果，但是 rsults 中的所有术语都以字母“a*”开头，并且不代表任何概念。有人在 ssvd 方面有经验，可以在聚类之前减少特征吗？或者知道如何使用 ssvd 在文本语料库中显示概念？

谢谢

cluster-analysis - 用于降维的 SSVD +Clustering

0 回答 0

Related

Reference