这是我的数据:
1.45000 lines(less than 100 words) single file.
2.Key: line ID
3.Value: line(String)
使用标准 Mahout CLI(一切正常)参数将这些文档转换为矢量:
Number of clusters: 6, Iteration:10
Result(ClusterDump): 155 Key:Value
任何人都可以帮我解决这个问题吗?
编辑:
样本数据:
No. data.
1 The MapReduce implementation of fuzzy k-means looks similar to that of the k-means.
2 Each entry in the sequence file has a key, which is the identifier of the vector.
...
45900 Fuzzy k-means has a parameter, m, called the fuzziness factor
转换为序列(使用 Seqdumper 验证)
<key:No.> <value:data>
...
45900
向量变换
mahout-distribution-0.8/bin/mahout seq2sparse -i /user/hadoop/book-seq -o /user/hadoop/book-vector -ow -chunk 100 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 --namedVector
Kmeans 聚类
mahout-distribution-0.8/bin/mahout kmeans -i /user/hadoop/book-vector/tfidf-vectors -c /user/hadoop/book-initial-cluster -o /user/hadoop/book-kmeans-cluster -cd 0.1 -k 6 -x 10 -cl -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure
集群转储
Directory Structure
ClusteredPoints
Cluster-0
Cluster-1
Cluster-2-final
mahout-distribution-0.8/bin/mahout clusterdump -i /user/hadoop/book-kmeans-cluster/clusters-2-final -p /user/hadoop/book-kmeans-cluster/clusteredPoints -of TEXT -o clusterdump.txt -dm org.apache.mahout.common.distance.CosineDistanceMeasure
cat clusterdump.txt
155 Entries
更新:
After vectorization, tfidf-vector is showing only 155 documents instead of ~ 45000