mahout - 将序列文件转换为矢量

Question

我正在尝试实现朴素贝叶斯算法来对 mahout 中的推文和 facebook 数据进行情感分析。我在文本文件中有这些推文和 Facebook 数据。我正在使用命令将这些文件转换为序列文件

bin/mahout seqdirectory -i /user/hadoopUser/sample/input -o /user/hadoopUser/sample/seqoutput

然后我尝试将序列文件转换为向量，以便使用命令向 mahout 提供输入

bin/mahout seq2sparse -i /user/hadoopUser/sample/seqoutput -o /user/hadoopUser/vectoroutput -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

这是将整个文档转换为向量，但我想将每个句子转换为向量而不是整体，因为我不想对文档进行分类。我想对文档中的评论进行分类。谁能帮我解决这个问题？

score 0 · Accepted Answer

What you should have is a CSV file with tweets data right? I'm dealing with this exact same problem. What I did (I'm not sure if it worked as I don't even know how to interpret the clustering output, it's just a mess of numbers and words) I wrote each column of my CSV file into the sequence file using Mahout's SequenceWriter class. Then used seq2sparse like normal on that sequence file.

score 0 · Accepted Answer

我不是 100% 确定，但主要问题是 mahout 将此文件视为一个键/值。您需要为每一行添加额外的 id，例如 md5 哈希。所以 CSV 格式将是：

positive    bf9373d6d85959ec755eb8ac5ba0ae77    This movie is a real masterpiece

mahout - 将序列文件转换为矢量

2 回答 2

Related

Reference