1

我正在尝试实现朴素贝叶斯算法来对 mahout 中的推文和 facebook 数据进行情感分析。我在文本文件中有这些推文和 Facebook 数据。我正在使用命令将这些文件转换为序列文件

bin/mahout seqdirectory -i /user/hadoopUser/sample/input -o /user/hadoopUser/sample/seqoutput

然后我尝试将序列文件转换为向量,以便使用命令向 mahout 提供输入

bin/mahout seq2sparse -i /user/hadoopUser/sample/seqoutput -o /user/hadoopUser/vectoroutput -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

这是将整个文档转换为向量,但我想将每个句子转换为向量而不是整体,因为我不想对文档进行分类。我想对文档中的评论进行分类。谁能帮我解决这个问题?

4

2 回答 2

0

What you should have is a CSV file with tweets data right? I'm dealing with this exact same problem. What I did (I'm not sure if it worked as I don't even know how to interpret the clustering output, it's just a mess of numbers and words) I wrote each column of my CSV file into the sequence file using Mahout's SequenceWriter class. Then used seq2sparse like normal on that sequence file.

于 2013-07-04T14:21:17.227 回答
0

我不是 100% 确定,但主要问题是 mahout 将此文件视为一个键/值。您需要为每一行添加额外的 id,例如 md5 哈希。所以 CSV 格式将是:

positive    bf9373d6d85959ec755eb8ac5ba0ae77    This movie is a real masterpiece
于 2014-01-11T16:28:57.937 回答