machine-learning - 用于情绪分析的 Mahout

Question

使用 mahout 我可以对数据的情绪进行分类。但我被一个混淆矩阵困住了。

我正在使用 mahout 0.7 朴素贝叶斯算法对推文的情绪进行分类。我使用朴素贝叶斯分类器来训练分类器并将推文的情绪分类为“正面”、“负面”或“中性” trainnb。testnb

样本正训练集

      'positive','i love my i phone'
      'positive' , it's pleasure to have i phone'

同样，我准备了负面和中性的训练样本，这是一个巨大的数据集。

我提供的示例测试数据推文不包括情绪。

  'it is nice model'
  'simply fantastic '

我能够运行 mahout 分类算法，它将分类实例的输出作为混淆矩阵提供。

下一步我需要找出哪些推文表现出积极的情绪，哪些是消极的。使用分类的预期输出：用情绪标记文本。

       'negative','very bad btr life time'
      'positive' , 'i phone has excellent design features'

在 mahout 中，我需要实现哪种算法才能以上述格式获得输出。或需要任何自定义源实现。

要“友好地”显示数据，建议我使用 apache mahout 提供的算法，这将适用于我的 twitter 数据情感分析。

score 3 · Accepted Answer

通常，要对某些文本进行分类，您需要使用不同的先验（在您的情况下为正和负）运行朴素贝叶斯，然后只选择产生更大价值的那个。

Mahout 书中的这段摘录有一些例子。请参见清单 2：

Parameters p = new Parameters();
p.set("basePath", modelDir.getCanonicalPath());9
Datastore ds = new InMemoryBayesDatastore(p);
Algorithm a = new BayesAlgorithm();
ClassifierContext ctx = new ClassifierContext(a,ds);
ctx.initialize();

....

ClassifierResult result = ctx.classifyDocument(tokens, defaultCategory);

这里的结果应该包含“正面”或“负面”标签。

score 1 · Accepted Answer

我不确定我能否为您提供全面的帮助，但我希望我能给您一些切入点。一般来说，我对您的建议是下载 Mahout 的源代码并查看示例和目标类是如何实现的。这并不容易，但您应该准备好 Mahout 没有简单的入口门。但是一旦你进入他们的学习曲线会很快。

首先，这取决于您使用的 Mahout 版本。我自己使用的是 0.7，所以我的解释是关于 0.7。

public void classify(String modelLocation, RawEntry unclassifiedInstanceRaw) throws IOException {

    Configuration conf = new Configuration();

    NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelLocation), conf);
    AbstractNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

    String unclassifiedInstanceFeatures = RawEntry.toNaiveBayesTrainingFormat(unclassifiedInstanceRaw);

    FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder("features");
    vectorEncoder.setProbes(1); // my features vectors are tiny

    Vector unclassifiedInstanceVector = new RandomAccessSparseVector(unclassifiedInstanceFeatures.split(" ").length());

    for (String feature: unclassifiedInstanceFeatures) {
        vectorEncoder.addToVector(feature, unclassifiedInstanceVector);
    }

    Vector classificationResult = classifier.classifyFull(unclassifiedInstanceVector);

    System.out.println(classificationResult.asFormatString());

}

这里会发生什么：

1）首先，你加载你通过trainnb得到的模型。该模型保存在您在调用 trainnb 时使用 -o 参数指定的位置。模型是 .bin 文件。

2) StandardNaiveBayesClassifier 是使用您的模型创建的

3) RawEntry 是我的自定义类，它只是我的数据原始字符串的包装器。toNaiveBayesTrainingFormar 获取我想要分类的字符串，根据我的需要从中去除噪音，然后简单地返回一个特征字符串“word1 word2 word3 word4”。因此，我未分类的原始字符串被转换为适用的分类格式。

4) 现在需要将特征字符串编码为 Mahout 的向量，因为分类器输入仅在向量中

5）将向量传递给分类器 - 魔术。

这是第一部分。现在，分类器返回您 Vector ，其中包含具有概率的类（在您的情况下是情绪）。你想要特定的输出。最直接的实现（但我认为不是最有效和最时尚的）将是下一步：

1）您创建 map reduce 作业，该作业遍历您要分类的所有数据

2）对于您调用分类方法的每个实例（不要忘记做一些更改，不要为每个实例创建 StandardNaiveBayesClassifier）

3）拥有分类结果向量，您可以在地图减少作业中以任何格式输出数据

4) 这里有用的设置是 jC.set("mapreduce.textoutputformat.separator", " "); 其中 jC 是 JobConf。这允许您从 mapreduce 作业中为输出文件选择分隔符。在您的情况下，这是“，”。

同样，这一切都适用于 Mahout 0.7。不能保证它会按原样为您工作。不过它对我有用。

一般来说，我从来没有在命令行中使用过 Mahout，对我来说，Java 中的 Mahout 是要走的路。

machine-learning - 用于情绪分析的 Mahout

2 回答 2

Related

Reference