twitter - 如何训练一个最大分类器

Question

[项目堆栈：Java、Opennlp、Elasticsearch (datastore)、twitter4j 从 twitter 读取数据]

我打算使用 maxent 分类器对推文进行分类。我知道第一步是训练模型。从文档中我发现我们有一个基于 GISTrainer 的训练方法来训练模型。我设法整理了一段简单的代码，它利用 opennlp 的 maxent 分类器来训练模型并预测结果。

我使用了两个文件 positive.txt 和negative.txt 来训练模型

positive.txt的内容

positive    This is good
positive    This is the best
positive    This is fantastic
positive    This is super
positive    This is fine 
positive    This is nice

否定.txt的内容

negative    This is bad
negative    This is ugly
negative    This is the worst
negative    This is worse
negative    This sucks

下面的 java 方法会生成结果。

@Override
public void trainDataset(String source, String destination) throws Exception {
    File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
    File modelFile = new File(destination);
    Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
    CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
    int cutoff = 5;
    int iterations = 100;
    BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
    DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
    model.serialize(new FileOutputStream(modelFile));
}

@Override
public void predict(String text, String modelFile) {
    InputStream modelStream = null;
    try{
        Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
        String[] tokens = tokenizer.tokenize(text);
        modelStream = new FileInputStream(modelFile);
        DoccatModel model = new DoccatModel(modelStream);
        BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator(); 
        DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
        double[] probs   = categorizer.categorize(tokens);
        if(null!=probs && probs.length>0){
            for(int i=0;i<probs.length;i++){
                System.out.println("double[] probs index  " + i + " value " + probs[i]);
            }
        }
        String label = categorizer.getBestCategory(probs);
        System.out.println("label " + label);
        int bestIndex = categorizer.getIndex(label);
        System.out.println("bestIndex " + bestIndex);
        double score = probs[bestIndex];
        System.out.println("score " + score);
    }
    catch(Exception e){
        e.printStackTrace();
    }
    finally{
        if(null!=modelStream){
            try {
                modelStream.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

public static void main(String[] args) {
    try {
        String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
        String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
        MaximunEntropyClassifier me = new MaximunEntropyClassifier();
        me.trainDataset(source, outputModelPath);
        me.predict("This is bad", outputModelPath);
    } catch (Exception e) {
        e.printStackTrace();
    }
}

我有以下问题。

1）如何迭代训练模型？另外，如何在模型中添加新句子/单词？数据文件有特定的格式吗？我发现该文件至少需要有两个由制表符分隔的单词。我的理解有效吗？2) 是否有任何公开可用的数据集可用于训练模型？我找到了一些电影评论的来源。我正在从事的项目不仅涉及电影评论，还涉及其他内容，例如产品评论，品牌情绪等。3）这在一定程度上有所帮助。是否有公开可用的工作示例？我找不到 maxent 的文档。

请帮帮我。我有点受阻。

score 0 · Accepted Answer

1）您可以将样本存储在数据库中。我为此使用了一次累积。然后每隔一段时间重建模型并重新处理数据。2) 格式为：categoryname space sample newline。没有标签 3) 听起来您想将一般情绪与主题或实体结合起来。您可以使用名称查找器或仅使用正则表达式来查找实体或将实体添加到您的类标签中，因为 doccat 包括产品名称等，那么您的样本必须非常具体

score 0 · Accepted Answer

AFAIK，如果你想添加新的训练样本，你必须完全重新训练MaxEnt 模型。它不能逐步在线完成。

opennlp maxent 的默认输入格式是文本文件，其中每一行代表一个样本。样本由空格分隔的标记（特征）组成。在训练期间，第一个标记代表结果。

看看我的最小工作示例： Training models using openNLP maxent

twitter - 如何训练一个最大分类器

2 回答 2

Related

Reference