[项目堆栈:Java、Opennlp、Elasticsearch (datastore)、twitter4j 从 twitter 读取数据]
我打算使用 maxent 分类器对推文进行分类。我知道第一步是训练模型。从文档中我发现我们有一个基于 GISTrainer 的训练方法来训练模型。我设法整理了一段简单的代码,它利用 opennlp 的 maxent 分类器来训练模型并预测结果。
我使用了两个文件 positive.txt 和negative.txt 来训练模型
positive.txt的内容
positive This is good
positive This is the best
positive This is fantastic
positive This is super
positive This is fine
positive This is nice
否定.txt的内容
negative This is bad
negative This is ugly
negative This is the worst
negative This is worse
negative This sucks
下面的 java 方法会生成结果。
@Override
public void trainDataset(String source, String destination) throws Exception {
File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
File modelFile = new File(destination);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
int cutoff = 5;
int iterations = 100;
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
model.serialize(new FileOutputStream(modelFile));
}
@Override
public void predict(String text, String modelFile) {
InputStream modelStream = null;
try{
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(text);
modelStream = new FileInputStream(modelFile);
DoccatModel model = new DoccatModel(modelStream);
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
double[] probs = categorizer.categorize(tokens);
if(null!=probs && probs.length>0){
for(int i=0;i<probs.length;i++){
System.out.println("double[] probs index " + i + " value " + probs[i]);
}
}
String label = categorizer.getBestCategory(probs);
System.out.println("label " + label);
int bestIndex = categorizer.getIndex(label);
System.out.println("bestIndex " + bestIndex);
double score = probs[bestIndex];
System.out.println("score " + score);
}
catch(Exception e){
e.printStackTrace();
}
finally{
if(null!=modelStream){
try {
modelStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
MaximunEntropyClassifier me = new MaximunEntropyClassifier();
me.trainDataset(source, outputModelPath);
me.predict("This is bad", outputModelPath);
} catch (Exception e) {
e.printStackTrace();
}
}
我有以下问题。
1)如何迭代训练模型?另外,如何在模型中添加新句子/单词?数据文件有特定的格式吗?我发现该文件至少需要有两个由制表符分隔的单词。我的理解有效吗?2) 是否有任何公开可用的数据集可用于训练模型?我找到了一些电影评论的来源。我正在从事的项目不仅涉及电影评论,还涉及其他内容,例如产品评论,品牌情绪等。3) 这在一定程度上有所帮助。是否有公开可用的工作示例?我找不到 maxent 的文档。
请帮帮我。我有点受阻。