使用 OpenNLP doccat api,您可以创建训练数据,然后从训练数据中创建模型。与朴素贝叶斯分类器相比,它的优势在于它返回了一组类别的概率分布。
所以如果你用这种格式创建一个文件:
customerserviceproblems They did not respond
customerserviceproblems They didn't respond
customerserviceproblems They didn't respond at all
customerserviceproblems They did not respond at all
customerserviceproblems I received no response from the website
customerserviceproblems I did not receive response from the website
等等......提供尽可能多的样本,并确保每行以 \n 换行符结尾
使用此方法,您可以添加任何您想要的内容,这意味着“客户服务问题”,您还可以添加任何其他类别,因此您不必过于确定哪些数据属于哪些类别
这是构建模型的java的样子
DoccatModel model = null;
InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove);
try {
ObjectStream<String> lineStream =
new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
model = DocumentCategorizerME.train("en", sampleStream);
OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile));
model.serialize(modelOut);
System.out.println("Model complete!");
} catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
一旦你有了模型,你就可以像这样使用它:
DocumentCategorizerME documentCategorizerME;
DoccatModel doccatModel;
doccatModel = new DoccatModel(new File(pathToModelYouJustMade));
documentCategorizerME = new DocumentCategorizerME(doccatModel);
/**
* returns a map of a category to a score
* @param text
* @return
* @throws Exception
*/
private Map<String, Double> getScore(String text) throws Exception {
Map<String, Double> scoreMap = new HashMap<>();
double[] categorize = documentCategorizerME.categorize(text);
int catSize = documentCategorizerME.getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
String category = documentCategorizerME.getCategory(i);
scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]);
}
return scoreMap;
}
然后在返回的 hashmap 中你有你建模的每个类别和一个分数,你可以使用分数来决定输入文本属于哪个类别。