I need to build a POS tagger in Java and need to know how to get started. Are there code examples or other resources that help illustrate how POS taggers work?
3 回答
Try Apache OpenNLP. It includes a POS Tagger tools. You can download ready-to-use English models from here.
The documentation provides details about how to use it from a Java application. Basically you need the following:
Load the POS model
InputStream modelIn = null;
try {
modelIn = new FileInputStream("en-pos-maxent.bin");
POSModel model = new POSModel(modelIn);
}
catch (IOException e) {
// Model loading failed, handle the error
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
Instantiate the POS tagger
POSTaggerME tagger = new POSTaggerME(model);
Execute it
String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had", "morning", "and", "afternoon", "newspapers", "."};
String tags[] = tagger.tag(sent);
Note that the POS tagger expects a tokenized sentence. Apache OpenNLP also provides tools and models to help with these tasks.
If you have to train your own model refer to this documentation.
您可以检查现有的标记器实现。
例如参考 Java 中的斯坦福大学 POS 标记器(由 Kristina Toutanova 编写),它在 GNU 通用公共许可证(v2 或更高版本)下可用,源代码编写良好且文档清晰:
http://nlp.stanford.edu/software/tagger.shtml
关于标记的好书是:Daniel Jurafsky、James H. Martin 的 Speech and Language Processing (2nd Edition)
有一些 POS/NER 标记器被广泛使用。
OpenNLP Maxent POS 标记器:使用 Apache OpenNLP。
Open NLP 是来自 Apache 的强大的 Java NLP 库。它为 NLP 提供了各种工具,其中之一是词性 (POS) 标注器。通常 POS 标记器用于找出文本中的结构语法结构,您使用标记数据集,其中每个单词(短语的一部分)都带有标签,您从该数据集构建 NLP 模型,然后对于新文本,您可以使用该模型为文本中的每个单词生成标签。
示例代码:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}
详细的博客以及如何使用它的完整代码:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so
基于斯坦福 CoreNLP 的 NER 标记器:
斯坦福核心 NLP 是迄今为止最久经考验的 NLP 库。在某种程度上,它是当今 NLP 性能的黄金标准。在各种其他功能中,库中支持命名实体识别 (NER),这允许在一段文本中标记重要实体,如人名、地点等。
示例代码:
public void doTagging(CRFClassifier model, String input) {
input = input.trim();
System.out.println(input + "=>" + model.classifyToString(input));
}
详细的博客以及如何使用它的完整代码:
https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so