java - Build a Part-of-Speech Tagger (POS Tagger)

Question

I need to build a POS tagger in Java and need to know how to get started. Are there code examples or other resources that help illustrate how POS taggers work?

score 6 · Accepted Answer

Try Apache OpenNLP. It includes a POS Tagger tools. You can download ready-to-use English models from here.

The documentation provides details about how to use it from a Java application. Basically you need the following:

Load the POS model

InputStream modelIn = null;

try {
  modelIn = new FileInputStream("en-pos-maxent.bin");
  POSModel model = new POSModel(modelIn);
}
catch (IOException e) {
  // Model loading failed, handle the error
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

Instantiate the POS tagger

POSTaggerME tagger = new POSTaggerME(model);

Execute it

String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had", "morning", "and", "afternoon", "newspapers", "."};          
String tags[] = tagger.tag(sent);

Note that the POS tagger expects a tokenized sentence. Apache OpenNLP also provides tools and models to help with these tasks.

If you have to train your own model refer to this documentation.

score 5 · Accepted Answer

您可以检查现有的标记器实现。

例如参考 Java 中的斯坦福大学 POS 标记器（由 Kristina Toutanova 编写），它在 GNU 通用公共许可证（v2 或更高版本）下可用，源代码编写良好且文档清晰：

http://nlp.stanford.edu/software/tagger.shtml

关于标记的好书是：Daniel Jurafsky、James H. Martin 的 Speech and Language Processing (2nd Edition)

score 2 · Accepted Answer

有一些 POS/NER 标记器被广泛使用。

OpenNLP Maxent POS 标记器：使用 Apache OpenNLP。

Open NLP 是来自 Apache 的强大的 Java NLP 库。它为 NLP 提供了各种工具，其中之一是词性 (POS) 标注器。通常 POS 标记器用于找出文本中的结构语法结构，您使用标记数据集，其中每个单词（短语的一部分）都带有标签，您从该数据集构建 NLP 模型，然后对于新文本，您可以使用该模型为文本中的每个单词生成标签。

示例代码：

public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

详细的博客以及如何使用它的完整代码：

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so

基于斯坦福 CoreNLP 的 NER 标记器：

斯坦福核心 NLP 是迄今为止最久经考验的 NLP 库。在某种程度上，它是当今 NLP 性能的黄金标准。在各种其他功能中，库中支持命名实体识别 (NER)，这允许在一段文本中标记重要实体，如人名、地点等。

示例代码：

public void doTagging(CRFClassifier model, String input) {
  input = input.trim();
  System.out.println(input + "=>"  +  model.classifyToString(input));
}

详细的博客以及如何使用它的完整代码：

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

java - Build a Part-of-Speech Tagger (POS Tagger)

3 回答 3

Related

Reference