3

I need to build a POS tagger in Java and need to know how to get started. Are there code examples or other resources that help illustrate how POS taggers work?

4

3 回答 3

6

Try Apache OpenNLP. It includes a POS Tagger tools. You can download ready-to-use English models from here.

The documentation provides details about how to use it from a Java application. Basically you need the following:

Load the POS model

InputStream modelIn = null;

try {
  modelIn = new FileInputStream("en-pos-maxent.bin");
  POSModel model = new POSModel(modelIn);
}
catch (IOException e) {
  // Model loading failed, handle the error
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

Instantiate the POS tagger

POSTaggerME tagger = new POSTaggerME(model);

Execute it

String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had", "morning", "and", "afternoon", "newspapers", "."};          
String tags[] = tagger.tag(sent);

Note that the POS tagger expects a tokenized sentence. Apache OpenNLP also provides tools and models to help with these tasks.

If you have to train your own model refer to this documentation.

于 2011-08-17T10:52:17.830 回答
5

您可以检查现有的标记器实现。

例如参考 Java 中的斯坦福大学 POS 标记器(由 Kristina Toutanova 编写),它在 GNU 通用公共许可证(v2 或更高版本)下可用,源代码编写良好且文档清晰:

http://nlp.stanford.edu/software/tagger.shtml

关于标记的好书是:Daniel Jurafsky、James H. Martin 的 Speech and Language Processing (2nd Edition)

于 2011-08-17T08:04:43.200 回答
2

有一些 POS/NER 标记器被广泛使用。

OpenNLP Maxent POS 标记器:使用 Apache OpenNLP。

Open NLP 是来自 Apache 的强大的 Java NLP 库。它为 NLP 提供了各种工具,其中之一是词性 (POS) 标注器。通常 POS 标记器用于找出文本中的结构语法结构,您使用标记数据集,其中每个单词(短语的一部分)都带有标签,您从该数据集构建 NLP 模型,然后对于新文本,您可以使用该模型为文本中的每个单词生成标签。

示例代码:

public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

详细的博客以及如何使用它的完整代码:

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so

基于斯坦福 CoreNLP 的 NER 标记器:

斯坦福核心 NLP 是迄今为止最久经考验的 NLP 库。在某种程度上,它是当今 NLP 性能的黄金标准。在各种其他功能中,库中支持命名实体识别 (NER),这允许在一段文本中标记重要实体,如人名、地点等。

示例代码:

public void doTagging(CRFClassifier model, String input) {
  input = input.trim();
  System.out.println(input + "=>"  +  model.classifyToString(input));
}  

详细的博客以及如何使用它的完整代码:

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

于 2018-03-23T07:00:12.677 回答