bin - 在 opennlp 中训练词性标注器

Question

我正在尝试训练 opennlp POS 标记器，它会根据我的特定词汇来标记句子中的单词。例如：

正常 POS 标记后：

语句：NodeManager/NNP failed/VBD to/TO start/VB the/DT server/NN

使用我的 pos 标记模型后：

句子：NodeManager/AGENT failed/OTHER to/OTHER start/OTHER/OTHER server/OBJECT

其中 AGENT,OTHER,OBJECT 是我定义的标签。

所以基本上我正在定义我自己的标签字典。并希望 POS 标注器使用我的模型。

我检查了 apache 文档以执行此操作

我找到了下面的代码

POSModel model = null;

InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-pos.train");
  ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

  model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch(IOException e)
{
   e.printStackTrace();
}
finally {
  if (dataIn != null) {
    try {
      dataIn.close();
    }
    catch (IOException e) {
      // Not an issue, training already finished.
      // The exception should be logged and investigated
      // if part of a production system.
      e.printStackTrace();
    }
  }
}

在这里，当他们打开 FileInputStream 到 en-pos.train 时，我猜这个 en-pos.train 是一个 .bin 文件，就像他们之前使用过的所有文件一样，但只是它是定制的。有人可以告诉我如何获取它的 .bin 文件吗？

或者 en-pos.train 在哪里？它到底是什么？如何创建它？

我提取了他们通常使用的 bin 文件

en-pos-maxent.bin。它有一个 xml 文件，我们在其中定义标签字典、一个模型文件和一个属性文件。我已根据需要更改了它们，但我的问题是从内容生成 .bin 文件。

score 1 · Accepted Answer

http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.postagger.training.tool

看看这里，你可以直接通过 opennlp 应用程序创建你的 bin 文件，命令在网站上给出。

score 0 · Accepted Answer

做起来很简单：

训练自己的模型后，将其转储到文件中（随意调用）：

public void writeToFile(POSModel model, String modelOutpath) {
    try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutpath))) {
        model.serialize(modelOut);
    }
    catch (Exception e) {
        e.printStackTrace();
    }
}

然后加载文件，如下图：

public POSModel getModel(String modelPath) {
try {
    try (InputStream modelIn = new FileInputStream(modelPath)) {
        POSModel model = new POSModel(modelIn);
        return model;
    }
}
catch (Exception e) {
    e.printStackTrace();
}
return model;

}

现在您可以使用加载的模型并进行标记。

    public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

这是一个详细的教程，其中包含有关如何训练和使用您自己的基于 Open NLP 的 POS 标记器的完整代码：

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php

bin - 在 opennlp 中训练词性标注器

2 回答 2

Related

Reference