file - 在 opennlp 中训练自己的模型

Question

我发现创建自己的模型 openNLP 很困难。谁能告诉我，如何拥有模型。培训应该如何进行。

输入应该是什么以及输出模型文件将存储在哪里。

score 9 · Accepted Answer

https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html

这个网站非常有用，既显示代码，又使用 OpenNLP 应用程序训练所有不同类型的模型，如实体提取和词性等。

我可以在这里给你一些代码示例，但是页面使用起来非常清晰。

理论上：

本质上，您创建一个文件，其中列出了您要训练的内容

例如。

运动 [空白] 这是一个关于足球、橄榄球和其他东西的页面

政治 [空白] 这是关于托尼·布莱尔担任首相的页面。

格式在上面的页面中描述（每个模型都需要不同的格式）。创建此文件后，您可以通过 API 或 opennlp 应用程序（通过命令行）运行它，它会生成一个 .bin 文件。一旦你有了这个 .bin 文件，你就可以将它加载到模型中，并开始使用它（根据上述网站中的 api）。

score 5 · Accepted Answer

首先，您需要使用所需的实体来训练数据。

句子应该用换行符 (\n) 分隔。值应与带有空格字符的标签分开。
假设您要创建医学实体模型，因此数据应该是这样的：

<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and 
<START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.

例如，您可以参考示例数据集。训练数据至少要有 15000 句才能得到更好的结果。

此外，您可以使用 Opennlp TokenNameFinderTrainer。输出文件将采用 .bin 格式。

这是示例：在 OpenNLP 中编写自定义 NameFinder 模型

有关更多详细信息，请参阅Opennlp 文档

score 2 · Accepted Answer

也许这篇文章会帮助你。它描述了如何从维基百科提取的数据中进行TokenNameFinder训练......

nuxeo - 博客 - 使用 Hadoop 和 Pig 挖掘维基百科以进行自然语言处理

score 1 · Accepted Answer

复制 data 中的数据并运行以下代码以获取您自己的 mymodel.bin 。

可以参考数据= https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt

public class Training {
       static String onlpModelPath = "mymodel.bin";
       // training data set
       static String trainingDataFilePath = "data.txt";

       public static void main(String[] args) throws IOException {
                       Charset charset = Charset.forName("UTF-8");
                       ObjectStream<String> lineStream = new PlainTextByLineStream(
                                                       new FileInputStream(trainingDataFilePath), charset);
                       ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                                                       lineStream);
                       TokenNameFinderModel model = null;
                       HashMap<String, Object> mp = new HashMap<String, Object>();
                       try {
                              //         model = NameFinderME.train("en","drugs", sampleStream, Collections.<String,Object>emptyMap(),100,4) ;
                                       model=  NameFinderME.train("en", "drugs", sampleStream, Collections. emptyMap());
                       } finally {
                                       sampleStream.close();
                       }
                       BufferedOutputStream modelOut = null;
                       try {
                                       modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
                                       model.serialize(modelOut);
                       } finally {
                                       if (modelOut != null)
                                                       modelOut.close();
                       }
       }
}

file - 在 opennlp 中训练自己的模型

4 回答 4

Related

Reference