hadoop - 在 hadoop 上使用 opennlp 进行句子检测

Question

我想使用 OPenNLP 和 Hadoop 进行句子检测。我已经成功地在 Java 上实现了相同的功能。想在 Mapreduce 平台上实现相同的功能。谁能帮我吗？

score 1 · Accepted Answer

我用两种不同的方式做到了这一点。一种方法是将您的句子检测模型推送到每个节点的标准目录（即/opt/opennlpmodels/），并在您的映射器类中的类级别读取序列化模型，然后在您的地图中适当地使用它或减少功能。

另一种方法是将模型放入数据库或分布式缓存中（作为 blob 或其他东西......我之前使用 Accumulo 来存储文档分类模型）。然后在类级别建立与数据库的连接并将模型作为 bytearrayinputstream 获取。

我使用 Puppet 推出模型，但使用您通常使用的任何东西来保持集群上的文件是最新的。

根据您的 hadoop 版本，您可以将模型作为属性偷偷放入 jobsetup 中，然后只有 master（或您从哪里启动作业）需要在其上具有实际的模型文件。我从来没有试过这个。

如果您需要知道如何实际使用 OpenNLP 句子检测器，请告诉我，我将发布一个示例。高温高压

import java.io.File;
import java.io.FileInputStream;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

public class SentenceDetection {

  SentenceDetector sd;

  public Span[] getSentences(String docTextFromMapFunction) throws Exception {

    if (sd == null) {
      sd = new SentenceDetectorME(new SentenceModel(new FileInputStream(new File("/standardized-on-each-node/path/to/en-sent.zip"))));
    }
    /**
     * this gives you the actual sentences as a string array
     */
    // String[] sentences = sd.sentDetect(docTextFromMapFunction);
    /**
     * this gives you the spans (the charindexes to the start and end of each
     * sentence in the doc)
     *
     */
    Span[] sentenceSpans = sd.sentPosDetect(docTextFromMapFunction);
    /**
     * you can do this as well to get the actual sentence strings based on the spans
     */
    // String[] spansToStrings = Span.spansToStrings(sentPosDetect, docTextFromMapFunction);
    return sentenceSpans;
  }
}

HTH ...只需确保文件到位。有更优雅的方法可以做到这一点，但这很有效，而且很简单。

hadoop - 在 hadoop 上使用 opennlp 进行句子检测

1 回答 1

Related

Reference