java - 使用 MapReduce 作业调用 StanfordCoreNLP API

Question

我正在尝试使用 MapReduce 处理大量文档，其想法是将文件拆分为 mapper 中的文档，并在 reducer 阶段应用 stanford coreNLP 注释器。

我有一个相当简单（标准）的“tokenize，ssplit，pos，lemma，ner”管道，reducer 只是调用一个函数，将这些注释器应用于 reducer 传递的值并返回注释（作为字符串列表），但是生成的输出是垃圾。

我观察到，如果我从映射器中调用注释函数，该作业会返回预期的输出，但这优于整个并行性。当我忽略在 reducer 中获得的值并仅将注释器应用于虚拟字符串时，该作业也会返回预期的输出。

这可能表明该过程中存在一些线程安全问题，但我无法弄清楚我的注释函数在哪里同步并且管道是私有的最终。

有人可以提供一些关于如何解决这个问题的指示吗？

——昂舒

编辑：

这就是我的减速器的样子，希望这会增加更多的清晰度

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
        while (values.hasNext()) {
            output.collect(key, new Text(se.getExtracts(values.next().toString()).toString()));             
        }
    }
}

这是获取提取的代码：

final StanfordCoreNLP pipeline; 
public instantiatePipeline(){
    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner");

}


synchronized List<String> getExtracts(String l){
    Annotation document = new Annotation(l);

    ArrayList<String> ret = new ArrayList<String>();

    pipeline.annotate(document);

    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    int sid = 0;
    for(CoreMap sentence:sentences){
        sid++;
        for(CoreLabel token: sentence.get(TokensAnnotation.class)){
            String word = token.get(TextAnnotation.class);
            String pos = token.get(PartOfSpeechAnnotation.class);
            String ner = token.get(NamedEntityTagAnnotation.class);
            String lemma = token.get(LemmaAnnotation.class);

            Timex timex = token.get(TimeAnnotations.TimexAnnotation.class);

            String ex = word+","+pos+","+ner+","+lemma;
            if(timex!=null){
                ex = ex+","+timex.tid();
            }
            else{
                ex = ex+",";
            }
            ex = ex+","+sid;
            ret.add(ex);
        }
    }

score 0 · Accepted Answer

我解决了这个问题，实际上问题出在我正在读取的文件中的文本编码（将其转换为文本会导致我猜想进一步损坏），这导致了标记化和溢出垃圾的问题。我正在清理输入字符串并应用严格的 UTF-8 编码，现在一切正常。

java - 使用 MapReduce 作业调用 StanfordCoreNLP API

1 回答 1

Related

Reference