nlp - 斯坦福命名实体识别器中的多术语命名实体

Question

我正在使用斯坦福命名实体识别器http://nlp.stanford.edu/software/CRF-NER.shtml，它工作正常。这是

    List<List<CoreLabel>> out = classifier.classify(text);
    for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
            if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
                namedEntities.add(word.word().trim());           
            }
        }
    }

然而，我发现的问题是识别姓名和姓氏。如果识别器遇到“Joe Smith”，它会分别返回“Joe”和“Smith”。我真的很希望将“乔·史密斯”作为一个术语返回。

这可以通过识别器通过配置来实现吗？直到现在我还没有在 javadoc 中找到任何东西。

谢谢！

score 20 · Accepted Answer

这是因为您的内部 for 循环正在迭代各个标记（单词）并分别添加它们。您需要更改内容以立即添加全名。

一种方法是用常规 for 循环替换内部 for 循环，其中内部有一个 while 循环，该循环采用同一类的相邻非 O 事物并将它们添加为单个实体。*

另一种方法是使用 CRFClassifier 方法调用：

List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)

这将为您提供整个实体，您可以通过substring在原始输入上使用来提取字符串形式。

*我们分发的模型使用简单的原始 IO 标签方案，其中事物被标记为 PERSON 或 LOCATION，而适当的做法是简单地将具有相同标签的相邻标记合并。许多 NER 系统使用更复杂的标签，例如 IOB 标签，其中 B-PERS 等代码指示人员实体的起始位置。CRFClassifier 类和特征工厂支持此类标签，但它们并未用于我们当前分发的模型中（截至 2012 年）。

score 5 · Accepted Answer

classifyToCharacterOffsets 方法的对应物是（AFAIK）您无法访问实体的标签。

正如 Christopher 所提出的，这里是一个组装“相邻的非 O 事物”的循环示例。此示例还计算出现次数。

public HashMap<String, HashMap<String, Integer>> extractEntities(String text){

    HashMap<String, HashMap<String, Integer>> entities =
            new HashMap<String, HashMap<String, Integer>>();

    for (List<CoreLabel> lcl : classifier.classify(text)) {

        Iterator<CoreLabel> iterator = lcl.iterator();

        if (!iterator.hasNext())
            continue;

        CoreLabel cl = iterator.next();

        while (iterator.hasNext()) {
            String answer =
                    cl.getString(CoreAnnotations.AnswerAnnotation.class);

            if (answer.equals("O")) {
                cl = iterator.next();
                continue;
            }

            if (!entities.containsKey(answer))
                entities.put(answer, new HashMap<String, Integer>());

            String value = cl.getString(CoreAnnotations.ValueAnnotation.class);

            while (iterator.hasNext()) {
                cl = iterator.next();
                if (answer.equals(
                        cl.getString(CoreAnnotations.AnswerAnnotation.class)))
                    value = value + " " +
                           cl.getString(CoreAnnotations.ValueAnnotation.class);
                else {
                    if (!entities.get(answer).containsKey(value))
                        entities.get(answer).put(value, 0);

                    entities.get(answer).put(value,
                            entities.get(answer).get(value) + 1);

                    break;
                }
            }

            if (!iterator.hasNext())
                break;
        }
    }

    return entities;
}

score 3 · Accepted Answer

我也遇到了同样的问题，所以我也查了一下。Christopher Manning 提出的方法是有效的，但微妙的一点是要知道如何决定哪种分离器是合适的。可以说只允许一个空格，例如“John Zorn”>> 一个实体。但是，我可能会找到“J.Zorn”的形式，所以我也应该允许某些标点符号。但是“杰克、詹姆斯和乔”呢？我可能会得到 2 个实体而不是 3 个（“Jack James”和“Joe”）。

通过深入研究斯坦福 NER 课程，我实际上找到了这个想法的正确实现。他们使用它以单个String对象的形式导出实体。例如，在方法PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML中，我们有：

 private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
    final String background = flags.backgroundSymbol;
    String prevTag = background;
    for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
      IN wi = wordIter.next();
      String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));

      String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));

      String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
      if (!tag.equals(prevTag)) {
        if (!prevTag.equals(background) && !tag.equals(background)) {
          out.print("</");
          out.print(prevTag);
          out.print('>');
          out.print(before);
          out.print('<');
          out.print(tag);
          out.print('>');
        } else if (!prevTag.equals(background)) {
          out.print("</");
          out.print(prevTag);
          out.print('>');
          out.print(before);
        } else if (!tag.equals(background)) {
          out.print(before);
          out.print('<');
          out.print(tag);
          out.print('>');
        }
      } else {
        out.print(before);
      }
      out.print(current);
      String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));

      if (!tag.equals(background) && !wordIter.hasNext()) {
        out.print("</");
        out.print(tag);
        out.print('>');
        prevTag = background;
      } else {
        prevTag = tag;
      }
      out.print(afterWS);
    }
  }

如前所述，他们遍历每个单词，检查它是否具有与前一个相同的类（答案）。为此，他们利用被认为不是实体的事实表达式使用所谓的backgroundSymbol（“O”类）进行标记。他们还使用 property BeforeAnnotation，它表示将当前单词与前一个单词分开的字符串。最后一点可以解决我最初提出的关于选择合适分隔符的问题。

score 2 · Accepted Answer

上述代码：

<List> result = classifier.classifyToCharacterOffsets(text);

for (Triple<String, Integer, Integer> triple : result)
{
    System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}

score 2 · Accepted Answer

List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
    String s = "";
    String prevLabel = null;
    for (CoreLabel word : sentence) {
      if(prevLabel == null  || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
         s = s + " " + word;
         prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
      }
      else {
        if(!prevLabel.equals("O"))
           System.out.println(s.trim() + '/' + prevLabel + ' ');
        s = " " + word;
        prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
      }
    }
    if(!prevLabel.equals("O"))
        System.out.println(s + '/' + prevLabel + ' ');
}

我只是写了一个小逻辑，它工作正常。如果它们相邻，我所做的是将具有相同标签的单词分组。

score 1 · Accepted Answer

使用已经提供给您的分类器。我相信这就是您正在寻找的：

    private static String combineNERSequence(String text) {

    String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";      
    AbstractSequenceClassifier<CoreLabel> classifier = null;
    try {
        classifier = CRFClassifier
                .getClassifier(serializedClassifier);
    } catch (ClassCastException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    System.out.println(classifier.classifyWithInlineXML(text));

    //  FOR TSV FORMAT  //
    //System.out.print(classifier.classifyToString(text, "tsv", false));

    return classifier.classifyWithInlineXML(text);
}

score 0 · Accepted Answer

Here is my full code, I use Stanford core NLP and write algorithm to concatenate Multi Term names.

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

/**
 * Created by Chanuka on 8/28/14 AD.
 */
public class FindNameEntityTypeExecutor {

private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);

private StanfordCoreNLP pipeline;

public FindNameEntityTypeExecutor() {
    logger.info("Initializing Annotator pipeline ...");

    Properties props = new Properties();

    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");

    pipeline = new StanfordCoreNLP(props);

    logger.info("Annotator pipeline initialized");
}

List<String> findNameEntityType(String text, String entity) {
    logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);

    // create an empty Annotation just with the given text
    Annotation document = new Annotation(text);

    // run all Annotators on this text
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    List<String> matches = new ArrayList<String>();

    for (CoreMap sentence : sentences) {

        int previousCount = 0;
        int count = 0;
        // traversing the words in the current sentence
        // a CoreLabel is a CoreMap with additional token-specific methods

        for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
            String word = token.get(CoreAnnotations.TextAnnotation.class);

            int previousWordIndex;
            if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
                count++;
                if (previousCount != 0 && (previousCount + 1) == count) {
                    previousWordIndex = matches.size() - 1;
                    String previousWord = matches.get(previousWordIndex);
                    matches.remove(previousWordIndex);
                    previousWord = previousWord.concat(" " + word);
                    matches.add(previousWordIndex, previousWord);

                } else {
                    matches.add(word);
                }
                previousCount = count;
            }
            else
            {
                count=0;
                previousCount=0;
            }


        }

    }
    return matches;
}
}

score 0 · Accepted Answer

另一种处理多词实体的方法。如果它们具有相同的注释并且连续出现，则此代码将多个标记组合在一起。

限制：
如果同一个token有两个不同的注解，最后一个会被保存。

private Document getEntities(String fullText) {

    Document entitiesList = new Document();
    NERClassifierCombiner nerCombClassifier = loadNERClassifiers();

    if (nerCombClassifier != null) {

        List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);

        for (List<CoreLabel> coreLabels : results) {

            String prevLabel = null;
            String prevToken = null;

            for (CoreLabel coreLabel : coreLabels) {

                String word = coreLabel.word();
                String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);

                if (!"O".equals(annotation)) {

                    if (prevLabel == null) {
                        prevLabel = annotation;
                        prevToken = word;
                    } else {

                        if (prevLabel.equals(annotation)) {
                            prevToken += " " + word;
                        } else {
                            prevLabel = annotation;
                            prevToken = word;
                        }
                    }
                } else {

                    if (prevLabel != null) {
                        entitiesList.put(prevToken, prevLabel);
                        prevLabel = null;
                    }
                }
            }
        }
    }

    return entitiesList;
}

进口：

Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

nlp - 斯坦福命名实体识别器中的多术语命名实体

8 回答 8

Related

Reference