n-gram - 将文档中的所有命名实体分组

Question

我想将给定文档中的所有命名实体分组。例如，

**Barack Hussein Obama** II  is the 44th and current President of the United States, and the first African American to hold the office.

我不想使用 OpenNLP API，因为它可能无法识别所有命名实体。有没有办法使用其他服务生成这样的 n-gram，或者可能是一种将所有名词术语组合在一起的方法。

score 4 · Accepted Answer

如果你想避免使用 NER，你可以使用句子分块器或解析器。这将一般地提取名词短语。OpenNLP 有一个句子分块器和解析器，但如果你出于某种原因反对使用 OpenNLP，你可以尝试其他的。如果您对使用 OpenNLP 分块器感兴趣，我将发布一些使用 OpenNLP 提取名词短语的代码。

这是代码。您需要在此处从 sourceforge 下载模型

http://opennlp.sourceforge.net/models-1.5/

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

/**
 *
 * Extracts noun phrases from a sentence. To create sentences using OpenNLP use
 * the SentenceDetector classes.
 */
public class OpenNLPNounPhraseExtractor {

  static final int N = 2;

  public static void main(String[] args) {

    try {
      String modelPath = "c:\\temp\\opennlpmodels\\";
      TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
      TokenizerME wordBreaker = new TokenizerME(tm);
      POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
      POSTaggerME posme = new POSTaggerME(pm);
      InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
      ChunkerModel chunkerModel = new ChunkerModel(modelIn);
      ChunkerME chunkerME = new ChunkerME(chunkerModel);
      //this is your sentence
      String sentence = "Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
      //words is the tokenized sentence
      String[] words = wordBreaker.tokenize(sentence);
      //posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
      String[] posTags = posme.tag(words);
      //chunks are the start end "spans" indices to the chunks in the words array
      Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
      //chunkStrings are the actual chunks
      String[] chunkStrings = Span.spansToStrings(chunks, words);
      for (int i = 0; i < chunks.length; i++) {
        if (chunks[i].getType().equals("NP")) {
          System.out.println("NP: \n\t" + chunkStrings[i]);
          String[] split = chunkStrings[i].split(" ");

          List<String> ngrams = ngram(Arrays.asList(split), N, " ");
          System.out.println("ngrams:");
          for (String gram : ngrams) {
            System.out.println("\t" + gram);
          }

        }
      }


    } catch (IOException e) {
    }
  }

  public static List<String> ngram(List<String> input, int n, String separator) {
    if (input.size() <= n) {
      return input;
    }
    List<String> outGrams = new ArrayList<String>();
    for (int i = 0; i < input.size() - (n - 2); i++) {
      String gram = "";
      if ((i + n) <= input.size()) {
        for (int x = i; x < (n + i); x++) {
          gram += input.get(x) + separator;
        }
        gram = gram.substring(0, gram.lastIndexOf(separator));
        outGrams.add(gram);
      }
    }
    return outGrams;
  }
}

我用你的句子得到的输出是这个（N 设置为 2（bigram）

NP: 
    Barack Hussein Obama II
ngrams:
    Barack Hussein
    Hussein Obama
    Obama II
NP: 
    the 44th and current President
ngrams:
    the 44th
    44th and
    and current
    current President
NP: 
    the United States
ngrams:
    the United
    United States
NP: 
    the first African American
ngrams:
    the first
    first African
    African American
NP: 
    the office
ngrams:
    the
    office

这并没有明确处理形容词落在 NP 之外的情况......如果是这样，您可以从 POS 标签中获取此信息并将其整合。我给你的东西应该把你送到正确的方向。

n-gram - 将文档中的所有命名实体分组

1 回答 1

Related

Reference