java - openNLP java - 多词葡萄牙语NER

Question

我在 java 中使用 openNLP API 来处理我正在处理的项目。问题是我的程序只处理单词，没有对应关系。编码：

String line = input.nextLine();


          InputStream inputStreamTokenizer = new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-token.bin"); 
          TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer); 

          //Instantiating the TokenizerME class 
          TokenizerME tokenizer = new TokenizerME(tokenModel); 
          String tokens[] = tokenizer.tokenize(line);


          InputStream inputStream = new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-sent.bin"); 
          SentenceModel model = new SentenceModel(inputStream); 

          //Instantiating the SentenceDetectorME class 
          SentenceDetectorME detector = new SentenceDetectorME(model);  

          //Detecting the sentence
          String sentences[] = detector.sentDetect(line); 

          //Loading the NER-location model 
          //InputStream inputStreamLocFinder = new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/en-ner-location.bin");       
          //TokenNameFinderModel model = new TokenNameFinderModel(inputStreamLocFinder);

          //Loading the NER-person model 
          InputStream inputStreamNameFinder = new FileInputStream("/home/bruno/TryOllie/data/pt-ner-floresta.bin");       
          TokenNameFinderModel model2 = new TokenNameFinderModel(inputStreamNameFinder);

          //Instantiating the NameFinderME class 
          NameFinderME nameFinder2 = new NameFinderME(model2);

          //Finding the names of a location 
          Span nameSpans2[] = nameFinder2.find(tokens);

          //Printing the spans of the locations in the sentence 
          //for(Span s: nameSpans)        
             //System.out.println(s.toString()+"  "+tokens[s.getStart()]);

          Set<String> x = new HashSet<String>();
          x.add("event");
          x.add("artprod");
          x.add("place");
          x.add("organization");
          x.add("person");
          x.add("numeric");

          SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;  
          Span[] tokenz = simpleTokenizer.tokenizePos(line);
          Set<String> tk = new HashSet<String>();
          for( Span tok : tokenz){
              tk.add(line.substring(tok.getStart(), tok.getEnd()));
          }

          for(Span n: nameSpans2)
          {
              if(x.contains(n.getType()))
                  System.out.println(n.toString()+ " -> " + tokens[n.getStart()]);

          }

我得到的输出是：

Ficheiro com extensao: file.txt
[1..2) event -> choque[3..4) event -> cadeia[6..7) artprod -> viaturas[13..14) event -> feira[16..18) place -> Avenida[20..21) place -> Porto[24..25) event -> incêndio[2..3) event -> acidente[5..6) artprod -> viaturas[44..45) organization -> JN[46..47) person -> António[47..48) place -> Campos[54..60) organization -> Batalhão[1..2) event -> acidente[6..8) numeric -> 9[11..12) place -> Porto-Matosinhos[21..22) event -> ocorrência[29..30) artprod -> .[4..5) organization -> Sapadores[7..10) organization -> Bombeiros[14..15) numeric -> 15

我想做的是一个多术语NER，比如Antonio Campos是一个人，而不是Person -> Antonio and Place -> Campos，或Organization -> Universidade Nova de Lisboa

score 2 · Accepted Answer

您正在打印错误的数据结构。跨度 getSart 和 getEnd 将指向作为实体一部分的标记序列。您只打印第一个令牌。

此外，您正在句子检测之前进行标记化。

试试下面的代码：

// load the models outside your loop
InputStream inputStream =
    new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-sent.bin");
SentenceModel model = new SentenceModel(inputStream);

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

InputStream inputStreamTokenizer =
    new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-token.bin");
TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer);
//Instantiating the TokenizerME class 
TokenizerME tokenizer = new TokenizerME(tokenModel);


//Loading the NER-person model 
InputStream inputStreamNameFinder = new FileInputStream("/home/bruno/TryOllie/data/pt-ner-floresta.bin");
TokenNameFinderModel model2 = new TokenNameFinderModel(inputStreamNameFinder);

//Instantiating the NameFinderME class 
NameFinderME nameFinder2 = new NameFinderME(model2);

String line = input.nextLine();

while(line != null) {

  // first we find sentences
  String sentences[] = detector.sentDetect(line);

  for (String sentence :
      sentences) {
    // now we find the sentence tokens
    String tokens[] = tokenizer.tokenize(sentence);

    // now we are good to apply NER
    Span[] nameSpans = nameFinder2.find(tokens);

    // now we can print the spans
    System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));

    line = input.nextLine();
  }
}

score 0 · Accepted Answer

Stanford-NLP仅处理单个单词。即使你给 coreNLP 一个 Sentence ，它也会分解成令牌并逐个处理。而且我从未听说 NER 适用于多期。

java - openNLP java - 多词葡萄牙语NER

2 回答 2

Related

Reference