machine-learning - 使用 ner/nlp 从文本中检测员工指定

Question

我对 NLP 领域非常陌生，我对检测职位/职务/角色以及他们的姓名、电子邮件、电话号码等感兴趣。我尝试使用 stanford NLP 从文本中检测姓名。电子邮件和电话号码解析似乎非常简单。但是，我无法从给定文本中检测到名称。

例如，这里有一些文本示例

1）医学总监，博士。AB Ahmad,example1@example.com
姓名：Dr. AB Ahmad，电子邮件：example1@example.com

2) 副院长学术教授 S. Antony example2@example.com
姓名：Prof. S.安东尼，电子邮件：example2@example.com

3) 副院长学术和 PG-Cell & Surg。纪律居民Trg。程序，先生。Sandeep
姓名：Sandeep 先生，电子邮件：无

4) 网络总监 Robert Adams，example3@example.com,9900131213
姓名：Robert Adams，电子邮件：example3@example.com，电话：9900131213

我对任何正则表达式匹配算法都不感兴趣，因为文本的性质是不确定的。我有兴趣知道的是如何从文本中提取上述名称。任何超越斯坦福 NLP 的解决方案，如使用 nltk、lingpipe 等都可以。如果我使用的是 stanford NLP，如何使用不同的实体类型（如“POSITION”或“DESIGNATION”）构建相同的训练模型，以及如何将此模型与其他模型一起包含（我在服务器中运行 stanford NLP模式）。

score 0 · Accepted Answer

尝试使用以下文件（designation.rules.txt）

ENV.defaultStringPatternFlags = 2

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

$Designation = (
  /CFO/|
  /Director/| 
  /CEO/|
  /Chief/| 
  /Executive/| 
  /Officer/|
  /Vice/| 
  /President/|
  /Senior/|
  /Financial/
)

ENV.defaults["ruleType"] = "tokens"
ENV.defaults["stage"] = 1
{
  pattern: ( $Designation ), 
  action: ( Annotate($0, ner, "DESIGNATION")) 
}

ENV.defaults["stage"] = 2
{
  ruleType: "tokens",
  pattern: ( ( [ { ner:PERSON } ]) /has/ ([ { ner:DESIGNATION } ]+) ),
  result: Format("hasDesignation(%s,%s)",$1.word, Join(" ",$2.word))
}

并使用下面的 Java 文件生成

package org.itcookies.nlpdemo;

import java.io.IOException;
import java.io.PrintWriter;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

/**
 * Demo illustrating how to use TokensRegexAnnotator
 */
public class TokensRegexAnnotatorDemo {

  public static void main(String[] args) throws IOException {
    PrintWriter out;

    String rules;
    if (args.length > 0) {
      rules = args[0];
    } else {
      rules = "org/itcookies/nlp/rules/designation.rules.txt";
    }
    if (args.length > 2) {
      out = new PrintWriter(args[2]);
    } else {
      out = new PrintWriter(System.out);
    }

    Properties properties = new Properties();
    properties.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregexdemo");
    properties.setProperty("customAnnotatorClass.tokensregexdemo", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
    properties.setProperty("tokensregexdemo.rules", rules);
    StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
    Annotation annotation;
    if (args.length > 1) {
      annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[1]));
    } else {
      annotation = new Annotation("John is CEO of ITCookies");
    }

    pipeline.annotate(annotation);

    // An Annotation is a Map and you can get and use the various analyses individually.
    out.println();
    // The toString() method on an Annotation just prints the text of the Annotation
    // But you can see what is in it with other methods like toShorterString()
    out.println("The top level annotation");
    out.println(annotation.toShorterString());
    List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);

    for (CoreMap sentence : sentences) {
      // NOTE: Depending on what tokensregex rules are specified, there are other annotations
      //       that are of interest other than just the tokens and what we print out here
      for (CoreLabel token:sentence.get(CoreAnnotations.TokensAnnotation.class)) {
        // Print out words, lemma, ne, and normalized ne
        String word = token.get(CoreAnnotations.TextAnnotation.class);
        String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
        String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
        String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
        String normalized = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);
        if(ne.equals("DESIGNATION"))
            out.println("token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", ne=" + ne + ", normalized=" + normalized);
      }
    }
    out.flush();
  }

}

下面是输出

The top level annotation
[Text=John is CEO of ITCookies Tokens=[John-1, is-2, CEO-3, of-4, ITCookies-5] Sentences=[John is CEO of ITCookies]]
token: word=CEO, lemma=CEO, pos=NNP, ne=DESIGNATION, normalized=null

machine-learning - 使用 ner/nlp 从文本中检测员工指定

1 回答 1

Related

Reference