我正在使用斯坦福 CoreNLP 解析器来处理一些文本,并且有一些日期短语,例如“十月的第二个星期一”和“过去的一年”。该库会将每个标记适当地标记为 DATE 命名实体,但是有没有办法以编程方式获取整个日期短语?不仅仅是日期,ORGANIZATION 命名实体也会这样做(例如,“国际奥林匹克委员会”可能是给定文本示例中标识的实体)。
String content = "Thanksgiving, or Thanksgiving Day (Canadian French: Jour de"
+ " l'Action de grâce), occurring on the second Monday in October, is"
+ " an annual Canadian holiday which celebrates the harvest and other"
+ " blessings of the past year.";
Properties p = new Properties();
p.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(p);
Annotation document = new Annotation(content);
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
String word = token.get(CoreAnnotations.TextAnnotation.class);
String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
if (ne.equals("DATE")) {
System.out.println("DATE: " + word);
}
}
}
在斯坦福注释器和分类器加载之后,将产生输出:
DATE: Thanksgiving
DATE: Thanksgiving
DATE: the
DATE: second
DATE: Monday
DATE: in
DATE: October
DATE: the
DATE: past
DATE: year
我觉得图书馆必须识别短语并将它们用于命名实体标记,所以问题是数据是否通过 api 以某种方式保存和可用?
谢谢,凯文