stanford-nlp - 是否可以获得一组包含短语的特定命名实体标记

Question

我正在使用斯坦福 CoreNLP 解析器来处理一些文本，并且有一些日期短语，例如“十月的第二个星期一”和“过去的一年”。该库会将每个标记适当地标记为 DATE 命名实体，但是有没有办法以编程方式获取整个日期短语？不仅仅是日期，ORGANIZATION 命名实体也会这样做（例如，“国际奥林匹克委员会”可能是给定文本示例中标识的实体）。

String content = "Thanksgiving, or Thanksgiving Day (Canadian French: Jour de"
        + " l'Action de grâce), occurring on the second Monday in October, is"
        + " an annual Canadian holiday which celebrates the harvest and other"
        + " blessings of the past year.";

Properties p = new Properties();
p.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(p);

Annotation document = new Annotation(content);
pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

        String word = token.get(CoreAnnotations.TextAnnotation.class);
        String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

        if (ne.equals("DATE")) {
            System.out.println("DATE: " + word);
        }

    }
}

在斯坦福注释器和分类器加载之后，将产生输出：

DATE: Thanksgiving
DATE: Thanksgiving
DATE: the
DATE: second
DATE: Monday
DATE: in
DATE: October
DATE: the
DATE: past
DATE: year

我觉得图书馆必须识别短语并将它们用于命名实体标记，所以问题是数据是否通过 api 以某种方式保存和可用？

谢谢，凯文

score 1 · Accepted Answer

在对邮件列表进行讨论后，我发现 api 不支持这一点。我的解决方案是只保留最后一个 NE 的状态，并在必要时构建一个字符串。nlp 邮件列表中的 John B. 有助于回答我的问题。

score 0 · Accepted Answer

非常感谢，我也打算这样做。然而，斯坦福 NER API 支持classifyToCharOffset（或类似的东西）获取整个短语。我不知道，也许这只是您想法的实现：D。

score 0 · Accepted Answer

命名实体标注器和词性标注器是 CoreNLP 管道中不同的算法，似乎 API 使用者的任务是集成它们。

请原谅我的 C#，但这是一个简单的类：

    public class NamedNounPhrase
    {
        public NamedNounPhrase()
        {
            Phrase = string.Empty;
            Tags = new List<string>();
        }

        public string Phrase { get; set; }

        public IList<string> Tags { get; set; }

    }

以及一些代码来查找所有顶级名词短语及其关联的命名实体标签：

    private void _monkey()
    {

        ...

        var nounPhrases = new List<NamedNounPhrase>();

        foreach (CoreMap sentence in sentences.toArray())
        {
            var tree =
                (Tree)sentence.get(new TreeCoreAnnotations.TreeAnnotation().getClass());

            if (null != tree)
                _walk(tree, nounPhrases);
        }

        foreach (var nounPhrase in nounPhrases)
            Console.WriteLine(
                "{0} ({1})",
                nounPhrase.Phrase,
                string.Join(", ", nounPhrase.Tags)
                );
    }

    private void _walk(Tree tree, IList<NamedNounPhrase> nounPhrases)
    {
        if ("NP" == tree.value())
        {
            var nounPhrase = new NamedNounPhrase();

            foreach (Tree leaf in tree.getLeaves().toArray())
            {
                var label = (CoreLabel) leaf.label();
                nounPhrase.Phrase += (string) label.get(new CoreAnnotations.TextAnnotation().getClass()) + " ";
                nounPhrase.Tags.Add((string) label.get(new CoreAnnotations.NamedEntityTagAnnotation().getClass()));
            }

            nounPhrases.Add(nounPhrase);
        }
        else
        {
            foreach (var child in tree.children())
            {
                _walk(child, nounPhrases);
            }
        }
    }

希望有帮助！

stanford-nlp - 是否可以获得一组包含短语的特定命名实体标记

3 回答 3

Related

Reference