python - 对于 nlp.matcher.add 方法，SpaCy 的“匹配器”（地名词典）格式如何工作？

Question

我已经开始使用 Spacy.io 的 NLP 包，并检查了一些介绍以及一些示例代码。

我对 spacy.en.English.matcher.add 方法很感兴趣——添加我自己的实体的格式是什么？虽然解释了基本格式，但似乎还有其他可用功能。我添加的实体可以链接到 dbpedia/wikipedia 条目或其他外部链接吗？

这是 Spacy 匹配器示例中的代码： https ://github.com/honnibal/spaCy/blob/master/examples/matcher_example.py

   nlp.matcher.add(
    "GoogleNow", # Entity ID: Not really used at the moment.
    "PRODUCT",   # Entity type: should be one of the types in the NER data
    {"wiki_en": "Google_Now"}, # Arbitrary attributes. Currently unused.
    [  # List of patterns that can be Surface Forms of the entity

        # This Surface Form matches "Google Now", verbatim
        [ # Each Surface Form is a list of Token Specifiers.
            { # This Token Specifier matches tokens whose orth field is "Google"
                ORTH: "Google"
            },
            { # This Token Specifier matches tokens whose orth field is "Now"
                ORTH: "Now"
            }
        ],
        [ # This Surface Form matches "google now", verbatim, and requires
          # "google" to have the NNP tag. This helps prevent the pattern from
          # matching cases like "I will google now to look up the time"
            {
                ORTH: "google",
                TAG: "NNP"
            },
            {
                ORTH: "now"
            }
        ]
    ]
)

感谢您的时间。

score 2 · Accepted Answer

当然，您可以将它们链接起来，但据我所知，这并不是 spaCy 开箱即用的。您可以设置自己的类别类型（例如 SINGER 而不是 PRODUCT；请注意，这目前已损坏，您可能需要为此使用 v0.93），然后使用 DBpedia 条目填充它（例如David Bowie而不是Google Now）。完成此操作后，您可以在实体及其 URL 之间使用映射。正如这条评论所暗示的那样，可能会出现自动执行最后一个链接的东西

 {"wiki_en": "Google_Now"}, # Arbitrary attributes. Currently unused.

score 1 · Accepted Answer

使用 spaCy >v1，您现在可以向匹配器添加回调函数。我可以想象这样的事情适用于您的用例：

def getlink(matcher, doc, i, matches):
    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
    for span in spans:
        **do something to get link from wikipedia**
matcher.add_entity('David Bowie', on_match=getlink)
matcher.add_pattern('David Bowie', {ORTH: 'David'}, {ORTH: 'Bowie'}])
doc = Doc(matcher.vocab, words=[u'David', u'Bowie', u'Space', u'Oddity'])
matcher(doc)

python - 对于 nlp.matcher.add 方法，SpaCy 的“匹配器”（地名词典）格式如何工作？

2 回答 2

Related

Reference