0

https://spacy.io/usage/rule-based-matching#phrasematcher

对于这个例子:

    nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("He lives in Washington, D.C. and Boston. ")

医生说:

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.

'Washington, DC' 可以与文本成功匹配而无需担心标记化的原因是因为'Washington, DC' 的标记化是正确的。假设标记化如下所示:

['in', 'Washington', ',',  'D.', 'C. and', 'Boston', '.']

我的问题是,如果'C. and' 被标记为一个标记,'Washington, DC' 的匹配是否仍然成功?

4

1 回答 1

1

Washington, D.C.只要您的短语的开头和结尾是标记边界,如何在内部进行标记化并不重要。在您的示例中,它不匹配,因为C. and是一个令牌(出于某种不寻常的原因?)。

因此,如果是一个标记,您也无法匹配Washington D.无法匹配D.C(没有)。.D.C.

于 2021-07-26T17:34:41.800 回答