spacy - Spacy 中的 PhraseMatcher 是否仍然适用于错误的标记化？

Question

https://spacy.io/usage/rule-based-matching#phrasematcher

对于这个例子：

    nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("He lives in Washington, D.C. and Boston. ")

医生说：

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.

'Washington, DC' 可以与文本成功匹配而无需担心标记化的原因是因为'Washington, DC' 的标记化是正确的。假设标记化如下所示：

['in', 'Washington', ',',  'D.', 'C. and', 'Boston', '.']

我的问题是，如果'C. and' 被标记为一个标记，'Washington, DC' 的匹配是否仍然成功？

score 1 · Accepted Answer

Washington, D.C.只要您的短语的开头和结尾是标记边界，如何在内部进行标记化并不重要。在您的示例中，它不匹配，因为C. and是一个令牌（出于某种不寻常的原因？）。

因此，如果是一个标记，您也无法匹配Washing或ton D.无法匹配D.C（没有）。.D.C.

spacy - Spacy 中的 PhraseMatcher 是否仍然适用于错误的标记化？

1 回答 1

Related

Reference