https://spacy.io/usage/rule-based-matching#phrasematcher
对于这个例子:
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)
doc = nlp("He lives in Washington, D.C. and Boston. ")
医生说:
Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.
'Washington, DC' 可以与文本成功匹配而无需担心标记化的原因是因为'Washington, DC' 的标记化是正确的。假设标记化如下所示:
['in', 'Washington', ',', 'D.', 'C. and', 'Boston', '.']
我的问题是,如果'C. and' 被标记为一个标记,'Washington, DC' 的匹配是否仍然成功?