在大量文本语料库中,我有兴趣提取句子中某处具有(动词-名词)或(形容词-名词)特定列表的每个句子。我有一个很长的清单,但这里有一个示例。在我的 MWE 中,我试图用“write/wrote/writing/writes”和“book/s”提取句子。我有大约 30 对这样的词。
这是我尝试过的,但它没有捕捉到大多数句子:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
不幸的是,我只有一场比赛:
“在写这本书时,他必须抵御外星人和恐龙。”
然而,我也希望得到“他写了他的第一本书”这句话。其他写书将作家作为名词,其好处是不匹配。