您的数据集似乎是某种技术写作,结构非常好,因此词性标签可能足以进行您想要的提取。
我建议您阅读这篇论文,并了解使用识别关系进行开放信息提取的基于 pos-tags 的模式
下面的一段代码用词性标签标记发送,然后查找与调用的混响模式匹配的序列。
import nltk
verb = "<ADV>*<AUX>*<VBN><IN|PART>*<ADV>*"
word = "<NOUN|ADJ|ADV|DET|ADP>"
preposition = "<ADP|ADJ>"
rel_pattern = "( %s (%s* (%s)+ )? )+ " % (verb, word, preposition)
grammar_long = '''REL_PHRASE: {%s}''' % rel_pattern
reverb_pattern = nltk.RegexpParser(grammar_long)
sent = "where the equation caused by the eccentricity is maximum."
sent_pos_tags = nltk.tag.pos_tag("where the equation caused by the eccentricity is maximum".split())
for x in reverb_pattern.parse(tags):
if isinstance(x, nltk.Tree) and x.label() == 'REL_PHRASE':
rel_phrase = " ".join([t[0] for t in x.leaves()])
print(rel_phrase)
缺少一点是找到最接近模式右侧和左侧的名词短语,但我将其留作练习。我还写了一篇博客文章,其中包含更详细的示例。我希望它有所帮助。