我最近刚刚解决了非常相似的问题——我需要提取主题、动作、对象。我开源了我的工作,所以你可以查看这个库:
https ://github.com/krzysiekfonal/textpipeliner
这基于 spacy(nltk 的对手),但它也基于句子树。
因此,例如,让我们将此文档嵌入到 spacy 中作为示例:
import spacy
nlp = spacy.load("en")
doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
"Pacific and was already at war with the Republic of China " \
"in 1937, but the world war is generally said to have begun on " \
"1 September 1939 with the invasion of Poland by Germany and " \
"subsequent declarations of war on Germany by France and the United Kingdom. " \
"From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
"or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
"Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
"annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
"The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
"and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
"the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
"long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
"of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
"of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
"the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
"the Western Pacific.")
您现在可以创建一个简单的管道结构(有关此项目自述文件中的管道的更多信息):
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
FindTokensPipe("VERB"),
AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("GPE"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()]),
SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("LOC"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()])])]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
结果你会得到:
>>>[([Germany], [conquered], [Europe]),
([Japan], [attacked], [the, United, States])]
实际上,它强烈地(查找管道)基于另一个库-grammaregex。您可以从帖子中了解它:
https ://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc
已编辑
实际上,我在自述文件中提供的示例丢弃了 adj,但您只需要根据需要调整传递给引擎的管道结构即可。例如,对于您的示例句子,我可以提出这样的结构/解决方案,它为每个句子提供 3 个元素(主语、动词、形容词)的元组:
import spacy
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB"),
FindTokensPipe("VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
它会给你结果:
[([Donald, Trump], [is], [the, worst])]
有点复杂的事实是你有复合句子,并且库每个句子产生一个元组 - 我很快就会添加可能性(我的项目也需要它)将管道结构列表传递给引擎以允许产生更多元组每个句子。但是现在您可以通过为复合发送创建第二个引擎来解决它,该引擎的结构仅与 VERB/conj/VERB 而不是 VERB 不同(那些正则表达式总是从 ROOT 开始,所以 VERB/conj/VERB 只引导您进入第二个动词复合句):
pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])
现在,在您运行两个引擎之后,您将获得预期的结果 :)
engine.process()
engine2.process()
[([Donald, Trump], [is], [the, worst])]
[([Hillary], [is], [better])]
我认为这就是你需要的。当然,我只是为给定的例句快速创建了一个管道结构,它并不适用于所有情况,但我看到了很多句子结构,它已经实现了相当不错的百分比,但是你可以添加更多 FindTokensPipe 等目前无法使用的案例,我相信经过一些调整后,您将涵盖非常多的可能句子(英语并不太复杂,所以...:)