python - python中的主体对象识别

Question

我想识别一组句子的主语和宾语。我的实际工作是从一组评论数据中找出因果关系。

我正在使用 Spacy 包来分块和解析数据。但实际上并没有达到我的目标。有什么办法吗？

例如：

 I thought it was the complete set

出去：

subject  object
I        complete set

score 14 · Accepted Answer

以最简单的方式。依赖项由 token.dep_ 访问并导入 spacy：

import spacy
nlp = spacy.load('en')
parsed_text = nlp(u"I thought it was the complete set")

#get token dependencies
for text in parsed_text:
    #subject would be
    if text.dep_ == "nsubj":
        subject = text.orth_
    #iobj for indirect object
    if text.dep_ == "iobj":
        indirect_object = text.orth_
    #dobj for direct object
    if text.dep_ == "dobj":
        direct_object = text.orth_

print(subject)
print(direct_object)
print(indirect_object)

score 1 · Accepted Answer

您可以使用名词块。

代码

doc = nlp("I thought it was the complete set")
for nc in doc.noun_chunks:
    print(nc.text)

结果：

I
it
the complete set

要仅选择“I”而不是“I”和“it”，您可以先编写一个测试以获取 ROOT 左侧的 nsubj。

score 0 · Accepted Answer

Stanza 使用高度准确的神经网络组件构建，还可以使用您自己的注释数据进行有效的训练和评估。这些模块构建在 PyTorch 库之上。

Stanza 是一个 Python 自然语言分析包。它包含可以在管道中使用的工具，将包含人类语言文本的字符串转换为句子和单词的列表，生成这些单词的基本形式，它们的词性和形态特征，给出句法结构依赖解析，并识别命名实体。

def find_Subject_Object(text):
    # import required packages
    import stanza
    nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse')
    doc = nlp(text)
    clausal_subject = []
    nominal_subject = []
    indirect_object = []
    Object          = []
    for sent in doc.sentences:
        for word in sent.words:
            if word.deprel  == "nsubj":
                nominal_subject.append({word.text:"nominal_subject nsubj"})
            elif word.deprel  == "csubj":
                clausal_subject.append({word.text:"clausal_subject csubj"})
            elif word.deprel  == "iobj":
                indirect_object.append({word.text:"indirect_object iobj"})
            elif word.deprel  == "obj":
                Object.append({word.text:"object obj"})
    return indirect_object, Object, clausal_subject,nominal_subject

text ="""John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City."""

find_Subject_Object(text)
# output #
([], [{'City': 'object obj'}], [], [{'John': 'nominal_subject nsubj'}, {'Airport': 'nominal_subject nsubj'}])

Stanza 包含一个到 CoreNLP Java 包的 Python 接口，并从那里继承了附加功能，例如选区解析、共指解析和语言模式匹配。

总而言之，Stanza 的特点是：

本机 Python 实现需要最少的设置；
用于稳健文本分析的完整神经网络管道，包括标记化、多词标记 (MWT) 扩展、词形还原、词性 (POS) 和形态特征标记、依赖解析和命名实体识别；
支持 66 种（人类）语言的预训练神经模型；
CoreNLP 的一个稳定的、官方维护的 Python 接口。节

python - python中的主体对象识别

3 回答 3

代码

结果：

Related

Reference