7

使用 Spacy,我根据我定义的语法规则从文本中提取方面意见对。规则基于 POS 标签和依赖标签,由token.pos_和获取token.dep_。以下是其中一种语法规则的示例。Japan is cool,如果我通过它返回的句子[('Japan', 'cool', 0.3182)],其中的值代表 的极性cool

但是我不知道如何让它识别命名实体。例如,如果我通过Air France is cool,我想得到,[('Air France', 'cool', 0.3182)]但我目前得到的是[('France', 'cool', 0.3182)]

我查看了 Spacy 在线文档,我知道如何提取 NE( doc.ents)。但我想知道使我的提取器工作的可能解决方法是什么。请注意,我不想要强制措施,例如连接字符串AirFranceAir_France

谢谢!

import spacy

nlp = spacy.load("en_core_web_lg-2.2.5")
review_body = "Air France is cool."
doc=nlp(review_body)

rule3_pairs = []

for token in doc:

    children = token.children
    A = "999999"
    M = "999999"
    add_neg_pfx = False

    for child in children :
        if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
            A = child.text

        if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
            M = child.text

        # example - 'this could have been better' -> (this, not better)
        if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
            neg_prefix = "not"
            add_neg_pfx = True

        if(child.dep_ == "neg"): # neg is negation
            neg_prefix = child.text
            add_neg_pfx = True

    if (add_neg_pfx and M != "999999"):
        M = neg_prefix + " " + M

    if(A != "999999" and M != "999999"):
        rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))

结果

rule3_pairs
>>> [('France', 'cool', 0.3182)]

期望的输出

rule3_pairs
>>> [('Air France', 'cool', 0.3182)]
4

1 回答 1

9

在提取器中集成实体非常容易。对于每对孩子,您应该检查“A”孩子是否是某个命名实体的头部,如果是,则使用整个实体作为您的对象。

在这里我提供整个代码

!python -m spacy download en_core_web_lg
import nltk
nltk.download('vader_lexicon')

import spacy
nlp = spacy.load("en_core_web_lg")

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()


def find_sentiment(doc):
    # find roots of all entities in the text
    ner_heads = {ent.root.idx: ent for ent in doc.ents}
    rule3_pairs = []
    for token in doc:
        children = token.children
        A = "999999"
        M = "999999"
        add_neg_pfx = False
        for child in children:
            if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
                if child.idx in ner_heads:
                    A = ner_heads[child.idx].text
                else:
                    A = child.text
            if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
                M = child.text
            # example - 'this could have been better' -> (this, not better)
            if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
                neg_prefix = "not"
                add_neg_pfx = True
            if(child.dep_ == "neg"): # neg is negation
                neg_prefix = child.text
                add_neg_pfx = True
        if (add_neg_pfx and M != "999999"):
            M = neg_prefix + " " + M
        if(A != "999999" and M != "999999"):
            rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))
    return rule3_pairs

print(find_sentiment(nlp("Air France is cool.")))
print(find_sentiment(nlp("I think Gabriel García Márquez is not boring.")))
print(find_sentiment(nlp("They say Central African Republic is really great. ")))

此代码的输出将是您所需要的:

[('Air France', 'cool', 0.3182)]
[('Gabriel García Márquez', 'not boring', 0.2411)]
[('Central African Republic', 'great', 0.6249)]

享受!

于 2020-04-04T22:52:01.073 回答