pandas - Python初学者：在python中预处理法语文本并使用词典计算极性

Question

我正在用python编写一个算法，它处理一列句子，然后给出我的句子列中每个单元格的极性（正或负）。该脚本使用 NRC 情感词典（法语版）中的否定和肯定词列表。我在编写预处理函数时遇到问题。我已经编写了计数函数和极性函数，但是由于我在编写预处理函数时遇到了一些困难，所以我不确定这些函数是否有效。

肯定词和否定词在同一个文件（词典）中，但我分别导出肯定词和否定词，因为我不知道如何按原样使用词典。

我的函数计数出现的正负数不起作用，我不知道为什么它总是给我发送 0。我在每个句子中添加了正字，所以应该出现在数据框中：

堆栈跟踪：


[4 rows x 6 columns]
   id                                           Verbatim      ...       word_positive  word_negative
0  15  Je n'ai pas bien compris si c'était destiné a ...      ...                   0              0
1  44  Moi aérien affable affaire agent de conservati...      ...                   0              0
2  45  Je affectueux affirmative te hais et la Foret ...      ...                   0              0
3  47  Je absurde accidentel accusateur accuser affli...      ...                   0              0

=>  
def count_occurences_Pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]


    return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))

这是我的 csv_data ：第 44、45 行包含肯定词，第 47 行包含更多否定词，但在肯定词和否定词列中，它始终为空，函数不返回词数，最后一列始终为正，而最后一句话是否定的

id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue

这里是完整的代码：

# -*- coding: UTF-8 -*-
import codecs 
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
    import treetaggerwrapper
    from treetaggerwrapper import TreeTagger, make_tags
    print("import TreeTagger OK")
except:
    print("Import TreeTagger pas Ok")

from itertools import islice
from collections import defaultdict, Counter



csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())

stopWords = set(stopwords.words('french'))  
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')     
def process_text(text):
    '''extract lemma and lowerize then removing stopwords.'''

    text_preprocess =[]
    text_without_stopwords= []

    text = tagger.tag_text(text)
    for word in text:
        parts = word.split('\t')
        try:
            if parts[2] == '':
                text_preprocess.append(parts[1])
            else:
                text_preprocess.append(parts[2])
        except:
            print(parts)


    text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
    return text_without_stopwords

csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)


lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''

    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)


#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)

lexiconneg = open('negative.txt', 'r', encoding='utf-8')

def count_occurences_neg(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)

def polarity_score(text):   
    ''' give the polarity of each text based on the number of positive and negative word '''
    positives_text =count_occurences_pos(text, lexiconpos)
    negatives_text =count_occurences_neg(text, lexiconneg)
    if positives_text > negatives_text :
        return "positive"
    else : 
        return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)

如果您还可以查看其余代码是否很好，谢谢。

score 1 · Accepted Answer

我发现了你的错误！它来自Polarity_score函数。

这只是一个错字：在您的 if 语句中，您正在比较count_occurences_Pos and count_occurences_Neg哪些是函数而不是比较函数的结果count_occurences_pos and count_occurences_peg

你的代码应该是这样的：

def Polarity_score(text):
    ''' give the polarity of each text based on the number of positive and negative word '''
    count_text_pos =count_occurences_Pos(text, word_list)
    count_text_neg =count_occurences_Neg(text, word_list)
    if count_occurences_pos > count_occurences_peg :
        return "Positive"
    else : 
        return "negative"

将来，您需要学习如何为变量取有意义的名称以避免此类错误使用正确的变量名称，您的函数应该是：

 def polarity_score(text):
        ''' give the polarity of each text based on the number of positive and negative word '''
        positives_text =count_occurences_pos(text, word_list)
        negatives_text =count_occurences_neg(text, word_list)
        if positives_text > negatives_text :
            return "Positive"
        else : 
            return "negative"

您可以在 count_occurences_pos 和 count_occurences_neg 函数中进行的另一项改进是使用 set 而不是列表。您的 text 和 world_list 可以转换为集合，您可以使用集合交集来检索其中的正文本。因为集合比列表快

pandas - Python初学者：在python中预处理法语文本并使用词典计算极性

1 回答 1

Related

Reference