我正在用python编写一个算法,它处理一列句子,然后给出我的句子列中每个单元格的极性(正或负)。该脚本使用 NRC 情感词典(法语版)中的否定和肯定词列表。我在编写预处理函数时遇到问题。我已经编写了计数函数和极性函数,但是由于我在编写预处理函数时遇到了一些困难,所以我不确定这些函数是否有效。
肯定词和否定词在同一个文件(词典)中,但我分别导出肯定词和否定词,因为我不知道如何按原样使用词典。
我的函数计数出现的正负数不起作用,我不知道为什么它总是给我发送 0。我在每个句子中添加了正字,所以应该出现在数据框中:
堆栈跟踪 :
[4 rows x 6 columns]
id Verbatim ... word_positive word_negative
0 15 Je n'ai pas bien compris si c'était destiné a ... ... 0 0
1 44 Moi aérien affable affaire agent de conservati... ... 0 0
2 45 Je affectueux affirmative te hais et la Foret ... ... 0 0
3 47 Je absurde accidentel accusateur accuser affli... ... 0 0
=>
def count_occurences_Pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))
这是我的 csv_data :第 44、45 行包含肯定词,第 47 行包含更多否定词,但在肯定词和否定词列中,它始终为空,函数不返回词数,最后一列始终为正,而最后一句话是否定的
id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue
这里是完整的代码:
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict, Counter
csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())
stopWords = set(stopwords.words('french'))
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
def process_text(text):
'''extract lemma and lowerize then removing stopwords.'''
text_preprocess =[]
text_without_stopwords= []
text = tagger.tag_text(text)
for word in text:
parts = word.split('\t')
try:
if parts[2] == '':
text_preprocess.append(parts[1])
else:
text_preprocess.append(parts[2])
except:
print(parts)
text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
return text_without_stopwords
csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)
lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)
lexiconneg = open('negative.txt', 'r', encoding='utf-8')
def count_occurences_neg(text, word_list):
'''Count occurences of words from a list in a text string.'''
text_list = process_text(text)
intersection = [w for w in text_list if w in word_list]
return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)
def polarity_score(text):
''' give the polarity of each text based on the number of positive and negative word '''
positives_text =count_occurences_pos(text, lexiconpos)
negatives_text =count_occurences_neg(text, lexiconneg)
if positives_text > negatives_text :
return "positive"
else :
return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)
如果您还可以查看其余代码是否很好,谢谢。