我有一个旅行顾问评论的 csv 文件。有四列:
人、标题、评级、评论、review_date。
我希望此代码执行以下操作:
- 在 csv 中,创建一个名为“tarate”的新列。
- 用“pos”、“neg”或“neut”填充“tarate”。它应该读取“评级”中的数值。如果“评级”>=40,则“tarate”== 'pos';“tarate” == 'neut' 如果“rating” == 30;“tarate” == 'neg' 如果“rating”<30。
- 接下来,通过 SentimentIntensityAnalyzer 运行“review”列。
- 将输出记录在名为“scores”的新 csv 列中
- 使用“pos”和“neg”分类为“复合”值创建一个单独的 csv 列
- 运行 sklearn.metrics 工具将旅行顾问评级(“tarate”)与“compound”进行比较。这可以打印。
部分代码基于 [http://akashsenta.com/blog/sentiment-analysis-with-vader-with-python/]
这是我的 csv 文件:[https://github.com/nsusmann/vadersentiment]
我遇到了一些错误。我是一个初学者,我想我被诸如指向特定列和 lambda 函数之类的东西绊倒了。
这是代码:
# open command prompt
# import nltk
# nltk.download()
# pip3 install pandas
# pip3 installs sci-kitlearn
# pip3 install matplotlib
# pip3 install seaborn
# pip3 install vaderSentiment
#pip3 install openpyxl
import pandas
import nltk
nltk.download([
"vader_lexicon",
"stopwords"])
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from collections import Counter
import re
import math
import html
import sklearn
import sklearn.metrics as metrics
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import openpyxl
# open the file to save the review
import csv
outputfile = open('D:\Documents\Archaeology\Projects\Patmos\TextAnalysis\Sentiment\scraped_cln_sent.csv', 'w', newline='')
df = csv.writer(outputfile)
#open Vader Sentiment Analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#make SIA into an object
analyzer = SentimentIntensityAnalyzer()
#create a new column called "tarate"
df['tarate'],
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, ['tarate'] == 'Pos',
df.loc[df['rating'] == 30, ['tarate'] == 'Neut',
df.loc[df['rating'] <= 20, ['tarate'] == 'Neg',
#use polarity_scores() to get sentiment metrics and write them to new column "scores"
df.head['scores'] == df['review'].apply(lambda review: sid.polarity_scores['review'])
#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])
#using column "compound", determine whether the score is <0> and write new column "score" recording positive or negative
df['score'] = df['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')
ta.file()
#get accuracy metrics. this will compare the trip advisor rating (text version recorded in column "tarate") to the sentiment analysis results in column "score"
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
accuracy_score(df['tarate'],df['score'])
print(classification_report(df['tarate'],df['score'])) ```