python - 如何计算Python或Excel中每行excel重复的标签数量？

Question

我有一个包含 10K 行的 excel 文件，每行都有一些推文信息。例如这些列：Tweet、Date of Tweet、User Name、Retweet Count、...、User Location、Sentiment（此列中的值是 Positive、negative 或中性）、State（此列的值是 50 个状态）美国), Abbreviation (此列的值是州的缩写，例如 CA, NJ, NY,..), CountofNegative (此列为空，我希望在此列中写出每个州的 Negative tweets 的数量，因此该列将有 50 个数字）。

在下面你可以看到这个数据集的截图：

问题：计算每个州或其缩写的负面推文数量，并写入 CountofNegative 列。以下是我的代码：

import pandas as pd

file=pd.read_excel("C:/Users/amtol/Desktop/Project/filter.xlsx")
UserLocation= file["User Location"]
Sentiment= file["Sentiment"]
CountofNegative= file["CountofNegative"]
State=file["State"]
Abbreviation= file["Abbreviation"]

for i, (loc,sent) in enumerate(zip(UserLocation, Sentiment)):
    count=0
    for j, (state, abbr) in enumerate(zip(State, Abbreviation)):
        if (loc == state or loc == abbr and sent == "Negative"):
            count=count+1
        file.loc[j+1,"CountofNegative"]=count

print(CountofNegative)

file.to_excel("C:/Users/amtol/Desktop/Project/filter.xlsx")

没有错误，但在创建输出文件时，“CountofNegative”列的前 24 个值为零，其余为 1（它们不是正确答案）。另外，我想通过测试程序print(CountofNegative)，但仍然没有发生任何事情。（无输出）。如何修复我的代码？

score 1 · Accepted Answer

好的，所以如果缩写和州名没有通用性，那么首先使用代码中的字典将全名转换为缩写。如果某些名称/缩写不正确，请在 dict 中进行一些更改。

因为我们只关心“负”计数。将 Negative 转换为 1 并将其他响应转换为 0，如下所示：

#Created sample dataset
 data={'State':['New York','New York','New York','New Jersey','New Jersey','New Jersey','California','California','California','NY','NJ','CA'],
'Sentiment' :['Negative','Positive','Negative','Neutral','Negative','Positive','Positive','Positive','Positive','Negative','Positive','Negative'], }
 df = pd.DataFrame(data, columns = ['State', 'Sentiment'])
 print (df)

#Dictionary of US states and abbreviations 
 di = {
'Alabama': 'AL',
'Alaska': 'AK',
'American Samoa': 'AS',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Guam': 'GU',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}

#Match the names in the dictionary to columns using
df=df.replace({"State": di}) 

#Create a function to give weight only to negative comments
def convert_to_int(word):
word_dict = {'Negative':1, 'Positive':0, 'Neutral':0, 0: 0}
return word_dict[word]

#Convert the Sentiment col as per the above function
df['Sentiment'] = df['Sentiment'].apply(lambda x : convert_to_int(x))

#Now the final part of doing the count of negative
df['negative_sum'] = df['Sentiment'].groupby(df['State']).transform('sum')


#My final output

 State  Sentiment   negative_sum
0   NY  1   3
1   NY  0   3
2   NY  1   3
3   NJ  0   1
4   NJ  1   1
5   NJ  0   1
6   CA  0   1
7   CA  0   1
8   CA  0   1
9   NY  1   3
10  NJ  0   1
11  CA  1   1

现在，您还可以选择再次将 Sentiment Column 转换为字符串，因为现在我们有了负和所需的列。我希望这足以达到目的。

python - 如何计算Python或Excel中每行excel重复的标签数量？

1 回答 1

Related

Reference