python - 如何正确循环通过两个文件比较两个文件中的字符串

Question

我无法针对单词列表（文件 2，制表符分隔，两列）对推文（文件 1，标准 twitter json 响应）进行情绪分析，并将其情绪分配给它们（正面或负面）。

问题是：顶部循环只运行一次，然后脚本在我循环文件 1 时结束，然后嵌套在其中，我循环文件 2 并尝试比较并保持每条推文的综合情绪的运行总和。

所以我有：

def get_sentiments(tweet_file, sentiment_file):


    sent_score = 0
    for line in tweet_file:

        document = json.loads(line)
        tweets = document.get('text')

        if tweets != None:
            tweet = str(tweets.encode('utf-8'))

            #print tweet


            for z in sentiment_file:
                line = z.split('\t')
                word = line[0].strip()
                score = int(line[1].rstrip('\n').strip())

                #print score



                if word in tweet:
                    print "+++++++++++++++++++++++++++++++++++++++"
                    print word, tweet
                    sent_score += score



            print "====", sent_score, "====="

    #PROBLEM, IT'S ONLY DOING THIS FOR THE FIRST TWEET

file1 = open(tweetsfile.txt)
file2 = open(sentimentfile.txt)


get_sentiments(file1, file2)

我花了大半天的时间试图弄清楚为什么它打印出所有的推文，而没有用于 file2 的嵌套 for 循环，但是有了它，它只处理第一条推文然后退出。

score 3 · Accepted Answer

它只做一次的原因是for循环已经到达文件的末尾，所以它停止了，因为没有更多的行要读取。

换句话说，你的循环第一次运行时，它会遍历整个文件，然后由于没有更多行要读取（因为它到达了文件的末尾），它不会再次循环，导致只有一个正在处理的行。

所以解决这个问题的一种方法是“倒带”文件，你可以使用seek文件对象的方法来做到这一点。

如果您的文件不大，另一种方法是将它们全部读入列表或类似结构，然后循环遍历它。

但是，由于您的情绪分数是一个简单的查找，最好的方法是使用情绪分数构建一个字典，然后在字典中查找每个单词以计算推文的整体情绪：

import csv
import json

scores = {}  # empty dictionary to store scores for each word

with open('sentimentfile.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scores[row[0].strip()] = int(row[1].strip()) 


with open('tweetsfile.txt') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet.get('text','').encode('utf-8')
        if text:
            total_sentiment = sum(scores.get(word,0) for word in text.split())
            print("{}: {}".format(text,score))

with statement自动关闭文件处理程序。我正在使用该csv模块来读取文件（它也适用于制表符分隔的文件）。

此行进行计算：

total_sentiment = sum(scores.get(word,0) for word in text.split())

这是编写此循环的一种更短的方法：

tweet_score = []
for word in text.split():
    if word in scores:
        tweet_score[word] = scores[word]

total_score = sum(tweet_score)

get当找不到键时，字典的方法采用第二个可选参数来返回自定义值；如果省略第二个参数，它将返回None. 在我的循环中，如果单词没有分数，我将使用它返回 0。

python - 如何正确循环通过两个文件比较两个文件中的字符串

1 回答 1

Related

Reference