python-3.x - 两个看起来应该做同样事情但输出不同结果的python循环？

Question

昨天我试图完成 Udacity 的第 11 课，关于文本的矢量化。我检查了代码，一切似乎都很好——我接收了一些电子邮件，打开它们，删除一些签名词并将每封电子邮件的词干词返回到一个列表中。

这是循环1：

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
#        temp_counter += 1
    if temp_counter < 200:
        path = os.path.join('/xxx', path[:-1])
        email = open(path, "r")

        ### use parseOutText to extract the text from the opened email

        email_stemmed = parseOutText(email)

        ### use str.replace() to remove any instances of the words
        ### ["sara", "shackleton", "chris", "germani"]

        email_stemmed.replace("sara","")
        email_stemmed.replace("shackleton","")
        email_stemmed.replace("chris","")
        email_stemmed.replace("germani","")

    ### append the text to word_data

    word_data.append(email_stemmed.replace('\n', ' ').strip())

    ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
        if from_person == "sara":
            from_data.append(0)
        elif from_person == "chris":
            from_data.append(1)

    email.close()

这是循环2：

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
#        temp_counter += 1
        if temp_counter < 200:
            path = os.path.join('/xxx', path[:-1])
            email = open(path, "r")

            ### use parseOutText to extract the text from the opened email
            stemmed_email = parseOutText(email)

            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]
            signature_words = ["sara", "shackleton", "chris", "germani"]
            for each_word in signature_words:
                stemmed_email = stemmed_email.replace(each_word, '')         #careful here, dont use another variable, I did and broke my head to solve it

            ### append the text to word_data
            word_data.append(stemmed_email)

            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
            if name == "sara":
                from_data.append(0)
            else: # its chris
                from_data.append(1)


            email.close()

代码的下一部分按预期工作：

print("emails processed")
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("/xxx/your_word_data.pkl", "wb") )
pickle.dump( from_data, open("xxx/your_email_authors.pkl", "wb") )


print("Answer to Lesson 11 quiz 19: ")
print(word_data[152])


### in Part 4, do TfIdf vectorization here

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
print("SKLearn has this many Stop Words: ")
print(len(stop_words.ENGLISH_STOP_WORDS))

vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
vectorizer.fit_transform(word_data)

feature_names = vectorizer.get_feature_names()

print('Number of different words: ')
print(len(feature_names))

但是当我用循环 1 计算单词总数时，我得到了错误的结果。当我使用循环 2 执行此操作时，我得到了正确的结果。

我看这段代码太久了，我看不出区别——我在循环 1 中做错了什么？

作为记录，我一直得到的错误答案是 38825。正确答案应该是 38757。

非常感谢您的帮助，善良的陌生人！

score 4 · Accepted Answer

这些行不做任何事情：

email_stemmed.replace("sara","")
email_stemmed.replace("shackleton","")
email_stemmed.replace("chris","")
email_stemmed.replace("germani","")

replace返回一个新字符串并且不修改email_stemmed. 相反，您应该将返回值设置为email_stemmed：

email_stemmed = email_stemmed.replace("sara", "")

等等等等。

循环二确实在 for 循环中设置了返回值：

for each_word in signature_words:
    stemmed_email = stemmed_email.replace(each_word, '')

上面的代码片段并不相同，因为在第一个片段的末尾由于使用正确而email_stemmed完全没有变化，而在第二个片段的末尾实际上已经被剥离了每个单词。replacestemmed_email

python-3.x - 两个看起来应该做同样事情但输出不同结果的python循环？

1 回答 1

Related

Reference