2

我正在使用 Python34。我想从 CSV 文件中获取单词的频率,但它显示错误。这是我的代码。任何人都可以帮我解决这个问题。

from textblob import TextBlob as tb
import math

words={}
def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(words, bloblist)))

bloblist = open('afterstopwords.csv', 'r').read()

for i, blob in enumerate(bloblist):
     print("Top words in document {}".format(i + 1))
     scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
     sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
     for word, score in sorted_words[:3]:
         print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

错误是:

 Top words in document 1
 Traceback (most recent call last):
 File "D:\Python34\tfidf.py", line 45, in <module>
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
 AttributeError: 'str' object has no attribute 'words'
4

1 回答 1

3

来自http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/的一些 bloblist 代码是:

bloblist = [document1, document2, document3]

不要改变它。另外,在它之前是文档的代码,例如:

document1 = tb("""blablabla""")

这就是我所做的......我使用一个函数在我的 python 中打开文件,其中 openfile 保存文件详细信息。

txt =openfile()  
document1=tb(txt)  
bloblist = [document1] 

原始代码的其余部分保持不变。这可行,但我只能让它完成小文件。较大的文件需要很长时间。而且看起来一点也不准确。对于字数统计,我使用https://rmtheis.wordpress.com/2012/09/26/count-word-frequency-with-python/
并且对于 9999 行每行 50-75 个字符长,它的工作速度非常快。似乎也很准确,结果似乎等同于 wordcloud 结果。

于 2016-07-20T22:12:43.680 回答