我有 2 本 txt 格式的书(6000 多行)。我想使用 Python 将每个单词与它的相关性关联起来(使用 td idf 算法)并按降序排列它们。我试过这段代码
#- * -coding: utf - 8 - * -
from __future__
import division, unicode_literals
import math
from textblob
import TextBlob as tb
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1
for blob in bloblist
if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
document1 = tb(""
"FULL BOOK1 TEST"
"")
document2 = tb(""
"FULL BOOK2 TEST"
"")
bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
with open("result.txt", 'w') as textfile:
print("Top words in document {}".format(i + 1))
scores = {
word: tfidf(word, blob, bloblist) for word in blob.words
}
sorted_words = sorted(scores.items(), key = lambda x: x[1], reverse = True)
for word, score in sorted_words:
textfile.write("Word: {}, TF-IDF: {}".format(word, round(score, 5)) + "\n")
我在这里找到了https://stevenloria.com/tf-idf/进行了一些更改,但这需要很多时间,几分钟后,它会崩溃说TypeError: coercing to Unicode: need string or buffer, float found
。为什么?
我还尝试通过 python https://github.com/mccurdyc/tf-idf/调用这个 Java 程序。该程序有效,但输出不正确:有很多单词本应具有高相关性,但被归类为 0 相关性。
有没有办法修复那个 Python 代码?或者,您能否建议我另一个正确执行我想要的 tf-idf 实现?