我有一个小的 python 脚本,可以打印文本文档中最常用的 10 个单词(每个单词是 2 个或更多字母),我需要继续脚本来打印文档中最不常用的 10 个单词。我有一个相对有效的脚本,除了它打印的 10 个最不常见的单词是数字(整数和浮点数),而它们应该是单词。如何仅迭代单词并排除数字?这是我的完整脚本:
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
number = 10
words = {}
with open("charactermask.txt") as txt_file:
words = [x.strip(punctuation).lower() for x in txt_file.read().split()]
counter = defaultdict(int)
for word in words:
if len(word) >= 2:
counter[word] += 1
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
编辑:文档的结尾(# Least Frequent Words
评论下的部分)是需要修复的部分。