我有一个小 Python 脚本,我正在为课堂作业做准备。该脚本读取一个文件并打印 10 个最频繁和最不频繁的单词及其频率。对于这个作业,一个单词被定义为 2 个字母或更多。我的词频工作得很好,但是任务的第三部分是打印文档中唯一词的总数。唯一词含义计算文档中的每个词,只计算一次。
在不过多更改当前脚本的情况下,如何只计算一次文档中的所有单词?
ps 我使用的是 Python 2.6,所以请不要提及 collections.Counter 的使用
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique