python - 使用 python 打印文本文档中 10 个最不常用的单词

Question

我有一个小的 python 脚本，可以打印文本文档中最常用的 10 个单词（每个单词是 2 个或更多字母），我需要继续脚本来打印文档中最不常用的 10 个单词。我有一个相对有效的脚本，除了它打印的 10 个最不常见的单词是数字（整数和浮点数），而它们应该是单词。如何仅迭代单词并排除数字？这是我的完整脚本：

# Most Frequent Words:
from string import punctuation
from collections import defaultdict

number = 10
words = {}

with open("charactermask.txt") as txt_file:
    words = [x.strip(punctuation).lower() for x in txt_file.read().split()]

counter = defaultdict(int)

for word in words:
  if len(word) >= 2:
    counter[word] += 1

top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)

编辑：文档的结尾（# Least Frequent Words评论下的部分）是需要修复的部分。

score 1 · Accepted Answer

您需要一个函数，letters_only()它将运行正则表达式匹配[0-9]，如果找到任何匹配项，则返回 False。像这样的东西::

def letters_only(word):
    return re.search(r'[0-9]', word) is None

然后，在你说的地方for word in words，而不是说for word in filter(letters_only, words)。

score 1 · Accepted Answer

您将需要一个过滤器 - 更改正则表达式以匹配但是您想定义一个“单词”：

import re
alphaonly = re.compile(r"^[a-z]{2,}$")

现在，您是否希望词频表首先不包含数字？

counter = defaultdict(int)

with open("charactermask.txt") as txt_file:
    for line in txt_file:
        for word in line.strip().split():
          word = word.strip(punctuation).lower()
          if alphaonly.match(word):
              counter[word] += 1

或者您只是想在从表中提取最不常用的单词时跳过数字？

words_by_freq = sorted(counter.iteritems(),
                       key=lambda(word, count): (count, word))

i = 0
for word, frequency in words_by_freq:
    if alphaonly.match(word):
        i += 1
        sys.stdout.write("{}: {}\n".format(word, frequency))
    if i == number: break

python - 使用 python 打印文本文档中 10 个最不常用的单词

2 回答 2

Related

Reference