python - 使用python检索文档中具有2个或更多字母的单词总数

Question

我有一个小的 Python 脚本，用于计算 .txt 文档中前 10 个最常用的词、10 个最不常用的词和总词数。根据作业，一个词被定义为2个字母或更多。我有 10 个最常见的单词和 10 个最不常见的单词打印得很好，但是当我尝试打印文档中的单词总数时，它会打印所有单词的总数，包括单个字母单词（例如“a” ）。我怎样才能得到单词总数来只计算有 2 个或更多字母的单词？

这是我的脚本：

from string import *
from collections import defaultdict
from operator import itemgetter
import re

number = 10
words = {}
total_words = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)

"""Define function to count the total number of words"""
def count_words(s):
    unique_words = split(s)
    return len(unique_words)

"""Define words as 2 letters or more -- no single letter words such as "a" """
for word in words:
    if len(word) >= 2:
        counter[word] += 1


"""Open text document, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    total_words = total_words + count_words(line)
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
            counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Words: %s" % total_words

我不是 Python 专家，这是针对我目前正在学习的 Python 课程的。在这项作业中，我的代码的整洁性和正确的格式对我不利，如果可能的话，有人还能告诉我这段代码的格式是否被认为是“好的做法”吗？

score 3 · Accepted Answer

列表理解方法：

def countWords(s):
    words = s.split()
    return len([word for word in words if len(word)>=2])

详细方法：

def countWords(s):
    words = s.split()
    count = 0
    for word in words:
        if len(word) >= 2:
            count += 1
    return count

顺便说一句，对 using 表示赞赏defaultdict，但我会选择collections.Counter：

words = collections.Counter([word for line in open(filepath) for word in line.strip()])
words = dict((k,v) for k,v in words.iteritems if len(k)>=2)
mostFrequent = [w[0] for w in words.most_common(10)]
leastFrequent = [w[0] for w in words.most_common()[-10:]]

希望这可以帮助

score 1 · Accepted Answer

对不起，但我似乎对这个解决方案有点过火了。我的意思是我真的把你的代码分开了，然后按照我的方式把它重新组合起来：

from collections import defaultdict
from operator import itemgetter
from heapq import nlargest, nsmallest
from itertools import starmap
from textwrap import dedent
import re

class WordCounter(object):
    """
    Count the number of words consisting of two letters or more.
    """

    words_only = re.compile(r'[a-z]{2,}', re.IGNORECASE)

    def __init__(self, filename, number=10):
        self.counter = defaultdict(int)

        # Open text document and find all words
        with open(filename, 'r') as txt_file:
            for word in self.words_only.findall(txt_file.read()):
                self.counter[word.lower()] += 1

        # Get total count
        self.total_words = sum(self.counter.values())

        # Most Frequent Words
        self.top_words = nlargest(
            number, self.counter.items(), itemgetter(1))

        # Least Frequent Words
        self.least_words = nsmallest(
            number, self.counter.items(), itemgetter(1))

    def __str__(self):
        """
        Summary of least and most used words, and total word count.
        """
        template = dedent("""
            Most Frequent Words:
            {0}

            Least Frequent Words:
            {1}

            Total Number of Words: {2}
            """)

        line_template = "{0}: {1}".format
        top_words = "\n".join(starmap(line_template, self.top_words))
        least_words = "\n".join(starmap(line_template, self.least_words))

        return template.format(top_words, least_words, self.total_words)


print WordCounter("charactermask.txt")

以下是我所做更改的摘要，以及原因

不要做from x import *。一些模块旨在让您安全地执行此操作，但通常由于命名空间污染，这是一个坏主意。只导入您需要的东西，或使用缩写名称导入模块：import string as st. 这将减少错误代码。
让它成为一堂课。尽管将其编写为脚本对于这类事情来说很好，但始终将代码包装在类或函数中以更好地组织代码以及在另一个项目中需要它们时是一个好习惯。然后你就可以做from wordcounter import WordCounter，你很高兴。
文档字符串在代码块内移动。这样，如果您help(my_class_or_function)在交互式解释器中键入，它们将被自动使用。
注释通常带有前缀#而不是一次性字符串。这不是一个很大的禁忌，而是一个相当普遍的约定。
打开文件时使用该with语句。这是一个好习惯。您不必担心记得关闭它们。
.strip().split()是多余的。只使用.split().
使用re.findall. 这样就避免了诸如“一流”之类的词的问题，使用你的方法根本不会计算在内。根据findall定义，我们正在计算“top”和“notch”。此外，它更快。但是我们必须稍微改变一下正则表达式。
words字典未使用。已删除。
用于sum计算总字数。这解决了您和inspectorG4dgets 代码中的问题，其中该words_only模式确实需要为每个单词使用两次——一次用于总数，一次用于字数——以获得一致的结果。
使用heapq.nlargest和heapq.nsmallest。当您只需要 n 个最小或最大的结果时，它们比完整排序更快且内存效率更高。
制作返回您可能希望或可能不希望打印的字符串的函数。直接使用 print 语句不太灵活，但非常适合调试。
对于新代码，请使用format字符串方法而不是%运算符。前者是为了改进和取代后者。
使用多行字符串而不是多个连续打印。更容易看到实际会写什么，也更容易维护。如果您想将字符串缩进到与周围代码相同的级别，textwrap.dedent 函数会有所帮助。

还有一个问题是哪个更具可读性：starmap(line_template, self.top_words)或[line_template(*x) for x in self.top_words]. 大多数人总是喜欢列表推导，我通常同意他们的观点，但在这里我喜欢星图方法的简洁性。

话虽如此，我同意user1552512，你的风格看起来很棒！漂亮、可读的代码，很好的注释，非常符合 PEP 8。你会走得很远。:)

score 1 · Accepted Answer

计数单词只需使用 split()

您也应该在这里使用 match_words 正则表达式

def count_words(s):
    unique_words = split(s)
    return len(filter(lambda x: words_only.match(x):, unique_words))

你的风格看起来很棒:)

score 0 · Accepted Answer

就个人而言，我认为您的代码看起来不错。我不知道它是否是“标准”python 风格，但它很容易阅读。我对 Python 也很陌生，但这是我的答案。

我假设您的 count_words(s) 函数是计算总字数的函数。您遇到的问题是，只需调用 split；你只是用空格分隔单词。

您只需要计算单词的 2+ 个字符，因此在该函数中编写一个循环，仅计算 unique_words 列表中具有 2+ 个字符的单词数。

python - 使用python检索文档中具有2个或更多字母的单词总数

4 回答 4

Related

Reference