python - 如何提取python中出现频率最高的10个单词和最不出现的10个单词？

Question

在我用 final line 运行几行代码后，我得到了一个输出vocabulary。它给了我 46132 个不同的单词，并告诉我每个单词在文档中出现的次数。

我附上了下面的输出截图。我不确定是哪种格式vocabulary。我需要提取文档中出现频率最高的 10 个词和出现频率最低的 10 个词。我不确定该怎么做，可能是因为我不知道输出的格式是str还是tuple.

我可以只使用max(vocabulary)获取文档中出现频率最高的单词吗？sorted(vocabulary)并获得前 10 个和后 10 个作为文档中出现频率最高的 10 个和最不常见的 10 个单词？

score 0 · Accepted Answer

使用collections.Counter类可以轻松获取k最常用的单词：

>>> vocabulary = { 'apple': 7, 'ball': 1, 'car': 3, 'dog': 6, 'elf': 2 }
>>> from collections import Counter
>>> vocabulary = Counter(vocabulary)
>>> vocabulary.most_common(2)
[('apple', 7), ('dog', 6)]

获得最不常用的词也有点棘手。最简单的方法可能是按值对字典的键/值对进行排序，然后取一个切片：

>>> sorted(vocabulary.items(), key=lambda x: x[1])[:2]
[('ball', 1), ('elf', 2)]

既然两者都需要，不如只排序一次，取两片；这样你就不需要使用 a Counter：

>>> sorted_vocabulary = sorted(vocabulary.items(), key=lambda x: x[1])
>>> most_common = sorted_vocabulary[-2:][::-1]
>>> least_common = sorted_vocabulary[:2]

python - 如何提取python中出现频率最高的10个单词和最不出现的10个单词？

1 回答 1

Related

Reference