python - 有效计算字符串中的词频

Question

我正在解析一长串文本并计算每个单词在 Python 中出现的次数。我有一个可以工作的函数，但我正在寻找关于是否有方法可以提高效率（在速度方面）以及是否有 python 库函数可以为我做到这一点的建议，所以我没有重新发明轮子?

您能否提出一种更有效的方法来计算出现在长字符串中的最常见单词（通常在字符串中超过 1000 个单词）？

另外，将字典排序到第一个元素是最常用词、第二个元素是第二个最常用词等的列表中的最佳方法是什么？

test = """abc def-ghi jkl abc
abc"""

def calculate_word_frequency(s):
    # Post: return a list of words ordered from the most
    # frequent to the least frequent

    words = s.split()
    freq  = {}
    for word in words:
        if freq.has_key(word):
            freq[word] += 1
        else:
            freq[word] = 1
    return sort(freq)

def sort(d):
    # Post: sort dictionary d into list of words ordered
    # from highest freq to lowest freq
    # eg: For {"the": 3, "a": 9, "abc": 2} should be
    # sorted into the following list ["a","the","abc"]

    #I have never used lambda's so I'm not sure this is correct
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))

print calculate_word_frequency(test)

score 40 · Accepted Answer

使用collections.Counter：

>>> from collections import Counter
>>> test = 'abc def abc def zzz zzz'
>>> Counter(test.split()).most_common()
[('abc', 2), ('zzz', 2), ('def', 2)]

score 6 · Accepted Answer

>>>> test = """abc def-ghi jkl abc
abc"""
>>> from collections import Counter
>>> words = Counter()
>>> words.update(test.split()) # Update counter with words
>>> words.most_common()        # Print list with most common to least common
[('abc', 3), ('jkl', 1), ('def-ghi', 1)]

score 3 · Accepted Answer

您也可以使用NLTK（自然语言工具包）。它为研究处理文本提供了非常好的库。对于此示例，您可以使用：

from nltk import FreqDist

text = "aa bb cc aa bb"
fdist1 = FreqDist(text)

# show most 10 frequent word in the text
print fdist1.most_common(10)

结果将是：

[('aa', 2), ('bb', 2), ('cc', 1)]

score 0 · Accepted Answer

如果您想显示常用词和计数值而不是列表，那么这是我的代码。

from collections import Counter

str = 'abc def ghi def abc abc'

arr = Counter(str.split()).most_common()

for word, count in arr:
    print(word, count)

输出：

abc 3
def 2
ghi 1

python - 有效计算字符串中的词频

4 回答 4

Related

Reference