python - Python - 我的频率函数效率低下

Question

我正在编写一个函数，该函数返回单词列表中出现次数最多的单词的出现次数。

def max_frequency(words):
    """Returns the number of times appeared of the word that
    appeared the most in a list of words."""

    words_set = set(words)
    words_list = words
    word_dict = {}

    for i in words_set:
        count = []
        for j in words_list:
            if i == j:
                count.append(1)
        word_dict[i] = len(count)

    result_num = 0
    for _, value in word_dict.items():
        if value > result_num:
            result_num = value
    return result_num

例如：

words = ["Happy", "Happy", "Happy", "Duck", "Duck"]
answer = max_frequency(words)
print(answer)

3

但是这个函数在处理列表中的大量单词时速度很慢，例如，250,000 个单词的列表需要 4 分钟以上才能呈现输出。所以我正在寻求帮助来调整这个。

我不想进口任何东西。

score 3 · Accepted Answer

为了防止每个唯一单词多次通过您的列表，您可以简单地对其进行一次迭代并更新每个计数的字典值。

counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1

输出：

>>> print(max(counts.values()))
3

defaultdict话虽如此，使用 a而不是get或 using ...可以做得更好，collections.Counter如果您可以选择，限制自己在 Python 中不进行导入绝不是一个好主意。

例如，使用collections.Counter：

from collections import Counter
counter = Counter(words)
most_common = counter.most_common(1)

score 0 · Accepted Answer

数据大小与 OP 相似

让我们从单词列表开始

In [55]: print(words)
['oihwf', 'rpowthj', 'trhok', 'rtpokh', 'tqhpork', 'reaokp', 'eahopk', 'qeaopker', 'okp[qrg', 'okehtq', 'pinjjn', 'rq38na', 'aogopire', "apoe'ak", 'apfobo;444', 'jiaegro', '908qymar', 'pe9irmp4', 'p9itoijar', 'oijor8']

并随机组合这些单词以形成文本

In [56]: from random import choice
In [57]: text = ' '.join(choice(words) for _ in range(250000))

不同的方法是可能的

从文本中我们可以得到文本中的单词列表（注意，wl与...有很大不同words）

In [58]: wl = text.split()

从这个列表中，我们想要提取字典或类似字典的对象，并计算出现次数，我们有很多选择。

第一个选项，我们构建一个包含所有不同单词的字典，wl并将每个键的值设置为零，然后我们对单词列表进行另一个循环以计算出现次数

In [59]: def count0(wl):
    wd = dict(zip(wl,[0]*len(wl)))
    for w in wl: wd[w] += 1            
    return wd
   ....:

第二种选择，我们从一个空字典开始，并使用get()允许默认值的方法

In [60]: def count1(wl):
    wd = dict()                   
    for w in wl: wd[w] = wd.get(w, 0)+1
    return wd
   ....:

第三个也是最后一个选项，我们使用标准库的一个组件

In [61]: def count2(wl):
    from collections import Counter
    wc = Counter(wl)
    return wc
   ....:

一种方法比其他方法更好吗？

哪个最好？你最喜欢的那个……无论如何，这里是各自的时间

In [62]: %timeit count0(wl) # start with a dict with 0 values
10 loops, best of 3: 82 ms per loop

In [63]: %timeit count1(wl) # uses .get(key, 0)
10 loops, best of 3: 92 ms per loop

In [64]: %timeit count2(wl) # uses collections.Counter
10 loops, best of 3: 43.8 ms per loop

正如预期的那样，最快的过程是使用的那个collections.Counter，但我有点惊讶地注意到第一个选项，它使两次数据传递，比第二个更快......我的猜测（我的意思是：猜测）是在实例化字典时完成对新值的所有测试，可能在一些C代码中。

score 0 · Accepted Answer

虽然我完全同意与您的我不想导入任何声明相关的评论，但我发现您的问题很有趣，所以让我们尝试一下。

您无需构建set. 直接去就好了words。

words = words = ["Happy", "Happy", "Happy", "Duck", "Duck"]
words_dict = {}

for w in words:
    if w in words_dict:
        words_dict[w] += 1
    else:
        words_dict[w] = 1

result_num = max(words_dict.values())

print(result_num)
# 3

score 0 · Accepted Answer

你可以试试这个快 760% 的代码。

def max_frequency(words):
    """Returns the number of times appeared of the word that
    appeared the most in a list of words."""

    count_dict = {}
    max = 0

    for word in words:
        current_count = 0

        if word in count_dict:
            current_count = count_dict[word] = count_dict[word] + 1
        else:
            current_count = count_dict[word] = 1

        if current_count > max:
            max = current_count

    return max

python - Python - 我的频率函数效率低下

4 回答 4

数据大小与 OP 相似

不同的方法是可能的

一种方法比其他方法更好吗？

Related

Reference