python - 以最快的方式计算python中的重复单词

Question

我试图在 23 万个单词的列表中计算重复单词。我使用 python 字典来做到这一点。代码如下：

for words in word_list:
    if words in word_dict.keys():
       word_dict[words] += 1
    else:
       word_dict[words] = 1

上面的代码花了 3 分钟！我运行了超过 150 万字的相同代码，它运行了超过 25 分钟，我失去了耐心并终止了。然后我发现我可以从这里使用以下代码（也如下所示）。结果令人惊讶，它在几秒钟内完成！所以我的问题是执行此操作的更快方法是什么？我猜字典创建过程必须花费 O(N) 时间。Counter 方法如何能够在几秒钟内完成这个过程，并创建一个精确的词词典作为键和频率作为它的值？

from collections import Counter
word_dict = Counter(word_list)

score 6 · Accepted Answer

正因为如此：

if words in word_dict.keys():

.keys()返回所有键的列表。列表需要线性时间来扫描，因此您的程序以二次时间运行！

试试这个：

if words in word_dict:

此外，如果您有兴趣，您可以自己查看Counter实现。它是用普通的 Python 编写的。

score 4 · Accepted Answer

您的字典计数方法构造不正确。

您可以defaultdict通过以下方式使用 a：

d = defaultdict(int)

for word in word_list:
    d[word] += 1

但是counter来自 itertools 的方法仍然更快，即使它做几乎相同的事情，因为它是用更有效的实现编写的。但是，使用 counter 方法，您需要将列表传递给它以进行计数，而使用 defaultdict，您可以放置来自不同位置的源并具有更复杂的循环。

最终这是您的偏好。如果计算一个列表，counter是要走的路，如果从多个源迭代，或者你只是想在你的程序中有一个计数器，不希望额外的查找来检查一个项目是否已经被计算。那么defaultdict是你的选择。

score 1 · Accepted Answer

您实际上可以查看 Counter 代码，这是在 init 上调用的更新方法：

（注意它使用了定义的本地定义的性能技巧self.get）

def update(self, iterable=None, **kwds):
    '''Like dict.update() but add counts instead of replacing them.

    Source can be an iterable, a dictionary, or another Counter instance.

    >>> c = Counter('which')
    >>> c.update('witch')           # add elements from another iterable
    >>> d = Counter('watch')
    >>> c.update(d)                 # add elements from another counter
    >>> c['h']                      # four 'h' in which, witch, and watch
    4

    '''
    # The regular dict.update() operation makes no sense here because the
    # replace behavior results in the some of original untouched counts
    # being mixed-in with all of the other counts for a mismash that
    # doesn't have a straight-forward interpretation in most counting
    # contexts.  Instead, we implement straight-addition.  Both the inputs
    # and outputs are allowed to contain zero and negative counts.

    if iterable is not None:
        if isinstance(iterable, Mapping):
            if self:
                self_get = self.get
                for elem, count in iterable.iteritems():
                    self[elem] = self_get(elem, 0) + count
            else:
                super(Counter, self).update(iterable) # fast path when counter is empty
        else:
            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1
    if kwds:
        self.update(kwds)

score 0 · Accepted Answer

您也可以尝试将defaultdict其用作更具竞争力的选择。尝试：

from collections import defaultdict

word_dict = defaultdict(lambda: 0)
for word in word_list:
    word_dict[word] +=1

print word_dict

score 0 · Accepted Answer

与蒙库特提到的类似，最好的方法之一是利用该.get()功能。这要归功于 Charles Severance 和Python For Everyone 课程

供测试用：

# Pretend line is as follow. 
# It can and does contain \n (newline) but not \t (tab).
line = """Your battle is my battle . We fight together . One team . One team . 
Shining sun always comes with the rays of hope . The hope is there . 
Our best days yet to come . Let the hope light the road .""".lower()

他的代码（附上我的笔记）：

# make an empty dictionary
# split `line` into a list. default is to split on a space character
# etc., etc.
# iterate over the LIST of words (made from splitting the string)
counts = dict()
words = line.split() 
for word in words: 
    counts[word] = counts.get(word,0) + 1

你的代码：

for words in word_list:
if words in word_dict.keys():
   word_dict[words] += 1
else:
   word_dict[words] = 1

.get()做这个：

返回与关联的字典中的 VALUE word。
否则（如果该词不是字典中的键，则返回）0。

无论返回什么，我们都会添加1它。因此，它处理了第一次看到这个词的基本情况。我们不能使用字典推导，因为当我们创建该变量时，推导分配给的变量将不存在。意义

this:counts = { word:counts.get(word,0) + 1 for word in words}是不可能的，因为counts（同时被创建和赋值。或者，因为）counts当我们（再次）引用它时，变量还没有完全定义.get()。

输出

>> counts
{'.': 8,
'always': 1,
'battle': 2,
'best': 1,
'come': 1,
'comes': 1,
'days': 1,
'fight': 1,
'hope': 3,
'is': 2,
'let': 1,
'light': 1,
'my': 1,
'of': 1,
'one': 2,
'our': 1,
'rays': 1,
'road': 1,
'shining': 1,
'sun': 1,
'team': 2,
'the': 4,
'there': 1,
'to': 1,
'together': 1,
'we': 1,
'with': 1,
'yet': 1,
'your': 1}

顺便说一句，这是我写的“加载”使用，作为.get()解决经典 FizzBuzz 问题的一种方式。我目前正在为类似情况编写代码，在这种情况下我将使用模数和字典，但将拆分字符串作为输入。

python - 以最快的方式计算python中的重复单词

5 回答 5

Related

Reference