3

I have a list of tuples, where each tuple contains a string and a number in the form of:

[(string_1, num_a), (string_2, num_b), ...]

The strings are nonunique, and so are the numbers, e.g. (string_1 , num_m) or (string_9 , num_b) are likely to exist in the list.

I'm attempting to create a dictionary with the string as the key and a set of all numbers occurring with that string as the value:

dict = {string_1: {num_a, num_m}, string_2: {num_b}, ...}

I've done this somewhat successfully with the following dictionary comprehension with nested set comprehension:

#st_id_list = [(string_1, num_a), ...]
#st_dict = {string_1: {num_a, num_m}, ...} 
st_dict = {
    st[0]: set(
        st_[1]
        for st_ in st_id_list
        if st_[0] == st[0]
    )
    for st in st_id_list
}

There's only one issue: st_id_list is 18,000 items long. This snippet of code takes less than ten seconds to run for a list of 500 tuples, but over twelve minutes to run for the full 18,000 tuples. I have to think this is because I've nested a set comprehension inside a dict comprehension.

Is there a way to avoid this, or a smarter way to it?

4

2 回答 2

11

你有一个双循环,所以你需要 O(N**2) 时间来制作你的字典。对于 500 个项目,需要执行 250.000 步,而对于您的 18k 个项目,需要执行3.24亿步。

这是一个 O(N) 循环,因此较小的数据集需要 500 步,较大的数据集需要 18.000 步:

st_dict = {}
for st, id in st_id_list:
    st_dict.setdefault(st, set()).add(id)

这使用该dict.setdefault()方法确保对于给定的键(您的字符串值),如果键丢失,至少有一个空集可用,然后将当前id值添加到该集。

你可以对一个collections.defaultdict()对象做同样的事情:

from collections import defaultdict

st_dict = defaultdict(set)
for st, id in st_id_list:
    st_dict[st].add(id)

使用传入的defaultdict()工厂为缺少的键设置默认值。

defaultdict方法的缺点是对象在循环之后继续为缺少的键生成默认值,这可以隐藏应用程序错误。用于st_dict.default_factory = None显式禁用工厂以防止这种情况发生。

于 2017-11-13T21:10:37.263 回答
0

当您可以像这样在一个循环中执行时,为什么要使用两个循环:

list_1=[('string_1', 'num_a'), ('string_2', 'num_b'),('string_1' , 'num_m'),('string_9' , 'num_b')]

string_num={}
for i in list_1:
    if i[0] not in string_num:
        string_num[i[0]]={i[1]}
    else:
        string_num[i[0]].add(i[1])

print(string_num)

输出:

{'string_9': {'num_b'}, 'string_1': {'num_a', 'num_m'}, 'string_2': {'num_b'}}
于 2017-11-13T22:42:21.060 回答