4

我有一个这样的数据集(简化):

foods_dict = {}
foods_dict['fruit'] = ['apple', 'orange', 'plum']
foods_dict['veg'] = ['cabbage', 'potato', 'carrot']

我有一个要分类的项目列表:

items = ['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg']

我希望能够items根据它们在foods_dict. 我认为这些子列表实际上应该是sets因为我不希望其中有任何重复项。

我第一次通过代码是这样的:

fruits = set()
veggies = set()
others = set()
for item in items:
    if item in foods_dict.get('fruit'):
        fruits.add(item)
    elif item in foods_dict.get('veg'):
        veggies.add(item)
    else:
        others.add(item)

但这对我来说似乎真的效率低下和不必要的冗长。我的问题是,如何改进这段代码?我猜列表理解在这里可能很有用,但我不确定列表的数量。

4

4 回答 4

6

对于一个有效的解决方案,您希望尽可能避免显式循环:

items = set(items)
fruits = set(foods_dict['fruit']) & items
veggies = set(foods_dict['veg']) & items
others = items - fruits - veggies

这几乎肯定会比使用显式循环更快。如果水果清单很长,那么做起来特别item in foods_dict['fruit']费时。


到目前为止,解决方案之间的一个非常简单的基准:

In [5]: %%timeit
   ...: items2 = set(items)
   ...: fruits = set(foods_dict['fruit']) & items2
   ...: veggies = set(foods_dict['veg']) & items2
   ...: others = items2 - fruits - veggies
   ...: 
1000000 loops, best of 3: 1.75 us per loop

In [6]: %%timeit
   ...: fruits = set()
   ...: veggies = set()
   ...: others = set()
   ...: for item in items:
   ...:     if item in foods_dict.get('fruit'):
   ...:         fruits.add(item)
   ...:     elif item in foods_dict.get('veg'):
   ...:         veggies.add(item)
   ...:     else:
   ...:         others.add(item)
   ...: 
100000 loops, best of 3: 2.57 us per loop

In [7]: %%timeit
   ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
   ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
   ...: others = set(items) - veggies - fruits
   ...: 
100000 loops, best of 3: 3.34 us per loop

当然,在选择之前,您应该使用“真实输入”进行一些测试。我不知道您的问题中有多少元素,并且随着输入的增加,时间可能会发生很大变化。无论如何,我的经验告诉我,至少在 CPython 中,显式循环往往比仅使用内置操作要慢。


Edit2:输入更大的示例:

In [9]: foods_dict = {}
   ...: foods_dict['fruit'] = list(range(0, 10000, 2))
   ...: foods_dict['veg'] = list(range(1, 10000, 2))

In [10]: items = list(range(5, 10000, 13))  #some odd some even

In [11]: %%timeit
    ...: fruits = set()
    ...: veggies = set()
    ...: others = set()
    ...: for item in items:
    ...:     if item in foods_dict.get('fruit'):
    ...:         fruits.add(item)
    ...:     elif item in foods_dict.get('veg'):
    ...:         veggies.add(item)
    ...:     else:
    ...:         others.add(item)
    ...: 
10 loops, best of 3: 68.8 ms per loop

In [12]: %%timeit
    ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
    ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
    ...: others = set(items) - veggies - fruits
    ...: 
10 loops, best of 3: 99.9 ms per loop

In [13]: %%timeit
    ...: items2 = set(items)
    ...: fruits = set(foods_dict['fruit']) & items2
    ...: veggies = set(foods_dict['veg']) & items2
    ...: others = items2 - fruits - veggies
    ...: 
1000 loops, best of 3: 445 us per loop

如您所见,仅使用内置函数比显式循环快约 20 倍。

于 2013-10-03T18:17:22.517 回答
1

这可能会满足您的需求(例如蔬菜案例):

veggies = set(elem for elem in items if elem in foods_dict['veg'])

更全面:

veggies = set(elem for elem in items if elem in foods_dict['veg'])
fruits = set(elem for elem in items if elem in foods_dict['fruit'])
others = set(items) - veggies - fruits
于 2013-10-03T18:12:27.237 回答
1

像这样的事情怎么样(避免使用集合操作进行列表推导):

fruits = set(items).intersection(set(foods_dict['fruit']))
veggies = set(items).intersection(set(foods_dict['veg']))
others = set(items).difference(veggies.union(fruits))

如果可以的话,您可以从集合开始以避免 set() 转换。

希望有帮助!

编辑:似乎您关心的是效率还是冗长(并且是“pythonic”)。如果您关心效率,请记住在字节码编译器和解释器之间您不知道正在实施哪些优化(如果有的话)。通常很难在如此高的水平上优化事物。可能,但您首先需要一些基准。如果您担心成为 pythonic,我会尝试更高级别(我可以在这里说声明性吗?或者我们还没有:))。

换句话说,与其循环并告诉python它应该如何决定哪个项目去哪里,我会尽量做到可读、清晰和简洁。我认为(因为我写了上面的内容)这种风格准确地告诉读者你想对项目列表做什么。

希望这会有所帮助,所有这些只是我的意见,应该持保留态度。

于 2013-10-03T18:26:55.050 回答
1

如果您有更多类别,这里是一个更通用的。(因此每个类别都没有单独的变量。)

from collections import defaultdict

foods_dict = {}
foods_dict['fruit'] = set(['apple', 'orange', 'plum'])
foods_dict['veg']   = set(['cabbage', 'potato', 'carrot'])

items = set(['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg'])

dict_items = set.union(*foods_dict.values())

assignments = defaultdict(set)

assignments['other'] = dict_items.copy()
for key in foods_dict.keys():
    assignments[key] = foods_dict[key] & items
    assignments['other'] -= foods_dict[key]
于 2013-10-03T18:37:05.677 回答