11

我有一些看起来像这样的数据:

ID1 ID2 ID3  
ID1 ID4 ID5  
ID3 ID5 ID7 ID6  
...  
...  

其中每一行是一个组。

我的目标是为每个 ID 提供一个字典,然后是一组与其共享 >= 1 组的其他 ID。

例如,此数据将返回 {ID1: [ID2, ID3, ID4, ID5], ID2:[ID1, ID3] ... }

我可以为此想到 3 个选项,我想知道哪个(通常)最好:

  1. 在添加之前检查一个 ID 是否已经在列表中
  2. 创建集合而不是列表,并将每个 ID 添加到集合中
  3. 将所有 ID 添加到列表中,然后将所有列表转换为最后的集合。
4

4 回答 4

7

选项 2 对我来说听起来最合乎逻辑,尤其是使用 defaultdict 它应该很容易做到:)

import pprint
import collections

data = '''ID1 ID2 ID3
ID1 ID4 ID5
ID3 ID5 ID7 ID6'''

groups = collections.defaultdict(set)

for row in data.split('\n'):
    cols = row.split()
    for groupcol in cols:
        for col in cols:
            if col is not groupcol:
                groups[groupcol].add(col)

pprint.pprint(dict(groups))

结果:

{'ID1': set(['ID2', 'ID3', 'ID4', 'ID5']),
 'ID2': set(['ID1', 'ID3']),
 'ID3': set(['ID1', 'ID2', 'ID5', 'ID6', 'ID7']),
 'ID4': set(['ID1', 'ID5']),
 'ID5': set(['ID1', 'ID3', 'ID4', 'ID6', 'ID7']),
 'ID6': set(['ID3', 'ID5', 'ID7']),
 'ID7': set(['ID3', 'ID5', 'ID6'])}
于 2013-09-15T00:14:06.403 回答
7

TL;DR:选择选项 2。从一开始就使用集合。

在 Python 中,集合是哈希集,列表是动态数组。插入O(1)适用于两者,但检查元素是否存在O(n)适用于列表和O(1)集合。

所以选项 1 立即退出。如果您正在插入n项目并且每次都需要检查列表,那么整体复杂度将变为O(n^2).

O(n)选项 2 和 3总体上都是最优的。选项 2 在微型基准测试中可能更快,因为您不需要在集合之间移动对象。在实践中,请选择在您的特定情况下更易于阅读和维护的选项。

于 2013-09-15T02:28:22.133 回答
0

我同意之前的分析,即选项 B 是最好的,但是在这些情况下,一个微观基准通常会很有启发性:

import time

class Timer(object):
  def __init__(self, desc):
    self.desc = desc
  def __enter__(self):
    self.start = time.time()
  def __exit__(self, type, value, traceback):
    self.finish = time.time()
    print self.desc, 'took', self.finish - self.start

data = list(range(4000000))

with Timer('option 2'):
  myset = set()
  for x in data: myset.add(x)

with Timer('option 3'):
  mylist = list()
  for x in data: mylist.append(x)
  myset = set(mylist)

结果令我惊讶:

$ python test.py 
option 2 took 0.652163028717
option 3 took 0.748883008957

我本来预计至少有 2 倍的速度差异。

于 2013-09-15T13:15:06.227 回答
0

所以,我计时了几个不同的选项,经过几次迭代,想出了以下策略。我认为 sets2 会是赢家,但 listToSet2 对于每种类型的组都更快。

除了 listFilter 之外的所有功能都在同一个范围内 - listFilter 慢得多。

import random
import collections

small = [[random.randint(1,25) for _ in range(5)] for i in range(100)]
medium = [[random.randint(1,250) for _ in range(5)] for i in range(1000)]
mediumLotsReps = [[random.randint(1,25) for _ in range(5)] for i in range(1000)]
bigGroups = [[random.randint(1,250) for _ in range(75)] for i in range(100)]
huge = [[random.randint(1,2500) for _ in range(5)] for i in range(10000)]

def sets(groups):
    results = collections.defaultdict(set)
    for group in groups:
        for i in group:
            for j in group:
                if i is not j:
                    results[i].add(j)
    return results

def listToSet(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            results[j] += group[:i] + group[i:]
    return {k:set(v) for k, v in results.iteritems()}

def listToSet2(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            results[j] += group
    return {k:set(v)-set([k]) for k, v in results.iteritems()}

def sets2(groups):
    results = collections.defaultdict(set)
    for group in groups:
        for i in group:
            results[i] |= set(group)
    return {k:v - set([k]) for k, v in results.iteritems()}

def listFilter(groups):
    results = collections.defaultdict(list)
    for group in groups:
        for i,j in enumerate(group):
            filteredGroup = group[:i] + group[i:]
            results[j] += ([k for k in filteredGroup if k not in results[j]])
    return results
于 2013-09-15T21:42:13.923 回答