1

I have a set of unique tuples that looks like the following. The first value is the name, the second value is the ID, and the third value is the type.

('9', '0000022', 'LRA')
('45', '0000016', 'PBM')
('16', '0000048', 'PBL')
('304', '0000042', 'PBL')
('7', '0000014', 'IBL')
('12', '0000051', 'LRA')
('7', '0000014', 'PBL')
('68', '0000002', 'PBM')
('356', '0000049', 'PBL')
('12', '0000051', 'PBL')
('15', '0000015', 'PBL')
('32', '0000046', 'PBL')
('9', '0000022', 'PBL')
('10', '0000007', 'PBM')
('7', '0000014', 'LRA')
('439', '0000005', 'PBL')
('4', '0000029', 'LRA')
('41', '0000064', 'PBL')
('10', '0000007', 'IBL')
('8', '0000006', 'PBL')
('331', '0000040', 'PBL')
('9', '0000022', 'IBL')

This set includes duplicates of the name/ID combination, but they each have a different type. For example:

('9', '0000022', 'LRA')
('9', '0000022', 'PBL')
('9', '0000022', 'IBL')

What I would like to do is process this set of tuples so that I can create a new list where each name/ID combination would only appear once, but include all types. This list should only include the name/ID combos that have more than one type. For example, my output would look like this:

('9', '0000022', 'LRA', 'PBL', 'IBL')
('7', '0000014', 'IBL', 'PBL', 'LRA')

but my output should not include name/ID combos that have only one type:

('45', '0000016', 'PBM')
('16', '0000048', 'PBL')

Any help is appreciated!

4

3 回答 3

3

itertools.groupby对其输出的内容进行一些额外的处理即可完成这项工作:

from itertools import groupby

data = {
    ('9', '0000022', 'LRA'),
    ('45', '0000016', 'PBM'),
    ('16', '0000048', 'PBL'),
    ...
}

def group_by_name_and_id(s):
    grouped = groupby(sorted(s), key=lambda (name, id_, type_): (name_, id))
    for (name, id_), items in grouped:
        types = tuple(type_ for _, _, type_ in items)
        if len(types) > 1:
            yield (name, id_) + types

print '\n'.join(str(x) for x in group_by_name_and_id(data))

输出:

('10', '0000007', 'PBM', 'IBL')
('12', '0000051', 'LRA', 'PBL')
('7', '0000014', 'LRA', 'PBL', 'IBL')
('9', '0000022', 'LRA', 'PBL', 'IBL')

PS,但我不太喜欢这种设计:类型可以/应该真的是包含在元组的第三项中的列表,而不是元组本身的一部分......因为这样元组的长度是动态的,那就是丑陋的...元组不应该那样使用。所以最好更换

        types = tuple(type_ for _, _, type_ in items)
        yield (name, id_) + types

        types = [type_ for _, _, type_ in items]
        yield (name, id_, types)

产生更清洁的外观

('10', '0000007', ['IBL', 'PBM'])
('12', '0000051', ['LRA', 'PBL'])
('7', '0000014', ['IBL', 'LRA', 'PBL'])
('9', '0000022', ['IBL', 'LRA', 'PBL'])

例如,您可以使用for name, id, types in transformed_data:.

于 2013-11-05T22:46:31.400 回答
1

defaultdict使用 a然后过滤器进行累积非常简单:

from collections import defaultdict

d = defaultdict(list)
for tup in list_of_tuples:
    d[(tup[0],tup[1])].append(tup[2])

d
Out[15]: defaultdict(<class 'list'>, {('16', '0000048'): ['PBL'], ('9', '0000022'): ['LRA', 'PBL', 'IBL'], ('12', '0000051'): ['LRA', 'PBL'], ('304', '0000042'): ['PBL'], ('331', '0000040'): ['PBL'], ('41', '0000064'): ['PBL'], ('356', '0000049'): ['PBL'], ('15', '0000015'): ['PBL'], ('8', '0000006'): ['PBL'], ('4', '0000029'): ['LRA'], ('7', '0000014'): ['IBL', 'PBL', 'LRA'], ('32', '0000046'): ['PBL'], ('68', '0000002'): ['PBM'], ('439', '0000005'): ['PBL'], ('10', '0000007'): ['PBM', 'IBL'], ('45', '0000016'): ['PBM']})

然后过滤:

[(key,val) for key,val in d.items() if len(val) > 1]
Out[29]: 
[(('9', '0000022'), ['LRA', 'PBL', 'IBL']),
 (('12', '0000051'), ['LRA', 'PBL']),
 (('7', '0000014'), ['IBL', 'PBL', 'LRA']),
 (('10', '0000007'), ['PBM', 'IBL'])]

如果您真的想将其恢复为原始格式:

from itertools import chain

[tuple(chain.from_iterable(tup)) for tup in d.items() if len(tup[1]) > 1]
Out[27]: 
[('9', '0000022', 'LRA', 'PBL', 'IBL'),
 ('12', '0000051', 'LRA', 'PBL'),
 ('7', '0000014', 'IBL', 'PBL', 'LRA'),
 ('10', '0000007', 'PBM', 'IBL')]

尽管我认为将其保留为dict以 (name,id) 元组作为键是最有意义的,正如我们在第一步中生成的那样。

于 2013-11-05T23:09:28.437 回答
1

科学单线(其他答案更具可读性并且可能更正确):

testlist=[('9', '0000022', 'LRA'),
('45', '0000016', 'PBM'),
('16', '0000048', 'PBL'),
('304', '0000042', 'PBL'),etc.]


from collections import Counter

new_list = [(a1,b1)+tuple([c for (a,b,c) in testlist if (a,b) == (a1,b1)]) \
      for (a1,b1) in [pair for pair,count in Counter([(a,b) \
      for (a,b,c) in testlist]).iteritems() if count > 1]]

print new_list

产量:

[('9', '0000022', 'LRA', 'PBL', 'IBL'),
 ('12', '0000051', 'LRA', 'PBL'), 
 ('10', '0000007', 'PBM', 'IBL'), 
 ('7', '0000014', 'IBL', 'PBL', 'LRA')]
于 2013-11-05T23:12:55.027 回答