1

我有由 4 个字段组成的数据行

aaaa bbb1 cccc dddd  
aaaa bbb2 cccc dddd  
aaaa bbb3 cccc eeee  
aaaa bbb4 cccc ffff  
aaaa bbb5 cccc gggg  
aaaa bbb6 cccc dddd    

请多多包涵。

第一个和第三个字段总是相同的 - 但我不需要它们,第四个字段可以相同或不同。问题是,我只想要不共享公共字段的行中的第 2 和第 4 字段。例如从上面的数据中

bbb3 eeee  
bbb4 ffff    
bbb5 gggg    

现在我的意思不是重复数据删除,因为那样会留下一个条目。如果第四个字段与另一行共享一个值,我不想要任何具有该值的行。

再次为问什么可能很简单而道歉。

4

2 回答 2

6

干得好:

from collections import defaultdict

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

# Count how many lines each unique value of the fourth field appears in.
d_counts = defaultdict(int)
for line in LINES:
    a, b, c, d = line.split()
    d_counts[d] += 1

# Print only those lines with a unique value for the fourth field.
for line in LINES:
    a, b, c, d = line.split()
    if d_counts[d] == 1:
        print b, d

# Prints
# bbb3 eeee
# bbb4 ffff
# bbb5 gggg
于 2009-07-06T22:50:32.127 回答
0

对于您的放大要求,您可以避免两次读取文件或将其保存在列表中:

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

import collections
adict = collections.defaultdict(list)
for line in LINES: # or file ...
    a, b, c, d = line.split()
    adict[d].append(b)

map_b_to_d = dict((blist[0], d) for d, blist in adict.items() if len(blist) == 1)
print(map_b_to_d)

# alternative; saves some memory

xdict = {}
duplicated = object()
for line in LINES: # or file ...
    a, b, c, d = line.split()
    xdict[d] = duplicated if d in xdict else b

map_b_to_d2 = dict((b, d) for d, b in xdict.items() if b is not duplicated)
print(map_b_to_d2)
于 2009-07-07T05:16:01.000 回答