python - 用“精简”列表替换列表列表，同时保持顺序

Question

我有一个列表，如我附加的代码中所示。如果有任何共同值，我想链接每个子列表。然后我想用一个精简的列表列表替换列表列表。示例： 如果我有[[1,2,3],[3,4]]我想要的列表[1,2,3,4]。如果我有[[4,3],[1,2,3]]我想要[4,3,1,2]的。如果我有[[1,2,3],[a,b],[3,4],[b,c]]我想要的[[1,2,3,4],[a,b,c]]，或者[[a,b,c],[1,2,3,4]]我不在乎哪一个。

我快到了...

我的问题是当我有[[1,2,3],[10,5],[3,8,5]]我想要的情况[1,2,3,10,5,8]但使用我当前的代码时[1,2,3,8,10,5]

这是我的代码：

import itertools

a = [1,2,3]
b = [3,4]
i = [21,22]
c = [88,7,8]
e = [5,4]
d = [3, 50]
f = [8,9]
g=  [9,10]
h = [20,21]

lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000  
#I have a lot of list but not very many different lists

def any_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

def find_uniq(lst):
    ''' return the uniq parts of lst'''
    seen = set()
    seen_add = seen.add
    return [ x for x in lst if x not in seen and not seen_add(x)]

def overlap_inlist(o_lst, lstoflst):
    '''
    Search for overlap, using "any_overlap", of a list( o_lst) in a list of lists (lstoflst).
    If there is overlap add the uniq part of the found list to the search list, and keep 
    track of where that list was found 
    '''
    used_lst =[ ]
    n_lst =[ ]
    for lst_num, each_lst in enumerate(lstoflst):
        if any_overlap(o_lst,each_lst):
            n_lst.extend(each_lst)
            used_lst.append(lst_num)
    n_lst= find_uniq(n_lst)
    return  n_lst, used_lst

def comb_list(lst):
    '''
    For each list in a list of list find all the overlaps using 'ovelap_inlist'.
    Update the list each time to delete the found lists. Return the final combined list. 
    '''
    for updated_lst in lst:
        n_lst, used_lst = overlap_inlist(updated_lst,lst)
        lst[:] = [x for i,x in enumerate(lst) if i not in used_lst]
        lst.insert(0,n_lst)
    return lst
comb_lst = comb_list(lst)
print comb_lst

该脚本的输出是：

[[88, 7, 8, 9, 10], [1, 2, 3, 4, 50, 5], [21, 22, 20]]

我想要它，所以密钥按原始顺序排列，例如：

[[88, 7, 8, 9, 10], [1, 2, 3, 4, 5, 50,], [21, 22, 20]]

5 和 50 在新的lst[2]中切换

我对python有点陌生。我将不胜感激任何问题的解决方案或对我当前代码的评论。我不是计算机科学家，我想可能有某种算法可以快速做到这一点（也许来自集合论？）。如果有这样的算法，请指出我正确的方向。

我可能会让这种方式变得更加复杂......谢谢！

score 14 · Accepted Answer

这是一种蛮力方法（可能更容易理解）：

from itertools import chain

def condense(*lists):
    # remember original positions
    positions = {}
    for pos, item in enumerate(chain(*lists)):
        if item not in positions:
            positions[item] = pos

    # condense disregarding order
    sets = condense_sets(map(set, lists))

    # restore order
    result = [sorted(s, key=positions.get) for s in sets]
    return result if len(result) != 1 else result[0]

def condense_sets(sets):
    result = []
    for candidate in sets:
        for current in result:
            if candidate & current:   # found overlap
                current |= candidate  # combine (merge sets)

                # new items from candidate may create an overlap
                # between current set and the remaining result sets
                result = condense_sets(result) # merge such sets
                break
        else:  # no common elements found (or result is empty)
            result.append(candidate)
    return result

例子

>>> condense([1,2,3], [10,5], [3,8,5])
[1, 2, 3, 10, 5, 8]
>>> a = [1,2,3]
>>> b = [3,4]
>>> i = [21,22]
>>> c = [88,7,8]
>>> e = [5,4]
>>> d = [3, 50]
>>> f = [8,9]
>>> g=  [9,10]
>>> h = [20,21]
>>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
>>> condense([1], [2, 3, 2])
[[1], [2, 3]]

性能比较

condense_*()功能来自这个问题的答案。lst_OP来自问题的输入列表（不同大小）， -来自@Blckknght 答案lst_BK的测试列表（不同大小）。查看源代码。

测量表明，基于“不相交集”和“无向图的连接组件”概念的解决方案在两种类型的输入上表现相似。

name                       time ratio comment
condense_hynekcer     5.79 msec  1.00 lst_OP
condense_hynekcer2     7.4 msec  1.28 lst_OP
condense_pillmuncher2 11.5 msec  1.99 lst_OP
condense_blckknght    16.8 msec  2.91 lst_OP
condense_jfs            26 msec  4.49 lst_OP
condense_BK           30.5 msec  5.28 lst_OP
condense_blckknght2   30.9 msec  5.34 lst_OP
condense_blckknght3   32.5 msec  5.62 lst_OP


name                       time  ratio comment
condense_blckknght     964 usec   1.00 lst_BK
condense_blckknght2   1.41 msec   1.47 lst_BK
condense_blckknght3   1.46 msec   1.51 lst_BK
condense_hynekcer2    1.95 msec   2.03 lst_BK
condense_pillmuncher2  2.1 msec   2.18 lst_BK
condense_hynekcer     2.12 msec   2.20 lst_BK
condense_BK           3.39 msec   3.52 lst_BK
condense_jfs           544 msec 563.66 lst_BK


name                       time ratio comment
condense_hynekcer     6.86 msec  1.00 lst_rnd
condense_jfs          16.8 msec  2.44 lst_rnd
condense_blckknght    28.6 msec  4.16 lst_rnd
condense_blckknght2   56.1 msec  8.18 lst_rnd
condense_blckknght3   56.3 msec  8.21 lst_rnd
condense_BK           70.2 msec 10.23 lst_rnd
condense_pillmuncher2  324 msec 47.22 lst_rnd
condense_hynekcer2     334 msec 48.61 lst_rnd

重现结果克隆要点并运行measure-performance-condense-lists.py

score 6 · Accepted Answer

这是我的方法。它使用“不相交集”的概念来首先识别哪些子列表相互重叠，然后将它们连接在一起，消除重复。

from collections import OrderedDict
def disjoint_set_find(djs, node): # disjoint set find, with path compression
    if node not in djs: # base case, node is a root of a set
        return node
    djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
    return djs[node]

def disjoint_set_union(djs, first, second):
    first = disjoint_set_find(djs, first)   # find root of first set
    second = disjoint_set_find(djs, second) # and of second set
    if first < second: # make smaller root the root of the new combined set
        djs[second] = first
    elif second < first:
        djs[first] = second
    # deliberately ignore the case where first == second (same set)

def condenseBK(*master_list):
    values = OrderedDict() # maps values to the first sublist containing them
    overlaps = {}  # maps sublist indexes to each other to form a disjoint set
    for i, sublist  in enumerate(master_list):
        for v in sublist:
            if v not in values: # no overlap, so just store value
                values[v] = i
            else: # overlap detected, do a disjoint set union
                disjoint_set_union(overlaps, values[v], i)
    output = [] # results
    output_map = {} # map from original indexes to output indexes
    for v, i, in values.items(): # iterate over values in order
        root = disjoint_set_find(overlaps, i)
        if root not in output_map:
            output_map[i] = len(output)
            output.append([]) # create new output sublists as necessary
        output[output_map[root]].append(v)
    return output

样本输出：

>>> a = [1,2,3]
>>> b = [3,4]
>>> c = [88,7,8]
>>> d = [3, 50]
>>> e = [5,4]
>>> f = [8,9]
>>> g = [9,10]
>>> h = [20,21]
>>> i = [21,22]
>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
>>> condenseBK(*lst)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]

算法说明：

根据要求，这里解释了我的代码是如何工作的。

前两个函数实现了不相交集find的和union操作。数据结构是用字典映射节点到它们的父节点来实现的。任何不是字典键的节点都是集合的。该操作返回包含给定的集合的根节点。为了提高性能，我实现了“路径压缩”，它减少了随着时间的推移所需的递归步骤的数量。该操作组合了包含其参数的集合和。rootfindnodeunionfirstsecond

主要condense功能有两个部分。首先，它设置了几个数据结构，然后使用它们来构建输出。

values是一个 OrderedDictionary，它从每个值映射到包含它的第一个子列表的索引。每个值添加的顺序用于以正确的顺序生成输出。

overlaps是用于不相交集的字典。它的节点是重叠子列表的整数索引。

第一个循环填充了这两个数据结构。它们循环遍历子列表及其内容。如果一个值以前没有出现过，它就会被添加到values字典中。如果已经看到，则当前子列表与包含该值的先前子列表重叠。

为了解决重叠，代码对包含两个子列表的不相交集进行并集。

输出内置在output列表中。因为输出子列表可能比输入中的少，所以我们需要一个额外的字典来映射旧索引（来自输入）到适用于输出列表的新索引。

为了填满输出列表，我们迭代了这些值，由于使用了 OrderedDict 类，这些值按照它们添加的顺序发生。使用不相交集，它可以正确组合重叠列表。

当有很多不立即重叠的列表要处理时，该算法具有非常好的性能。例如，这组 200 个三元素列表最终全部重叠，但您只是在处理前 100 个之后才开始看到重叠出现：

lst2 = [list(range(4*i,   4*(i+1)-1)) for i in range(100)] + \
       [list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]

score 4 · Accepted Answer

我确信有一种更清洁的方法可以做到这一点，但我从一条特定的道路开始，并做了我必须做的事情以使其工作而无需任何重构。

lookup = {}
out = []
index = 0
for grp in lst:
    keys = [lookup.get(num, None) for num in grp]
    keys = [key for key in keys if key is not None]
    if len(keys):
        if len(set(keys)) != 1:
            for num in grp:
                out[keys[0]].append(num)
            seen = set()
            keys = [key for key in keys if key not in seen and not seen.add(key)]
            for key in keys[1:]:
                out[keys[0]].extend(out[key])
                del out[key]
            seen = set()
            out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
        else:
            for num in grp:
                lookup[num] = keys[0]
            out[keys[0]].extend(grp)
            seen = set()
            out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
    else:
        out.append(grp)
        for num in grp:
            lookup[num] = index
        index += 1

print out

感谢@Steven 提供了该集合的列表缩减技术。

输出

[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]

score 4 · Accepted Answer

您的问题本质上是一个图论问题，即连接组件的问题，对每个组件的元素顺序有附加要求。

在您的程序中，所有列表的集合形成一个无向图，其中每个列表都是图中的一个节点。如果两个列表有共同的元素，我们就说它们是直接连接的；如果存在第三个列表，它们直接或间接地连接，我们就说它们是间接连接的。给定例如三个列表 [a, b], [b, c] 和 [c, d]，那么 [a, b] 和 [b, c] 直接连接，以及 [b, c] 和 [c, d]，但是 [a, b] 和 [c, d] 是间接连接的，因为虽然它们不共享公共元素，但它们都使用相同的列表 [b, c] 共享元素。

如果组中的所有节点都连接（直接或间接）并且图中没有其他节点连接到组中的任何节点，则一组节点是一个连接组件。

有一个相当简单的线性时间算法，可以在无向图中生成所有连通分量。使用它，我们可以定义一个函数来生成所有压缩不相交列表的列表，同时保持其元素的顺序：

from itertools import imap, combinations_with_replacement
from collections import defaultdict

def connected_components(edges):
    neighbors = defaultdict(set)
    for a, b in edges:
        neighbors[a].add(b)
        neighbors[b].add(a)
    seen = set()
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        unseen = set([node])
        next_unseen = unseen.pop
        while unseen:
            node = next_unseen()
            see(node)
            unseen |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield component(node)

def condense(lists):
    tuples = combinations_with_replacement(enumerate(imap(tuple, lists)), 2)
    overlapping = ((a, b) for a, b in tuples
                          if not set(a[1]).isdisjoint(b[1]))
    seen = set()
    see = seen.add
    for component in connected_components(overlapping):
        yield [item for each in sorted(component)
                    for item in each[1]
                    if not (item in seen or see(item))]

print list(condense([[1, 2, 3], [10, 5], [3, 8, 5], [9]]))
print list(condense([[1, 2, 3], [5, 6], [3, 4], [6, 7]]))

结果：

[[1, 2, 3, 10, 5, 8], [9]]
[[5, 6, 7], [1, 2, 3, 4]]

的时间复杂度condense()是二次的，因为每个列表都必须针对每个其他列表进行测试，以确定它们是否具有共同元素。因此，性能很糟糕。我们能以某种方式改进它吗？是的。

如果两个列表有共同的元素，则它们直接相连。我们可以扭转这种关系：如果两个元素属于同一个列表，则它们是直接连接的，如果存在（直接或间接）连接到它们两者的第三个元素，则它们是间接连接的。例如，给定两个列表 [a, b] 和 [b, c]，则 a 和 b 直接连接，b 和 c 也是如此，因此 a 和 c 是间接连接的。如果我们也采用 JFSebastians 存储每个元素第一次出现的位置的想法，我们可以这样实现它：

def condense(lists):
    neighbors = defaultdict(set)
    positions = {}
    position = 0
    for each in lists:
        for item in each:
            neighbors[item].update(each)
            if item not in positions:
                positions[item] = position
                position += 1
    seen = set()
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        unseen = set([node])
        next_unseen = unseen.pop
        while unseen:
            node = next_unseen()
            see(node)
            unseen |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield sorted(component(node), key=positions.get)

它仍然使用连接组件算法，但这次我们将元素视为连接，而不是列表。结果和以前一样，但由于时间复杂度现在是线性的，它运行得更快。

score 4 · Accepted Answer

我试图编写一个快速且易读的解决方案。如果我知道的话，它永远不会比其他解决方案慢很多，但有时可能会快得多，因为它针对更长的子列表或作为任何现有组子集的许多子列表进行了额外优化。（这是由“我有很多列表但不是很多不同的列表”问题的文本所激发的。）该代码仅对可能比原始数据少得多的压缩数据使用较少的内存。例如，它可以与从实时过程中收集数据的生成器一起工作。复杂度的估计是 O(n log n)。我认为没有使用排序的算法可以具有线性复杂性。

def condense(lists):
    groups = {}         # items divided into groups {id(the_set): the_set,...}
    members = {}        # mapping from item to group
    positions = {}      # mapping from item to sequential ordering
    iposition = 0       # counter for positions
    for sublist in lists:
        if not sublist or members.get(sublist[0], set()).issuperset(sublist):
            continue    # speed-up condition if all is from one group
        common = set([x for x in sublist if x in members])
        if common:      # any common group can be a destination for merge
            dst_group = members[common.pop()]
            common = common - dst_group   # are more groups common for sublist?
            while common:
                grp = members[common.pop()]
                if len(grp) > len(dst_group):   # merge shorter into longer grp
                    grp, dst_group = dst_group, grp
                dst_group.update(grp)
                for item in grp:
                    members[item] = dst_group
                del groups[id(grp)]
                common = common - dst_group
        else:           # or create a new group if nothing common
            dst_group = set()
            groups[id(dst_group)] = dst_group
        newitems = []
        for item in sublist:    # merge also new items
            if item not in positions:
                positions[item] = iposition
                iposition += 1
                newitems.append(item)
                members[item] = dst_group
        dst_group.update(newitems)
    return [sorted(x, key=positions.get) for x in groups.values()]

对于超过大约 8 个项目的子列表，它比 pillmuncher2 更快，因为它可以一起处理更多项目。对于具有许多相似子列表或作为任何现有组的子集的许多子列表的列表也非常快。对于 lst_OP，它比 pillmuncher2 快 25%，但对于 lst_BK，慢 15%。

具有长子列表的测试数据示例是[list(range(30)) + [-i] for i in range(100)].

我故意写了“common = common - dst_group”而不是使用集合运算符-=或“set.difference_update”，因为如果右侧的集合比左侧的集合大得多，则就地更新无效。

修改了 Pillmuncher 的解决方案，以便于阅读。由于用 list.append 替换了生成器，因此修改比原来的要慢一些。可能是最易读的快速解决方案。

# Modified pillmuncher's solution
from collections import defaultdict

def condense(lists):
    neighbors = defaultdict(set)  # mapping from items to sublists
    positions = {}                # mapping from items to sequential ordering
    position = 0
    for each in lists:
        for item in each:
            neighbors[item].update(each)
            if item not in positions:
                positions[item] = position
                position += 1
    seen = set()
    see = seen.add
    for node in neighbors:
        if node not in seen:
            unseen = set([node])      # this is a "todo" set
            next_unseen = unseen.pop  # method alias, not called now
            group = []                # collects the output
            while unseen:
                node = next_unseen()
                see(node)
                unseen |= neighbors[node] - seen
                group.append(node)
            yield sorted(group, key=positions.get)

score 3 · Accepted Answer

class List(list): pass

rv = dict()

def condense_step():
    """find and merge one overlapping pair in rv"""
    for i, iv in rv.items():
        for j, jv in rv.items():
            if i != j and i.intersection(j):
                m = i.union(j)
                del rv[i]
                del rv[j]
                rv.setdefault(m, [])
                rv[m] += iv
                rv[m] += jv
                return True

def unique(l):
    """flatten list-of-lists excluding duplicates"""
    seen = set()
    for i in sum(l, []):
        if i not in seen:
            seen.add(i)
            yield i

def condense(inp):
    rv.clear()
    inp = map(List, inp)

    for i in range(len(inp)):
        inp[i].order = i
        rv.setdefault(frozenset(inp[i]), [])
        rv[frozenset(inp[i])].append(inp[i])

    while condense_step():
        pass

    for v in rv.values():
        v.sort(key=lambda x: x.order)

    return [list(unique(i)) for i in rv.values()]

score 2 · Accepted Answer

此解决方案仅使用有序字典。如果希望原始副本保持不变，则需要
deepcopy() 。

from collections import OrderedDict
from copy import deepcopy

def treat(passed_list):
    L = deepcopy(passed_list)

    dic = OrderedDict()
    for subl in L:
        for x in subl:
            if x not in dic:
                dic[x] =  subl
    print 'dic at start'
    print '\n'.join('%-3s : %s' % (a,dic[a])
                    for a in dic) + '\n'

    for sublist in L:
        short = []
        short.extend(el for el in sublist
                     if el not in short)
        seen  = []
        for k,val in dic.iteritems():
            if val is sublist:
                break
            if k in short:
                if val not in seen:
                    seen.append(val)

        sumseen = []
        for elseen in seen:
            for y in elseen:
                sumseen.append(y)
                dic[y] = sumseen

        if seen:
            for el in sublist:
                if el not in sumseen:
                    sumseen.append(el)
                    dic[el] = sumseen

        sublist[:] = short

    cumul = []
    cumul.extend(lu for lu in dic.itervalues()
                 if lu not in cumul)
    return cumul


plus = [[1,2,3,2,1],[10,5,5,5,10],
        [8,5,3,3,5],[45,50,12,45,40,12]]

lst = [[1,2,3], [10,5], [3,8,5]]

for one_list in (plus,lst):
    print 'one_list before == %r\n' % one_list
    print 'treat(one_list) == %r\n' % treat(one_list)
    print 'one_list after  == %r\n' % one_list
    print '===================================='

结果

one_list before == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]

dic at start
1   : [1, 2, 3, 2, 1]
2   : [1, 2, 3, 2, 1]
3   : [1, 2, 3, 2, 1]
10  : [10, 5, 5, 5, 10]
5   : [10, 5, 5, 5, 10]
8   : [8, 5, 3, 3, 5]
45  : [45, 50, 12, 45, 40, 12]
50  : [45, 50, 12, 45, 40, 12]
12  : [45, 50, 12, 45, 40, 12]
40  : [45, 50, 12, 45, 40, 12]

treat(one_list) == [[1, 2, 3, 10, 5, 8], [45, 50, 12, 40]]

one_list after  == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]

====================================
one_list before == [[1, 2, 3], [10, 5], [3, 8, 5]]

dic at start
1   : [1, 2, 3]
2   : [1, 2, 3]
3   : [1, 2, 3]
10  : [10, 5]
5   : [10, 5]
8   : [3, 8, 5]

treat(one_list) == [[1, 2, 3, 10, 5, 8]]

one_list after  == [[1, 2, 3], [10, 5], [3, 8, 5]]

====================================

该解决方案有一个不便之处：它比 JF Sebastian 的解决方案慢 2 到 3 倍。

python - 用“精简”列表替换列表列表，同时保持顺序

7 回答 7

例子

性能比较

Related

Reference