python - Python：基于交集的简单列表合并

Question

考虑有一些整数列表：

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

问题是合并具有至少一个共同元素的列表。因此，仅给定部分的结果将如下所示：

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

对大数据（元素只是数字）执行此操作的最有效方法是什么？tree结构需要考虑吗？我现在通过将列表转换sets为交叉点并迭代交叉点来完成这项工作，但这很慢！此外，我有一种如此初级的感觉！此外，该实现缺少一些东西（未知），因为某些列表有时仍未合并！话虽如此，如果您提出自我实现，请慷慨并提供一个简单的示例代码 [显然Python是我最喜欢的 :)] 或伪代码。
更新 1： 这是我使用的代码：

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

功能是（越野车！！）：

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------

结果是：

#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

更新 2： 根据我的经验，下面Niklas Baumstark给出的代码对于简单的情况来说要快一些。尚未测试“Hooked”给出的方法，因为它是完全不同的方法（顺便说一句，看起来很有趣）。所有这些的测试程序可能真的很难或不可能确保结果。我将使用的真实数据集如此庞大和复杂，因此仅通过重复是不可能追踪任何错误的。也就是说，我需要对方法的可靠性 100% 满意，然后才能将其作为模块推送到大型代码中的位置。就目前而言 Niklas的方法更快，简单集合的答案当然是正确的。
但是，我如何确定它适用于真正的大型数据集？因为我将无法直观地跟踪错误！

更新 3： 请注意，对于这个问题，方法的可靠性比速度更重要。我希望最终能够将 Python 代码转换为 Fortran 以获得最大性能。

更新 4：
这篇文章有很多有趣的观点，并且慷慨地给出了答案，建设性的意见。我建议通读一遍。请接受我对问题的发展、惊人的答案以及建设性的评论和讨论的赞赏。

score 26 · Accepted Answer

我的尝试：

def merge(lsts):
    sets = [set(lst) for lst in lsts if lst]
    merged = True
    while merged:
        merged = False
        results = []
        while sets:
            common, rest = sets[0], sets[1:]
            sets = []
            for x in rest:
                if x.isdisjoint(common):
                    sets.append(x)
                else:
                    merged = True
                    common |= x
            results.append(common)
        sets = results
    return sets

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]
print merge(lst)

基准：

import random

# adapt parameters to your own usage scenario
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5

if False:  # change to true to generate the test data file (takes a while)
    with open("/tmp/test.txt", "w") as f:
        lists = []
        classes = [
            range(class_size * i, class_size * (i + 1)) for i in range(class_count)
        ]
        for c in classes:
            # distribute each class across ~300 lists
            for i in xrange(list_count_per_class):
                lst = []
                if random.random() < large_list_probability:
                    size = random.choice(large_list_sizes)
                else:
                    size = random.choice(small_list_sizes)
                nums = set(c)
                for j in xrange(size):
                    x = random.choice(list(nums))
                    lst.append(x)
                    nums.remove(x)
                random.shuffle(lst)
                lists.append(lst)
        random.shuffle(lists)
        for lst in lists:
            f.write(" ".join(str(x) for x in lst) + "\n")

setup = """
# Niklas'
def merge_niklas(lsts):
    sets = [set(lst) for lst in lsts if lst]
    merged = 1
    while merged:
        merged = 0
        results = []
        while sets:
            common, rest = sets[0], sets[1:]
            sets = []
            for x in rest:
                if x.isdisjoint(common):
                    sets.append(x)
                else:
                    merged = 1
                    common |= x
            results.append(common)
        sets = results
    return sets

# Rik's
def merge_rik(data):
    sets = (set(e) for e in data if e)
    results = [next(sets)]
    for e_set in sets:
        to_update = []
        for i, res in enumerate(results):
            if not e_set.isdisjoint(res):
                to_update.insert(0, i)

        if not to_update:
            results.append(e_set)
        else:
            last = results[to_update.pop(-1)]
            for i in to_update:
                last |= results[i]
                del results[i]
            last |= e_set
    return results

# katrielalex's
def pairs(lst):
    i = iter(lst)
    first = prev = item = i.next()
    for item in i:
        yield prev, item
        prev = item
    yield item, first

import networkx

def merge_katrielalex(lsts):
    g = networkx.Graph()
    for lst in lsts:
        for edge in pairs(lst):
            g.add_edge(*edge)
    return networkx.connected_components(g)

# agf's (optimized)
from collections import deque

def merge_agf_optimized(lists):
    sets = deque(set(lst) for lst in lists if lst)
    results = []
    disjoint = 0
    current = sets.pop()
    while True:
        merged = False
        newsets = deque()
        for _ in xrange(disjoint, len(sets)):
            this = sets.pop()
            if not current.isdisjoint(this):
                current.update(this)
                merged = True
                disjoint = 0
            else:
                newsets.append(this)
                disjoint += 1
        if sets:
            newsets.extendleft(sets)
        if not merged:
            results.append(current)
            try:
                current = newsets.pop()
            except IndexError:
                break
            disjoint = 0
        sets = newsets
    return results

# agf's (simple)
def merge_agf_simple(lists):
    newsets, sets = [set(lst) for lst in lists if lst], []
    while len(sets) != len(newsets):
        sets, newsets = newsets, []
        for aset in sets:
            for eachset in newsets:
                if not aset.isdisjoint(eachset):
                    eachset.update(aset)
                    break
            else:
                newsets.append(aset)
    return newsets

# alexis'
def merge_alexis(data):
    bins = range(len(data))  # Initialize each bin[n] == n
    nums = dict()

    data = [set(m) for m in data]  # Convert to sets
    for r, row in enumerate(data):
        for num in row:
            if num not in nums:
                # New number: tag it with a pointer to this row's bin
                nums[num] = r
                continue
            else:
                dest = locatebin(bins, nums[num])
                if dest == r:
                    continue  # already in the same bin

                if dest > r:
                    dest, r = r, dest  # always merge into the smallest bin

                data[dest].update(data[r])
                data[r] = None
                # Update our indices to reflect the move
                bins[r] = dest
                r = dest

    # Filter out the empty bins
    have = [m for m in data if m]
    return have

def locatebin(bins, n):
    while bins[n] != n:
        n = bins[n]
    return n

lsts = []
size = 0
num = 0
max = 0
for line in open("/tmp/test.txt", "r"):
    lst = [int(x) for x in line.split()]
    size += len(lst)
    if len(lst) > max:
        max = len(lst)
    num += 1
    lsts.append(lst)
"""

setup += """
print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
""".format(class_count=class_count)

import timeit
print "niklas"
print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
print "rik"
print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
print "katrielalex"
print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
print "agf (1)"
print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
print "agf (2)"
print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
print "alexis"
print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)

这些时间显然取决于基准测试的特定参数，例如类数、列表数、列表大小等。根据您的需要调整这些参数以获得更有用的结果。

下面是我机器上不同参数的一些示例输出。他们表明，所有算法都有其优点和缺点，具体取决于它们获得的输入类型：

=====================
# many disjoint classes, large lists
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
=====================

niklas
5000 lists, 50 equally distributed classes, average size 298, max size 999
4.80084705353
rik
5000 lists, 50 equally distributed classes, average size 298, max size 999
9.49251699448
katrielalex
5000 lists, 50 equally distributed classes, average size 298, max size 999
21.5317108631
agf (1)
5000 lists, 50 equally distributed classes, average size 298, max size 999
8.61671280861
agf (2)
5000 lists, 50 equally distributed classes, average size 298, max size 999
5.18117713928
=> alexis
=> 5000 lists, 50 equally distributed classes, average size 298, max size 999
=> 3.73504281044

===================
# less number of classes, large lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
===================

niklas
4500 lists, 15 equally distributed classes, average size 296, max size 999
1.79993700981
rik
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.58237695694
katrielalex
4500 lists, 15 equally distributed classes, average size 296, max size 999
19.5465381145
agf (1)
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.75445604324
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 296, max size 999
=> 1.77850699425
alexis
4500 lists, 15 equally distributed classes, average size 296, max size 999
3.23530197144

===================
# less number of classes, smaller lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.1
===================

niklas
4500 lists, 15 equally distributed classes, average size 95, max size 997
0.773697137833
rik
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.0523750782
katrielalex
4500 lists, 15 equally distributed classes, average size 95, max size 997
6.04466891289
agf (1)
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.20285701752
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 95, max size 997
=> 0.714507102966
alexis
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.1286110878

score 13 · Accepted Answer

我试图总结在这个问题和重复的问题中关于这个主题的所有所说和所做的一切。

我试图测试和计时每个解决方案（所有代码在这里）。

测试

这是TestCase来自测试模块：

class MergeTestCase(unittest.TestCase):

    def setUp(self):
        with open('./lists/test_list.txt') as f:
            self.lsts = json.loads(f.read())
        self.merged = self.merge_func(deepcopy(self.lsts))

    def test_disjoint(self):
        """Check disjoint-ness of merged results"""
        from itertools import combinations
        for a,b in combinations(self.merged, 2):
            self.assertTrue(a.isdisjoint(b))

    def test_coverage(self):    # Credit to katrielalex
        """Check coverage original data"""
        merged_flat = set()
        for s in self.merged:
            merged_flat |= s

        original_flat = set()
        for lst in self.lsts:
            original_flat |= set(lst)

        self.assertTrue(merged_flat == original_flat)

    def test_subset(self):      # Credit to WolframH
        """Check that every original data is a subset"""
        for lst in self.lsts:
            self.assertTrue(any(set(lst) <= e for e in self.merged))

这个测试假设一个集合列表作为结果，所以我无法测试几个与列表一起工作的解决方案。

我无法测试以下内容：

katrielalex
steabert

在我可以测试的那些中，有两个失败了：

  -- Going to test: agf (optimized) --
Check disjoint-ness of merged results ... FAIL

  -- Going to test: robert king --
Check disjoint-ness of merged results ... FAIL

定时

性能与所采用的数据测试密切相关。

到目前为止，三个答案试图为他们和其他人的解决方案计时。由于他们使用了不同的测试数据，因此得出了不同的结果。

Niklas 基准测试非常灵活。有了他的基准，人们可以做不同的测试来改变一些参数。

我使用了他在自己的答案中使用的相同的三组参数，并将它们放在三个不同的文件中：

filename = './lists/timing_1.txt'
class_count = 50,
class_size = 1000,
list_count_per_class = 100,
large_list_sizes = (100, 1000),
small_list_sizes = (0, 100),
large_list_probability = 0.5,

filename = './lists/timing_2.txt'
class_count = 15,
class_size = 1000,
list_count_per_class = 300,
large_list_sizes = (100, 1000),
small_list_sizes = (0, 100),
large_list_probability = 0.5,

filename = './lists/timing_3.txt'
class_count = 15,
class_size = 1000,
list_count_per_class = 300,
large_list_sizes = (100, 1000),
small_list_sizes = (0, 100),
large_list_probability = 0.1,

这是我得到的结果：

从文件：timing_1.txt

Timing with: >> Niklas << Benchmark
Info: 5000 lists, average size 305, max size 999

Timing Results:
10.434  -- alexis
11.476  -- agf
11.555  -- Niklas B.
13.622  -- Rik. Poggi
14.016  -- agf (optimized)
14.057  -- ChessMaster
20.208  -- katrielalex
21.697  -- steabert
25.101  -- robert king
76.870  -- Sven Marnach
133.399  -- hochl

从文件：timing_2.txt

Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 305, max size 999

Timing Results:
8.247  -- Niklas B.
8.286  -- agf
8.637  -- Rik. Poggi
8.967  -- alexis
9.090  -- ChessMaster
9.091  -- agf (optimized)
18.186  -- katrielalex
19.543  -- steabert
22.852  -- robert king
70.486  -- Sven Marnach
104.405  -- hochl

从文件：timing_3.txt

Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 98, max size 999

Timing Results:
2.746  -- agf
2.850  -- Niklas B.
2.887  -- Rik. Poggi
2.972  -- alexis
3.077  -- ChessMaster
3.174  -- agf (optimized)
5.811  -- katrielalex
7.208  -- robert king
9.193  -- steabert
23.536  -- Sven Marnach
37.436  -- hochl

使用Sven的测试数据，我得到以下结果：

Timing with: >> Sven << Benchmark
Info: 200 lists, average size 10, max size 10

Timing Results:
2.053  -- alexis
2.199  -- ChessMaster
2.410  -- agf (optimized)
3.394  -- agf
3.398  -- Rik. Poggi
3.640  -- robert king
3.719  -- steabert
3.776  -- Niklas B.
3.888  -- hochl
4.610  -- Sven Marnach
5.018  -- katrielalex

最后，我得到了Agf的基准：

Timing with: >> Agf << Benchmark
Info: 2000 lists, average size 246, max size 500

Timing Results:
3.446  -- Rik. Poggi
3.500  -- ChessMaster
3.520  -- agf (optimized)
3.527  -- Niklas B.
3.527  -- agf
3.902  -- hochl
5.080  -- alexis
15.997  -- steabert
16.422  -- katrielalex
18.317  -- robert king
1257.152  -- Sven Marnach

正如我在开头所说的，所有代码都可以在这个 git 存储库中找到。所有合并函数都在一个名为的文件中core.py，其中名称以结尾的每个函数_merge都将在测试期间自动加载，因此添加/测试/改进自己的解决方案应该不难。

让我也知道是否有问题，这是很多编码，我可以用一些新鲜的眼睛:)

score 7 · Accepted Answer

使用矩阵操作

让我用以下评论作为这个答案的开头：

这是错误的做法。它容易出现数值不稳定，并且比现有的其他方法慢得多，使用风险自负。

话虽如此，我还是忍不住从动态的角度来解决这个问题（我希望你会对这个问题有一个全新的视角）。理论上这应该一直有效，但特征值计算经常会失败。这个想法是将您的列表视为从行到列的流。如果两行共享一个共同值，则它们之间存在连接流。如果我们将这些流视为水，我们会看到当它们之间有一条连接路径时，它们会聚集成小水池。为简单起见，我将使用一个较小的集合，尽管它也适用于您的数据集：

from numpy import where, newaxis
from scipy import linalg, array, zeros

X = [[0,1,3],[2],[3,1]]

我们需要将数据转换为流程图。如果第i行流入值j，我们将其放入矩阵中。这里我们有 3 行和 4 个唯一值：

A = zeros((4,len(X)), dtype=float)
for i,row in enumerate(X):
    for val in row: A[val,i] = 1

通常，您需要更改4以捕获您拥有的唯一值的数量。如果集合是我们所拥有的从 0 开始的整数列表，您可以简单地将其设为最大数。我们现在执行特征值分解。准确地说是 SVD，因为我们的矩阵不是正方形的。

S  = linalg.svd(A)

我们只想保留这个答案的 3x3 部分，因为它将代表池的流动。实际上我们只想要这个矩阵的绝对值；我们只关心这个集群空间中是否有流量。

M  = abs(S[2])

我们可以将此矩阵 M 视为马尔可夫矩阵，并通过行归一化使其显式化。一旦我们有了这个，我们就计算（左）特征值分解。这个矩阵的。

M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)
V = abs(V)

现在，一个不连通（非遍历）马尔可夫矩阵有一个很好的属性，即对于每个非连通簇，都有一个统一的特征值。与这些统一值相关的特征向量是我们想要的：

idx = where(U > .999)[0]
C = V.T[idx] > 0

由于前面提到的数值不稳定性，我必须使用 .999。至此，我们完成了！每个独立的集群现在可以拉出相应的行：

for cluster in C:
    print where(A[:,cluster].sum(axis=1))[0]

正如预期的那样：

[0 1 3]
[2]

换成X你的lst，你会得到：[ 0 1 3 4 5 10 11 16] [2 8]。

附录

为什么这可能有用？我不知道您的基础数据来自哪里，但是当连接不是绝对的时会发生什么？假设行1有380% 的时间进入 - 你会如何概括这个问题？上面的 flow 方法可以正常工作，并且完全由该.999值参数化，离统一越远，关联越松。

视觉表现

由于一张图片值 1K 字，这里分别是我的示例和您的示例的矩阵 A 和 V 的图lst。注意 in 如何V分裂成两个簇（它是一个块对角矩阵，排列后有两个块），因为对于每个示例只有两个唯一列表！

我的例子您的样本数据

更快的实施

事后看来，我意识到你可以跳过 SVD 步骤，只计算一个 decomp：

M = dot(A.T,A)
M /=  M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)

这种方法的优点（除了速度）是M现在是对称的，因此计算可以更快、更准确（无需担心虚值）。

score 6 · Accepted Answer

编辑：好的，其他问题已经结束，在这里发布。

好问题！如果您将其视为图中的连接组件问题，则要简单得多。下面的代码使用了优秀的networkx图形库和这个问题pairs中的函数。

def pairs(lst):
    i = iter(lst)
    first = prev = item = i.next()
    for item in i:
        yield prev, item
        prev = item
    yield item, first

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx
g = networkx.Graph()
for sub_list in lists:
    for edge in pairs(sub_list):
            g.add_edge(*edge)

networkx.connected_components(g)
[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

解释

我们创建一个新的（空）图g。对于中的每个子列表lists，将其元素视为图的节点并在它们之间添加一条边。（因为我们只关心连通性，我们不需要添加所有的边——只需要相邻的！）请注意，add_edge需要两个对象，将它们视为节点（如果它们不存在则添加它们），然后添加他们之间的边缘。

然后，我们只需找到图的连通分量——一个已解决的问题！-- 并将它们作为我们的相交集输出。

score 6 · Accepted Answer

这是我的答案。我还没有对照今天的一批答案检查它。

基于交集的算法是 O(N^2)，因为它们检查每个新集合与所有现有集合，所以我使用了一种方法，索引每个数字并运行接近 O(N)（如果我们接受字典查找是O(1))。然后我跑了基准测试，感觉自己像个白痴，因为它跑得慢，但仔细检查后发现测试数据最终只有少数不同的结果集，所以二次算法没有太多工作要做做。用超过 10-15 个不同的 bin 对其进行测试，我的算法要快得多。尝试使用 50 多个不同的 bin 测试数据，它的速度非常快。

（编辑：基准测试的运行方式也存在问题，但我的诊断是错误的。我更改了代码以使用重复测试的运行方式）。

def mergelists5(data):
    """Check each number in our arrays only once, merging when we find
    a number we have seen before.
    """

    bins = range(len(data))  # Initialize each bin[n] == n
    nums = dict()

    data = [set(m) for m in data ]  # Convert to sets    
    for r, row in enumerate(data):
        for num in row:
            if num not in nums:
                # New number: tag it with a pointer to this row's bin
                nums[num] = r
                continue
            else:
                dest = locatebin(bins, nums[num])
                if dest == r:
                    continue # already in the same bin

                if dest > r:
                    dest, r = r, dest   # always merge into the smallest bin

                data[dest].update(data[r]) 
                data[r] = None
                # Update our indices to reflect the move
                bins[r] = dest
                r = dest 

    # Filter out the empty bins
    have = [ m for m in data if m ]
    print len(have), "groups in result"
    return have


def locatebin(bins, n):
    """
    Find the bin where list n has ended up: Follow bin references until
    we find a bin that has not moved.
    """
    while bins[n] != n:
        n = bins[n]
    return n

score 3 · Accepted Answer

这个新功能只进行最少必要数量的不相交测试，这是其他类似解决方案无法做到的。它还使用 adeque来避免尽可能多的线性时间操作，例如从列表的早期进行列表切片和删除。

from collections import deque

def merge(lists):
    sets = deque(set(lst) for lst in lists if lst)
    results = []
    disjoint = 0
    current = sets.pop()
    while True:
        merged = False
        newsets = deque()
        for _ in xrange(disjoint, len(sets)):
            this = sets.pop()
            if not current.isdisjoint(this):
                current.update(this)
                merged = True
                disjoint = 0
            else:
                newsets.append(this)
                disjoint += 1
        if sets:
            newsets.extendleft(sets)
        if not merged:
            results.append(current)
            try:
                current = newsets.pop()
            except IndexError:
                break
            disjoint = 0
        sets = newsets
    return results

给定数据集中的集合之间的重叠越少，与其他函数相比，这将做得越好。

这是一个示例案例。如果你有4套，你需要比较：

1, 2
1、3
1、4
2、3
2、4
3、4

如果 1 与 3 重叠，则需要重新测试 2 以查看它现在是否与 1 重叠，以便安全地跳过针对 3 的测试 2。

有两种方法可以解决这个问题。首先是在每次重叠和合并后重新开始对第 1 组与其他组的测试。第二种是通过比较 1 和 4 来继续测试，然后返回并重新测试。后者导致更少的不相交测试，因为更多的合并发生在一次通过中，所以在重新测试通过时，剩下的测试集更少。

问题是跟踪哪些集合必须重新测试。在上面的示例中，需要针对 2 而不是针对 4 重新测试 1，因为在第一次测试 4 之前 1 已经处于当前状态。

计数器允许对此disjoint进行跟踪。

我的回答无助于找到一种改进的算法以重新编码为 FORTRAN 的主要问题；在我看来，这是用 Python 实现算法的最简单、最优雅的方法。

根据我的测试（或已接受答案中的测试），它比下一个最快的解决方案略快（高达 10%）。

def merge0(lists):
    newsets, sets = [set(lst) for lst in lists if lst], []
    while len(sets) != len(newsets):
        sets, newsets = newsets, []
        for aset in sets:
            for eachset in newsets:
                if not aset.isdisjoint(eachset):
                    eachset.update(aset)
                    break
            else:
                newsets.append(aset)
    return newsets

不需要其他实现中使用的非 Pythonic 计数器 ( i, range) 或复杂的突变 ( del, pop, insert)。它只使用简单的迭代，以最简单的方式合并重叠的集合，并在每次遍历数据时构建一个新列表。

我的（更快更简单）版本的测试代码：

import random

tenk = range(10000)
lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]

setup = """
def merge0(lists):
  newsets, sets = [set(lst) for lst in lists if lst], []
  while len(sets) != len(newsets):
    sets, newsets = newsets, []
    for aset in sets:
      for eachset in newsets:
        if not aset.isdisjoint(eachset):
          eachset.update(aset)
          break
      else:
        newsets.append(aset)
  return newsets

def merge1(lsts):
  sets = [set(lst) for lst in lsts if lst]
  merged = 1
  while merged:
    merged = 0
    results = []
    while sets:
      common, rest = sets[0], sets[1:]
      sets = []
      for x in rest:
        if x.isdisjoint(common):
          sets.append(x)
        else:
          merged = 1
          common |= x
      results.append(common)
    sets = results
  return sets

lsts = """ + repr(lsts)

import timeit
print timeit.timeit("merge0(lsts)", setup=setup, number=10)
print timeit.timeit("merge1(lsts)", setup=setup, number=10)

score 2 · Accepted Answer

这是一个使用不相交集数据结构（特别是不相交森林）的实现，这要归功于comingstorm关于合并甚至具有一个共同元素的集合的提示。我正在使用路径压缩来稍微提高（~5%）速度；这不是完全必要的（它可以防止find尾递归，这可能会减慢速度）。请注意，我使用 adict来表示不相交的森林；鉴于数据是ints，数组也可以工作，尽管它可能不会快多少。

def merge(data):
    parents = {}
    def find(i):
        j = parents.get(i, i)
        if j == i:
            return i
        k = find(j)
        if k != j:
            parents[i] = k
        return k
    for l in filter(None, data):
        parents.update(dict.fromkeys(map(find, l), find(l[0])))
    merged = {}
    for k, v in parents.items():
        merged.setdefault(find(v), []).append(k)
    return merged.values()

这种方法可与 Rik 基准测试中的其他最佳算法相媲美。

score 1 · Accepted Answer

这将是我更新的方法：

def merge(data):
    sets = (set(e) for e in data if e)
    results = [next(sets)]
    for e_set in sets:
        to_update = []
        for i,res in enumerate(results):
            if not e_set.isdisjoint(res):
                to_update.insert(0,i)

        if not to_update:
            results.append(e_set)
        else:
            last = results[to_update.pop(-1)]
            for i in to_update:
                last |= results[i]
                del results[i]
            last |= e_set

    return results

注意：在合并过程中，空列表将被删除。

更新：可靠性。

您需要两次测试才能获得 100% 的成功可靠性：

检查所有结果集是否相互脱节：

merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]

from itertools import combinations
for a,b in combinations(merged,2):
    if not a.isdisjoint(b):
        raise Exception(a,b)    # just an example

检查合并集是否覆盖原始数据。（由 katrielalex 建议）

我认为这需要一些时间，但如果你想 100% 确定，也许这是值得的。

score 1 · Accepted Answer

这比 Niklas 提供的解决方案慢（我在 test.txt 上得到 3.9 秒而不是他的解决方案的 0.5 秒），但产生相同的结果并且可能更容易在例如 Fortran 中实现，因为它不使用集合，仅对元素总数进行排序，然后对所有元素进行一次运行。

它返回一个包含合并列表 id 的列表，因此还跟踪空列表，它们保持未合并。

def merge(lsts):
        # this is an index list that stores the joined id for each list
        joined = range(len(lsts))
        # create an ordered list with indices
        indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
        # loop throught the ordered list, and if two elements are the same and
        # the lists are not yet joined, alter the list with joined id
        el_0,idx_0 = None,None
        for el,idx in indexed_list:
                if el == el_0 and joined[idx] != joined[idx_0]:
                        old = joined[idx]
                        rep = joined[idx_0]
                        joined = [rep if id == old else id for id in joined]
                el_0, idx_0 = el, idx
        return joined

score 1 · Accepted Answer

这是一个函数（Python 3.1），用于检查合并函数的结果是否正常。它检查：

结果集是否不相交？（联合的元素数==元素数之和）
结果集的元素是否与输入列表的元素相同？
每个输入列表都是结果集的子集吗？
每个结果集是否都是最小的，即不可能将它分成两个较小的集？
它不检查是否有空的结果集 - 我不知道你是否想要它们......

.

from itertools import chain

def check(lsts, result):
    lsts = [set(s) for s in lsts]
    all_items = set(chain(*lsts))
    all_result_items = set(chain(*result))
    num_result_items = sum(len(s) for s in result)
    if num_result_items != len(all_result_items):
        print("Error: result sets overlap!")
        print(num_result_items, len(all_result_items))
        print(sorted(map(len, result)), sorted(map(len, lsts)))
    if all_items != all_result_items:
        print("Error: result doesn't match input lists!")
    if not all(any(set(s).issubset(t) for t in result) for s in lst):
        print("Error: not all input lists are contained in a result set!")

    seen = set()
    todo = list(filter(bool, lsts))
    done = False
    while not done:
        deletes = []
        for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
            if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
                seen.update(s)
                deletes.append(i)
        for i in reversed(deletes):
            del todo[i]
        done = not deletes
    if todo:
        print("Error: A result set should be split into two or more parts!")
        print(todo)

score 1 · Accepted Answer

只是为了好玩...

def merge(mylists):
    results, sets = [], [set(lst) for lst in mylists if lst]
    upd, isd, pop = set.update, set.isdisjoint, sets.pop
    while sets:
        if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
            results.append(pop(0))
    return results

和我重写的最佳答案

def merge(lsts):
  sets = map(set,lsts)
  results = []
  while sets:
    first, rest = sets[0], sets[1:]
    merged = False
    sets = []
    for s in rest:
      if s and s.isdisjoint(first):
        sets.append(s)
      else:
        first |= s
        merged = True
    if merged: sets.append(first)
    else: results.append(first)
  return results

score 1 · Accepted Answer

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

import networkx as nx
g = nx.Graph()

for sub_list in lists:
    for i in range(1,len(sub_list)):
        g.add_edge(sub_list[0],sub_list[i])

print nx.connected_components(g)
#[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]

表现：

5000 lists, 5 classes, average size 74, max size 1000
15.2264976415

合并1的表现：

print timeit.timeit("merge1(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26998780571

所以它比最快的慢 11 倍。但代码更简单易读！

score 1 · Accepted Answer

首先，我不确定基准是否公平：

将以下代码添加到我的函数的开头：

c = Counter(chain(*lists))
    print c[1]
"88"

这意味着在所有列表中的所有值中，只有 88 个不同的值。通常在现实世界中重复很少见，并且您会期望更多不同的值。（当然我不知道你的数据来自哪里，所以无法做出假设）。

因为重复更常见，这意味着集合不太可能不相交。这意味着 set.isdisjoint() 方法会快得多，因为只有在几次测试之后，它才会发现集合不是不相交的。

说了这么多，我确实相信使用不相交的方法无论如何都是最快的，但我只是说，而不是快 20 倍，也许它们应该只比具有不同基准测试的其他方法快 10 倍。

无论如何，我想我会尝试一种稍微不同的技术来解决这个问题，但是合并排序太慢了，这种方法比使用基准测试的两种最快方法慢大约 20 倍：

我以为我会订购所有东西

import heapq
from itertools import chain
def merge6(lists):
    for l in lists:
        l.sort()
    one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
    previous = one_list.next()

    d = {i:i for i in range(len(lists))}
    for current in one_list:
        if current[0]==previous[0]:
            d[current[1]] = d[previous[1]]
        previous=current

    groups=[[] for i in range(len(lists))]
    for k in d:
        groups[d[k]].append(lists[k]) #add a each list to its group

    return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.


lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
print merge6(lists)
"[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""



import timeit
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
print timeit.timeit("merge4(lsts)", setup=setup, number=10)
print timeit.timeit("merge6(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26732238315
5000 lists, 5 classes, average size 74, max size 1000
1.16062907437
5000 lists, 5 classes, average size 74, max size 1000
30.7257182826

score 0 · Accepted Answer

用于flag确保您获得最终的互斥结果

def merge(lists): 
    while(1):
        flag=0
        for i in range(0,len(lists)):
            for j in range(i+1,len(lists)):
                if len(intersection(lists[i],lists[j]))!=0:
                    lists[i]=union(lists[i],lists[j])
                    lists.remove(lists[j])
                    flag+=1
                    break
        if flag==0:
            break
    return lists

score 0 · Accepted Answer

from itertools import combinations


def merge(elements_list):
    d = {index: set(elements) for index, elements in enumerate(elements_list)}

    while any(not set.isdisjoint(d[i], d[j]) for i, j in combinations(d.keys(), 2)):
        merged = set()
        for i, j in combinations(d.keys(), 2):
            if not set.isdisjoint(d[i], d[j]):
                d[i] = set.union(d[i], d[j])
                merged.add(j)

        for k in merged:
            d.pop(k)    

    return [v for v in d.values() if v]

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]

print(merge(lst))

score 0 · Accepted Answer

我发现@Niklas B. 的回答真的很有帮助……但我花了一段时间才通读并理解其中的逻辑。这是用新变量名和更多解释重新编写完全相同的代码......以帮助那里的其他 N00bs！

def mergeUntilOnlyDisjointSetsRemain(_listsOfLists):

    """Function for merging lists until there are only disjoint sets"""
    
    """Imagine this algorithm as if it were processing train cars filled with
    integers. It takes the first car of the train, separates it from the rest, 
    and then compares the first car to each subsequent car.
    Start by renaming the first car to 'common'
    If the two train cars have a common integer, you merge the two cars into
    common, and continue down the line until you reach the end of the train.
    Once you reach the end of the train, place the common car in results, (which
    is essentially a collection of train cars that have already been compared 
    to all other cars).
    You can exit the loop as soon as you get to the end of the train without
    merging any of the cars. This is controlled by the continueMerge variable.
    This variable is only reset to True after a merge operation.
    """

    # Start by creating a trainCar(i.e. a set) from each list in our listOfLists
    freightTrain = [set(trainCar) for trainCar in _listsOfLists if trainCar]

    # This continueMerge means that we have not yet compared all cars in the train.
    continueMerge = True
    while continueMerge:
        # Reset the while loop trigger.
        continueMerge = False
        # Create a fresh empty list of cars that have already been cross checked
        crossCheckedCars = []
        # While there are still elements in freightTrain
        while freightTrain:
            # Split the freightTrain up, into first car vs all the remaining cars
            commonFirstTrainCar = freightTrain[0]
            remainingCars = freightTrain[1:]
            # The freightTrain is now empty
            freightTrain = []
            # Iterate over all the remaining traincars
            for currentTrainCar in remainingCars:
                # If there are not any common integers with the first car...
                if currentTrainCar.isdisjoint(commonFirstTrainCar):
                    # Add each train car back onto the freightTrain
                    freightTrain.append(currentTrainCar)
                # But if they share a common integer...
                else:
                    # Trigger the reset switch to continueMerging cars
                    continueMerge = True
                    # and Join he cars together
                    commonFirstTrainCar |= currentTrainCar
            # Once we have checked commonFirstTrainCar, remove it from the 
            # freightTrain and place it in crossCheckedCars
            crossCheckedCars.append(commonFirstTrainCar)

        # Now we have compared the first car to all subsequent cars
        # (... but we aren't finished because the 5th and 7th cars might have
        # had a common integer with each other... but only 1st and 5th cars
        # shared an integer the first time around... so the 1st and 5th cars
        # were merged, but the 7th car is still alone!)

        # Reset the system by creating a new freightTrain
        freightTrain = crossCheckedCars

    # Post-process the freight train to turn it into lists instead of sets
    listsForReturnTripHome = []
    for processedTraincar in freightTrain:
        listsForReturnTripHome.append(list(processedTraincar))
    return listsForReturnTripHome

score -1 · Accepted Answer

我的解决方案适用于小型列表，并且在没有依赖关系的情况下非常易读。

def merge_list(starting_list):
    final_list = []
    for i,v in enumerate(starting_list[:-1]):
        if set(v)&set(starting_list[i+1]):
            starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
        else:
            final_list.append(v)
    final_list.append(starting_list[-1])
    return final_list

对其进行基准测试：

列表 = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

%timeit 合并列表（列表）

100000 次循环，最佳 3 次：每个循环 4.9 µs

score -1 · Accepted Answer

这可以通过使用 union-find 算法在 O(n) 中解决。给定数据的前两行，在联合查找中使用的边是以下对：(0,1),(1,3),(1,0),(0,3),(3,4 ),(4,5),(5,10)

python - Python：基于交集的简单列表合并

18 回答 18

测试

定时

使用矩阵操作

解释

Related

Reference