python - 查找重叠的列/行集

Question

背景：这个问题使另一个线程中的问题更进一步。

假设我有一个二维数组，其中的列被分成几组。为简单起见，我们可以假设数组包含int如下值：

np.random.randint(3,size=(2,10))   

# Column indices:
#       0  1  2  3  4  5  6  7  8  9                     
array([[0, 2, 2, 2, 1, 1, 0, 1, 1, 2],
       [1, 1, 0, 1, 1, 0, 2, 1, 1, 0]])

作为列索引分区的示例，我们可以选择以下内容：

# Partitioning the column indices of the previous array:

my_partition['first']  = [0,1,2]
my_partition['second'] = [3,4]
my_partition['third']  = [5,6,7]
my_partition['fourth'] = [8, 9]

我想找到具有相同值的列的列索引集的组。在上面的示例中，这些组的一些示例是：

# The following sets include indices for a common column vector with values [2,0]^T
group['a'] = ['first', 'fourth'] 

# The following sets include indices for a common column vector with values [1,1]^T
group['b'] = ['second', 'third', 'fourth']

我对这个问题的解决方案感兴趣，该解决方案适用于保存实际值的数组（例如，值1.0/2和1.0/2相同，即1.0/2 == 1.0/2返回True）。

我知道浮动精度的潜在限制，所以简单地说，我分两步处理这个问题：

如果值相同，则假设两列相同
假设如果值彼此接近，则两列相同（例如，向量差异低于阈值）

我试图在上一个线程中概括解决方案，但我不确定它是否直接适用。我认为它适用于第一个问题（列中的值完全相同），但我们可能需要“一艘更大的船”来解决第二个问题。

score 2 · Accepted Answer

如果您想从列的集合中创建一个集合样式的数据结构，这是一种方法（我确信对于更大的数据有更有效的方法）：

group = {}
for i in range(array.shape[1]):
    tup = tuple(array[:,i])
    if tup in group.keys():
        group[tup].append(i)
    else:
        group[tup] = [i]

为您的array给出示例执行：

In [132]: group
Out[132]:
{(0, 1): [0],
 (0, 2): [6],
 (1, 0): [5],
 (1, 1): [4, 7, 8],
 (2, 0): [2, 9],
 (2, 1): [1, 3]}

由于numpy.ndarray(like list) 不可散列，因此列本身不能用作dict键。我选择只使用tuple列的 -equivalent，但还有许多其他选择。

另外，我假设list在group. 如果这是真的，您可以考虑使用 adefaultdict而不是常规的dict。但是您也可以使用许多其他容器来存储列索引。

更新

我相信我更好地理解了这个问题的含义：给定一组预定义的列的任意集合，如何确定任何两个给定的组是否包含一个共同的列。

如果我们假设您已经在我上面的回答中构建了类似集合的结构，您可以获取两个组，查看它们的组成列，并询问是否有任何列最终位于集合字典的同一部分中：

假设我们定义：

my_partition['first']  = [0,1,2]
my_partition['second'] = [3,4]
my_partition['third']  = [5,6,7]
my_partition['fourth'] = [8, 9]

# Define a helper to back-out the column that serves as a key for the set-like structure.
# Take 0th element, column index should only be part of one subset.
get_key = lambda x: [k for k,v in group.iteritems() if x in v][0]

# use itertools
import itertools

# Print out the common columns between each pair of groups.
for pair_x, pair_y in itertools.combinations(my_partition.keys(), 2):
    print pair_x, pair_y, (set(map(get_key, my_partition[pair_x])) &
                           set(map(get_key, my_partition[pair_y])))

只要这不是空集，就意味着两组之间的某些列是共同的。

针对您的问题执行：

In [163]: for pair_x, pair_y in itertools.combinations(my_partition.keys(), 2):
    print pair_x, pair_y, set(map(get_key, my_partition[pair_x])) & set(map(get_key, my_partition[pair_y]))
   .....:
second fourth set([(1, 1)])
second third set([(1, 1)])
second first set([(2, 1)])
fourth third set([(1, 1)])
fourth first set([(2, 0)])
third first set([])

python - 查找重叠的列/行集

1 回答 1

Related

Reference