17

Real-world problem:

I have data on directors across many firms, but sometimes "John Smith, director of XYZ" and "John Smith, director of ABC" are the same person, sometimes they're not. Also "John J. Smith, director of XYZ" and "John Smith, director of ABC" might be the same person, or might not be. Often examination of additional information (e.g., comparison of biographical data on "John Smith, director of XYZ" and "John Smith, director of ABC") makes it possible to resolve whether two observations are the same person or not.

Conceptual version of the problem:

In that spirit, am collecting data that will identify matching pairs. For example, suppose I have the following matching pairs: {(a, b), (b, c), (c, d), (d, e), (f, g)}. I want to use the transitivity property of the relation "is the same person as" to generate "connected components" of {{a, b, c, d, e}, {f, g}}. That is {a, b, c, d, e} is one person and {f, g} is another. (An earlier version of the question referred to "cliques", which are apparently something else; this would explain why find_cliques in networkx was giving the "wrong" results (for my purposes).

The following Python code does the job. But I wonder: is there a better (less computationally costly) approach (e.g., using standard or available libraries)?

There are examples here and there that seem related (e.g., Cliques in python), but these are incomplete, so I am not sure what libraries they are referring to or how to set up my data to use them.

Sample Python 2 code:

def get_cliques(pairs):
    from sets import Set

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

This produces the desired output: [Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])].

Sample Python 3 code:

This produces [set(['a', 'c', 'b', 'e', 'd']), set(['g', 'f'])]):

def get_cliques(pairs):

    set_list = [set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for a_set in set_list:
            if pair[0] in a_set or pair[1] in a_set:
                a_set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))
4

4 回答 4

12

使用networkX:

import networkx as nx
G1=nx.Graph()
G1.add_edges_from([("a","b"),("b","c"),("c","d"),("d","e"),("f","g")])
sorted(nx.connected_components(G1), key = len, reverse=True)

给予:

[['a', 'd', 'e', 'b', 'c'], ['f', 'g']]

您现在必须检查最快的算法...

操作:

这很好用!我现在在我的 PostgreSQL 数据库中有这个。只需将对组织成一个两列表,然后用于array_agg()传递给 PL/Python 函数get_connected()。谢谢。

CREATE OR REPLACE FUNCTION get_connected(
    lhs text[],
    rhs text[])
  RETURNS SETOF text[] AS
$BODY$
    pairs = zip(lhs, rhs)

    import networkx as nx
    G=nx.Graph()
    G.add_edges_from(pairs)
    return sorted(nx.connected_components(G), key = len, reverse=True)

$BODY$ LANGUAGE plpythonu;

(注意:我编辑了答案,因为我认为显示此步骤可能会有所帮助,但评论太长了。)

于 2015-01-15T16:44:42.000 回答
3

我不相信(如果我错了,请纠正我)这与最大的集团问题直接相关。团的定义(维基百科)说团“在无向图中是其顶点的子集,使得子集中的每两个顶点都由一条边连接”。在这种情况下,我们想要找出哪些节点可以互相到达(甚至是间接的)。

我做了一个小样本。它构建一个图并遍历它寻找邻居。这应该非常有效,因为每个节点只在组形成时被遍历一次。

from collections import defaultdict

def get_cliques(pairs):
    # Build a graph using the pairs
    nodes = defaultdict(lambda: [])
    for a, b in pairs:
        if b is not None:
            nodes[a].append((b, nodes[b]))
            nodes[b].append((a, nodes[a]))
        else:
            nodes[a]  # empty list

    # Add all neighbors to the same group    
    visited = set()
    def _build_group(key, group):
        if key in visited:
            return
        visited.add(key)
        group.add(key)
        for key, _ in nodes[key]:
            _build_group(key, group)

    groups = []
    for key in nodes.keys():
        if key in visited: continue
        groups.append(set())
        _build_group(key, groups[-1])

    return groups

if __name__ == '__main__':
    pairs = [
        ('a', 'b'), ('b', 'c'), ('b', 'd'), # a "tree"
        ('f', None),                        # no relations
        ('h', 'i'), ('i', 'j'), ('j', 'h')  # circular
    ]
    print get_cliques(pairs)
    # Output: [set(['a', 'c', 'b', 'd']), set(['f']), set(['i', 'h', 'j'])]

如果您的数据集最好建模为图形并且非常大,那么诸如Neo4j之类的图形数据库是否合适?

于 2015-01-15T16:29:43.553 回答
2

DSM 的评论让我在 Python 中寻找集合合并算法。Rosetta Code有两个版本的相同算法。示例使用(非递归版本):

[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

# Copied from Rosetta Code
def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

print consolidate([set(pair) for pair in pairs])
# Output: [set(['a', 'c', 'b', 'd']), set([None, 'f']), set(['i', 'h', 'j'])]
于 2015-01-15T16:45:18.250 回答
1

我尝试了使用字典作为查找的替代实现,并且可能在计算延迟方面略有减少。

# Modified to use a dictionary
from collections import defaultdict

def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed

只是为了让自己相信它会返回正确的结果(get_cliques1这是您最初的 Python 2 解决方案):

>>> from cliques import *
>>> get_cliques1(pairs) # Original Python 2 solution
[Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])]
>>> get_cliques2(pairs) # Dictionary-based alternative
[['a', 'c', 'b', 'e', 'd'], ['g', 'f']]

以秒为单位的计时信息(1000 万次重复):

$ python get_times.py 
get_cliques: 75.1285209656
get_cliques2: 69.9816100597

为了完整和参考,这是两者cliques.pyget_times.py计时脚本的完整列表:

# cliques.py
# Python 2.7
from collections import defaultdict
from sets import Set  # I moved your import out of the function to try to get closer to apples-apples

# Original Python 2 solution
def get_cliques1(pairs):

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

# Modified to use a dictionary
def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed.values()

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]


# get_times.py
# Python 2.7
from timeit import timeit

REPS = 10000000

print "get_cliques: " + str(timeit(
  stmt='get_cliques1(pairs)', setup='from cliques import get_cliques1, pairs',
  number=REPS
))
print "get_cliques2: " + str(timeit(
  stmt='get_cliques2(pairs)', setup='from cliques import get_cliques2, pairs',
  number=REPS
))

所以至少在这个人为的场景中,有一个可测量的加速。诚然,这不是开创性的,而且我确信我在实现中留下了一些性能位,但也许它会帮助您考虑其他替代方案?

于 2015-01-15T16:24:29.727 回答