您的数据集可以被视为可能断开的有向图。在我看来,您想要为每个弱连接子图设置节点。自己写这个并不难:
- 我们认为该图是无向的。
- 我们将边存储在哈希中,因此条目
$edge{$a}{$b}
是从顶点$a
到的有向边$b
。
- 现在我们只需要一个迭代搜索,删除所有访问过的边。
示例代码:
use strict; use warnings; use feature qw/say/;
# build the graph
my %edge;
while (<>) {
my ($from, $to) = split;
$edge{$from}{$to} = $edge{$to}{$from} = undef;
}
while (my ($start) = keys %edge) {
my @seen = ($start);
my @stack = ($start);
while (@stack) {
my $vertex = pop @stack;
# delete edges from and to this vertex
# NB: any connections to seen vertices are already removed.
my @reachable = keys %{ delete($edge{$vertex}) // {} };
delete $edge{$_}{$vertex} for @reachable;
# mark new vertices as seen, and enqueue them
push @seen, @reachable;
push @stack, @reachable;
}
my $nodes = join ', ', sort @seen;
say "node set: {$nodes}";
}
数据的输出:
node set: {B, F, Q, R}
node set: {D, K, P}
node set: {A, C, E, I, S, Y}
该算法已经相当优化,并且在O ( n · k ) 时间和空间中运行(其中k是邻居的平均数量)。
当然,已经有一个实现图算法的模块。不出所料,它被称为Graph
. 上面的代码等价于:
use strict; use warnings; use feature qw/say/;
use Graph;
my $graph = Graph::Undirected->new;
while (<>) {
my ($from, $to) = split;
$graph->add_edge($from, $to);
}
for my $nodes_array ($graph->connected_components) {
my $nodes = join ', ', sort @$nodes_array;
say "node set: {$nodes}";
}
通过在构建图形时跟踪连接的组件,可以在更少的内存和更少的时间中执行这些计算。为此,我们有一个将顶点映射到其子图的哈希。
- 如果边的两个顶点都是未知的,则它们创建一个新的子图。
- 如果确切地知道一个顶点,则另一个节点映射到第一个节点的子图,并在那里列为成员。
- 如果两个顶点都是已知的,那么
- 如果它们指向同一个子图,则不会发生任何事情。
- 如果它们指向不同的子图,则所有条目都将更新为指向同一个子图,该子图现在包含先前子图的组合节点。
编码:
use strict; use warnings; use feature qw/say/;
my %subgraph_by_id;
my %subgraph_by_vertex;
while(<>) {
my ($x, $y) = split;
# case 1:
# If an both vertices of an edge are unknown, they create a new subgraph.
if (not exists $subgraph_by_vertex{$x} and not exists $subgraph_by_vertex{$y}) {
my $new = [$x, $y];
$subgraph_by_id{0+ $new} = $new;
$subgraph_by_vertex{$_} = $new for $x, $y;
}
# case 2:
# If exactly one vertex is known, the other node maps to the subgraph of the
# first node, and is listed there as a member.
elsif (not exists $subgraph_by_vertex{$x} or not exists $subgraph_by_vertex{$y}) {
my ($known, $unknown) = (exists $subgraph_by_vertex{$x}) ? ($x, $y) : ($y, $x);
my $subgraph = $subgraph_by_vertex{$unknown} = $subgraph_by_vertex{$known};
push @$subgraph, $unknown;
}
# case 3:
# both vertices are known. If they point to different subgraphs, all entries
# are updated to point to the same subgraph which now contains the combined
# nodes of the previous subgraphs.
# Except all that copying would make for a horrible worst case.
# Instead, we just add a reference to the smaller list, flattening it later.
else {
my $prev_x = $subgraph_by_vertex{$x};
my $prev_y = $subgraph_by_vertex{$y};
# don't test for inequality directly to allow subgraph nesting
if ($subgraph_by_id{0+ $prev_x} != $subgraph_by_id{0+ $prev_y}) {
my ($new, $old) = (@$prev_x > @$prev_y) ? ($prev_x, $prev_y) : ($prev_y, $prev_x);
push @$new, $old;
# $old not needed on top level any longer – associate it with $new by id
$subgraph_by_id{0+ $old} = 0+ $new;
}
}
}
# skip symbolic IDs
for my $nodes_array (grep ref, values %subgraph_by_id) {
my $nodes = join ', ', flatten($nodes_array);
say "node set: {$nodes}";
}
sub flatten {
return map { ref $_ ? flatten($_) : $_ } @{ shift() };
}
这仅使用O ( n ) 空间和时间,使用了很多尴尬的技巧。在构建子图期间,我不会合并两个连接的子图,而是将其推迟到以后。否则,边缘情况(自下而上构建的平衡树——对于每个非叶节点,将复制树的一半)可能需要指数时间——我没有进行全面分析。“ 0+
venus”伪操作符将其参数进行了数字化,在这里用于获取数组引用的 ID。