mysql - 使用存储过程实现的 Warshall 算法

Question

我在 MySQL 存储过程中实现了 Warshall 算法。不幸的是，该过程需要很长时间才能完成。我是编写存储过程的初学者，你知道我能做些什么来让它更快吗？

简要说明：我正在尝试计算邻接表的传递闭包。我想知道，连接了哪些节点（直接在一条边上，或间接在更多边上）。例如：

Input:  (1, 2), (2, 3), (3, 4)
Output: (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)

下图说明了图表：

_{图片来自维基共享资源：
https ://en.wikipedia.org/wiki/File:Transitive-closure.svg}

您可以在SQL Fiddle或此处查看代码：

# "Warshall's algorithm" to calculate the transitive closure
# (1) For k = 1 to n
# (2)   For i = 1 to n
# (3)     If d[i,k] = 1
# (4)       For j = 1 to n
# (5)         If d[k,j] = 1 : d[i,j] = 1
create procedure closure()
begin
    drop table if exists adjMatrix;
    drop table if exists idArray;
    create temporary table adjMatrix (idFrom int not null, idTo int not null,
                                      primary key (idFrom, idTo));
    create temporary table idArray (id int);
    insert into adjMatrix select parent_id, id
                          from article where parent_id is not null;
    insert into idArray select id from article;
    blockk: begin
        declare k, fink int;
        declare ck cursor for select id from idArray;
        declare continue handler for not found set fink = 1;
        open ck;
        loopk: loop
            fetch ck into k;
            if fink = 1 then
                leave loopk;
            end if;
            blocki: begin
                declare i, fini int;
                declare ci cursor for select id from idArray;
                declare continue handler for not found set fini = 1;
                -- select k from dual;
                open ci;
                loopi: loop
                    fetch ci into i;
                    if fini = 1 then
                        leave loopi;
                    end if;
                    blockj: begin
                        if exists (select * from adjMatrix where idFrom=i and idTo=k)
                        then
                            blockx: begin
                                declare j, finj int;
                                declare cj cursor for select id from idArray;
                                declare continue handler for not found set finj = 1;
                                open cj;
                                loopj: loop
                                    fetch cj into j;
                                    if finj = 1 then
                                        leave loopj;
                                    end if;
                                    if exists (select * from adjMatrix
                                               where idFrom=k and idTo=j) then
                                        insert into adjMatrix values (i, j);
                                    end if;
                                end loop loopj;
                                close cj;
                            end blockx;
                        end if;
                    end blockj;
                end loop loopi;
                close ci;
                -- select idFrom, idTo from adjMatrix order by idFrom, idTo;
            end blocki;
        end loop loopk;
        close ck;
    end blockk;
    insert into adjMatrix select id, id from article where parent_id is null;
    select idFrom, idTo from adjMatrix order by idFrom, idTo;
    drop temporary table adjMatrix;
    drop temporary table idArray;
end//

在我的计算机上运行具有 1466 条记录的表需要 45 秒：

mysql> call closure;
+--------+------+
| idFrom | idTo |
+--------+------+
|      1 |    1 |
|      1 |    2 |
|      1 |    3 |
|      1 |    4 |
|      1 |    5 |
|      2 |    3 |
|      2 |    4 |
|      2 |    5 |
|      3 |    4 |
|      3 |    5 |
|      4 |    5 |
~        ~      ~
|   1587 | 1587 |
|   1588 | 1588 |
|   1589 | 1589 |
+--------+------+
1472 rows in set (45.58 sec)

score 1 · Accepted Answer

警告：由于我不熟悉 mysql，因此我已将问题“翻译”为 MSSQL，因此您需要努力将其翻译回来 =)

我猜事情变慢的原因是因为 SQL 并不真正适合这种操作。它不“喜欢”分支和循环以及所有这些东西。它的作用很多是将数据从一个表匹配到另一个表，最好是大堆。（想想 RDBMS 中的 R）

因此，为了加快您的存储过程，您可以切换到更适合这种编码风格的不同编程语言；或者您可以将问题转换为更适合 SQL 的内容。当然，后者更有趣！=)

稍微思考了一下这个问题，我想到了这个：

CREATE TABLE #result (idFrom int not null, idTo int not null, primary key (idFrom, idTo));

INSERT INTO #result
SELECT parent_id, id
  FROM article 
 WHERE parent_id is not null;

 WHILE @@ROWCOUNT > 0
    BEGIN
        INSERT INTO #result 
        SELECT DISTINCT f.idFrom, t.idTo
          FROM #result f
          JOIN #result t
            ON t.idFrom = f.idTo
         WHERE NOT EXISTS ( SELECT *
                              FROM #result old
                             WHERE old.idFrom = f.idFrom
                               AND old.idTo = t.idTo )
    END

SELECT * FROM #result ORDER BY idFrom, idTo

同样，这是 TSQL（MSSQL 使用的 SQL 方言），但我猜想将其转换为 mysql（??）应该非常简单。

它的作用是：

创建一个临时表#result 与您的 adjMatrix 表几乎相同
从其中的源表中加载“直接链接”
通过将一条记录 idTo 匹配到另一条记录 idFrom 来插入所有“辅助组合”；确保它无法在表中找到所述组合，并确保所述列表仅具有唯一组合（不同）
如果添加了新记录（以及因此的组合），看看我们是否可以添加“下一个”层。

一旦没有新的发现，返回结果

例如：给定输入 (1, 2), (2, 3), (3, 4)

将首先用 (1, 2), (2, 3), (3, 4) 填充#result，然后进入循环：

迭代 1 将匹配以下记录：

(1,2)-(2,3) => (1,3)
(2,3)-(3,4) => (2,4)

并将它们添加到#result，我们因此找到 (1, 2), (2, 3), (3, 4), (1, 3), (2, 4)

迭代 2 将匹配以下记录：

(1,2)-(2,3) => (1,3) 但由于 WHERE NOT EXISTS() 它将被淘汰
(1,2)-(2,4) => (1,4)
(2,3)-(3,4) => (2,4) 但由于 WHERE NOT EXISTS() 它将被淘汰
(1,3)-(3,4) => (1,4)

然后它将 DISTINCT 所述列表，并且仅保留一个 (1,4) 的实例并将被添加，因此找到 (1, 2), (2, 3), (3, 4), (1, 3), (2, 4), (1,4)

迭代4将匹配以下记录：

(1,2)-(2,3) => (1,3) 但由于 WHERE NOT EXISTS() 它将被淘汰
(1,2)-(2,4) => (1,4) 但由于 WHERE NOT EXISTS() 它会被淘汰
(2,3)-(3,4) => (2,4) 但由于 WHERE NOT EXISTS() 它将被淘汰
(1,3)-(3,4) => (1,4) 但它会因为 WHERE NOT EXISTS() 而被淘汰

随着所有新记录都被消除，我们最终得到一个零行数，因此跳出循环。

我也尝试过使用 SqlFiddle 输入的算法，结果几乎是瞬间完成的，但我最终得到了 15 条记录而不是 16 条记录的输出。看来您的代码还包含一个 (1, 1) 记录，这有点令人惊讶我？！

无论如何，希望这会有所帮助。在使用 SQL 一段时间后，您将自动学会对数据块进行操作，而不是“按值”工作，这会使任何 RDBMS 陷入困境。它通常适用于一小组数据，但一旦放入更多数据，性能就会下降。编写良好的基于集合的 SQL 几乎不会注意到处理 100 条记录或 100000 条记录之间的区别。（可以这么说=）

mysql - 使用存储过程实现的 Warshall 算法

1 回答 1

Related

Reference