c - D&C 的 C 实现比矩阵乘法的 Naive 解决方案更快？

Question

用于矩阵乘法的分治 (D&C) 解决方案和 Naive 解决方案均使用 C 编程语言“就地”实现。所以根本没有动态内存分配。

正如我们对这两种解决方案所了解的那样，它们实际上具有相同的时间复杂度，即 O(n^3)。现在它们共享相同的空间复杂度，因为它们都是就地实现的。那么一个人怎么会比另一个人快这么多呢？

使用clock_gettime 获取时间。

在核心 i7 笔记本电脑的 Windows 7 上使用 Cygwin，D&C 解决方案的运行速度比 Naive 解决方案快得多（删除了冗余日志）：

编辑：

"algo0" 表示 Naive 解决方案，而 "algo1" 表示 D&C 解决方案。

“len”表示矩阵的宽度和高度。并且矩阵是NxN矩阵。

“00:00:00:000:003:421”表示：“时：分：秒：毫秒：微秒：纳秒”。

[alg0]time cost[0, len=00000002]: 00:00:00:000:003:421 (malloc_cnt=0)
[alg1]time cost[0, len=00000002]: 00:00:00:000:000:855 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[1, len=00000004]: 00:00:00:000:001:711 (malloc_cnt=0)
[alg1]time cost[1, len=00000004]: 00:00:00:000:001:711 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[2, len=00000008]: 00:00:00:000:009:408 (malloc_cnt=0)
[alg1]time cost[2, len=00000008]: 00:00:00:000:008:553 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[3, len=00000016]: 00:00:00:000:070:134 (malloc_cnt=0)
[alg1]time cost[3, len=00000016]: 00:00:00:000:065:858 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[4, len=00000032]: 00:00:00:000:564:066 (malloc_cnt=0)
[alg1]time cost[4, len=00000032]: 00:00:00:000:520:873 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[5, len=00000064]: 00:00:00:004:667:337 (malloc_cnt=0)
[alg1]time cost[5, len=00000064]: 00:00:00:004:340:188 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[6, len=00000128]: 00:00:00:009:662:680 (malloc_cnt=0)
[alg1]time cost[6, len=00000128]: 00:00:00:008:139:403 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[7, len=00000256]: 00:00:00:080:031:116 (malloc_cnt=0)
[alg1]time cost[7, len=00000256]: 00:00:00:065:395:329 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[8, len=00000512]: 00:00:00:836:392:576 (malloc_cnt=0)
[alg1]time cost[8, len=00000512]: 00:00:00:533:799:924 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[9, len=00001024]: 00:00:09:942:086:780 (malloc_cnt=0)
[alg1]time cost[9, len=00001024]: 00:00:04:307:021:362 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[10, len=00002048]: 00:02:53:413:046:992 (malloc_cnt=0)
[alg1]time cost[10, len=00002048]: 00:00:35:588:289:832 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[11, len=00004096]: 00:25:46:154:930:041 (malloc_cnt=0)
[alg1]time cost[11, len=00004096]: 00:04:38:196:205:661 (malloc_cnt=0)

即使在只有一个 ARM 内核的 Raspberry Pi 上，结果也是相似的（同时删除了冗余数据）：

[alg0]time cost[0, len=00000002]: 00:00:00:000:005:999 (malloc_cnt=0)
[alg1]time cost[0, len=00000002]: 00:00:00:000:051:997 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[1, len=00000004]: 00:00:00:000:004:999 (malloc_cnt=0)
[alg1]time cost[1, len=00000004]: 00:00:00:000:008:000 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[2, len=00000008]: 00:00:00:000:014:999 (malloc_cnt=0)
[alg1]time cost[2, len=00000008]: 00:00:00:000:023:999 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[3, len=00000016]: 00:00:00:000:077:996 (malloc_cnt=0)
[alg1]time cost[3, len=00000016]: 00:00:00:000:157:991 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[4, len=00000032]: 00:00:00:000:559:972 (malloc_cnt=0)
[alg1]time cost[4, len=00000032]: 00:00:00:001:248:936 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[5, len=00000064]: 00:00:00:005:862:700 (malloc_cnt=0)
[alg1]time cost[5, len=00000064]: 00:00:00:010:739:450 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[6, len=00000128]: 00:00:00:169:060:336 (malloc_cnt=0)
[alg1]time cost[6, len=00000128]: 00:00:00:090:290:373 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[7, len=00000256]: 00:00:03:207:909:599 (malloc_cnt=0)
[alg1]time cost[7, len=00000256]: 00:00:00:771:870:443 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[8, len=00000512]: 00:00:35:725:494:551 (malloc_cnt=0)
[alg1]time cost[8, len=00000512]: 00:00:08:139:712:988 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[9, len=00001024]: 00:06:29:762:101:314 (malloc_cnt=0)
[alg1]time cost[9, len=00001024]: 00:01:50:964:568:907 (malloc_cnt=0)
------------------------------------------------------
[alg0]time cost[10, len=00002048]: 00:52:03:950:717:474 (malloc_cnt=0)
[alg1]time cost[10, len=00002048]: 00:14:19:222:020:444 (malloc_cnt=0)

我的第一个猜测是，一定是 GCC 做了一些优化。但具体怎么做？

以下是 Naive 解决方案和 D&C 解决方案的代码。天真的解决方案：

void ClassicalMulti(int const * const mat1,
                    int const * const mat2,
                    int * const matrix,
                    const int n) {
    if (!mat1 || !mat2 || n<=0) {
        printf("ClassicalMulti: Invalid Input\n");
        return;
    }

    int cnt, row, col;

    for (row=0;row<n;++row) {
        for (col=0;col<n;++col) {
            for (cnt=0;cnt<n;++cnt) {
                matrix[row*n+col] += mat1[row*n+cnt] * mat2[cnt*n+col];
            }
        }
    }
}

分而治之解决方案：

void DCMulti(int const * const mat1,
             int const * const mat2,
             int * const matrix,
             const int p1,
             const int p2,
             const int pn,
             const int n) {
    if (!mat1 || !mat2 || !matrix || n<2 || p1<0 || p2 <0 || pn<2) {
        printf("DCMulti: Invalid Input\n");
        return;
    }

    if (pn == 2) {
        int pos = (p1/n)*n + p2%n;
        matrix[pos]     += mat1[p1]*mat2[p2] + mat1[p1+1]*mat2[p2+n];
        matrix[pos+1]   += mat1[p1]*mat2[p2+1] + mat1[p1+1]*mat2[p2+1+n];
        matrix[pos+n]   += mat1[p1+n]*mat2[p2] + mat1[p1+1+n]*mat2[p2+n];
        matrix[pos+1+n] += mat1[p1+n]*mat2[p2+1] + mat1[p1+1+n]*mat2[p2+1+n];
    } else {
        int a = p1;
        int b = p1 + pn/2;
        int c = p1 + pn*n/2;
        int d = p1 + pn*(n+1)/2;
        int e = p2;
        int f = p2 + pn/2;
        int g = p2 + pn*n/2;
        int h = p2 + pn*(n+1)/2;
        DCMulti(mat1, mat2, matrix, a, e, pn/2, n);   // a*e
        DCMulti(mat1, mat2, matrix, b, g, pn/2, n);   // b*g
        DCMulti(mat1, mat2, matrix, a, f, pn/2, n);   // a*f
        DCMulti(mat1, mat2, matrix, b, h, pn/2, n);   // b*h 
        DCMulti(mat1, mat2, matrix, c, e, pn/2, n);   // c*e 
        DCMulti(mat1, mat2, matrix, d, g, pn/2, n);   // d*g 
        DCMulti(mat1, mat2, matrix, c, f, pn/2, n);   // c*f 
        DCMulti(mat1, mat2, matrix, d, h, pn/2, n);   // d*h 
    }
}

score 6 · Accepted Answer

这两种方法的区别仅在于内存访问模式。即缓存位置；对于大型矩阵，尤其是行竞争相同的缓存行并导致缓存未命中的惩罚越来越大。最后，D&C 策略得到了回报，尽管全局更好的方法是将问题划分为 8x8 块——一种称为循环平铺的技术。（毫不奇怪，矩阵乘法在维基百科文章中作为拱形示例呈现......）

score 2 · Accepted Answer

有许多不同的方法可以进行简单的矩阵乘法。有些比其他的快得多。

您可以在做您正在做的事情之前尝试转置第二个矩阵。Aki Suikhonen 正确地指出平铺可能是有益的。您可能会尝试的另一件事是“打破巨大的负载存储链”——内部循环的每次迭代都会将一个数字添加到同一位置。CPU 可能会忽略几乎所有的负载，但它仍然必须等待前一个添加完成，然后才能开始下一个。尝试拥有多个累加器并在最后将它们相加。您可能希望展开内部循环以使其更易于编码。

c - D&C 的 C 实现比矩阵乘法的 Naive 解决方案更快？

2 回答 2

Related

Reference