1

我尝试通过将整数矩阵倍数划分为更小的矩阵块来优化整数矩阵倍数,以便在 raspberry pi 3b+ 上获得更好的缓存命中率(它是 Cortex-A53 内核,具有 64 字节缓存线,4 路关联。它是 32K 字节) .

这是代码:

#define L1_D_CACHE_SZ 32 * 1024
size_t cache_tune_g = 32;

void mat_mul(int *A, int *B, int *C, size_t M, size_t N, size_t strideA, size_t strideB, size_t strideC) {

  for(int i = 0; i < M; i++) {
    int *Ai = A + (N + strideA) * i;
    for(int j = 0; j < M; j++) {
        int sum = 0;
        int *Bj = B + j;

        for (int k = 0; k < N; k++) {
            int *Aik = Ai + k;
            int *Bjk = Bj + (M + strideB) * k;
            sum += (*Aik) * (*Bjk);
        }

        int *Cij = C + (M + strideC) * i + j;
        *Cij = (*Cij) + sum;
    }
  }
}

// if B 'fits' into L1 data cache, then do the multiplication, 
// else divide A and B into 4 sub-matrixes and then call itself recursively.
void mat_mul_opt(int *A, int *B, int *C, size_t M, size_t N, size_t strideA, size_t strideB, size_t strideC) {
  int B_size = sizeof(int) * M * N;
  if (B_size < L1_D_CACHE_SZ/cache_tune_g) {
    mat_mul(A, B, C, M, N, strideA, strideB, strideC);
  } else {
    size_t M_sub = M / 2;
    size_t N_sub = N / 2;
    size_t strideA_sub = N_sub + strideA;
    size_t strideB_sub = M_sub + strideB;
    size_t strideC_sub = M_sub + strideC;

    int *A1 = A;
    int *A2 = A + N_sub;
    int *A3 = A + (N + strideA) * M_sub;
    int *A4 = A3 + N_sub;

    int *B1 = B;
    int *B2 = B + M_sub;
    int *B3 = B + (M + strideB) * N_sub;
    int *B4 = B3 + M_sub;

    int *C1 = C;
    int *C2 = C + M_sub;
    int *C3 = C + (M + strideC) * M_sub;
    int *C4 = C3 + M_sub;

    // due to the result in C is accumulated, order here matters.
    mat_mul_opt(A1, B1, C1, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A2, B3, C1, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);

    mat_mul_opt(A1, B2, C2, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A2, B4, C2, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);

    mat_mul_opt(A3, B1, C3, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A4, B3, C3, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);

    mat_mul_opt(A3, B2, C4, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A4, B4, C4, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
  }
}

这是性能结果:

 1,244,238,488      cache-references:u                                            (87.41%)
   193,808,545      cache-misses:u            #   15.576 % of all cache refs      (87.42%)
   192,979,016      L1-dcache-load-misses:u                                       (75.14%)
 6,651,396,875      cycles:u                                                      (87.59%)
 3,499,761,427      instructions:u            #    0.53  insn per cycle           (87.62%)
   539,801,098      branches:u                                                    (87.62%)                                            
     1,632,374      armv7_cortex_a7/l2d_cache_refill/:u                                     (87.48%)

   4.847838433 seconds time elapsed

我在测试中将 A 设置为1024x512和 B设置为512x1024。并得到262144mat_mul函数的调用,并且MxN16x8最后调用mat_mul.

而且我对缓存丢失的计算远小于perf的结果,这里是:

因为矩阵 A 是 16x8 而 B 是 8x16,所以 B ( 16* sizeof(int) = 64 Byte) 的每一行都适合一个 L1 高速缓存行。现在 A 和 B 都应该适合 L1 缓存(16*8*2*sizeof(int) = 1024 Byte我假设有 32KB 的 L1D 缓存,并且考虑到所述 4 路的关联,1024 Byte也应该能够适合它)。因此,mat_mul使用 A ( 16x8) 和 B ( 8x16) 进行的计算应该包含16 + 8 = 24L1 缓存缺失。262,144 * 24 = 6,291,456所以在整个计算中存在缓存缺失。

但是 perf 的结果显示存在192,979,016缓存缺失。这30比我预期的要多几倍。

所以我的问题是我的计算有什么问题?或者缺少的额外缓存从何而来?

而且我还使用 perf 来记录 L1 D 缓存丢失的来源,结果如下所示。if from 的 99% 缺失mat_mul和 in 缺失的 80%mat_mul来自最内循环的行:sum += (*Aik) * (*Bjk);.

  1.21 │ 9c:┌─→ldr    r0, [r3], #4                                                                                                                                           
  2.84 │    │  ldr    ip, [r1], fp                                                                                                                                           
       │    │  cmp    lr, r3                                                                                                                                                 
 80.42 │    │  mla    r2, ip, r0, r2                                                                                                                                         
       │    └──bne    9c               

谢谢!

4

0 回答 0