我正在做一个任务,我转置矩阵以减少矩阵乘法运算的缓存未命中。根据我从几个同学那里了解到的情况,我应该得到 8 倍的提升。但是,我只得到 2 倍……我可能做错了什么?
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
}
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
}
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}
到目前为止,我做的事情正确吗?
仅供参考:我需要做的下一个优化是使用 SIMD/Intel SSE3