我创建了一个简单的测试,该测试使用 SIMD 一次添加一个数组的 4 个元素(求和),而不是用 4 个求和变量累加它并在最后将它们相加。这是我的测试用例代码:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <immintrin.h>
#include <x86intrin.h>
int main()
{
double time1, time2;
time1 = time2 = 0;
int n = 50000000;
int runs = 5;
double * test = _mm_malloc(sizeof(double) * n, 32);
for(int i = 0; i < n; i++){
test[i] = i;
}
time1 = omp_get_wtime();
double overalla;
for(int a = 0; a < runs; a++){
__m256d accumulate = _mm256_setzero_pd();
overalla = 0;
for(int i = 0; i < n; i += 4){
accumulate = _mm256_add_pd(_mm256_load_pd(test + i), accumulate);
}
double result[4] __attribute__ ((aligned (32)));
_mm256_store_pd((double *)&result, accumulate);
overalla = result[0] + result[1] + result[2] + result[3];
}
time1 = omp_get_wtime() - time1;
double overall;
time2 = omp_get_wtime();
for(int a = 0; a < runs; a++){
double sum1, sum2, sum3, sum4;
sum1 = sum2 = sum3 = sum4 = 0;
overall = 0;
for(int i = 0; i < n; i += 4){
sum1 += test[i];
sum2 += test[i+1];
sum3 += test[i+2];
sum4 += test[i+3];
}
overall = sum1 + sum2 + sum3 + sum4;
}
time2 = omp_get_wtime() - time2;
printf("A: %f, B: %f\n", overalla, overall);
printf("Time 1: %f, Time 2: %f\n", time1, time2);
printf("Unroll %f times faster\n", time1/time2);
}
我预计 SIMD 会明显更快(一次添加 4 个),但事实并非如此。我想知道是否有人可以指出我为什么会这样?我运行代码得到的结果是:
A: 1249999975000000.000000, B: 1249999975000000.000000
时间一:0.317978,时间二:0.207965
展开速度快 1.528996 倍
我正在编译没有优化,gcc 选项是 gcc -fopenmp -mavx -mfma -pthread