0

我创建了一个简单的测试,该测试使用 SIMD 一次添加一个数组的 4 个元素(求和),而不是用 4 个求和变量累加它并在最后将它们相加。这是我的测试用例代码:

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h>
#include <omp.h>

#include <immintrin.h>
#include <x86intrin.h>

int main() 
{ 
    double time1, time2;
    time1 = time2 = 0;
    int n = 50000000;
    int runs = 5;
    double * test = _mm_malloc(sizeof(double) * n, 32);

    for(int i = 0; i < n; i++){
        test[i] = i;
    }

    time1 = omp_get_wtime();
    double overalla;
    for(int a = 0; a < runs; a++){
        __m256d accumulate = _mm256_setzero_pd();
        overalla = 0;
        for(int i = 0; i < n; i += 4){
            accumulate = _mm256_add_pd(_mm256_load_pd(test + i), accumulate);
        }
        double result[4] __attribute__ ((aligned (32)));
        _mm256_store_pd((double *)&result, accumulate);
        overalla = result[0] + result[1] + result[2] + result[3];
    }
    time1 = omp_get_wtime() - time1;

    double overall;
    time2 = omp_get_wtime();
    for(int a = 0; a < runs; a++){
        double sum1, sum2, sum3, sum4;
        sum1 = sum2 = sum3 = sum4 = 0;
        overall = 0;
        for(int i = 0; i < n; i += 4){
            sum1 += test[i];
            sum2 += test[i+1];
            sum3 += test[i+2];
            sum4 += test[i+3];
        }
        overall = sum1 + sum2 + sum3 + sum4;
    }
    time2 = omp_get_wtime() - time2;

    printf("A: %f, B: %f\n", overalla, overall);
    printf("Time 1: %f, Time 2: %f\n", time1, time2);
    printf("Unroll %f times faster\n", time1/time2);
} 

我预计 SIMD 会明显更快(一次添加 4 个),但事实并非如此。我想知道是否有人可以指出我为什么会这样?我运行代码得到的结果是:

A: 1249999975000000.000000, B: 1249999975000000.000000

时间一:0.317978,时间二:0.207965

展开速度快 1.528996 倍

我正在编译没有优化,gcc 选项是 gcc -fopenmp -mavx -mfma -pthread

4

0 回答 0