0

在 Fortran 程序中,我有一个大循环,其中dot_product对循环内生成的小向量进行了多次调用:

program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels
        !$acc loop independent private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end kernels
        !$acc end data

        print "(2(g0, x))", res
endprogram

当使用 PGI 编译器编译时,似乎加速实现dot_product使用加速循环,因此可以防止更好地加速主循环(在 gang 和 vector 上):

test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     14, Loop is parallelizable
         Generating Tesla code
         14, !$acc loop gang ! blockidx%x
         15, !$acc loop vector(32) ! threadidx%x
         17, !$acc loop vector(32) ! threadidx%x
             Generating implicit reduction(+:subarray1$r)
     14, CUDA shared memory used for subarray2,subarray1
     15, Loop is parallelizable
     17, Loop is parallelizable

从日志中可以看出,它对循环私有向量使用隐式缩减和共享内存。

有没有办法强制dot_product顺序运行?

4

1 回答 1

1

Is there a way to force dot_product to run sequentially?

So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.

% cat test.f90
program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels loop gang vector private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end data

        print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     13, Loop is parallelizable
         Generating Tesla code
         13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         14, !$acc loop seq
         16, !$acc loop seq
     13, Local memory used for subarray2,subarray1
     14, Loop is parallelizable
     16, Loop is parallelizable
于 2021-02-02T17:24:32.603 回答