在 Fortran 程序中,我有一个大循环,其中dot_product
对循环内生成的小向量进行了多次调用:
program test
implicit none
real :: array1(2, 2), array2(2, 2), res(2)
real :: subarray1(2), subarray2(2)
integer :: i
array1 = 1
array2 = 2
!$acc data copyin(array1, array2) copyout(res)
!$acc kernels
!$acc loop independent private(subarray1, subarray2)
do i = 1, 2
subarray1(:) = array1(:, i)
subarray2(:) = array2(:, i)
res(i) = dot_product(subarray1, subarray2)
enddo
!$acc end kernels
!$acc end data
print "(2(g0, x))", res
endprogram
当使用 PGI 编译器编译时,似乎加速实现dot_product
使用加速循环,因此可以防止更好地加速主循环(在 gang 和 vector 上):
test:
11, Generating copyin(array1(:,:)) [if not already present]
Generating copyout(res(:)) [if not already present]
Generating copyin(array2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
17, !$acc loop vector(32) ! threadidx%x
Generating implicit reduction(+:subarray1$r)
14, CUDA shared memory used for subarray2,subarray1
15, Loop is parallelizable
17, Loop is parallelizable
从日志中可以看出,它对循环私有向量使用隐式缩减和共享内存。
有没有办法强制dot_product
顺序运行?