在并行结构中使用“vector(value)”是非法的 OpenACC 语法。您需要在并行指令上使用“vector_length”子句来定义向量长度。原因是因为“并行”定义了要卸载的单个计算区域,因此该区域中的所有向量循环都需要具有相同的向量长度。
您只能将“vector(value)”与“kernels”构造一起使用,因为编译器随后可以将该区域拆分为多个内核,每个内核具有不同的向量长度。
选项1:
% cat test.c
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc parallel vector_length(128) copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop vector
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
% nvc -acc -c test.c -Minfo=accel
foo:
4, Generating copyout(c[:c_stride*256]) [if not already present]
Generating copyin(a[:a_stride*256]) [if not already present]
Generating Tesla code
5, #pragma acc loop vector(128) /* threadIdx.x */
7, #pragma acc loop seq
5, Loop is parallelizable
7, Loop is parallelizable
选项 2:
% cat test.c
void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc kernels copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop independent vector(128)
{
for (int i = 0; i < 256; ++i) {
float sum = 0;
for (int j = 0; j < 256; ++j) {
sum += *(a + a_stride * i + j);
}
*(c + c_stride * i) = sum;
}
}
}
% nvc -acc -c test.c -Minfo=accel
foo:
4, Generating copyout(c[:c_stride*256]) [if not already present]
Generating copyin(a[:a_stride*256]) [if not already present]
5, Loop is parallelizable
Generating Tesla code
5, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
7, #pragma acc loop seq
7, Loop is parallelizable