g++ - 我可以在 GPU 上使用 `omp_get_thread_num()` 吗？

Question

我有 OpenMP 代码，它通过让每个线程管理由线程的 id 号寻址的内存来在 CPU 上工作，可以通过omp_get_thread_num(). 这在 CPU 上运行良好，但它可以在 GPU 上运行吗？

MWE 是：

#include <iostream>
#include <omp.h>

int main(){
  const int SIZE = 400000;

  int *m;
  m = new int[SIZE];

  #pragma omp target
  {
    #pragma omp parallel for
    for(int i=0;i<SIZE;i++)
      m[i] = omp_get_thread_num();
  }

  for(int i=0;i<SIZE;i++)
    std::cout<<m[i]<<"\n";
}

score 2 · Accepted Answer

使用 GCC 在 GPU 上运行良好。你需要m像这样映射

#pragma omp target map(tofrom:m[0:SIZE])

我是这样编译的

g++ -O3 -Wall -fopenmp -fno-stack-protector so.cpp

您可以在此处查看不卸载系统的示例

http://coliru.stacked-crooked.com/a/1e756410d6e2db61

我在工作之前用来找出团队和线程数量的方法是：

#pragma omp target teams defaultmap(tofrom:scalar)
{
    nteams = omp_get_num_teams();
    #pragma omp parallel
    #pragma omp single
    nthreads = omp_get_num_threads();
}

在我使用 GCC 7.2、Ubuntu 17.10 和gcc-offload-nvptxGTX 1060 的系统上，我得到 nteams = 30和nthreads = 8. 请参阅此答案，其中我使用线程和团队对目标区域进行自定义缩减。带-offload=disable nteams = 1和nthreads = 8（4 核/8 硬件线程 CPU）。

我添加-fopt-info到编译选项，我只得到消息

note: basic block vectorized

score 1 · Accepted Answer

答案似乎是否定的。

使用 PGI 编译：

pgc++ -fast -mp -ta=tesla,pinned,cc60 -Minfo=all test2.cpp

给出：

13, Parallel region activated
    Parallel loop activated with static block schedule
    Loop not vectorized/parallelized: contains call
14, Parallel region terminated

而使用 GCC 编译

g++ -O3 test2.cpp -fopenmp -fopt-info

给

test2.cpp:17: note: not vectorized: loop contains function calls or data references that cannot be analyzed
test2.cpp:17: note: bad data references.

g++ - 我可以在 GPU 上使用 `omp_get_thread_num()` 吗？

2 回答 2

Related

Reference