0

我正在尝试在集群上运行基于 tensorflow 的项目,我已经在我的 anaconda 环境中安装了所有相关依赖项,就像我在运行项目的本地计算机上所做的一样,但我收到了以下错误消息:

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: libdevice not found at ./libdevice.10.bc
         [[{{node cluster_2_1/xla_compile}}]]
         [[cluster_1_1/merge_oidx_20/_1]]
  (1) Internal: libdevice not found at ./libdevice.10.bc
         [[{{node cluster_2_1/xla_compile}}]]

完整追溯 - https://pastebin.com/njqNFWvC

/u/usr/anaconda3/envs/Project_BM/lib/我可以看到有libdevice.10.bc问题的地方。

2021-06-30 08:27:50.484735: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:69] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
2021-06-30 08:27:50.484775: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:70] Searched for CUDA in the following directories:
2021-06-30 08:27:50.484781: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73]   ./cuda_sdk_lib
2021-06-30 08:27:50.484784: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73]   /usr/local/cuda
2021-06-30 08:27:50.484787: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73]   .
2021-06-30 08:27:50.484791: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:75] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work. 

回溯的这一部分让我认为 tensorflow 是在本地而不是在 conda 环境中搜索 cuda,要解决这个问题,我是否需要将 XLA_FLAGS 设置为/u/usr/anaconda3/envs/Project_BM/lib/libdevice.10.bc,如果不是,我在哪里可以找到环境中的/cuda/目录Project_BM

还值得知道我在集群上运行它,所以我没有 root 权限。

4

0 回答 0