我正在尝试在集群上运行基于 tensorflow 的项目,我已经在我的 anaconda 环境中安装了所有相关依赖项,就像我在运行项目的本地计算机上所做的一样,但我收到了以下错误消息:
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: libdevice not found at ./libdevice.10.bc
[[{{node cluster_2_1/xla_compile}}]]
[[cluster_1_1/merge_oidx_20/_1]]
(1) Internal: libdevice not found at ./libdevice.10.bc
[[{{node cluster_2_1/xla_compile}}]]
完整追溯 - https://pastebin.com/njqNFWvC
在/u/usr/anaconda3/envs/Project_BM/lib/
我可以看到有libdevice.10.bc
问题的地方。
2021-06-30 08:27:50.484735: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:69] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
2021-06-30 08:27:50.484775: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:70] Searched for CUDA in the following directories:
2021-06-30 08:27:50.484781: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73] ./cuda_sdk_lib
2021-06-30 08:27:50.484784: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73] /usr/local/cuda
2021-06-30 08:27:50.484787: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:73] .
2021-06-30 08:27:50.484791: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:75] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
回溯的这一部分让我认为 tensorflow 是在本地而不是在 conda 环境中搜索 cuda,要解决这个问题,我是否需要将 XLA_FLAGS 设置为/u/usr/anaconda3/envs/Project_BM/lib/libdevice.10.bc
,如果不是,我在哪里可以找到环境中的/cuda/
目录Project_BM
?
还值得知道我在集群上运行它,所以我没有 root 权限。