python - 使用基于 GPU 的远程执行与 Tensorflow 联合时出错

Question

我正在尝试使用此链接上提供的示例来试验远程执行器运行时。 https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/examples/remote_executor_example.py

如果我使用基于 CPU 的张量流，那么一切正常。但是，对于基于 GPU 的张量流，会发生以下错误并中止执行：

2020-03-29 16:27:22.904103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-29 16:27:22.904807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 978 MB memory) -> physical GPU (device: 0, name: GRID V100DX-32C, pci bus id: 0000:02:00.0, compute capability: 7.0)
2020-03-29 16:27:22.995000: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: No unary variant device copy function found for direction: 1 and Variant type_index: tensorflow::data::(anonymous namespace)::DatasetVariantWrapper
[[{{node partitionedcall_args_0/_2}}]]

我该如何解决这个问题？有没有人遇到过类似的问题？

score 1 · Accepted Answer

截至本次提交，此问题应在 TFF 中修复。减轻影响的选项包括：

使用 Bazel 从 master 构建 TFF，如此处所述。
等待下一个 pip 包发布，计划在下周。
手动编辑远程工作人员上的站点包，以明确地将数据集实例化固定在 CPU 上，如链接更改中所示。

python - 使用基于 GPU 的远程执行与 Tensorflow 联合时出错

1 回答 1

Related

Reference