pytorch - 使用 PyTorch 在 Cloud TPU 上训练 FairSeq RoBERTa 时，RPC 失败，状态 =“不可用：套接字已关闭”错误

Question

我按照教程“使用 Pytorch 在 Cloud TPU 上预训练 FairSeq RoBERTa ”来设置 Preemptible (v2-8) TPU 环境并训练我的 RoBERTa 模型。PyTorch env 按照文档的说明基于 torch-xla-1.6。但是，它不会像往常一样在 GPU 中输出任何训练日志，并且会在 2-3 天（间隔 12 小时）内两次抛出 RPC 失败警告（见下文 - 网络端点已在此处删除）。

我每个 epoch 的训练步数是 161,529。根据文档，按照我的配置，v2-8 将花费 80 小时进行 5 个 epoch。但是，我的工作似乎悬而未决。

请问有什么建议吗？

 W    4566 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1599580717.037250202","description":"Error received from peer ipv4:<my_network_endpoint>:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

score -1 · Accepted Answer

-1

听起来在这种情况下您的 TPU 可能已被抢占。请尝试使用非抢占式 TPU。

于 2020-11-09T00:22:25.677 回答

pytorch - 使用 PyTorch 在 Cloud TPU 上训练 FairSeq RoBERTa 时，RPC 失败，状态 =“不可用：套接字已关闭”错误

1 回答 1

Related

Reference