我目前正在尝试在 Google Colab 上的多个 tpu 内核上运行一些代码,但是当在目标函数的末尾调用同步代码 (xm.rendezvous) 但现在当同步代码位于最佳。这是一个例子:
# "Map function": acquires a corresponding Cloud TPU core, creates a tensor on it,
# and prints its core
def simple_map_fn(index, flags):
# xm.rendezvous('init') # place rendezvous here instead of at the bottom works fine.
# Acquires the (unique) Cloud TPU core corresponding to this process's index
device = xm.xla_device()
ordinal = xm.get_ordinal()
local_ordinal = xm.get_ordinal()
print(f"index {index}, process device {device}, local ordinal {local_ordinal}, ordinal {ordinal}")
# Barrier to prevent master from exiting before workers connect.
xm.rendezvous('leave')
# Spawns eight of the map functions, one for each of the eight cores on
# the Cloud TPU
flags = {}
xmp.spawn(simple_map_fn, args=(flags,), nprocs=8, start_method='fork')
当我像在这个notebook中一样在 Google Colab 中运行上面的代码时,我收到以下错误:
Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'leave': Socket closed (14)
知道为什么集合点放在目标函数的底部时会失败吗?