我正在尝试让 dask-kubernetes 使用我的 GKE 帐户。令人抓狂的是它奏效了。但现在没有了。我设置了一个集群。节点也可以很好地创建。它们运行 60 秒,然后超时并显示以下消息(如 所示kubectl logs podname
):
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/dask-worker", line 8, in <module>
sys.exit(go())
File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 446, in go
main()
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 432, in main
loop.run_sync(run)
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 426, in run
await asyncio.gather(*nannies)
File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 284, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
我认为这意味着工作人员无法连接到在我的笔记本电脑上运行的调度程序?但是我不明白为什么。端口似乎是开放的。
from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da
if __name__ == '__main__':
cluster = KubeCluster.from_yaml('worker-spec-2.yml')
cluster.scale(1)
client = Client(cluster)
array = da.ones((1000, 1000, 1000))
print(array.mean().compute())
worker-spec-2.yml 包含以下内容:
kind: Pod
metadata:
labels:
foo: bar
spec:
restartPolicy: Never
containers:
- image: daskdev/dask:latest
imagePullPolicy: IfNotPresent
args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
name: easyvvuq
env:
- name: EXTRA_PIP_PACKAGES
value: git+https://github.com/dask/distributed
resources:
limits:
cpu: "1"
memory: 2G
requests:
cpu: 500m
memory: 2G
同样,这个或类似的东西对我有用。我可能已经更改了 worker-spec.yml 中的某些内容,但仅此而已。
我的问题是 - 我该如何诊断?无论如何,我都不是 Kubernetes 专家。