2

我正在尝试让 dask-kubernetes 使用我的 GKE 帐户。令人抓狂的是它奏效了。但现在没有了。我设置了一个集群。节点也可以很好地创建。它们运行 60 秒,然后超时并显示以下消息(如 所示kubectl logs podname):

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/dask-worker", line 8, in <module>
    sys.exit(go())
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 446, in go
    main()
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 432, in main
    loop.run_sync(run)
  File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/cli/dask_worker.py", line 426, in run
    await asyncio.gather(*nannies)
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 284, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

我认为这意味着工作人员无法连接到在我的笔记本电脑上运行的调度程序?但是我不明白为什么。端口似乎是开放的。

from dask_kubernetes import KubeCluster
from dask.distributed import Client
import dask.array as da

if __name__ == '__main__':
    cluster = KubeCluster.from_yaml('worker-spec-2.yml')
    cluster.scale(1)
    client = Client(cluster)
    array = da.ones((1000, 1000, 1000))  
    print(array.mean().compute())

worker-spec-2.yml 包含以下内容:

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
    name: easyvvuq
    env:
      - name: EXTRA_PIP_PACKAGES
        value: git+https://github.com/dask/distributed
    resources:
      limits:
        cpu: "1"
        memory: 2G
      requests:
        cpu: 500m
        memory: 2G

同样,这个或类似的东西对我有用。我可能已经更改了 worker-spec.yml 中的某些内容,但仅此而已。

我的问题是 - 我该如何诊断?无论如何,我都不是 Kubernetes 专家。

4

0 回答 0