0

I created my own very slightly modified Dockerfile based on the dask-docker Dockerfile that installs adlfs and copies one of my custom libraries into the container in order to make it available to all worker nodes. I deployed my container to my Kubernetes cluster and have connected to it from a REPL on my local machine, creating a client and a function locally:

>>> def add1(n): return n + 1
...
>>> client = Client(my_ip + ':8786')

But when I run client.submit, I'm getting either distributed.protocol.pickle "Failed to deserialize b'...'" error messages or Futures stuck in a 'pending' state:

>>> f = client.submit(add1, 2)
>>> distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x05\x95\xba\x03\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support...'
...
ValueError: unsupported pickle protocol: 5
>>>
>>> f = client.submit(add1, 2)
>>> f
<Future: pending, key: add1-d5d2ff94399d4bb4e41150868f4c6da7>

It seems like the pickle protocol error will occur only once when I submit the first job, then afterward, everything is just stuck in pending.

From kubectl, I see that I have:

  • one LoadBalancer service named dask-scheduler,
  • two deployments: 1x dask-scheduler and 3x dask-worker,
  • and the corresponding one dask-scheduler-... and three dask-worker-... pods

What would cause this, and how can I debug? I opened up the web interface to my Dask scheduler, and it shows that I have an instance of add1 that has erred, but it gives no details.

For what it's worth, the only changes I made to the Dockerfile were:

    # ...
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs

RUN pip install adlfs==0.3.0          # new line

COPY prepare.sh /usr/bin/prepare.sh   # existing line
COPY foobar.sh /usr/bin/foobar.sh     # new line
COPY my_file.so /usr/bin/my_file.so   # new line

Edit: I'll note that if I deploy the Dask image (image: "daskdev/dask:2.11.0" in my K8s manifest), things work just fine. So in trying to create a customized Docker image, something seems to get misconfigured with Dask. I commented out my changes to the Dockerfile, ran docker rmi on my local and ACR images, tore down my deployed service and deployments, then rebuilt a container, pushed it, and made the deployment, but it still fails.

4

1 回答 1

0

看起来问题是我成功部署的 Dask 映像和Dockerfile我创建自己的映像之间的区别。前者(2.11.0)在 Dask 2.11.0 中烘焙,而后者在 Dask 2.16.0 和 Python 3.8 中烘焙。这些版本中的一些差异会导致问题。

当我更新我Dockerfile以使用 2.11.0 并删除显式 Python 依赖项时,一切正常。

于 2020-05-26T23:17:09.480 回答