I created my own very slightly modified Dockerfile
based on the dask-docker
Dockerfile
that installs adlfs
and copies one of my custom libraries into the container in order to make it available to all worker nodes. I deployed my container to my Kubernetes cluster and have connected to it from a REPL on my local machine, creating a client and a function locally:
>>> def add1(n): return n + 1
...
>>> client = Client(my_ip + ':8786')
But when I run client.submit
, I'm getting either distributed.protocol.pickle
"Failed to deserialize b'...'" error messages or Futures stuck in a 'pending' state:
>>> f = client.submit(add1, 2)
>>> distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x05\x95\xba\x03\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support...'
...
ValueError: unsupported pickle protocol: 5
>>>
>>> f = client.submit(add1, 2)
>>> f
<Future: pending, key: add1-d5d2ff94399d4bb4e41150868f4c6da7>
It seems like the pickle protocol error will occur only once when I submit the first job, then afterward, everything is just stuck in pending
.
From kubectl
, I see that I have:
- one
LoadBalancer
service nameddask-scheduler
, - two deployments: 1x
dask-scheduler
and 3xdask-worker
, - and the corresponding one
dask-scheduler-...
and threedask-worker-...
pods
What would cause this, and how can I debug? I opened up the web interface to my Dask scheduler, and it shows that I have an instance of add1
that has erred, but it gives no details.
For what it's worth, the only changes I made to the Dockerfile
were:
# ...
&& find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
&& rm -rf /opt/conda/pkgs
RUN pip install adlfs==0.3.0 # new line
COPY prepare.sh /usr/bin/prepare.sh # existing line
COPY foobar.sh /usr/bin/foobar.sh # new line
COPY my_file.so /usr/bin/my_file.so # new line
Edit: I'll note that if I deploy the Dask image (image: "daskdev/dask:2.11.0"
in my K8s manifest), things work just fine. So in trying to create a customized Docker image, something seems to get misconfigured with Dask. I commented out my changes to the Dockerfile
, ran docker rmi
on my local and ACR images, tore down my deployed service and deployments, then rebuilt a container, pushed it, and made the deployment, but it still fails.