I created my own very slightly modified Dockerfile based on the dask-docker Dockerfile that installs adlfs and copies one of my custom libraries into the container in order to make it available to all worker nodes. I deployed my container to my Kubernetes cluster and have connected to it from a REPL on my local machine, creating a client and a function locally:
>>> def add1(n): return n + 1
...
>>> client = Client(my_ip + ':8786')
But when I run client.submit, I'm getting either distributed.protocol.pickle "Failed to deserialize b'...'" error messages or Futures stuck in a 'pending' state:
>>> f = client.submit(add1, 2)
>>> distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x05\x95\xba\x03\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support...'
...
ValueError: unsupported pickle protocol: 5
>>>
>>> f = client.submit(add1, 2)
>>> f
<Future: pending, key: add1-d5d2ff94399d4bb4e41150868f4c6da7>
It seems like the pickle protocol error will occur only once when I submit the first job, then afterward, everything is just stuck in pending.
From kubectl, I see that I have:
- one
LoadBalancerservice nameddask-scheduler, - two deployments: 1x
dask-schedulerand 3xdask-worker, - and the corresponding one
dask-scheduler-...and threedask-worker-...pods
What would cause this, and how can I debug? I opened up the web interface to my Dask scheduler, and it shows that I have an instance of add1 that has erred, but it gives no details.
For what it's worth, the only changes I made to the Dockerfile were:
# ...
&& find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
&& rm -rf /opt/conda/pkgs
RUN pip install adlfs==0.3.0 # new line
COPY prepare.sh /usr/bin/prepare.sh # existing line
COPY foobar.sh /usr/bin/foobar.sh # new line
COPY my_file.so /usr/bin/my_file.so # new line
Edit: I'll note that if I deploy the Dask image (image: "daskdev/dask:2.11.0" in my K8s manifest), things work just fine. So in trying to create a customized Docker image, something seems to get misconfigured with Dask. I commented out my changes to the Dockerfile, ran docker rmi on my local and ACR images, tore down my deployed service and deployments, then rebuilt a container, pushed it, and made the deployment, but it still fails.