我面临的问题如下:
这是设置:
- 一个基础镜像为 Python:3.7.8-stretch 的 docker 容器
- Ubuntu 20.04 上的本地环境
我的目标是使用 Python 和 Papermill 包在 docker 容器内并行启动多个 jupyter notebook。
Dockerfile:
FROM python:3.7.8-stretch
WORKDIR /opt/app
COPY requirements.txt /opt/app/requirements.txt
RUN python3.7 -m pip install -r requirements.txt
COPY src/* /opt/app/
COPY src/notebooks /opt/app/notebooks
CMD ["python", "main.py"]
代码就在这里:
import os
import time
import multiprocess as mp
import papermill as pm
def callback(result):
print(f"{result} end successfully.")
def main(*args, **kwargs):
print(kwargs)
print(args)
result = pm.execute_notebook(**kwargs)
return result
if __name__ == "__main__":
package_name = os.getenv("PACKAGE_FOLDER_NAME")
folder_name = f"notebooks/{package_name}"
pool = mp.Pool(processes=2)
# for every file in the folder, we are going to execute the notebook
# one great addition is that you can choose which notebooks run inside a package folder
# if you only specify paramaters for a single one.
notebook_parameters = eval(os.getenv("NOTEBOOK_PARAMETERS"))
start_time = time.time()
for filename in os.listdir(folder_name):
if filename in notebook_parameters.keys():
output_filename = filename.replace(".ipynb", "_output.ipynb")
result = pool.apply_async(
main,
kwds={"input_path": folder_name + "/" + filename,
"output_path": folder_name + "/output_notebooks" + "/" + output_filename,
"parameters": notebook_parameters[filename],
"log_output": True},
callback=callback)
print(f"notebook {filename} has started.")
# this prevents race conditions while creating the jupyter notebook kernels.
time.sleep(2)
pool.close()
pool.join()
这是我用来运行容器的 docker 命令(我没有指定所有环境变量,因为我丢失了终端历史记录,抱歉):
docker run -e ${environment_variable} rental-metrics-aggs:latest
我已经尝试了一些东西,例如并发包,但是即使笔记本执行完成,它也会在最后抛出一个错误。
这段代码的奇怪之处在于:当我在本地运行时(我通过 Pycharm 和终端启动脚本),它完成了一堆笔记本的执行,但它不在Docker 容器中;即使笔记本已经完成,进程仍然挂起,我确认了这一点,因为我可以看到 API 调用的结果。
所以,如果有什么我可以改进的问题,请告诉我。