6

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.

Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.

One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.

So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?

Appreciate all the help

4

2 回答 2

3

如果这是正确的,我不是 100% 肯定的,但这是我首先要尝试的。当你打电话时luigi.run,通过它--scheduler-remove-delay。我猜这是调度程序在所有依赖项完成后忘记任务之前等待的时间。如果你查看luigi 的源码,默认是 600 秒。例如:

luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
于 2015-09-29T15:40:46.267 回答
2

如果您在 luigi.cfg 中配置 remove_delay 设置,那么它将使任务保留更长时间。

[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400

请注意,文档中有一个错字(“remove-delay”而不是 remove_delay“),它正在https://github.com/spotify/luigi/issues/2133下修复

于 2017-06-22T20:37:17.273 回答