0

I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer).

In data pipelines we would:

  • Spawn a task runner,
  • Bootstrap it,
  • Do work,
  • Kill the task runner.

I just realized that my airflow runners are not ephemeral. I touched a file in /tmp, did it again in a separate DagRun, then listed /tmp and found two files. I expected only the one I most recently touched.

This seems to mean I need to watch out how much "stuff" is being stored locally on the runner.

I know GCS mounts the /data folder with FUSE so I'm defaulting to storing a lot of my working files there, and moving files from there to final buckets elsewhere, but how do you approach this? What would be "best practice"?

Thanks for the advice.

4

1 回答 1

2

Cloud Composer currently uses CeleryExecutor, which configures persistent worker processes that handle the execution of task instances. As you have discovered, you can make changes to the filesystems of the Airflow workers (which are Kubernetes pods), and they will indeed persist until the pod is restarted/replaced.

Best-practice wise, you should consider the local filesystem to be ephemeral to the task instance's lifetime, but you shouldn't expect that it will clean up for you. If you have tasks that perform heavy I/O, you should perform them outside of /home/airflow/gcs because that path is network mounted (GCSFUSE), but if there is final data you want to persist, then you should write it to /data.

于 2020-03-28T02:24:39.793 回答