I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer).
In data pipelines we would:
- Spawn a task runner,
- Bootstrap it,
- Do work,
- Kill the task runner.
I just realized that my airflow runners are not ephemeral. I touched a file in /tmp
, did it again in a separate DagRun, then listed /tmp
and found two files. I expected only the one I most recently touched.
This seems to mean I need to watch out how much "stuff" is being stored locally on the runner.
I know GCS mounts the /data
folder with FUSE so I'm defaulting to storing a lot of my working files there, and moving files from there to final buckets elsewhere, but how do you approach this? What would be "best practice"?
Thanks for the advice.