I have multiple Dataflow jobs that run on a daily basis and pick up data from GCS, apply transformations, and load to BigQuery. I just recently noticed that the temp files that are being created (essentially the newline json that gets loaded into BigQuery) are all being saved in the temp location I specify + /bq_load. I would expect these files to be deleted or at least soft deleted, but instead they appear to be staying here forever. I've experimented with DirectRunner and DataflowRunner from my local machine and both resulted in the temp file being kept. Same with the jobs scheduled via Airflow jobs that use templates I built. Is this expected behavior and I need to set up some sort of cleanup in my Airflow DAG or is this unexpected?
Solved! Go to Solution.
Hi aorso-as,
Welcome to the Google Cloud Community!
When using Dataflow’s BigQuery integration, temporary files are created in your Cloud Storage temp_location/bq_load directory. This is expected behavior to support job reliability and retries. However, these files are not automatically removed, so implementing a cleanup strategy is essential.
Use Object Lifecycle Management to automatically delete files older than a certain age (e.g., 1-7 days) in the bq_load directory. This is simple, cost-effective, and reliable.
Alternatively, you can use an Airflow DAG task to delete the files after the Dataflow job completes successfully. This offers greater control and flexibility but involves additional configuration and maintenance.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi aorso-as,
Welcome to the Google Cloud Community!
When using Dataflow’s BigQuery integration, temporary files are created in your Cloud Storage temp_location/bq_load directory. This is expected behavior to support job reliability and retries. However, these files are not automatically removed, so implementing a cleanup strategy is essential.
Use Object Lifecycle Management to automatically delete files older than a certain age (e.g., 1-7 days) in the bq_load directory. This is simple, cost-effective, and reliable.
Alternatively, you can use an Airflow DAG task to delete the files after the Dataflow job completes successfully. This offers greater control and flexibility but involves additional configuration and maintenance.
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Got it, thank you @nikacalupas . I do feel like this should be mentioned more explicitly somewhere (unless I missed it) because it has the potential to really rack up storage costs if you don't catch it. I don't see why temp files need to be kept around like this.