Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Copy files from GCS folder to GCS folder and write to BigQuery

Hello all,

I have files that resides on GCS folder. I need to check if certain numbers of files were uploaded and haven't been proccessed yet (based on a BigQuery table). Once those files were uploaded, I need to copied them instantly to another GCS bucket, and then write a new line to a BigQuery table. What is the best approach to address this use case? My initial thought were to use PubSub/CloudFunctions or combination of both. Can I use DataFlow for this use case? Python is the preffered langauge.

What do you think?

0 1 595
1 REPLY 1

Hi Tzachi_Israel,

Welcome to the Google Cloud Community!

It seems you're looking to automate file uploads in Google Cloud Storage, track their status in BigQuery, and move them to another bucket once processed. Both Pub/Sub + Cloud Functions and Dataflow can handle this. Here are some potential approaches to meet your needs, along with a breakdown of the options and recommendations:

Pub/Sub + Cloud Function Approach:

DataFlow Approach:

  • Pub/Sub for notification: Just like the Cloud Functions approach, you use GCS event notifications to trigger Pub/Sub to receive notifications about uploaded files.
  • Develop a Streaming Pipeline: Use Apache Beam to build the data processing pipeline. The pipeline can be either batch or streaming, depending on your needs. Since you are dealing with file upload events (which are triggered in real time), a streaming pipeline is most likely the best choice.This pipeline would:
    • Process the file upload events.
    • Perform lookups in BigQuery to check whether the file has been processed.
    • Copy the file from the source GCS bucket to the target GCS bucket.
    • Insert a row into BigQuery to mark the file as processed.

Both Pub/Sub + Cloud Functions and Dataflow can achieve your goal, but the best choice depends on the scale and complexity of your system. I recommend starting with Pub/Sub + Cloud Functions for simplicity and ease of use, especially if you're processing a moderate number of files. If you expect the volume of files to grow or need more complex processing, Dataflow could be a more scalable solution.

I hope the above information is helpful.

Top Labels in this Space