Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Sequential Read and Write to BigQuery in Dataflow

Our team has several data workflows where we first read the start date from one table in BigQuery, and then use that date to read the actual data from another table. The same process occurs at the end, when we write the processed data to BigQuery, and then extract a timestamp from the data and log it in a separate table.

When running on Dataflow, we typically start by using ReadFromBigQuery to fetch the start date. Then, we use a ParDo class with the Python BigQuery client SDK to read the actual data, filtering by the start date. At the end of the process, we use another ParDo class to write data to our sink table, followed by WriteToBigQuery to save the timestamp of the last processed record.

I'm unsure if this is the best approach, as it gives the impression that we are using BigQuery connectors only to read/write the date, rather than fetching the actual data to be processed.

What would you suggest?

0 5 697
5 REPLIES 5