PubSub to BQ - Dataflow Standard Templates to upsert the records

Would like to know whether we have standard pubsub to BQ dataflow template to insert the record if new and update the record if existing? If not what is the recommended approach? UDF or Custom templates? Does any one have any examples?

Solved Solved
0 3 419
1 ACCEPTED SOLUTION

The PubSubCdcToBigQuery template under v2 in Google Cloud's DataflowTemplates is specifically designed to handle Change Data Capture (CDC) events from Pub/Sub and update existing records in BigQuery. This template facilitates maintaining a near-real-time replica of your data in BigQuery, streamlining the process without the need for custom scripts or complex workflows.

Key Differences Between the Two Templates:

  1. PubSubToBigQuery (v1):

    • This template is primarily for ingesting new data from Pub/Sub into BigQuery. It reads messages from Pub/Sub and writes them as new records in BigQuery, without the capability to handle updates or deletions.
  2. PubSubCdcToBigQuery (v2):

    • This advanced template is designed to interpret CDC events from supported sources like MySQL, PostgreSQL, and SQL Server. It processes these events to identify changes in the source database, including updates, inserts, and deletions, and then applies these changes to the corresponding records in BigQuery.
    • It's important to ensure that the CDC events are properly formatted and structured for the template to interpret them effectively. Additionally, the template's configuration should align with the specifics of the CDC events.

Use Cases:

  • The PubSubCdcToBigQuery template is ideal for scenarios where you need to synchronize data between operational databases and BigQuery in near-real-time, especially when dealing with frequent updates or deletions.

Additional Considerations:

  • Configuration: Proper configuration of both the source system for generating CDC events and the Dataflow template is crucial for successful data synchronization.
  • Schema Management: If there are schema changes in the source database, additional steps may be required to handle these changes in BigQuery.

Helpful Resources:

View solution in original post

3 REPLIES 3

Unfortunately, there is no standard Pub/Sub to BigQuery Dataflow template that directly supports an insert-or-update (upsert) use case. However, you have a couple of options depending on your specific requirements:

  1. Using a Custom Dataflow Template:

    • This is a flexible option where you design your own Dataflow pipeline using Apache Beam. This pipeline can handle the logic of identifying new versus existing records and perform inserts or updates accordingly. This approach is ideal for complex data processing logic and offers the most control.
    • Resources to help you build a custom template include:
  2. Using BigQuery's Merge Statement:

    • An alternative to UDFs in Dataflow is to use BigQuery's MERGE statement. This involves streaming data into a BigQuery staging table and then applying the MERGE statement to handle upserts. This SQL-based approach is less real-time compared to processing directly within a Dataflow pipeline but can be simpler for certain use cases.
    • BigQuery's MERGE statement allows for complex operations like updates, inserts, and deletes in a single operation, based on a source table or query result.
    • More information on the MERGE statement can be found here: BigQuery MERGE Statement

Recommendation:

  • For maximum flexibility and control over your specific use case, a custom Dataflow template is likely the better option. This approach is particularly beneficial for high-throughput, low-latency requirements and when dealing with complex logic.
  • If you are looking for a simpler solution and your data processing needs are not overly complex, using BigQuery's MERGE statement might be a more straightforward alternative. This approach is better suited for scenarios where real-time processing is not a critical requirement.
  • Ultimately, the best approach depends on your specific requirements, including the complexity of your project, performance needs, and your team's expertise with Google Cloud technologies.

Thanks @ms4446  for the response. Under GitHub - DataflowTemplates , I see PubSubToBigQuery under v1 and PubSubCdcToBigQuery under v2 template. Is PubSubCdcToBigQuery under v2 to update the existing records in the BigQuery?

The PubSubCdcToBigQuery template under v2 in Google Cloud's DataflowTemplates is specifically designed to handle Change Data Capture (CDC) events from Pub/Sub and update existing records in BigQuery. This template facilitates maintaining a near-real-time replica of your data in BigQuery, streamlining the process without the need for custom scripts or complex workflows.

Key Differences Between the Two Templates:

  1. PubSubToBigQuery (v1):

    • This template is primarily for ingesting new data from Pub/Sub into BigQuery. It reads messages from Pub/Sub and writes them as new records in BigQuery, without the capability to handle updates or deletions.
  2. PubSubCdcToBigQuery (v2):

    • This advanced template is designed to interpret CDC events from supported sources like MySQL, PostgreSQL, and SQL Server. It processes these events to identify changes in the source database, including updates, inserts, and deletions, and then applies these changes to the corresponding records in BigQuery.
    • It's important to ensure that the CDC events are properly formatted and structured for the template to interpret them effectively. Additionally, the template's configuration should align with the specifics of the CDC events.

Use Cases:

  • The PubSubCdcToBigQuery template is ideal for scenarios where you need to synchronize data between operational databases and BigQuery in near-real-time, especially when dealing with frequent updates or deletions.

Additional Considerations:

  • Configuration: Proper configuration of both the source system for generating CDC events and the Dataflow template is crucial for successful data synchronization.
  • Schema Management: If there are schema changes in the source database, additional steps may be required to handle these changes in BigQuery.

Helpful Resources: