Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

My connection between Datastream and BigQuery is not keeping the history

Hey guys!

I have an active connection from Datastream to Bigquery, but when changing or deleting records in my source, Bigquery is replicating this. I would like to use the change stream method, keeping the change history in the BigQuery table.

Is it possible to do this with a connection between Datastream and BigQuery?

0 1 544
1 REPLY 1

Implementing a CDC workflow using Datastream and BigQuery allows you to capture and preserve historical data changes from your source databases in near real-time. Here's a detailed breakdown:

Prerequisites

  • Source Database with Change Capture:
    • Oracle: Configure LogMiner to capture all changes made to your database.
    • MySQL: Enable binary logging (binlog) to track changes.
  • Google Cloud Project: Ensure you have a Google Cloud project with the necessary permissions.

Steps

  1. Datastream CDC:

    • Create a Datastream stream with your Oracle or MySQL database as the source and BigQuery as the destination.
    • Datastream leverages LogMiner or binlogs to capture changes from the source database in near real-time.
  2. BigQuery Schema Design for CDC:

    • Explicit _CHANGE_TYPE Column: Add a column (e.g., _CHANGE_TYPE) to your BigQuery table to indicate the type of change ('INSERT', 'UPDATE', 'DELETE').
    • Timestamp Column: Include a timestamp column (e.g., change_timestamp) to record the time of each change.
    • Partitioning: Consider partitioning your BigQuery table based on the timestamp column for improved performance and manageability.
  3. Data Transformation for CDC:

    • Populate Change Metadata: Implement logic to populate the _CHANGE_TYPE and change_timestamp columns based on the changes received from Datastream. This can be done in several ways:
      • Before Data Reaches BigQuery: Employ Cloud Functions or Dataflow for pre-processing and more complex transformations.
      • Within BigQuery: Utilize SQL transformations for simpler adjustments.
  4. Querying CDC Data:

    • Use SQL queries to analyze and track historical changes:
       
      SELECT * 
      FROM my_cdc_table 
      WHERE record_id = 123 
      AND _CHANGE_TYPE = 'UPDATE' 
      ORDER BY change_timestamp; 
      

Important Considerations:

  • Complexity: CDC implementation can be complex; ensure careful planning of data transformation and schema management.
  • Schema Changes: Gracefully handle schema changes in your source database to prevent disruptions. This may involve updating Datastream configuration, BigQuery schemas, and transformation logic.
  • Storage and Costs: Plan for increased storage requirements and associated costs with historical data in BigQuery.
  • Orchestration: For complex CDC workflows, consider Cloud Composer or Cloud Workflows for efficient management of data transformation and loading processes.

Monitoring and Maintenance:

  • Regular Monitoring: Monitor the health of your CDC pipeline, including Datastream connectivity, BigQuery query performance, and transformation pipelines.
  • Proactive Maintenance: Address issues promptly to ensure the system remains efficient and accurate.