Datastream vs Data Fusion replication

Hello!

Could you explain the difference under the hood between Datastream and Data Fusion using replication job to load CDC data into BigQuery?

Solved Solved
2 11 2,292
1 ACCEPTED SOLUTION

Datastream and Data Fusion are both powerful services provided by Google Cloud to handle data integration tasks, but they have some important differences:

Datastream is a serverless, real-time change data capture and replication service that provides access to streaming, low-latency data from databases like MySQL, PostgreSQL, AlloyDB, and Oracle. It allows for near real-time analytics in BigQuery, offers a simple setup with secure connectivity, and automatically scales with no infrastructure to manage. Datastream reads and delivers every change from your databases (insert, update, delete) to load data into BigQuery, CloudSQL, GCS, and Spanner. It also normalizes data types across sources for easier downstream processing and handles schema drift resolution. In terms of security, it supports multiple secure, private connectivity methods, and data is encrypted in transit and at rest​.

See https://cloud.google.com/datastream

Data Fusion is a fully managed, cloud-native data integration service that offers a visual point-and-click interface for code-free deployment of ETL/ELT data pipelines. It includes a broad library of preconfigured connectors and transformations and integrates natively with Google Cloud services. It's built with an open-source core (CDAP) for pipeline portability, allowing for data pipeline portability across on-premises and public cloud platforms. It also has built-in features for data governance, including end-to-end data lineage, integration metadata, and cloud-native security and data protection services. Additionally, it allows for data integration through collaboration and standardization, offering pre-built transformations for both batch and real-time processing and the ability to create an internal library of custom connections and transformations​.

See https://cloud.google.com/data-fusion/

In a nutshell, while both services offer data integration capabilities, Datastream focuses on real-time change data capture and replication from databases, while Data Fusion provides a more extensive toolset for building and managing ETL/ELT pipelines with a visual interface, built-in connectors, and transformations. Datastream would be a better choice if your main requirement is real-time, low-latency data replication with automatic schema handling. On the other hand, Data Fusion would be more suitable if you need to build complex data pipelines with a visual interface, need a broad set of connectors and transformations, or require pipeline portability across different environments

View solution in original post

11 REPLIES 11

Datastream and Data Fusion are both powerful services provided by Google Cloud to handle data integration tasks, but they have some important differences:

Datastream is a serverless, real-time change data capture and replication service that provides access to streaming, low-latency data from databases like MySQL, PostgreSQL, AlloyDB, and Oracle. It allows for near real-time analytics in BigQuery, offers a simple setup with secure connectivity, and automatically scales with no infrastructure to manage. Datastream reads and delivers every change from your databases (insert, update, delete) to load data into BigQuery, CloudSQL, GCS, and Spanner. It also normalizes data types across sources for easier downstream processing and handles schema drift resolution. In terms of security, it supports multiple secure, private connectivity methods, and data is encrypted in transit and at rest​.

See https://cloud.google.com/datastream

Data Fusion is a fully managed, cloud-native data integration service that offers a visual point-and-click interface for code-free deployment of ETL/ELT data pipelines. It includes a broad library of preconfigured connectors and transformations and integrates natively with Google Cloud services. It's built with an open-source core (CDAP) for pipeline portability, allowing for data pipeline portability across on-premises and public cloud platforms. It also has built-in features for data governance, including end-to-end data lineage, integration metadata, and cloud-native security and data protection services. Additionally, it allows for data integration through collaboration and standardization, offering pre-built transformations for both batch and real-time processing and the ability to create an internal library of custom connections and transformations​.

See https://cloud.google.com/data-fusion/

In a nutshell, while both services offer data integration capabilities, Datastream focuses on real-time change data capture and replication from databases, while Data Fusion provides a more extensive toolset for building and managing ETL/ELT pipelines with a visual interface, built-in connectors, and transformations. Datastream would be a better choice if your main requirement is real-time, low-latency data replication with automatic schema handling. On the other hand, Data Fusion would be more suitable if you need to build complex data pipelines with a visual interface, need a broad set of connectors and transformations, or require pipeline portability across different environments

Ok, thanks a lot for explanation. As I understood Data Fusion is a some kind of ecosystem for creating and maintaining some different ETL/ELT jobs. Replication is one of its features if you use Data Fusion and don't want to make a zoo of different tools.

Data Fusion is a fully managed, cloud-native data integration service that helps you create, deploy, and manage ETL/ELT jobs. It provides a visual interface and a large library of pre-built connectors and transformations, making it easier to create and manage data pipelines. While it does support replication tasks, including change data capture (CDC), it's not limited to just that. It's a broader tool that can handle a wide variety of data integration tasks, making it more of an ecosystem, as you described it.

Data Fusion is built on the open-source project CDAP (Cask Data Application Platform), which gives it a lot of flexibility and portability. Its features allow you to build complex data pipelines that can work with a variety of data sources and destinations, integrate data across your organization, and ensure data governance and compliance. You can use it for both batch and real-time data processing, and it supports collaboration and standardization, with the ability to share and reuse custom connections and transformations across teams​

Hi, @ms4446 

How is the Datastream 'triggered'? (Example - Cloud SQL (PostgreSQL) -> BigQuery) - We can confirm, insert, update, delete operations are streamed. However, we'd like to learn more on how it's being triggered. 

Google Cloud Datastream captures and streams changes from databases continuously in near real-time. The triggering in this context is automatic as Datastream is designed to continuously monitor and replicate changes from the source database to the destination such as BigQuery. There's no need to manually trigger Datastream to capture and stream changes. Once it's set up and running, it will automatically capture changes like inserts, updates, and deletes from the source database and replicate them to the destination.

Appreciate the quick feedback @ms4446 ! 

Is there a way we can enhance the operation/streaming capabilities? During our testing - we are averaging 19seconds+ in streaming the data from CloudSQL(PostgreSQL) to BigQuery.

Just wanted to check if there are still optimization that we can do on the side to achieve near real-time capabilities.

 

Thank you!

Yes, there are several strategies you can employ to enhance the operation and streaming capabilities of Datastream when replicating data from Cloud SQL (PostgreSQL) to BigQuery:

  1. Use a Dedicated Network:

    • Dedicate a private network connection between Cloud SQL (PostgreSQL) and BigQuery to minimize latency and bolster performance.
    • Ensure robust security measures, such as encryption and firewall settings, are in place to safeguard the data in transit.
  2. Increase the Datastream Stream Capacity:

    • Amplify the Datastream stream capacity to empower Datastream to process and stream more data concurrently.
    • Continuously monitor the system to ascertain it sustains the augmented load efficiently, ensuring no bottlenecks or system overloads.
  3. Allocate More Resources to BigQuery:

    • Allocate additional resources to BigQuery to diminish the contention on BigQuery resources and enhance performance.
    • This involves increasing the amount of allocated storage and computational power to handle larger datasets and queries more effectively.
  4. Optimize the Data Types:

    • Refine the data types of the columns in the source and destination tables to boost performance.
    • For example, use INT instead of STRING for integer values, and avoid using NULLABLE fields where NOT NULL is applicable.
  5. Choose the Correct Stream Mode:

    • Opt for the suitable stream mode (batch or real-time) aligned with your requirements.
    • Batch mode is more resource-efficient for substantial volumes of data, whereas real-time mode is superior for near real-time updates, albeit at a potential increase in resource usage.
  6. Ensure a Consistent Data Model:

    • Guarantee that the data models of the source and destination tables are harmonious, mitigating the necessity for extensive data transformations.
    • This involves ensuring consistent data types, structures, and relationships between the two systems.
  7. Adopt the Right Partitioning Scheme:

    • Partition the BigQuery dataset effectively to enhance performance when querying the data.
    • For instance, consider partitioning tables based on date or specific key fields relevant to query patterns.
  8. Implement Caching:

    • Utilize caching to retain frequently accessed data in memory, augmenting performance when querying the data.
    • Employ caching solutions that are compatible with BigQuery to ensure seamless integration and operation.
  9. Deploy a Load Balancer:

    • Utilize a load balancer to evenly distribute traffic across multiple BigQuery instances, enhancing performance and scalability.
    • Choose a load balancer that integrates seamlessly with Google Cloud services for optimal performance and reliability.

Thank you, @ms4446 !

Can you expound more on the [2] Increase the Datastream Stream Capacity - as per checking the Datastream dashboard - I can't seem to view capacity configurations / resources under the profiles and stream.

And on [5] Choose the Correct Stream Mode -  is there a way to check if the current Datastream mode is batch or stream?

Thank you.

2. Increase the Datastream Stream Capacity:

Google Cloud's Datastream is a serverless, real-time change data capture and replication service. Datastream was designed to automatically scale based on the volume of changes and the complexity of transformations. This means that you don't typically have to manually adjust "stream capacity" as you might with some other services.

However, there are a few things you can check and adjust to ensure optimal performance:

  • Source Database Load: Ensure that your source Cloud SQL (PostgreSQL) instance has adequate resources (CPU, memory, and storage). If the source database is under heavy load, it might slow down the change data capture process.

  • Monitor Metrics: Use the Datastream dashboard to monitor key metrics such as latency, throughput, and error rates. This can give you insights into any bottlenecks or issues.

  • Adjust Parallelism: While Datastream automatically manages resources, you can potentially increase performance by adjusting the parallelism of certain tasks, if such configurations become available in future updates.

If you don't see specific configurations related to "stream capacity" in the Datastream dashboard, it's possible that Google has abstracted this to simplify user experience, or it might be a feature that's been added after my last update.

5. Choose the Correct Stream Mode:

Datastream is designed to capture and replicate data in real-time. However, the distinction between "batch" and "real-time" modes might not be a direct configuration option in Datastream as it is in some other platforms. Instead, Datastream focuses on continuous, real-time replication.

To determine the mode or check configurations:

  • Dashboard Check: Review the Datastream dashboard and the details of your specific stream. Look for any configurations or settings related to replication mode or frequency.

  • Documentation & Release Notes: Google Cloud frequently updates its services. It's a good idea to check the official Datastream documentation or release notes for any recent changes or added features related to replication modes.

  • Logs & Metrics: Examine the logs and metrics associated with your Datastream. This might give you insights into the frequency and mode of replication.

If you're unsure about the current mode or can't find specific settings, it might be beneficial to reach out to Google Cloud Support or consult the official documentation for the most up-to-date information.

(I'm on the same project as @cyrille115 )

This scenario is replicating data from Cloud SQL (Postgres) to BigQuery.

To test the replication time, I created a test table with a timestamp column. I'm running the insert statement passing now() as value. What I can observe:

  • Once the value is inserted, replication is not visible for at least 20s.
  • Once replication happens, the timestamp column has the correct value
  • The auto-generated column source_timestamp also contains the same timestamp from the source (exactly when the statement was executed).
  • Looking at the logs, we can see the interval for fetches is exactly the one we're seeing (around 20 to 30 seconds).

Is there anything else we can do to troubleshoot this case? Why would the CDC have such a huge delay to get started?

The delay of 20-30 seconds that you are observing in replication from Cloud SQL (Postgres) to BigQuery is likely due to several factors:

  1. Change Data Capture (CDC) Lag:

    • CDC relies on capturing changes from the database's transaction logs. There can be a delay in reading or processing these logs, especially if there's a backlog or if the logs aren't retained for a sufficient duration.
  2. Datastream Processing:

    • Datastream processes CDC events and transforms them into a format suitable for BigQuery. This transformation might involve operations like schema mapping and data type conversion, which can introduce delays.
  3. BigQuery Ingestion:

    • BigQuery's ingestion process, especially its streaming ingestion, might have micro-batching behaviors that introduce small delays. While the dataset's size can influence query performance, ingestion delays are more likely due to the streaming buffer or the ingestion mechanism itself.

In addition to these general factors, specific conditions might be contributing to the delay:

  • If your Cloud SQL instance is under heavy load, it may take longer to process the CDC events.
  • BigQuery's performance can be influenced by various factors, but the size of the dataset primarily affects query performance rather than ingestion.

Troubleshooting Steps:

  1. Check Datastream Logs and Metrics:

    • Review the Datastream logs and metrics for insights into the replication process. Look for error messages, warnings, or patterns that might indicate bottlenecks or issues.
  2. Datastream Capacity:

    • As of the last update in 2022, Datastream was designed to auto-scale. If newer versions offer manual scaling options, consider adjusting them to process more CDC events concurrently.
  3. BigQuery Performance:

    • While BigQuery is serverless and doesn't have traditional "clusters" to scale, if you're on a reserved pricing plan, you can manage slots to optimize query performance.

Optimization Tips:

  1. Dedicated Network:

    • Establish a dedicated network connection between Cloud SQL and BigQuery to minimize latency.
  2. Partition the BigQuery Dataset:

    • Implement partitioning in BigQuery to enhance query performance.
  3. BigQuery Behavior:

    • Understand that BigQuery does not utilize caching in the traditional database sense. Also, it automatically manages its resources, eliminating the need for traditional load balancing.

By integrating these refined strategies and clarifications, you can gain a clearer understanding of the factors influencing the replication delay and take appropriate measures to optimize the process.