Re: Issue in windowing with large bigquery table r...

mohitshagcp · 12-06-2023 09:37 AM

Hi,
This post is in continuation with https://www.googlecloudcommunity.com/gc/Data-Analytics/Dataflow-Stateful-processing-Issue-worker-fai... . In dataflow pipeline I need to read data from bigquery using custom DoFn and writing to cloud sql. I found some of the table are having large dataset in several millions range, For these heavy table I was facing challenges to track status of written batches batches so as suggested in previous post I use external system/storage for maintaining the state of these batches and was sending an event with every batch commit/write in cloudsql.

But this process was not optimized as there will be lots of events in external system so for optimizing the process and implemented the windows ( let's say 2 min ) and after grouping I had tried to send events but the pervious issue starts re-occurring. Now data is not being captured in window until previous stage data gets commit ( it wait for all the data to read from table and write to cloud sql) , which is large and will take time. I am using bigquery client to read from bigquery and read data in batches. I did some research and found that Splitable DoFn will help here but I find limited resource on that..

Can someone please provide reference documents / articles which I can refer to read data from bigquery using splitable DoFn

@ms4446 @mohitshagcp

ms4446

Splittable DoFns in Apache Beam offer a powerful method for processing large datasets efficiently. They enable the division of work into smaller, manageable chunks that can be processed in parallel, which can be particularly beneficial when dealing with large datasets. However, when it comes to reading data from BigQuery, it's important to consider the capabilities of BigQueryIO in Beam, which is already optimized for handling large-scale data efficiently.

Key Considerations:

Complexity of Implementation: Implementing a Splittable DoFn can be complex and requires a deep understanding of Apache Beam's model for parallel processing and state management.
BigQueryIO's Efficiency: Before implementing a custom Splittable DoFn, it's beneficial to explore the capabilities of BigQueryIO, as it might already meet your performance requirements.
Balancing Overhead and Performance: Consider the overhead of splitting and state management against the actual performance gains. In some cases, the built-in I/O connectors might be more efficient.
Use Cases for Splittable DoFn: Splittable DoFn is most useful when the data source or sink is not natively splittable or for complex custom data processing tasks.
Testing and Monitoring: Ensure thorough testing and performance monitoring to validate that the splitting logic and pipeline processing are efficient and effective.

Implementation Tips:

Define Splitting Logic: Determine how your DoFn will split the work into smaller chunks based on BigQuery data.
Process Split Outputs: Each split should process its assigned chunk of data and write to Cloud SQL.
Handle State Management: Manage state effectively for each split to track progress and handle errors.
Consider Batching: Optimize Cloud SQL writes through batching for better performance.

Here are some resources and considerations for using Splittable DoFns, especially in the context of reading data from BigQuery:

mohitshagcp

Thank you so much @ms4446 for your inputs !

Issue in windowing with large bigquery table read

Key Considerations:

Implementation Tips: