Solved: Bigquery streaming buffer taking too long (90 minu... - Page 2

jshandorov · 12-20-2023 03:32 AM

I am streaming data from pub/sub to bigquery using dataflow templates via this code [1]. Data is typically available after 40 seconds in preview. Unfortunately most of the times querying can be done after 90 minutes (as in documentation). I've tried also STREAMING_INSERTS and STORAGE_WRITE_API method with same latency.

But when using the bigquery api directly [2], data stays in the streaming buffer for 3-4 minutes and then it is available for querying. Rows even go in the buffer and become available before the rows from case [1].

This makes me think that the default behaviour of the storage write api is what I need to make the streaming faster. There should be some setting to be done in apache beam in order to make it behave the same way as direct usage of storage write api.

A quick fix I've found is to add "OR _PARTITIONTIME IS NULL" to query, which adds to results also the rows from the streaming buffer. But it still does not help when you need the exact partition time (e.g. in a materialized view for which you need to know the update time, which is being updated only when main table is updated).

Any ideas what could be tweaked in this pipeline [1] or other dataflow setting to make it work? Probably it is the Type.PENDING which is passed to the writeclient, but it can not be added via Apache Beam library in dataflow.

[1]

WriteResult writeResult =
        convertedTableRows
            .get(TRANSFORM_OUT)
            .apply(
                "WriteSuccessfulRecords",
                BigQueryIO.writeTableRows()
                    .withoutValidation()
                    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                    .withExtendedErrorInfo()
                    .withMethod(BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE)
                    .withNumStorageWriteApiStreams(0)
                    .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                    .to(
                        input ->
                            getTableDestination(
                                input, tableNameAttr, datasetNameAttr, outputTableProject)));

[2]

com.google.cloud.bigquery.storage.v1.BigQueryWriteClient + Type.PENDING

ms4446

The latency you're experiencing with the Dataflow approach is challenge. Here are a few considerations and potential solutions:

Data Ingestion and Latency:
- The latency you're experiencing is likely due to the time it takes for data to move from the streaming buffer to a queryable state in BigQuery, rather than data finalization in Dataflow.
- Solution: Employ BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE with WriteDisposition.WRITE_APPEND. This method optimizes the write process, ensuring data is appended promptly upon arrival, which can help reduce latency.
Partition Time Visibility:
- The delay in _PARTITIONTIME visibility is a common challenge with streaming data.
- Solution: While using BigQueryIO.Write.Method.STREAMING_INSERTS with IgnoreUnknownValues can help in some scenarios, consider implementing a ParDo transform in Dataflow to manage and store partition times more effectively. This approach can provide better control over data, especially for updates in materialized views.
Write Client Configuration:
- Direct configuration of Type.PENDING as in the BigQuery Write API is not available in Dataflow's BigQueryIO.
- Solution: Utilize BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE and experiment with the numStorageWriteApiStreams setting. While setting it to 0 allows the service to choose the number of streams, fine-tuning this number might offer better latency optimization. Additionally, explore BigQueryIO.Write.WriteOption configurations like IGNORE_PARTITION_ERRORS and IGNORE_UNKNOWN_VALUES for tailored data handling.
Additional Strategies:
- Monitoring and Performance Tuning: Leverage Google Cloud's monitoring tools to identify bottlenecks. Adjusting batch sizes, windowing, and sharding in your Dataflow pipeline can significantly impact performance.
- Direct BigQuery API Usage: If latency remains a critical issue, consider using the BigQuery API directly for lower latency. This approach requires more custom code management but can provide faster data availability.

View solution in original post

Bigquery streaming buffer taking too long (90 minutes each time)