I am streaming data from pub/sub to bigquery using dataflow templates via this code [1]. Data is typically available after 40 seconds in preview. Unfortunately most of the times querying can be done after 90 minutes (as in documentation). I've tried also STREAMING_INSERTS and STORAGE_WRITE_API method with same latency.
But when using the bigquery api directly [2], data stays in the streaming buffer for 3-4 minutes and then it is available for querying. Rows even go in the buffer and become available before the rows from case [1].
This makes me think that the default behaviour of the storage write api is what I need to make the streaming faster. There should be some setting to be done in apache beam in order to make it behave the same way as direct usage of storage write api.
A quick fix I've found is to add "OR _PARTITIONTIME IS NULL" to query, which adds to results also the rows from the streaming buffer. But it still does not help when you need the exact partition time (e.g. in a materialized view for which you need to know the update time, which is being updated only when main table is updated).
Any ideas what could be tweaked in this pipeline [1] or other dataflow setting to make it work? Probably it is the Type.PENDING which is passed to the writeclient, but it can not be added via Apache Beam library in dataflow.
[1]
WriteResult writeResult = convertedTableRows .get(TRANSFORM_OUT) .apply( "WriteSuccessfulRecords", BigQueryIO.writeTableRows() .withoutValidation() .withCreateDisposition(CreateDisposition.CREATE_NEVER) .withWriteDisposition(WriteDisposition.WRITE_APPEND) .withExtendedErrorInfo() .withMethod(BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE) .withNumStorageWriteApiStreams(0) .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()) .to( input -> getTableDestination( input, tableNameAttr, datasetNameAttr, outputTableProject)));
[2]
com.google.cloud.bigquery.storage.v1.BigQueryWriteClient + Type.PENDING
Solved! Go to Solution.
The latency you're experiencing with the Dataflow approach is challenge. Here are a few considerations and potential solutions:
Data Ingestion and Latency:
BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE
with WriteDisposition.WRITE_APPEND
. This method optimizes the write process, ensuring data is appended promptly upon arrival, which can help reduce latency.Partition Time Visibility:
_PARTITIONTIME
visibility is a common challenge with streaming data.BigQueryIO.Write.Method.STREAMING_INSERTS
with IgnoreUnknownValues
can help in some scenarios, consider implementing a ParDo
transform in Dataflow to manage and store partition times more effectively. This approach can provide better control over data, especially for updates in materialized views.Write Client Configuration:
Type.PENDING
as in the BigQuery Write API is not available in Dataflow's BigQueryIO
.BigQueryIO.Write.Method.STORAGE_API_AT_LEAST_ONCE
and experiment with the numStorageWriteApiStreams
setting. While setting it to 0 allows the service to choose the number of streams, fine-tuning this number might offer better latency optimization. Additionally, explore BigQueryIO.Write.WriteOption
configurations like IGNORE_PARTITION_ERRORS
and IGNORE_UNKNOWN_VALUES
for tailored data handling.Additional Strategies: