Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Structured Streaming data upsert into BigQuery

Hi

We are trying to merge streaming data managed in structured streaming, into BigQuery target table. But we meet blockers.

1. It seems spark-bigquery-connector doesn't support df.writeStream.outputMode("update").

2. We've been using spark-bigquery-connector to merge the dataframe in spark(batch mode) into the bigquery table, by creating a tmp table then use "merge" sql to upsert into the target table. Is it also the best solution in structured streaming? We may concern about the performance.

3. We have checked Storage Write API in document that contains Capture Data Change ability and it can upsert/delete data, but didn't find any code example. Could you confirm if CDC can implement merge case or give an example of code if it can do it.

Thanks

0 3 1,802
3 REPLIES 3

Yes, you're on the right track. The spark-bigquery-connector does not currently support the outputMode("update") option for streaming writes. This limitation isn't solely due to the BigQuery streaming API but also stems from the nature of BigQuery itself. BigQuery is primarily optimized for large-scale data analytics rather than real-time updates, which are typical in traditional databases.

While using the spark-bigquery-connector to merge data in Spark batch mode is a valid approach for structured streaming, it's essential to understand that this method can be performance-intensive. However, it's not because the entire streaming DataFrame is materialized in memory. Structured streaming in Spark processes data in micro-batches, and each micro-batch is processed separately. The performance concern arises from writing each micro-batch to a temporary table in BigQuery and then merging it into the target table.

Regarding the BigQuery Storage Write API, it indeed supports upserts and deletes. But it's worth noting that while this API allows for real-time inserts, updates, and deletes, it doesn't natively support the MERGE operation like traditional SQL databases.

In general, the optimal approach for merging streaming data into a BigQuery table will depend on your specific use case. If performance is a top priority, you might consider using a dedicated streaming data pipeline tailored for upserts and deletes. On the other hand, if simplicity is more crucial, the spark-bigquery-connector in batch mode might be more suitable.

Additional options to consider include:

  • Utilizing the BigQuery streaming API to write data to a temporary table, followed by a Cloud Scheduler job that periodically merges this data into the target table.
  • Exploring third-party streaming data pipelines that offer built-in support for merging data into BigQuery.
  • Designing a custom streaming data pipeline tailored to the unique requirements of your use case.

Thanks very much for your detailed explanation.

In addition, do u have any sample code that implements "upsert" with Storage Write API? Even though the API supports this feature, we still don't know how to use it.

Thanks.

Here are some resources that might help:

  1. Introduction to the BigQuery Storage Write API | Google Cloud: This is the official documentation for the Storage Write API. It provides a comprehensive overview of the API's capabilities and how to use it.

  2. BigQuery Storage Write API best practices | Google Cloud: This guide offers best practices for using the Storage Write API, which might provide insights into handling upserts efficiently.

  3. 13 tricks for the new Bigquery Storage Write API in Python: This article provides some tricks and tips for using the Storage Write API with Python. It might contain relevant examples or insights for your use case.

For a direct implementation of upserts using the Storage Write API, you might need to combine the real-time inserts capability of the API with a strategy to handle updates and deletes, as I mentioned earlier (e.g., streaming updated records with new timestamps and using window functions during analysis).