Re: Insert/Update to GCP Bigquery using Spring boo...

dharma · 08-16-2023 11:09 PM

Hi @,

I need to insert/update data received from another saas based application to GCP bigquery, could you suggest me best cost effective, loosely coupled and synchronous way of development. I need to send back status of the insertion/updation to saas based application. could you suggest me the best approach to achieve this.

Thanks

ms4446

Sure, here are some options for you to consider:

Google Cloud Pub/Sub:
- Description: Pub/Sub is a fully-managed real-time messaging service that allows you to decouple applications and services.
- How it Works: You can use Pub/Sub to send data from the SaaS application to a topic. To get this data into BigQuery, you'd typically pair Pub/Sub with another service like Dataflow.
- Benefits: This approach is cost-effective as you only pay for the messages you send and receive. It's also loosely coupled, meaning the two applications don't need to be intimately aware of each other's operations.
Google Cloud Dataflow:
- Description: Dataflow is a fully-managed stream and batch data processing service based on Apache Beam.
- How it Works: You can use Dataflow to create a pipeline that ingests data (e.g., from Pub/Sub), processes it, and then loads it into BigQuery. Once the data is in BigQuery, a mechanism like a Cloud Function can be used to send back the status to the SaaS application.
- Benefits: This approach is scalable and cost-effective since you only pay for the resources you use. It also maintains the loose coupling between the two applications.

If you're looking for a cost-effective, loosely coupled, and relatively synchronous solution, combining Google Cloud Pub/Sub with Dataflow is likely your best bet. This combination allows for real-time data ingestion into BigQuery and can provide timely feedback to the SaaS application.

dharma

Hi @ms4446,

Thanks for the input, In both the approaches,When it comes returning the status to the saas based application is not real-time( synchronous).

Basically here Saas based application is invoking rest endpoints, intern these endpoints will have to insert/update to Bigquery as soon as it inserts to Bigquery table and rest endpoint has to return 200/201 http status to saas app, But If I go with above 2 approaches I am not sure when the data will be available in Bigquery table( basically there will be small delay).

If I go with GCP bigquery insert/update apis, When It comes to update, lets assume that if table has around million records, to update it has to select the record based on some unique key and update. so in this situation will there be more charge involved as it is egress to Bigquery?

ms4446

You are right, the Pub/Sub and Dataflow approaches do not guarantee a real-time response. There might be a delay before the data is available in BigQuery, even after the rest endpoint returns a 200/201 status code.

If you require a truly synchronous response, you can opt for the GCP BigQuery insert/update APIs. However, it's essential to clarify that using the BigQuery API for updates within GCP does not incur "egress" charges. Instead, you'll be charged based on the amount of data processed during the update. For instance, updating a few records might have minimal costs, but updating millions could be more significant due to BigQuery's approach to updates (old records are marked as deleted, and new records are written).

The best approach will depend on your specific requirements and budget:

Volume of Data: Consider how much data you'll be inserting or updating regularly.
Latency Requirements: If real-time feedback is crucial, the direct BigQuery API approach offers the least latency.
Budget: Be mindful of the costs associated with frequent updates, especially on large datasets.
Scalability: Think about future growth. While direct API calls might work for smaller datasets, more extensive operations might benefit from the scalability of Pub/Sub and Dataflow.

If real-time feedback is essential and you're prepared for the associated costs, the BigQuery insert/update APIs are suitable. However, if you can accommodate a slight delay, Pub/Sub combined with Dataflow might be more cost-effective in the long run.

dharma

Hey ms4446 Thanks for the detailed answer, While implementing I got another doubt that,

is it avoiding duplicates in the Big query table is better or getting the latest record form the duplicates in the table is better with respect to performance and cost? - Here we are inserting to stage table before it goes to target table .

again this question is with Java insert/update APIs with bigquery.

ms4446

Whether it is better to avoid duplicates in a BigQuery table or retrieve the latest record from duplicates depends on your specific requirements and use case.

Avoiding Duplicates:

Pros:
- Can lead to cleaner data and potentially reduce storage costs.
- May also simplify analytics by ensuring each record is unique.
- Can improve performance in some scenarios, especially if the data is frequently updated.
Cons:
- Can be complex to implement, especially if checking for duplicates during insertion. This can also slow down the ingestion process if not done efficiently.
- May require additional logic to handle edge cases, such as when two records have the same value for all but one column.

Getting the Latest Record:

Pros:
- Simpler to implement and can ensure you're always working with the most recent data.
- Does not require any additional logic to handle edge cases.
- Can be more scalable than avoiding duplicates, especially for large volumes of data.
Cons:
- If not managed correctly, it can lead to increased storage costs due to duplicate records.
- Identifying the "latest" record requires a reliable timestamp or versioning mechanism.

Additional Considerations:

Volume of Data: If you're inserting large volumes, avoiding duplicates during ingestion might be more efficient in the long run.
Frequency of Updates: Frequent updates might benefit from a system that inherently avoids duplicates, ensuring that the data remains clean.
Importance of Accuracy: If ensuring the most recent and accurate data is critical, you'll need a robust mechanism to identify and retrieve the latest record.

In conclusion, if performance, cost-efficiency, and data cleanliness are priorities, avoiding duplicates might be the preferred approach. However, if simplicity and ensuring access to the most recent data are more important, then designing the system to easily retrieve the latest record could be beneficial. Your final decision should align with your specific requirements and use case.

dharma

Hi ms4446,

Do java insert/update/select apis have limitations/quota with Bigquery query api execution in numbers of query execution like insertions, updations or select query, If yes what are the limitations? Thanks

ms4446

Hi @dharma ,

Yes, there are some limitations and quota restrictions associated with the Java insert/update/select APIs and BigQuery query API execution. These limitations are designed to protect the stability and performance of the BigQuery service.

Insert/Update/Select API Limitations:

Maximum query length: 12 MB (resolved query length)
Maximum number of query parameters: 10,000
Maximum request size: 10 MB
Maximum response size: 10 GB compressed
Maximum row size: 100 MB
Maximum number of columns in a table, query result, or view definition: 10,000

BigQuery Query API Execution Limitations:

Query execution time limit: 6 hours
Maximum number of resources referenced per query: 10,000
Maximum number of concurrent queries per project: 10,000
Maximum number of concurrent queries per user: 100

Quota Restrictions:

BigQuery API free quota: 1 TB of data processed per month for on-demand queries
BigQuery API paid quota: You can purchase additional query capacity, which is billed based on the amount of data processed.You can also opt for flat-rate pricing.

Tips for optimizing your queries and using the BigQuery API efficiently:

Use the BigQuery query planner to preview the estimated cost of your query before you run it.
Avoid using unnecessary joins and subqueries.
Use partitioned tables whenever possible.
Use the BigQuery caching mechanism to store the results of frequently run queries.
Use batch queries to process large volumes of data more efficiently.

Insert/Update to GCP Bigquery using Spring boot