Re: Queriying on table with large data gives null ...

appflowpersiste · 12-03-2023 08:17 PM

Hi,

I am developing an integrating with Google big query, I have notices sometimes "/queries" API endpoint responds with "jobComplete": false when ever queries on table with huge data, and when hit again then it responds with the desired table data and jobComplete as true. I wanted to know what is the maximum time that GBQ takes to fetch huge data rather complete a "select" job, as we are trying to implement retry mechanism and we can only retry for couple of times within a restricted time range.

Need suggestion here, if the time to complete the job is more then what should be done so that we can get the data in the first hit only.

ms4446

When dealing with Google BigQuery, especially for "select" jobs on large datasets, it's important to understand that there's no fixed maximum time for query completion. The duration can vary based on several factors:

Data Size: The larger the dataset, the longer it might take to process. While millions of rows can be processed in minutes, billions could take significantly longer.
Query Complexity: Simple queries like "SELECT *" are generally quicker than more complex ones involving joins, aggregations, and filters.
Resource Availability: BigQuery operates on shared resources. During peak times, your query might be queued, which can add to the wait time.
Network Latency: While usually a minor factor, the latency between your application and BigQuery can also influence the overall response time.

To effectively manage long-running queries, consider these strategies:

Utilize Job Metadata:
- The jobComplete field in the /queries response indicates if the job is finished. Avoid retrying until it's set to true.
- Use job.getQueryResults() with a timeoutMs parameter to periodically check for completion within a specific timeframe.
- Analyze bytesProcessed and rowsProcessed in job statistics to estimate the remaining time based on past progress.
Optimize Your Query:
- Simplify the query where possible, or break it into smaller parts.
- Implement partitioning and clustering for more efficient data retrieval.
- Use materialized views for common queries to improve performance.
Increase Concurrency Slots:
- If resource limitation is a bottleneck, consider increasing your concurrency slots. This can expedite processing but may incur additional costs.
Implement Asynchronous Processing:
- Initiate the query and proceed with other tasks, using BigQuery notifications to alert you upon completion.
Explore Alternative Solutions:
- For extremely large datasets, the BigQuery Storage API can provide direct access for processing with tools like Apache Spark.
- For real-time analytics, BigQuery Dataflow might be a suitable option.

Remember, setting overly ambitious timeouts can lead to unnecessary retries and resource wastage. It's crucial to balance your expectations with the realities of data size, query complexity, and available resources. By effectively using job metadata, optimizing queries, and considering alternative approaches, you can efficiently handle large datasets in BigQuery.

Queriying on table with large data gives null response the first time