Re: Problem with partitioned table in BigQuery and...

ionchiciuc · 02-17-2024 01:26 PM

Hello,

I have created a datastream from MySql to BigQuery, a made table in Bigquery partitioned.

It looks like inserting data works fine with partitions.

But when I try to make a query, while there are still some data in the streaming buffer, it doesn't use partitions and read data from the whole table.

When there are no more data in streaming buffer, it works fine.

Why it has to read whole table? Shouldn't it read only the required partitions + streaming buffer?

ms4446

When using Datastream to stream data from MySQL to a partitioned table in BigQuery, the streaming buffer serves as a temporary holding area for incoming data before it's distributed into the appropriate partitions based on your table's partitioning scheme (e.g., a date column). This mechanism ensures that your data is continuously updated and available for querying in near real-time.

Why It Might Seem Like the Whole Table Is Being Scanned:

Streaming Buffer Inclusion: Queries need to scan the streaming buffer in addition to the specified partitions to ensure they capture all relevant data. This is because the streaming buffer contains new data that hasn't yet been assigned to a partition. If there's a significant amount of data in the streaming buffer, this can make it appear as though the query is scanning more of the table than intended.
Query Optimization and Data Freshness: BigQuery automatically optimizes queries to balance between data freshness and efficient data processing. When recent data is essential to your query, BigQuery includes the streaming buffer in the scan, which can increase the amount of data processed and potentially impact query performance.

Optimizing Your Queries:

Timing Your Queries: Consider running queries at times when the streaming buffer is likely to be smaller, allowing data to be fully processed into partitions. This can help reduce the volume of data scanned and improve query performance.
Monitoring the Streaming Buffer: Keeping an eye on the size and presence of the streaming buffer can give you insights into the best times to run your queries and how the streaming process might be affecting query costs and performance.
Partition and Cluster Design: Ensuring your table's partitioning and optional clustering are well-aligned with your query patterns can significantly reduce the amount of data scanned, thereby optimizing costs and performance.
Table Statistics: While BigQuery manages table statistics automatically, understanding how these statistics influence query optimization can help you design better tables and write more efficient queries.

In summary, while the presence of the streaming buffer requires BigQuery to scan additional data to ensure query completeness, careful planning and understanding of BigQuery's behavior can mitigate performance impacts and optimize query efficiency.

ionchiciuc

Thanks @ms4446

I was thinking it should work like you described, but...

For example, I have some data in streaming buffer (monstly it's recent data)

So when I make a query, I assume it would process data from today's partition + data from streaming buffer.
But every time it processes all 5GB (the whole table), even if there is only 1 row in streming buffer.

This leaves no room for optimizations.
Queries will use partitions only if I turn off data stream, otherwise they will scan whole table, because most part of the day there will be some data in streaming buffer.

ms4446

Your observations and the scenario you've shared highlight the complexities of working with BigQuery's partitioned tables and the streaming buffer. Here's an enhanced breakdown of your situation and refined strategies for potential improvements:

Impact of the Streaming Buffer on Full Table Scans: The presence of data in BigQuery's streaming buffer necessitates scanning this buffer alongside the specified partitions to ensure query completeness. This is crucial for capturing all relevant data, including the most recent additions. However, this can lead to broader table scans than expected, especially when the buffer contains a significant amount of unpartitioned data.
Table Size and Query Optimization: Although your table is 5GB, which is relatively modest by BigQuery standards, the expectation that partitioning will enhance query performance remains valid. Nonetheless, the streaming buffer's impact can sometimes diminish the anticipated benefits of partitioning, particularly for queries targeting real-time data.
Balancing Real-time Data Freshness with Query Efficiency: There's an inherent trade-off between ensuring data freshness through the streaming buffer and optimizing query performance via partitioning. This trade-off becomes particularly pronounced in use cases that demand up-to-the-minute data accuracy.

Optimization Strategies:

Optimal Query Timing: Consider scheduling your queries during periods when the streaming buffer is expected to be minimal. This approach can help mitigate the need for full table scans, although it may not always align with real-time data requirements.
Reevaluating Data Freshness Requirements: Assess the criticality of real-time data for your specific use case. If immediate data freshness is less crucial, you might find opportunities to optimize query performance by adjusting how you interact with the streaming buffer.
Exploring Clustering alongside Partitioning: While clustering is not an alternative to partitioning, it can serve as a complementary strategy. By clustering your table on relevant fields (such as the date field), you can further enhance query performance within each partition, especially for range-based queries.
Monitoring Data Ingestion Patterns: Keeping a close watch on the size and frequency of data updates in the streaming buffer can provide valuable insights. This understanding can guide the selection of optimization strategies by anticipating the impact on query performance.

Additional Considerations:

Automatic Management of Table Statistics: BigQuery automatically updates table statistics to optimize query execution. While direct user intervention is not possible, structuring queries effectively and adhering to best practices can aid in achieving optimal performance.
Engagement with Google Cloud Support: For ongoing performance issues or complex scenarios, consulting with Google Cloud support can offer tailored advice and insights specific to your project's needs.

edward_wb

Hi, i got same issue here.

My partitioned table with pubsub streaming always scan whole table with partitioning column filter. (even just filter last year ago data)

How can I use partition filter to reduce usage of scan range?

ionchiciuc

The best "solution" I found, was to create a mirror table and copy data from main table.

I created a schedulted query that run every 15 minutes:

MERGE `your_table_mirror` T
USING `your_table` S
ON T.guid = S.guid
WHEN MATCHED THEN
  -- if row exists in both table, update values from main table
  UPDATE SET
    field_1 = S.field_1
    field_2 = S.field_2
WHEN NOT MATCHED BY TARGET THEN
  -- if row doesn't exist in mirror table, insert it
  INSERT(guid, fiedl_1, field_2)
  VALUES(guid, fiedl_1, field_2)
WHEN NOT MATCHED BY SOURCE THEN
  -- if row doesn't exist in main table, delete it from mirror table
  DELETE

Then I make queries on mirror table, and in this case partitioning is working fine.

The only problem is that mirror table doesn't have real-time data, and may be 15 minnutes late.

edward_wb

My problem was resolved with set max_staleness (it was CDC table)

wwells

I'm having a similar issue. If you attempt to expose a partitioned table with streaming inserts through a view, any queries against the view that attempt to select only a partition instead do a full scan. This holds even if the partition is old enough that the streaming buffer shouldn't have any data relevant to the partition in it.

Example:
--On the raw table

SELECT * 
FROM `mydataset.raw_stream` 
--- scans ~20 GB

SELECT * 
FROM `mydataset.raw_stream` 
WHERE TIMESTAMP_TRUNC(publish_time, DAY) = TIMESTAMP("2024-07-17") 
--- scans ~6 GB

--On the view

SELECT * 
FROM `mydataset.raw_view` 
--- scans ~20 GB

SELECT * 
FROM `mydataset.raw_view` 
WHERE TIMESTAMP_TRUNC(publish_time, DAY) = TIMESTAMP("2024-07-17") 
--- scans ~20 GB

Kartik_Dodiya

When you stream data into Google BigQuery from sources like MySQL, it goes into a temporary area called the streaming buffer before being added to the actual table. This buffer holds the data temporarily while it's being processed.

Now, when you try to run a query on this data while it's still in the streaming buffer, BigQuery might not be able to use the partitions (which are like organized sections of the table) efficiently. This means it might end up scanning the whole table instead of just the parts it needs to, which can slow things down.

To make sure your queries run smoothly, it's a good idea to wait until all the data has been moved out of the streaming buffer and into the table before running them. You can keep an eye on this process in the BigQuery interface or through programming tools.

If you're having trouble with slow queries or need to speed things up, you might want to consider adjusting how you're streaming data into BigQuery or changing how you're structuring your queries to make them more efficient.

nickorlove

Hey Kartik,

Based on your problem description, it sounds like you are streaming upserts to BigQuery using BigQuery CDC [ref] with the Storage Write API, perhaps with Datastream reading from your MySQL table. Assuming this is true, based on the description of your question, the reason why your queries aren't using partitioning effectively actually has nothing to do with your table's streaming buffer. It is because you've configured your table's max_staleness setting too low and your queries are performing runtime merge jobs to merge the most recent upserts during query time. If you were to increase your table's max_staleness and/or configured sufficient resources to keep your baseline table refreshed within your configured staleness window, your queries will resume using table partitioning.

You can learn more about query behavior using partitioning HERE, and learn more about recommended max_staleness values HERE.

For use cases not involving streaming CDC upserts to BigQuery, running queries against appended data streamed to BigQuery will leverage partitioning in your queries as expected.

davidwhiteside

I'm still having issues with this and the above answers hold no weight.

I have a table with 3.9Gb in it and when I query based on a partition, it does a full table scan. The data in the buffer is Kb but my queries are still 3.9Gb. This is the opposite of what you are suggesting

My table:

My storage info:

My buffer stats:

My query and estimated cost:

Please explain why it is doing 3.9Gb when there is so little in the streaming buffer. The responses that Google have responded are not telling the story

Problem with partitioned table in BigQuery and streaming buffer from Datastream