Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Why does the viewed BigQuery data change after a certain period of time?

The January data I viewed in February is different from the January data I viewed in March.
The query is the same, but the number of sign_up events dropped by 25% in March, and I don't know why.

Solved Solved
0 2 229
1 ACCEPTED SOLUTION

 

When working with Google Cloud BigQuery, you might notice that the data observed through the same query changes over time. Here's why this can happen, along with how to investigate:

Usual Suspects

  • Late-arriving data and backfills: Data pipelines often experience delays or perform backfill operations to correct historical data. These are frequently the main reasons why your January sign-up numbers might look different between February and March.
  • Query errors: Before diving deeper, double-check your query for unintentional filters, incorrect joins, or changes that might have been introduced accidentally. Even small query mistakes can drastically skew your results.

Other Factors to Consider

  • Data ingestion patterns (CDC): While BigQuery doesn't directly utilize Change Data Capture (CDC), if your data originates from a system that does, updates or deletes in the source will be reflected in BigQuery over time.
  • Table Snapshots and Versioning: If you are unintentionally querying older snapshots instead of the live table, results might appear to change. Ensure you are referencing the correct table versions.
  • Streaming Inserts: BigQuery's streaming buffer might temporarily delay recent data availability for querying. This is less likely to explain major discrepancies over longer periods.
  • Schema Changes: Rarely, a table's schema may be modified. If your query doesn't align with the new schema, the results might be incorrect.
  • Data Deletion for Compliance: Systems might have processes that delete older data for privacy reasons (e.g., GDPR). This would cause records to disappear over time.

Investigation Strategies

 

  • Change History: BigQuery's change history is your first tool to see if data updates or deletions have occurred.
  • Data Pipeline Examination: Work with your data team to see if delays and backfills are normal parts of the pipeline.
  • Snapshot Verification: Ensure you're querying the primary table and not outdated snapshots.
  • Source System Checks: Consult with those who manage your source data to understand if historical modifications are expected.

 

View solution in original post

2 REPLIES 2

 

When working with Google Cloud BigQuery, you might notice that the data observed through the same query changes over time. Here's why this can happen, along with how to investigate:

Usual Suspects

  • Late-arriving data and backfills: Data pipelines often experience delays or perform backfill operations to correct historical data. These are frequently the main reasons why your January sign-up numbers might look different between February and March.
  • Query errors: Before diving deeper, double-check your query for unintentional filters, incorrect joins, or changes that might have been introduced accidentally. Even small query mistakes can drastically skew your results.

Other Factors to Consider

  • Data ingestion patterns (CDC): While BigQuery doesn't directly utilize Change Data Capture (CDC), if your data originates from a system that does, updates or deletes in the source will be reflected in BigQuery over time.
  • Table Snapshots and Versioning: If you are unintentionally querying older snapshots instead of the live table, results might appear to change. Ensure you are referencing the correct table versions.
  • Streaming Inserts: BigQuery's streaming buffer might temporarily delay recent data availability for querying. This is less likely to explain major discrepancies over longer periods.
  • Schema Changes: Rarely, a table's schema may be modified. If your query doesn't align with the new schema, the results might be incorrect.
  • Data Deletion for Compliance: Systems might have processes that delete older data for privacy reasons (e.g., GDPR). This would cause records to disappear over time.

Investigation Strategies

 

  • Change History: BigQuery's change history is your first tool to see if data updates or deletions have occurred.
  • Data Pipeline Examination: Work with your data team to see if delays and backfills are normal parts of the pipeline.
  • Snapshot Verification: Ensure you're querying the primary table and not outdated snapshots.
  • Source System Checks: Consult with those who manage your source data to understand if historical modifications are expected.

 

Thank you for answer!

I found a reason why the data i saw in the past is different from now.
That reason is that Big query table was set deleted after 60days!

I fixed that and solve this problem🙂