Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Datastream missing data, no error

Hi, I have 2 datastream from the same mysql source that ingest data to bigquery and google cloud storage (accessed in bigquery using external table). 

I found a lot of times, there are missing rows in bigquery table. So for example, the row count in source & destination are like this:

tablerow count
mysql (source)100
bigquery86
cloud storage100

To temporarily fix this, I usually do a backfill for the datastream to bigquery. However as we have no idea what the root cause is, it's keep happinening. Wondering if anyone have the same experience, and what can I do to fix the issue for the data that are directly ingested to bigquery.

0 1 405
1 REPLY 1

Hi @brown_squirrel,

Welcome to Google Cloud Community!

Discrepancies on Datastream between your source(MySQL) and destinations(Cloud Storage and BigQuery) indicate a possible problem on your ingestion pipeline specifically to BigQuery. Datastream missing data with BigQuery as a destination can be due to various reasons. Here are some potential causes and suggestions that may help resolve the issue:

  • Limitations of BigQuery as a destination: There are several limitations when using BigQuery as a destination on Datastream that could potentially affect your data. This includes primary keys, clustering columns, project consistency, table name restrictions, etc. For complete details, you can refer to this documentation. Ensuring that the Datastream setup addresses these limitations is crucial for maintaining data integrity.
  • Unsupported Events: The most common unsupported events in Datastream involve unsupported data types and unsupported schema changes, which cannot be replicated from the source (MySQL) to the destination (BigQuery). To prevent issues such as data loss or schema inconsistencies, it is important to ensure that data types in MySQL have compatible mappings in BigQuery or changes in schema in the source database are reflected and updated in BigQuery as your destination.
  • Primary Key Changes: Any changes to primary keys can affect data integrity. If changes, such as a new primary key, are not correctly inserted, it might result in missing rows in BigQuery.
  • Time Delay: Another possible scenario is data to BigQuery arrives later than expected. Check the Datastream logs for any possible latency issue or backlogs in the data pipeline. You can also check the timestamp in both of your source and destination to verify any delay in ingestion. Additionally, you can use Datastream’s monitoring to track the latency and backlog.
  • BigQuery Quota: Check if your quota does not exceed the limit set on your account. Exceeding the number of load jobs per day can definitely affect the data ingestion in BigQuery.

If the issue persists, I recommend reaching out to Google Cloud Support for further assistance, as they can provide insights into whether this behavior is specific to your project.

I hope the above information is helpful.