Hi everyone!
I am streaming data from a source DB to cloud storage (.avro format) using datastream, and then reading those files into BigQuery using BigLake external tables.
My question is, I have now had to add a column to the source DB, and when reading the BigLake table I get this error:
Error while reading table: itest.test.hello, error message: The Apache Avro library failed to read data with the following error: Field in reader schema 'new_col' without default value not present in writer schema. File: bigstore/itest/hello_user/2023/11/14/21/47/0c74d72b1a43793ba622b7869a41bfdcd799f6cd_mysql-cdc-binlog_864522774_7_0.avro
How can I get this table working and reading in correctly? From the datastream cloud storage destination documentation it says that the files are "rotated" each time a source schema change is detected. Is this done automatically, or am I missing something here?
Solved! Go to Solution.
Yes, you are correct that assigning default values directly to nested fields within a BigQuery stream is not feasible. BigQuery streams are optimized for handling continuously flowing data, and the complexity of managing default values for nested fields could introduce unnecessary overhead in the streaming process.
To effectively manage missing values in the "payload" record when a new column is added to the source database, consider the following approaches:
Utilize NULL Values:
Implement NULL Value Checks in Queries:
IFNULL
or COALESCE
to handle these NULL values. For example, you can replace NULL values with a default value in your queries:
SELECT
payload.id,
payload.name,
payload.email,
IFNULL(payload.new_col, 'default_value') AS new_col
FROM `project.dataset.table`
Use Materialized Views for Data Transformation:
Data Transformation Before BigQuery Ingestion:
Adapt to Schema Evolution:
Consult Documentation and Seek Support:
When you add a new column to the source DB, Datastream is designed to automatically detect this schema change and rotate the Avro files in Cloud Storage. However, there might be a delay in this process, during which the BigLake table may attempt to read from the old Avro files that do not contain the new column, leading to the error you're experiencing.
To address this and ensure your BigLake table reads data correctly, consider the following steps:
Wait for Rotated Files: Allow some time for Datastream to process the schema change and rotate the Avro files. The time required can vary depending on the data volume and Datastream configuration.
Reload BigLake Table Definition: Manually reloading the BigLake table definition can prompt it to adopt the new schema from the updated Avro files.
Create a New BigLake Table: If the issue persists, consider creating a new BigLake table with the updated schema. This involves exporting the current table data to Cloud Storage and then importing it into a new table with the revised schema.
Monitor Datastream: Keep an eye on Datastream notifications for schema changes. This proactive monitoring can help you anticipate and manage schema updates more effectively.
Implement Avro File Versioning: Adopting a versioning scheme for your Avro files can aid in tracking different data versions and resolving conflicts.
Regular Testing: Conduct regular tests on your BigLake tables to ensure they can read the latest Avro files without encountering errors.
Manage Schema Evolution: Ensure that new schemas are backward compatible with older ones, especially when processing a mix of old and new files. If backward compatibility isn't feasible, consider reprocessing older data with the new schema or maintaining separate data processing pipelines for different schema versions.
Thank you for the detailed response. I am, however, still struggling with this issue.
Should there be specific configurations made to datastream, gcs or bigquery for this to work, or is it possible to set up using just the console?
I'll add a few more details on the steps followed. All of the steps were performed within the console, and was done in region eu-west9:
1. I created a cloud SQL MySQL instance and added a few records.
2. I then created a source datastream connection to the cloud SQL MySQL instance
3. Then I created the destination connection to cloud storage.
4. I then started the stream and performed an automatic backfill. It is set to save to the cloud storage destination in AVRO format.
5. I then set up a bigQuery biglake table pointing to the parent directory containing all of the AVRO files and used the star * annotation to pull in all of the files for reading. Schema detection was set to automatic.
6. I was then able to query the table without an issue.
7. I then only added one additional column to the source MYSQL table, and added a few records. Within a few minutes they were pulled into the cloud storage by datastream.
8. However, when I now tried executing queries on the biglake table I received the error above:
...error message: The Apache Avro library failed to read data with the following error: Field in reader schema 'new_col' without default value not present in writer schema....
I followed steps 1, 2 and 3 but received the same error message.
Are there potentially specific configurations I need to set up using the CLI instead of the console defaults?
I am also not sure how to perform steps 4,5 and 7?
I basically just need it to adapt to me adding additional columns to the dataset (no type changes or column name changes will occur, only additional columns).
And since I am streaming from datastream, the only actual data I need is within the "payload" record type. Is it possible to easily read in that part of the data or do I need to create a materialised view to achieve that?
If more project-specific information is needed I'd be happy to share but I'm assuming that would have to through a direct line of support?
Based on the details you've provided, it appears that Datastream is detecting the schema change and rotating the Avro files as expected. However, the BigLake table in BigQuery may not be immediately recognizing these schema changes. This delay could be due to the time Datastream takes to process the schema change and generate new Avro files, or the time required for BigQuery to refresh its metadata for the BigLake table.
To address this issue, consider the following steps:
Allow More Time for Processing:
Manually Reload BigLake Table Definition:
Create a New BigLake Table:
Advanced Troubleshooting:
Extracting "Payload" Data:
SELECT
payload.id,
payload.name,
payload.email
FROM `project.dataset.table`
Considerations for Schema Evolution:
Once again thank you for the answer.
I think from the error: The Apache Avro library failed to read data with the following error: Field in reader schema 'new_col' without default value not present in writer schema, It is likely to be 6.
Do you perhaps have a guide on how I can add a default value to the schema. Unfortunately since it is a bigQuery stream and within the "payload" record I am not allowed to assign a default value to a nested field and get a "Default value does not apply to nested schema." notification.
Yes, you are correct that assigning default values directly to nested fields within a BigQuery stream is not feasible. BigQuery streams are optimized for handling continuously flowing data, and the complexity of managing default values for nested fields could introduce unnecessary overhead in the streaming process.
To effectively manage missing values in the "payload" record when a new column is added to the source database, consider the following approaches:
Utilize NULL Values:
Implement NULL Value Checks in Queries:
IFNULL
or COALESCE
to handle these NULL values. For example, you can replace NULL values with a default value in your queries:
SELECT
payload.id,
payload.name,
payload.email,
IFNULL(payload.new_col, 'default_value') AS new_col
FROM `project.dataset.table`
Use Materialized Views for Data Transformation:
Data Transformation Before BigQuery Ingestion:
Adapt to Schema Evolution:
Consult Documentation and Seek Support:
Hi Steve,
Did you manage to resolve it? How?