Solved: Re: How do you manage schema changes for BigLake t...

Stev0198 · 11-14-2023 10:37 PM

Hi everyone!

I am streaming data from a source DB to cloud storage (.avro format) using datastream, and then reading those files into BigQuery using BigLake external tables.

My question is, I have now had to add a column to the source DB, and when reading the BigLake table I get this error:

Error while reading table: itest.test.hello, error message: The Apache Avro library failed to read data with the following error: Field in reader schema 'new_col' without default value not present in writer schema. File: bigstore/itest/hello_user/2023/11/14/21/47/0c74d72b1a43793ba622b7869a41bfdcd799f6cd_mysql-cdc-binlog_864522774_7_0.avro

How can I get this table working and reading in correctly? From the datastream cloud storage destination documentation it says that the files are "rotated" each time a source schema change is detected. Is this done automatically, or am I missing something here?

ms4446

Yes, you are correct that assigning default values directly to nested fields within a BigQuery stream is not feasible. BigQuery streams are optimized for handling continuously flowing data, and the complexity of managing default values for nested fields could introduce unnecessary overhead in the streaming process.

To effectively manage missing values in the "payload" record when a new column is added to the source database, consider the following approaches:

Utilize NULL Values:
- In BigQuery, when a new field is added to a schema, existing records will not have data for this field, resulting in NULL values. These NULL values indicate the absence of data for the new column in the source data.
Implement NULL Value Checks in Queries:
- When querying the BigLake table, explicitly check for NULL values in the "payload" record. Utilize SQL functions like IFNULL or COALESCE to handle these NULL values. For example, you can replace NULL values with a default value in your queries:
```
SELECT 
payload.id, 
payload.name, 
payload.email, 
IFNULL(payload.new_col, 'default_value') AS new_col 
FROM `project.dataset.table` 
```
  This query will substitute NULL values in the "new_col" field with "default_value".
Use Materialized Views for Data Transformation:
- If consistently replacing NULL values with default values is a common requirement, consider creating a materialized view in BigQuery. This view can transform the data into the desired format, precompute, and store the results of a query. This approach simplifies querying by eliminating the need to handle NULL values in every query.
Data Transformation Before BigQuery Ingestion:
- Consider transforming the data before it is streamed into BigQuery. This can be achieved through an intermediate data processing step, like a Dataflow job, where default values can be added to the data as it's being ingested into BigQuery.
Adapt to Schema Evolution:
- Ensure that your data processing and querying logic are adaptable to schema changes, particularly in a streaming context. This includes handling new fields and understanding how these changes impact your data pipeline and queries.
Consult Documentation and Seek Support:
- For complex scenarios involving BigQuery and streaming data, refer to the latest Google Cloud documentation for BigQuery. If needed, seek support from Google Cloud professionals for solutions tailored to your specific use case.

View solution in original post

ms4446

When you add a new column to the source DB, Datastream is designed to automatically detect this schema change and rotate the Avro files in Cloud Storage. However, there might be a delay in this process, during which the BigLake table may attempt to read from the old Avro files that do not contain the new column, leading to the error you're experiencing.

To address this and ensure your BigLake table reads data correctly, consider the following steps:

Wait for Rotated Files: Allow some time for Datastream to process the schema change and rotate the Avro files. The time required can vary depending on the data volume and Datastream configuration.
Reload BigLake Table Definition: Manually reloading the BigLake table definition can prompt it to adopt the new schema from the updated Avro files.
Create a New BigLake Table: If the issue persists, consider creating a new BigLake table with the updated schema. This involves exporting the current table data to Cloud Storage and then importing it into a new table with the revised schema.
Monitor Datastream: Keep an eye on Datastream notifications for schema changes. This proactive monitoring can help you anticipate and manage schema updates more effectively.
Implement Avro File Versioning: Adopting a versioning scheme for your Avro files can aid in tracking different data versions and resolving conflicts.
Regular Testing: Conduct regular tests on your BigLake tables to ensure they can read the latest Avro files without encountering errors.
Manage Schema Evolution: Ensure that new schemas are backward compatible with older ones, especially when processing a mix of old and new files. If backward compatibility isn't feasible, consider reprocessing older data with the new schema or maintaining separate data processing pipelines for different schema versions.

Stev0198

Thank you for the detailed response. I am, however, still struggling with this issue.

Should there be specific configurations made to datastream, gcs or bigquery for this to work, or is it possible to set up using just the console?

I'll add a few more details on the steps followed. All of the steps were performed within the console, and was done in region eu-west9:

1. I created a cloud SQL MySQL instance and added a few records.
2. I then created a source datastream connection to the cloud SQL MySQL instance
3. Then I created the destination connection to cloud storage.
4. I then started the stream and performed an automatic backfill. It is set to save to the cloud storage destination in AVRO format.
5. I then set up a bigQuery biglake table pointing to the parent directory containing all of the AVRO files and used the star * annotation to pull in all of the files for reading. Schema detection was set to automatic.
6. I was then able to query the table without an issue.
7. I then only added one additional column to the source MYSQL table, and added a few records. Within a few minutes they were pulled into the cloud storage by datastream.
8. However, when I now tried executing queries on the biglake table I received the error above:
...error message: The Apache Avro library failed to read data with the following error: Field in reader schema 'new_col' without default value not present in writer schema....

I followed steps 1, 2 and 3 but received the same error message.

Are there potentially specific configurations I need to set up using the CLI instead of the console defaults?

I am also not sure how to perform steps 4,5 and 7?

I basically just need it to adapt to me adding additional columns to the dataset (no type changes or column name changes will occur, only additional columns).

And since I am streaming from datastream, the only actual data I need is within the "payload" record type. Is it possible to easily read in that part of the data or do I need to create a materialised view to achieve that?

If more project-specific information is needed I'd be happy to share but I'm assuming that would have to through a direct line of support?

ms4446

Based on the details you've provided, it appears that Datastream is detecting the schema change and rotating the Avro files as expected. However, the BigLake table in BigQuery may not be immediately recognizing these schema changes. This delay could be due to the time Datastream takes to process the schema change and generate new Avro files, or the time required for BigQuery to refresh its metadata for the BigLake table.

To address this issue, consider the following steps:

Allow More Time for Processing:
- After adding a new column to the source MySQL table, it might take more than a few minutes for Datastream to process the schema change and create new Avro files. Waiting a bit longer could resolve the issue.
Manually Reload BigLake Table Definition:
- To ensure that BigQuery picks up the new schema from the rotated Avro files, manually reload the BigLake table definition:
  - Go to the BigLake table in the BigQuery web UI.
  - Click on the "Schema" tab.
  - Select "Edit schema."
  - Save the changes without altering the schema. This action can prompt BigQuery to refresh its metadata and recognize the new schema.
Create a New BigLake Table:
- If the above steps don't work, consider creating a new BigLake table with the updated schema. This involves exporting the existing table data to Cloud Storage and then importing it into a new table with the revised schema.
Advanced Troubleshooting:
- If the issue persists, advanced troubleshooting using the gcloud CLI might be necessary. Alternatively, reaching out to Google Cloud support can provide more targeted assistance.
Extracting "Payload" Data:
- To extract specific fields from the "payload" record type in Avro files, use the UNNEST function in BigQuery. For example:
```
SELECT
  payload.id,
  payload.name,
  payload.email
FROM `project.dataset.table`
```
  - If you require more complex processing of the "payload" data, creating a materialized view in BigQuery could be beneficial. This approach can improve query performance by precomputing and storing the results of a complex query.
Considerations for Schema Evolution:
- Ensure that new schema changes, especially adding new columns, are backward compatible. If the new column in the Avro files lacks a default value, it might lead to issues when reading older files that do not include this column.

Stev0198

Once again thank you for the answer.

I think from the error: The Apache Avro library failed to read data with the following error: Field in reader schema 'new_col' without default value not present in writer schema, It is likely to be 6.

Do you perhaps have a guide on how I can add a default value to the schema. Unfortunately since it is a bigQuery stream and within the "payload" record I am not allowed to assign a default value to a nested field and get a "Default value does not apply to nested schema." notification.

ms4446

Yes, you are correct that assigning default values directly to nested fields within a BigQuery stream is not feasible. BigQuery streams are optimized for handling continuously flowing data, and the complexity of managing default values for nested fields could introduce unnecessary overhead in the streaming process.

To effectively manage missing values in the "payload" record when a new column is added to the source database, consider the following approaches:

Utilize NULL Values:
- In BigQuery, when a new field is added to a schema, existing records will not have data for this field, resulting in NULL values. These NULL values indicate the absence of data for the new column in the source data.
Implement NULL Value Checks in Queries:
- When querying the BigLake table, explicitly check for NULL values in the "payload" record. Utilize SQL functions like IFNULL or COALESCE to handle these NULL values. For example, you can replace NULL values with a default value in your queries:
```
SELECT 
payload.id, 
payload.name, 
payload.email, 
IFNULL(payload.new_col, 'default_value') AS new_col 
FROM `project.dataset.table` 
```
  This query will substitute NULL values in the "new_col" field with "default_value".
Use Materialized Views for Data Transformation:
- If consistently replacing NULL values with default values is a common requirement, consider creating a materialized view in BigQuery. This view can transform the data into the desired format, precompute, and store the results of a query. This approach simplifies querying by eliminating the need to handle NULL values in every query.
Data Transformation Before BigQuery Ingestion:
- Consider transforming the data before it is streamed into BigQuery. This can be achieved through an intermediate data processing step, like a Dataflow job, where default values can be added to the data as it's being ingested into BigQuery.
Adapt to Schema Evolution:
- Ensure that your data processing and querying logic are adaptable to schema changes, particularly in a streaming context. This includes handling new fields and understanding how these changes impact your data pipeline and queries.
Consult Documentation and Seek Support:
- For complex scenarios involving BigQuery and streaming data, refer to the latest Google Cloud documentation for BigQuery. If needed, seek support from Google Cloud professionals for solutions tailored to your specific use case.

aryasachin1108

Hi Steve,

Did you manage to resolve it? How?

How do you manage schema changes for BigLake tables?