Solved: Re: Failing to load empty cells when a column is c...

murali2 · 05-21-2024 12:22 AM

Hi everyone,

I'm working on a pipeline to load a CSV file from a GCS bucket into BigQuery using Cloud Data Fusion. I'm encountering an issue when I try to convert a string column to a simple date format. The problem is that any empty cells in this column are causing the entire rows to be skipped when loading into the BigQuery table.

I need help ensuring that the data loads into BigQuery without skipping the rows that have empty cells in the date column. Any advice or solutions would be greatly appreciated.

Thanks,
Murali

ms4446

BigQuery is particular about data types. When you try to convert a string field with empty values into a DATE type, it likely throws an error, causing the entire row to be rejected. You can follow these steps in Google Cloud Data Fusion:

Add a Wrangler Plugin: In your pipeline, insert a Wrangler plugin between your GCS source and BigQuery sink.

Transformation:

Replace Empty with Null:

derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn

Replace Empty with Default Date (e.g., '1900-01-01'):

derive value:IFMISSING(YourDateColumn, '1900-01-01') as:YourDateColumn

Replace Empty with a Calculated Date (e.g., today's date):

derive value:IFMISSING(YourDateColumn, CURRENT_DATE()) as:YourDateColumn

Type Conversion: After filling the empty cells, add a transformation to convert the column to the DATE type:
```
parse-as simple-date YourDateColumn 'yyyy-MM-dd'
```

Advantages:

Wrangler provides a flexible way to handle various data quality issues within your pipeline.

BigQuery Load Options:

Implementation:

Configure BigQuery Sink: In your BigQuery sink settings, look for options related to "Write Disposition" and "Error Handling."
Write Disposition: Choose "WRITE_APPEND" to add new data to your table.
Error Handling:
- Maximum Bad Records: Set this to a value greater than zero to allow some rows with errors to be skipped.
- Ignore Unknown Values: Enable this if you have extra columns in your CSV that aren't defined in your BigQuery table.

Disadvantages:

This is less granular than Wrangler, and you might lose some data if errors occur beyond your allowed threshold.

Pre-processing the CSV:

How it Works:

You could modify your CSV file before loading it into Data Fusion, either manually or with a script, to fill or remove empty cells.

Disadvantages:

Adds an extra step and might not be scalable for large datasets.

Example Wrangler Transformation:

 

parse-as csv
derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn
parse-as simple-date YourDateColumn 'yyyy-MM-dd'

Important Considerations:

Data Quality: Before choosing a solution, assess why you have empty dates. If they represent missing data, consider if filling them with a default value is appropriate for your analysis.
Date Format: Ensure your date strings are in a format BigQuery understands (e.g., YYYY-MM-DD).

View solution in original post

ms4446

BigQuery is particular about data types. When you try to convert a string field with empty values into a DATE type, it likely throws an error, causing the entire row to be rejected. You can follow these steps in Google Cloud Data Fusion:

Add a Wrangler Plugin: In your pipeline, insert a Wrangler plugin between your GCS source and BigQuery sink.

Transformation:

Replace Empty with Null:

derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn

Replace Empty with Default Date (e.g., '1900-01-01'):

derive value:IFMISSING(YourDateColumn, '1900-01-01') as:YourDateColumn

Replace Empty with a Calculated Date (e.g., today's date):

derive value:IFMISSING(YourDateColumn, CURRENT_DATE()) as:YourDateColumn

Type Conversion: After filling the empty cells, add a transformation to convert the column to the DATE type:
```
parse-as simple-date YourDateColumn 'yyyy-MM-dd'
```

Advantages:

Wrangler provides a flexible way to handle various data quality issues within your pipeline.

BigQuery Load Options:

Implementation:

Configure BigQuery Sink: In your BigQuery sink settings, look for options related to "Write Disposition" and "Error Handling."
Write Disposition: Choose "WRITE_APPEND" to add new data to your table.
Error Handling:
- Maximum Bad Records: Set this to a value greater than zero to allow some rows with errors to be skipped.
- Ignore Unknown Values: Enable this if you have extra columns in your CSV that aren't defined in your BigQuery table.

Disadvantages:

This is less granular than Wrangler, and you might lose some data if errors occur beyond your allowed threshold.

Pre-processing the CSV:

How it Works:

You could modify your CSV file before loading it into Data Fusion, either manually or with a script, to fill or remove empty cells.

Disadvantages:

Adds an extra step and might not be scalable for large datasets.

Example Wrangler Transformation:

 

parse-as csv
derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn
parse-as simple-date YourDateColumn 'yyyy-MM-dd'

Important Considerations:

Data Quality: Before choosing a solution, assess why you have empty dates. If they represent missing data, consider if filling them with a default value is appropriate for your analysis.
Date Format: Ensure your date strings are in a format BigQuery understands (e.g., YYYY-MM-DD).

Failing to load empty cells when a column is converted to simple-date in cloud data fusion