Hi everyone,
I'm working on a pipeline to load a CSV file from a GCS bucket into BigQuery using Cloud Data Fusion. I'm encountering an issue when I try to convert a string column to a simple date format. The problem is that any empty cells in this column are causing the entire rows to be skipped when loading into the BigQuery table.
I need help ensuring that the data loads into BigQuery without skipping the rows that have empty cells in the date column. Any advice or solutions would be greatly appreciated.
Thanks,
Murali
Solved! Go to Solution.
BigQuery is particular about data types. When you try to convert a string field with empty values into a DATE type, it likely throws an error, causing the entire row to be rejected. You can follow these steps in Google Cloud Data Fusion:
Add a Wrangler Plugin: In your pipeline, insert a Wrangler plugin between your GCS source and BigQuery sink.
Transformation:
derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn
derive value:IFMISSING(YourDateColumn, '1900-01-01') as:YourDateColumn
derive value:IFMISSING(YourDateColumn, CURRENT_DATE()) as:YourDateColumn
Type Conversion: After filling the empty cells, add a transformation to convert the column to the DATE type:
parse-as simple-date YourDateColumn 'yyyy-MM-dd'
Advantages:
Wrangler provides a flexible way to handle various data quality issues within your pipeline.
BigQuery Load Options:
Implementation:
Disadvantages:
This is less granular than Wrangler, and you might lose some data if errors occur beyond your allowed threshold.
Pre-processing the CSV:
How it Works:
You could modify your CSV file before loading it into Data Fusion, either manually or with a script, to fill or remove empty cells.
Disadvantages:
Adds an extra step and might not be scalable for large datasets.
Example Wrangler Transformation:
parse-as csv
derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn
parse-as simple-date YourDateColumn 'yyyy-MM-dd'
Important Considerations:
BigQuery is particular about data types. When you try to convert a string field with empty values into a DATE type, it likely throws an error, causing the entire row to be rejected. You can follow these steps in Google Cloud Data Fusion:
Add a Wrangler Plugin: In your pipeline, insert a Wrangler plugin between your GCS source and BigQuery sink.
Transformation:
derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn
derive value:IFMISSING(YourDateColumn, '1900-01-01') as:YourDateColumn
derive value:IFMISSING(YourDateColumn, CURRENT_DATE()) as:YourDateColumn
Type Conversion: After filling the empty cells, add a transformation to convert the column to the DATE type:
parse-as simple-date YourDateColumn 'yyyy-MM-dd'
Advantages:
Wrangler provides a flexible way to handle various data quality issues within your pipeline.
BigQuery Load Options:
Implementation:
Disadvantages:
This is less granular than Wrangler, and you might lose some data if errors occur beyond your allowed threshold.
Pre-processing the CSV:
How it Works:
You could modify your CSV file before loading it into Data Fusion, either manually or with a script, to fill or remove empty cells.
Disadvantages:
Adds an extra step and might not be scalable for large datasets.
Example Wrangler Transformation:
parse-as csv
derive value:IFMISSING(YourDateColumn, null) as:YourDateColumn
parse-as simple-date YourDateColumn 'yyyy-MM-dd'
Important Considerations: