Re: Adding Created & updated Timestamps columns in...

Naveen_kumar · 06-14-2024 04:02 AM

Hi GCP Community,

I'm currently working with Data Fusion and have set up a pipeline to transfer data from SQL server to BigQuery using Wrangler. I'm looking to add two columns, created_timestamp and updated_timestamp with the current timestamp in each row. I attempted to use JEXL in Wrangler with the following transformation: "set-column: created_timestamp now()".

However, this doesn't seem to work. Could anyone provide guidance or examples on how to achieve this?

Thanks in advance for you help!

Best regards,

Naveen.

ms4446

Wrangler's primary purpose is data cleaning and transformation. While it offers scripting capabilities with JEXL, it's not designed to generate dynamic timestamps within the data flow. The now() function isn’t natively recognized in Wrangler's JEXL context.

Solution 1: Pre-Processing with SQL Server

The most efficient approach is to handle timestamp generation at the source (SQL Server) before the data enters Data Fusion:

Modify Your SQL Query: Adjust your SQL query to include created_timestamp and updated_timestamp columns. Use SQL Server's built-in functions to populate them with the current timestamp. For example:
```
SELECT *, GETDATE() AS created_timestamp, GETDATE() AS updated_timestamp
FROM your_table;
```
Data Fusion Pipeline: The SQL Server source will now include these columns automatically. You won’t need any Wrangler transformations for this task.

Solution 2: Wrangler + JavaScript UDF (Less Ideal)

If generating timestamps within Data Fusion is necessary, use Wrangler’s JavaScript User-Defined Function (UDF) capability:

Wrangler Transformation:
- Add a Wrangler transform to your pipeline.
- In the Wrangler UI, click the "Functions" tab.
- Create a new JavaScript UDF with the following code:
  function addTimestamp() { return new Date().toISOString(); }

Apply the UDF:

Switch to the "Transform" tab.

Add two directives:

derive created_timestamp using addTimestamp()
derive updated_timestamp using addTimestamp()

Important Considerations:

Accuracy: The JavaScript UDF approach might have slight variations in timestamps between rows due to execution time.
Performance: UDFs can be less performant than handling this at the SQL Server level.

Example Pipeline (Solution 1)

SQL Server Source -> BigQuery Sink

SQL Server Query: Include the timestamp generation logic mentioned earlier.

Example Pipeline (Solution 2)

SQL Server Source -> Wrangler (JavaScript UDF) -> BigQuery Sink

Naveen_kumar

Hi @ms4446 ,

Thank you for the suggestion.

I tried Solution 1 : Pre-Processing with SQL Server using the following SQL query:

SELECT *, GETDATE() AS created_timestamp, GETDATE() AS updated_timestamp
FROM dataset.schema.My_table;

This works for transferring the table data from SQL Server to BigQuery, but the audit columns(`created_timestamp` and `updated_timestamp`) are not being added. Could you please check the above image for your reference.

Any further assistance on how to properly and these timestamps columns would be greatly appreciated.

Thanks again!

Best regards,
Naveen.

ms4446

Hi @Naveen_kumar ,

This issue might be related to how Data Fusion handles schema updates and transformations. Here are some potential solutions:

Schema Mismatch in Data Fusion
- Issue: The schema defined in Data Fusion might not automatically reflect the new columns from the SQL query.
- Solution: Refresh the schema in Data Fusion.
  - Steps:
    1. Modify your SQL query as shown:
      SELECT *, GETDATE() AS created_timestamp, GETDATE() AS updated_timestamp FROM "dataset"."schema"."My_table";
    2. Click the GET SCHEMA button in the Data Fusion UI to update the schema.
Verify Output Schema
- Issue: The output schema may need explicit updates to reflect changes.
- Solution: Manually add the new columns to the output schema in Data Fusion.
  - Steps:
    1. In the Output Schema section, click on the plus (+) icon to add new fields.
    2. Add created_timestamp and updated_timestamp manually as fields of type TIMESTAMP.
SQL Query Adjustment
- Issue: The SQL query might need adjustments depending on the SQL dialect or Data Fusion's handling.
- Solution: Ensure correct SQL syntax and proper referencing of your dataset and table.
  - Steps:
    1. Use fully qualified names and check SQL Server configuration:
      SELECT t.*, GETDATE() AS created_timestamp, GETDATE() AS updated_timestamp FROM "dataset"."schema"."My_table" t;
Data Fusion Version or Configuration
- Issue: Your version of Data Fusion might not support dynamic schema updates from SQL queries fully.
- Solution: Check for updates or known issues with your Data Fusion version.
  - Steps:
    1. Ensure your Data Fusion instance is up to date.
    2. Consult Data Fusion documentation or support for version-specific schema update issues.
Pipeline Configuration
- Issue: The Data Fusion pipeline might not propagate schema changes properly through the stages.
- Solution: Review the pipeline stages to ensure consistent schema changes application.
  - Steps:
    1. Verify the schema at each pipeline stage (from source to sink).
    2. Ensure intermediate stages don't strip or ignore the new columns.

Adding Created & updated Timestamps columns in Data Fusion pipeline using wrangler