Re: removing columns from data stream to bigquery ...

Gokulavasan · 09-03-2023 11:30 PM

Hi ,
I have a dataflow streaming job running with a datastream to big query template , now my source data base has only 4 columns but after the dataflow job my final table has so many columns like _metadata_timestamp, _metadata_read_timestamp etc , I have tried to remove these columns with the help of udf parameter but now i am not getting any data in the staging or final table. this is my udf

function process(inJson) {

const obj = JSON.parse(inJson);

// Concatenate two fields: column1 and column2

obj.Name = obj.First_name +" "+ obj.Last_Name;

// Optionally, you can remove the original fields if needed

delete obj.First_name;

delete obj.Last_Name;

delete obj._metadata_timestamp;

delete obj._metadata_read_timestamp;

delete obj._metadata_source_timestamp;

delete obj._metadata_read_method;

delete obj._metadata_source_type;

delete obj._metadata_deleted;

delete obj._metadata_change_type;

delete obj._metadata_uuid;

delete obj._metadata_log_file;

delete obj._metadata_log_position;

return JSON.stringify(obj);

}

how can i remove those columns in the dataflow job itself?

ms4446

To remove columns in a Datastream to BigQuery template, you need to use a UDF. Your UDF looks correct, but there are a few things to check:

Make sure that the UDF is returning a valid JSON object.
Make sure that the UDF is removing all of the columns that you want to remove.
Make sure that the UDF is not adding any new columns to the data.

If you have checked all of the above and you are still not getting any data in the staging or final table, then it is possible that there is a problem with the Dataflow job itself. You can try running the job in debug mode to see if any errors are occurring.

Here are some additional tips for using UDFs in Datastream to BigQuery templates:

Use the @RequiresSchemaannotation to specify the schema of the input and output data.
Use the @ReturnsSchemaannotation to specify the schema of the output data.
Use the @ProcessElementannotation to implement the logic of your UDF.

For more information on using UDFs in Datastream to BigQuery templates, please see the official documentation: Extend your Dataflow template with UDFs: https://cloud.google.com/blog/topics/developers-practitioners/extend-your-dataflow-template-with-udfs.

Additional notes:

It is important to note that using a UDF to remove columns in a Datastream to BigQuery template can impact the performance of the job. This is because the UDF will be executed for each data record that is processed.
If you need to remove a large number of columns, or if you need to perform complex data transformations, then you may want to consider using a custom Dataflow pipeline instead of a template.

Gokulavasan

Thanks for replying

I was going through the datastream to bigquery template source code in order to create my own custom template, from that I found I can't see one of the optional parameters which is "IgnoreFields" in the ui but the parameter was there in the source code

I want to know whether this parameter still supported or not?

ms4446

The IgnoreFields parameter is still supported in the Datastream to BigQuery template, but it is not exposed in the UI. This means that it is not officially supported by Google. By modifying the template source code and deploying a custom template, you are essentially creating your own version of the template, and this may not be supported or maintained by Google. Make sure you test thoroughly before using it in a production environment.

Template Source Code

The source code for the Datastream to BigQuery template is not publicly available. However, you can create a custom template by modifying the source code of a similar template, such as the Pub/Sub to BigQuery template.

gcloud Commands

To deploy a custom template, you would need to create a template file based on your modified version of the Datastream to BigQuery template. You can then use the following command to deploy the custom template:

gcloud dataflow templates deploy my-template --gcs-location gs://my-bucket/my-template.json

To run a Datastream to BigQuery job using the custom template, you would use the following command:

gcloud dataflow jobs run my-job --template gs://my-bucket/my-template.json

Modifying the template source code and deploying a custom template is an advanced task and may not be officially supported by Google. Use caution and test thoroughly before using a custom template in a production environment.

The --gcs-location parameter in the gcloud dataflow templates deploy command expects the path to the compiled template file (e.g., a .json file), not a .yaml file. The .yaml file you referenced would be the configuration file used to run a Dataflow job using the gcloud dataflow jobs run command, not the template file itself. Make sure you adjust the file extensions and paths accordingly.

Gokulavasan

Thanks for replying


gcloud dataflow templates deploy my-template --gcs-location gs://my-bucket/my-template.json

so in this command <my-template> is that a java source code file?
also, if I create a custom template can I create a job in dataflow ui?

ms4446

The gs://my-bucket/my-template.json file is a compiled Dataflow template, which is more than just a JSON configuration file. It is created by running a Dataflow pipeline with the TemplateLocation parameter set to the desired output location. This step is usually done with the gcloud dataflow jobs run command or by running the pipeline in a development environment.

Custom Template in Dataflow UI

While it is true that you can specify a custom template's GCS path in the Dataflow UI, this feature may not be supported for all custom templates. It is best to check the documentation for your custom template to see if it is supported. If it is not supported, you can use the gcloud command-line tool to create a job using the custom template.

Parameters

The parameters you specify in the Dataflow UI must match the parameters expected by your custom template. Otherwise, the job will fail to run.

Please Note: Using a custom template is an advanced task and may not be officially supported by Google. Use caution and test thoroughly before using a custom template in a production environment.

Additional suggestions

You may want to mention that custom templates can be used to create reusable Dataflow pipelines that can be customized for each job by changing specific pipeline parameters.
You may also want to mention that custom templates can be used to implement complex data processing logic that is not possible with pre-built templates.

I hope this updated response is helpful. Please let me know if you have any other questions.

removing columns from data stream to bigquery datafow job