Data Fusion - Preventing Duplicated from sinking t...

yrstruly · 07-21-2022 03:36 AM

Please assist?

I would like to prevent duplicates entering into my Google Bigquery db. I have used the Distinct plugin and noticed that my final data is more than 50% less as my original data. Can i trust this plugin(see screen shots)?

In this ETL Flow, i would like to denormalise some tables to make it more compact in Bigquery. I would like to use the Joiner plugin, see attached. The idea is to join 2 or more tables in the ETL flow and have one table written in Bigquery for it. Please advise?

@kolban @Eduardo_Ortiz

josegutierrez

Yes, you can trust the Distinct plugin. This de-duplicates input records so that all output records are distinct.

Use the Joiner analytics plugin to combine data from multiple inputs. Joins are based on equality. Supports inner and outer joins, selection and renaming of output fields. You can add a Joiner transformation at any stage in a data pipeline.

yrstruly

Can the Distinct plugin be moved or placed as the first stage of the ETL pipeline to prevent data that has already been loaded to load again. For a ETL job that runs daily?

josegutierrez

No, it cannot be moved or placed at the first stage because this needs to have data to check if there is a duplicate from the data you are loading.

You could add an extra stage/step in your pipeline using the Deduplicate Plugin this will allow you to add extra filters to the loading of the data.

Data Fusion - Preventing Duplicated from sinking to Bigquery & Denormalise