Managing multiple workflows with Google Dataform

VCimino · 11-07-2023 11:06 PM

Hi All,

i'm going to use Google Dataform in our GCP project that host several BigQuery datasets.

We need to implement multiple workflows/DAGs with different scheduling and different datasets involved.

I'm newby on Dataform and i'd like to understand the best practices to define the dataform project in order to handle with different workflow.

For example, how many workspace needed ? 1 workspace for each workflow or 1 workspace for all of them ?

Thanks in advance

DataEngineer

I think you have a misconception about workspaces. A repository should have as many workspaces as there are developers. So if you will be the only user then you will most likely only need 1 and maybe an additional workspace for testing purposes if need be.

As far workflows go, you can have multiple workflows under the same repository. You can schedule as many as you like using the Workflow Configurations module in Dataform. I think you should focus on setting up the directory tree and getting a good understanding of how your data will move through transformations in your repository. Good organization will play a big roll in how your workflows are laid out.

This section of the GCP documentation may help. https://cloud.google.com/dataform/docs/best-practices
And this section of the legacy Dataform may also help. https://docs.dataform.co/best-practices/start-your-dataform-project

VCimino

Hi,
thanks for your reply, your links are useful.
My main doubt was how to manage multiple workflow schedules that affects multiple dataset.

As far as i understood from the second link you posted, the use of tags is the correct strategy.
I'm also following this guideline which reflects the same concepts: https://cloud.google.com/dataform/docs/structure-repositories.
BR

DataEngineer

At the moment, I find tags to be the best way to manage multiple workflow schedules in the same release configuration. If you don't use tags you either have to pick and choose which tables to execute manually or run everything in your release all at once. You also have the option to set up multiple release configurations and could assign one dataset per release. I set up different tags for different datasets and break that down even further by frequency.

VCimino

I also find interesting the splitting repository strategy: https://cloud.google.com/dataform/docs/splitting-repositories.

Splitting repo in order to have indipendent workflows could improve maintainability e let single developers to focus on specific domains. It seems a smarter solution then using tags in case of multiple source data and multiple developers.