Re: run a Dataform execution in a specific region ...

syl · 12-09-2024 10:12 AM

Hi,

I'm new to Dataform, but have previously used dbt

in dataform (cloud), I have a repo named "mydata"
inside, I have workspaces named eudev and syldev
Everything is linked to a git repo named "mydata". the workspace generate the right branches
in the workspaces. There are SQLXs that are tagged wit eudata. these SQLX need to be executed on a schedule and run on the EU region.
The rest of the SQLX (not tagged) needs to be run on the US (default) region

I tried using Compilation variables (set to EU/US) and then try to add that in workflow_settings.yaml like

... other workflow_settings 
defaultLocation: "${variables.region}"

but I get an error

Location ${variables.region} does not support this operation.

How do I set that up in Dataform?

my goal is to have 1 repo managed data transformation across the regions, and have GCP data transfer moving the data between region for my needs

ms4446

In Dataform, dynamically setting the BigQuery location using compilation variables in workflow_settings.yaml is not possible due to the way BigQuery handles region settings. BQ requires a fixed location to allocate resources at the time of dataset creation or query execution. Similarly, Dataform validates the defaultLocation field during compilation, and it must be a statically defined value, such as EU or US. Attempting to use variables like ${variables.region} in this field results in errors because runtime substitution is not supported.

To address this limitation, there are three primary approaches. The first option is to create separate workflows or environments for each region, each with a fixed defaultLocation. For example, an "EU pipeline" can process EU-tagged datasets, while a "US pipeline" handles other datasets. This approach ensures clear separation of transformations, compliance with data residency requirements, and easy management of region-specific workflows.

The second approach is to use conditional logic within SQLX files, enabling a single codebase to accommodate multiple regions. Compilation variables can control which datasets are processed in each workflow based on the region. This method simplifies code management but may introduce additional complexity in SQL logic.

The third option is programmatic deployment using CI/CD pipelines. By triggering region-specific workflows with distinct parameters, teams gain fine-grained control over deployment, including pre-run validation and automation. This is particularly useful for organizations with sophisticated DevOps practices.

For inter-regional dependencies, such as workflows requiring data from another region, teams should ensure upstream workflows complete first and data replication is reliable. Workflow orchestration tools or carefully scheduled runs can help manage dependencies effectively.

Ultimately, Dataform’s design aligns with BigQuery’s architectural constraints, requiring static region definitions. While this limits dynamic flexibility, structured workflows, conditional logic, or CI/CD pipelines can provide robust solutions for managing multi-region data transformations.

syl

Thanks for the reply!

I would like to implement the 2nd approach, one codebase with some conditional Logic. do you have some example on how to implement it?

ms4446

In Dataform, handling multi-region transformations in a single codebase requires explicit separation of logic for each region while adhering to BigQuery’s constraints, which mandate static dataset locations. Dynamic schema or table selection using variables or expressions within SQL or the config block is not supported. Instead, a correct approach involves using separate SQLX files, tags, and fully qualified table names for region-specific workflows.

Each region-specific SQLX file should explicitly define its schema and location using the config block. For example, EU-specific files can be tagged with eu-region and configured to use the "EU" location, while US-specific files are tagged with us-region and use "US". Fully qualified table names (project.dataset.table) ensure clarity and proper execution within the intended BigQuery region. For shared logic, region-agnostic SQLX files without region-specific tags can be used.

Execution is controlled via tags and compilation variables. For instance, running dataform run --vars='{"target_region": "EU"}' --tag=eu-region filters and runs only EU-specific files. Automation tools, such as Bash scripts or orchestration platforms like Cloud Composer, can streamline this process by dynamically triggering region-specific runs based on a variable.

Dynamic expressions in the config block or SQL, such as attempting to switch schemas or tables using variables, are unsupported and should be avoided. Instead, clear separation of regional logic through distinct files and tags ensures compliance with BigQuery’s architecture while maintaining a manageable and flexible codebase. This approach effectively addresses multi-region requirements while adhering to Dataform and BigQuery’s constraints.

syl

How would to go about doing something like this in dataform then?

all_us_users
REGION: US
select * from prod-services.app_data_us.all_users

all_eu_users
REGION: EU
select * from prod-services.app_data_eu.all_users

then I would like to do something like

select * from all_us_users
UNION ALL
select * from all_eu_users

run a Dataform execution in a specific region (US vs EU)