Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Normalize Multi-Tenant Data using Dataform

Hi All, 

At my company we are working on moving our production data from MySQL to Bigquery, however it's in a multi-tenant structure. So each customer has their own dataset (schema) and the tables are all the same unless a customer has add-on features that don't come standard. So once in BigQuery we want to leverage Dataform to create new tables for ALL customers for the common tables in each dataset. So things like the Account table or Transactions table will all be common. How can we loop through each dataset to combine these tables into one? Does anyone have documentation or has faced something similar? Any help is appreciated!

Thanks,

Aaron

0 5 869
5 REPLIES 5

Hi @aaron_harkins

Welcome to Google Cloud Community!

You're absolutely right, Dataform is a fantastic tool for this scenario. It provides a powerful way to define and manage your BigQuery data pipelines.

Here's how you can leverage it to create consolidated tables across all your customer datasets in BigQuery:

  • Set up a new Dataform project. This involves creating a dataform.json file, specifying your BigQuery project ID, and potentially adding other configuration options.
  • Ensure that the dataform.json file specifies the appropriate BigQuery project ID in which your customer datasets are stored.
  • Create a Dataform model for each common table (e.g., Account, Transaction). This model will store your SQL code to combine data from different datasets.
  • Use SQL to query the individual datasets and combine the data. You'll need to incorporate a loop (e.g., FOR loop in standard SQL)  or a UNION ALL
    SELECT column1, column2, ...
    FROM table1
    UNION ALL
    SELECT column1, column2, ...
    FROM table2;
    operation to merge the data.

Additionally, for large datasets, you might need to consider partitioning, clustering, or other optimization techniques in BigQuery to improve query performance. For more information you can also check Dataform and BigQuery documentation. 

I hope the above information is helpful. 

Hi @dawnberdan sorry I've been caught up and I missed this...Ok cool, I understand the concept of UNION however when we have 10,000+ customers and adding more each day, the FOR loop is what I guess I was thinking of however i don't have much experience (well not in a while) with that. Curious if you might have an example? Either way, thanks for getting back to me!

Hi @aaron_harkins,

You’re right! When working with an extensive customer base and needing to perform repetitive actions on each customer, utilizing a for loop in programming can be incredibly useful. The exact implementation may differ depending on your chosen programming language.

Here’s an example:

1. If your customer data is stored in a list or dictionary, a for loop is a suitable choice:

 

# Assuming customer data is stored in a list of dictionaries
customers = [
    {'id': 1, 'name': 'John Doe', 'email': 'johndoe@example.com'},
    # ... other customers
]

for customer in customers:
    # Perform action on the customer
    print(f"Sending email to {customer['name']} at {customer['email']}")

 

2. For extremely large datasets, consider batch processing to improve performance:

 

# Assuming a list of customer IDs
customer_ids = [1, 2, 3, ...]

batch_size = 1000
for i in range(0, len(customer_ids), batch_size):
    batch = customer_ids[i:i+batch_size]
    # Process the batch of customers
    print(f"Processing batch {i} to {i+batch_size}")

 

In this example:

  • Within the customers list, that contains the names of all your customers (actual data should be added to this list).
  • for loop iterates through each element. During each iteration, the customer variable stores the current customer's name (or object, depending on the implementation).
  • Within the loop, you can perform any necessary operations on each customer, such as sending emails, updating databases, calculating statistics, etc. It is important to replace the print statement with your actual logic for processing each customer.
  • Regardless of the number of customers (10, 10,000, or more), thefor loop will efficiently iterate through all customers, handling the processing efficiently.

I hope the above information is helpful.

Hi @aaron_harkins migrating from a multi-tenant MySQL structure to BigQuery while utilizing Dataform is a great choice. Consolidating common tables across multiple datasets will simplify your analysis and reporting workflows. Below is a step-by-step guide to help you achieve this efficiently:

Step 1: Identify Datasets and Tables

  • List all datasets (schemas) in your BigQuery project.
  • Pinpoint the common tables you want to merge, such as Account or Transactions.
  • If certain datasets or tables differ due to add-ons, make sure to address these cases with conditional logic.

Step 2: Automate Table Consolidation Using Dataform

a. Create a Parameterized SQL Template


In Dataform, you can automate table merging by writing SQL-based transformations. Start by using BigQuery’s INFORMATION_SCHEMA to dynamically fetch datasets and tables. For example, the following query lists all datasets with a specific naming pattern:

 

 
SELECT schema_name
FROM `<your_project>`.region-us.INFORMATION_SCHEMA.SCHEMATA
WHERE schema_name LIKE 'customer_%';

b. Dynamically Combine Tables

To merge tables like Account across datasets, you can write a SQL script in Dataform:

 

 
SELECT *
FROM `customer_1.Account`
UNION ALL
SELECT *
FROM `customer_2.Account`
UNION ALL
SELECT *
FROM `customer_3.Account`;

You can automate this process further by building dynamic UNION queries using JavaScript or SQLX in Dataform.

Step 3: Automate Using BigQuery Scripting

If you prefer scripting, BigQuery allows you to loop through datasets programmatically and generate queries dynamically. Here’s an example script:

 

DECLARE datasets ARRAY<STRING>;
SET datasets = ARRAY(
SELECT schema_name
FROM `your_project`.region-us.INFORMATION_SCHEMA.SCHEMATA
WHERE schema_name LIKE 'customer_%'
);
 
DECLARE query STRING DEFAULT "";
FOR dataset IN UNNEST(datasets) DO
SET query = CONCAT(query, "SELECT * FROM `", dataset, ".Account` UNION ALL ");
END FOR;
 
-- Remove the final "UNION ALL" and execute the query
SET query = LEFT(query, LENGTH(query) - 10);
EXECUTE IMMEDIATE query;

You can schedule this script to run regularly in BigQuery to keep your consolidated tables updated.

Step 4: Explore Third-Party Tools for Streamlined Automation

If manually handling scripting and SQL transformations feels overwhelming, tools like Windsor.ai can make the process much simpler. While Windsor.ai primarily focuses on data integration, its automation features allow you to:

  • Seamlessly merge and load data into BigQuery.
  • Automate complex workflows without requiring advanced SQL or scripting.

Final thoughts

  • Use Dataform for creating parameterized SQL templates and automating transformations.
  • Take advantage of BigQuery’s INFORMATION_SCHEMA and scripting to dynamically build and execute UNION queries.
  • If you want a more intuitive and less manual solution, consider using tools like Windsor.ai for streamlined data workflows.

Hope this can be helpful!

Thanks @Snoshone07 . Curious if you've ever run into query limits while doing this? We have a ton of data and I keep getting errors...my thought process was to split up the dynamic union a certain amount of times and see if it gets past that. Not sure what else to do.