Hi All,
I am working on a poc to determine how useful can dataform be in our ETL architecture, Our requirement is where we have to fetch data for different client from google cloud storage (every client will have different bucket) and create a external table in bigquery for each client under each client's dataset name.
Can someone please help in setting this initial part of the project using dataform?
And also how can we structure this for multiple clients? Potentially 50+ clients?
Solved! Go to Solution.
One practical approach to logging in Dataform is leveraging BigQuery itself. By creating dedicated BigQuery tables—such as dataform_logs
and dataform_errors
—you can store log messages, timestamps, error details, and relevant context. This method allows centralized logging, facilitating easy analysis and reporting within the familiar BigQuery environment.
To implement this, you can incorporate INSERT
statements into your Dataform SQLX scripts to populate these log tables after each operation or when errors occur. Additionally, Dataform assertions can be used to check data quality and log any violations to your designated table. An example SQLX action might look like this:
BEGIN
-- Your Dataform SQL operation
CREATE OR REPLACE EXTERNAL TABLE ...
-- Log success message (optional)
INSERT INTO `dataform_logs` (message, timestamp)
VALUES ("External table created successfully", CURRENT_TIMESTAMP());
EXCEPTION WHEN ERROR THEN
-- Log error message and details
INSERT INTO `dataform_errors` (error_message, stack_trace, timestamp)
VALUES (ERROR_MESSAGE(), ERROR_STACK_TRACE(), CURRENT_TIMESTAMP());
END;
This approach offers several benefits. Firstly, it centralizes logging, allowing all logs and errors to be stored in BigQuery for easy access and analysis. Secondly, it is customizable; you can structure your log tables and messages to suit your specific needs. Lastly, it integrates seamlessly with the BigQuery ecosystem, enabling you to leverage BigQuery's powerful querying, analysis, and visualization capabilities.
For a more advanced logging solution, you can integrate with Cloud Logging through custom scripting. This involves writing a post-processing script—using Python or Bash, for instance—that runs after your Dataform execution. The script would query your BigQuery log tables (dataform_logs
and dataform_errors
) and push the log data into Cloud Logging using the Cloud Logging API or the gcloud logging
command.
This script could be run manually, scheduled to run at specific intervals, or triggered based on Dataform run completion events. This approach offers full Cloud Logging integration, allowing you to take advantage of Cloud Logging's advanced features like log-based metrics, alerting, and integration with other Google Cloud services. It also provides a unified view, consolidating Dataform logs with other logs from your Google Cloud environment, which can be beneficial for comprehensive monitoring and troubleshooting.
Additional Logging Best Practices
When implementing logging in Dataform, several best practices can enhance the effectiveness and efficiency of your logging strategy:
While Dataform does not natively support Cloud Logging, these strategies can help you achieve robust logging and error reporting. By leveraging BigQuery for centralized logging, integrating with Cloud Logging through custom scripts, and following best practices for structured logging, you can create a scalable and maintainable logging solution for your Dataform projects. These approaches ensure that you can monitor, analyze, and respond to your ETL processes effectively, enhancing the overall reliability and performance of your data workflows.
To use Dataform for creating external tables in BigQuery from data stored in GCS for multiple clients, you can set up a parameterized Dataform project. Here's a possible solution that allows you to scale it for multiple clients:
1. Project Setup
Initialize Dataform:
dataform init your_project_name
cd your_project_name
Configure Project:
{
"defaultSchema": "default_dataset",
"warehouse": "bigquery",
"assertionSchema": "assertions",
"defaultDatabase": "your_project_id"
}
Install Dataform CLI:
npm install -g @dataform/cli
2. Define Client Details
Create clients_config.json
:
{
"clients": [
{ "name": "client_a", "bucket": "client_a_bucket", "dataset": "client_a_dataset" },
{ "name": "client_b", "bucket": "client_b_bucket", "dataset": "client_b_dataset" }
// Add more clients as needed
]
}
Load Configuration :
const fs = require('fs');
const rawData = fs.readFileSync('clients_config.json');
const clients = JSON.parse(rawData).clients;
module.exports = { clients };
3. Create Dataform Scripts
Create External Table Script (definitions/create_external_tables.sqlx
) :
{{ clients.forEach(client => { /* ... (see full script below) */ }) }}
Compile & Run:
dataform compile
dataform run
4. Scaling for Multiple Clients
Dynamic Client Addition: Add new clients to clients_config.json
and re-run Dataform.
Handle Different File Types:
{{
const fileFormat = "CSV"; // Or dynamically determine
}}
{{
clients.forEach(client => {
// ... (rest of the script remains the same)
});
}}
Monitoring: Use Dataform logs and UI for execution tracking.
Full SQL Script:
{{
const fileFormat = "CSV"; // Set based on client or file extension
clients.forEach(client => {
const bucket = client.bucket;
const dataset = client.dataset;
const tableName = `${client.name}_external_table`;
operate({
type: "operation",
schema: dataset,
name: tableName,
sql: `
create or replace external table \`${dataset}.${tableName}\`
options (
format = "${fileFormat}",
uris = ["gs://${bucket}/data/*.${fileFormat.toLowerCase()}"],
skip_leading_rows = 1
)
`
});
});
}}
Thank you for sharing the steps. This was very helpful. I actually created a table in bigquery itself to store the config related information and get the required info to create the external tables.
But what i am struggling currently with is to be able to integrate GCP cloud logging within the dataform project. I am not able to find anything helpful in the documentation.
I want to capture info and other errors in the gcp logging. How can i integrate that with my dataform project?
One practical approach to logging in Dataform is leveraging BigQuery itself. By creating dedicated BigQuery tables—such as dataform_logs
and dataform_errors
—you can store log messages, timestamps, error details, and relevant context. This method allows centralized logging, facilitating easy analysis and reporting within the familiar BigQuery environment.
To implement this, you can incorporate INSERT
statements into your Dataform SQLX scripts to populate these log tables after each operation or when errors occur. Additionally, Dataform assertions can be used to check data quality and log any violations to your designated table. An example SQLX action might look like this:
BEGIN
-- Your Dataform SQL operation
CREATE OR REPLACE EXTERNAL TABLE ...
-- Log success message (optional)
INSERT INTO `dataform_logs` (message, timestamp)
VALUES ("External table created successfully", CURRENT_TIMESTAMP());
EXCEPTION WHEN ERROR THEN
-- Log error message and details
INSERT INTO `dataform_errors` (error_message, stack_trace, timestamp)
VALUES (ERROR_MESSAGE(), ERROR_STACK_TRACE(), CURRENT_TIMESTAMP());
END;
This approach offers several benefits. Firstly, it centralizes logging, allowing all logs and errors to be stored in BigQuery for easy access and analysis. Secondly, it is customizable; you can structure your log tables and messages to suit your specific needs. Lastly, it integrates seamlessly with the BigQuery ecosystem, enabling you to leverage BigQuery's powerful querying, analysis, and visualization capabilities.
For a more advanced logging solution, you can integrate with Cloud Logging through custom scripting. This involves writing a post-processing script—using Python or Bash, for instance—that runs after your Dataform execution. The script would query your BigQuery log tables (dataform_logs
and dataform_errors
) and push the log data into Cloud Logging using the Cloud Logging API or the gcloud logging
command.
This script could be run manually, scheduled to run at specific intervals, or triggered based on Dataform run completion events. This approach offers full Cloud Logging integration, allowing you to take advantage of Cloud Logging's advanced features like log-based metrics, alerting, and integration with other Google Cloud services. It also provides a unified view, consolidating Dataform logs with other logs from your Google Cloud environment, which can be beneficial for comprehensive monitoring and troubleshooting.
Additional Logging Best Practices
When implementing logging in Dataform, several best practices can enhance the effectiveness and efficiency of your logging strategy:
While Dataform does not natively support Cloud Logging, these strategies can help you achieve robust logging and error reporting. By leveraging BigQuery for centralized logging, integrating with Cloud Logging through custom scripts, and following best practices for structured logging, you can create a scalable and maintainable logging solution for your Dataform projects. These approaches ensure that you can monitor, analyze, and respond to your ETL processes effectively, enhancing the overall reliability and performance of your data workflows.
Thank you for sharing the info. I will try implementing logs in BigQuery itself.
I have another issue where i am trying to access the table from the dataform whose defaultRegion is set to us-east1 but the tables are in the region europe-west1
for example, below is the helper functions which i have created in my dataform repo.
function get_bucket_name(client_code) {
return `'SELECT DISTINCT bucket_name FROM db_tm_configs.tm_client_config_latest where client_code = "${dataform.projectConfig.vars.code}"'`
}
module.exports = {
get_bucket_name
}
The table db_tm_configs.tm_client_config_latest is actually in the europe-west1 region but my workflow_settings.yaml looks like below the defaultLocation is set as us-east1.
defaultProject: project_id
defaultLocation: us-east1
defaultDataset: dataform_tm_poc
defaultAssertionDataset: dataform_assertions
dataformCoreVersion: 3.0.0-beta.4
vars:
code: A
The issue is when i am calling my helper functions in other sqlx file, it is throwing the error as the db_tm_configs.tm_client_config_latest is not present in the us-east1 region.
I know i have the option to override the defaultLocation in my workflow_settings.yaml file. But i do not want to do that. I always want my helper function to run that query in the "europe-west1" region irrespective of what defaultLocation i set in the workflow_settings.yaml.
I have tried mutliple ways and had no luck. is there a way i can achieve this, without overriding the defaultLocation in the project settings?