Hi All,
I am working on a poc to determine how useful can dataform be in our ETL architecture, Our requirement is where we have to fetch data for different client from google cloud storage (every client will have different bucket) and create a external table in bigquery for each client under each client's dataset name.
Can someone please help in setting this initial part of the project using dataform?
And also how can we structure this for multiple clients? Potentially 50+ clients?
Solved! Go to Solution.
One practical approach to logging in Dataform is leveraging BigQuery itself. By creating dedicated BigQuery tables—such as dataform_logs
and dataform_errors
—you can store log messages, timestamps, error details, and relevant context. This method allows centralized logging, facilitating easy analysis and reporting within the familiar BigQuery environment.
To implement this, you can incorporate INSERT
statements into your Dataform SQLX scripts to populate these log tables after each operation or when errors occur. Additionally, Dataform assertions can be used to check data quality and log any violations to your designated table. An example SQLX action might look like this:
BEGIN
-- Your Dataform SQL operation
CREATE OR REPLACE EXTERNAL TABLE ...
-- Log success message (optional)
INSERT INTO `dataform_logs` (message, timestamp)
VALUES ("External table created successfully", CURRENT_TIMESTAMP());
EXCEPTION WHEN ERROR THEN
-- Log error message and details
INSERT INTO `dataform_errors` (error_message, stack_trace, timestamp)
VALUES (ERROR_MESSAGE(), ERROR_STACK_TRACE(), CURRENT_TIMESTAMP());
END;
This approach offers several benefits. Firstly, it centralizes logging, allowing all logs and errors to be stored in BigQuery for easy access and analysis. Secondly, it is customizable; you can structure your log tables and messages to suit your specific needs. Lastly, it integrates seamlessly with the BigQuery ecosystem, enabling you to leverage BigQuery's powerful querying, analysis, and visualization capabilities.
For a more advanced logging solution, you can integrate with Cloud Logging through custom scripting. This involves writing a post-processing script—using Python or Bash, for instance—that runs after your Dataform execution. The script would query your BigQuery log tables (dataform_logs
and dataform_errors
) and push the log data into Cloud Logging using the Cloud Logging API or the gcloud logging
command.
This script could be run manually, scheduled to run at specific intervals, or triggered based on Dataform run completion events. This approach offers full Cloud Logging integration, allowing you to take advantage of Cloud Logging's advanced features like log-based metrics, alerting, and integration with other Google Cloud services. It also provides a unified view, consolidating Dataform logs with other logs from your Google Cloud environment, which can be beneficial for comprehensive monitoring and troubleshooting.
Additional Logging Best Practices
When implementing logging in Dataform, several best practices can enhance the effectiveness and efficiency of your logging strategy:
While Dataform does not natively support Cloud Logging, these strategies can help you achieve robust logging and error reporting. By leveraging BigQuery for centralized logging, integrating with Cloud Logging through custom scripts, and following best practices for structured logging, you can create a scalable and maintainable logging solution for your Dataform projects. These approaches ensure that you can monitor, analyze, and respond to your ETL processes effectively, enhancing the overall reliability and performance of your data workflows.