Solved: Re: Dataform external configuration

novas · 08-07-2023 05:00 AM

Hello everyone,

My team is grappling with the challenge of scaling our Dataform code both robustly and as automatically as possible. We manage an extensive number of tables (over 300) which can be grouped into common categories, and we apply nearly identical configurations to each of them.

Given that schemas may change over time and other metadata like descriptions, primary keys, assertions, clustering columns, and partition columns can evolve, our goal is to automate our Dataform pipeline to the greatest extent possible.

Initially, we considered reading from external configuration files and dynamically generating transformations based on those configurations. However, we've observed that Dataform restricts Javascript usage extensively. For instance, it doesn't permit reading from GCS, BigQuery, or any source external. Surprisingly, it can't even read a .json file located within the repository. Additionally, it doesn't support Javascript classes, which would enable us to model our requirements in a scalable and contemporary manner.

Given these constraints, I'm reaching out to inquire about best practices to address this issue. Should we create a bespoke pipeline that uses templating and Python to produce the necessary .js and .sqlx files? Or should we consider transitioning to a different tool, perhaps like DBT or something similar, to achieve our objectives?

Thank you in advance for any guidance and insights you can offer. 🙂

ms4446

It sounds like you're facing a common challenge with Dataform: the limitations of its JavaScript usage. While Dataform is a powerful tool, it's not always flexible enough to meet the needs of complex projects.

In your case, you need to be able to read from external configuration files and dynamically generate transformations based on those configurations. This is something that Dataform doesn't currently support. Additionally, the lack of support for JavaScript classes makes it difficult to model your requirements in a scalable and contemporary manner.

There are a few options you can consider to address this issue. One option is to create a bespoke pipeline that uses templating and Python to produce the necessary .js and .sqlx files. This would give you more flexibility in terms of how you read from external sources and generate transformations. However, it would also require you to develop and maintain your own pipeline, which can be time-consuming and error-prone.

Another option is to transition to a different tool that offers more flexibility. DBT is a popular alternative to Dataform that supports reading from external sources and generating transformations dynamically using SQL and Jinja, a templating language for Python. However, transitioning to DBT would require you to learn a new system and potentially refactor your existing code.

Ultimately, the best option for you will depend on your specific needs and requirements. If you need a lot of flexibility and are willing to invest in developing and maintaining your own pipeline, then creating a bespoke solution may be the best option. If you're looking for a more turnkey solution that's easier to learn and use, then transitioning to DBT may be a better choice.

View solution in original post

lewish

Hi there,

Correct that you can't make network requests (and therefore not use any APIs) during Dataform compilation, and this is unlikely to ever change.

You can however use JSON - our compilation process supports require ing json files, e.g:

const json = require("./test.json");

This can be performed in JS files, or in SQLX files via the JS block and then the data in it used programmatically. This is probably the easiest path forward to automating certain aspects of your pipeline. Hope that helps.

View solution in original post

lewish

Correct, it is not on the roadmap.

To explain our philosophy here - we want the compilation process - turning your code into a complete pipeline representation - to be fast, reproducible, secure and to scale to thousands of models. Querying systems for data during compilation can hurt all of these things - and we would encourage you to avoid it if possible especially as your project grows!

While it's a bit more involved, automating systems to commit or update data files (JSON) as part of your repository is our recommended way to do this, as it can happen asynchronously and keep compilation fast.

View solution in original post

ms4446

It sounds like you're facing a common challenge with Dataform: the limitations of its JavaScript usage. While Dataform is a powerful tool, it's not always flexible enough to meet the needs of complex projects.

In your case, you need to be able to read from external configuration files and dynamically generate transformations based on those configurations. This is something that Dataform doesn't currently support. Additionally, the lack of support for JavaScript classes makes it difficult to model your requirements in a scalable and contemporary manner.

There are a few options you can consider to address this issue. One option is to create a bespoke pipeline that uses templating and Python to produce the necessary .js and .sqlx files. This would give you more flexibility in terms of how you read from external sources and generate transformations. However, it would also require you to develop and maintain your own pipeline, which can be time-consuming and error-prone.

Another option is to transition to a different tool that offers more flexibility. DBT is a popular alternative to Dataform that supports reading from external sources and generating transformations dynamically using SQL and Jinja, a templating language for Python. However, transitioning to DBT would require you to learn a new system and potentially refactor your existing code.

Ultimately, the best option for you will depend on your specific needs and requirements. If you need a lot of flexibility and are willing to invest in developing and maintaining your own pipeline, then creating a bespoke solution may be the best option. If you're looking for a more turnkey solution that's easier to learn and use, then transitioning to DBT may be a better choice.

novas

Thank you very much for your reply @ms4446

The inability to use JS classes in dataform or to download "data" from external APIs is indeed a challenge for the data pipeline approach we're proposing within BigQuery. Is there a chance that these features are on the Dataform roadmap? Or is it simply not designed to operate with complex and dynamic pipelines? Would you recommend dbt core to overcome these obstacles, or would this tool (sorry, I don't have experience with it) not offer a solution either? Best,

lewish

Hi there,

Correct that you can't make network requests (and therefore not use any APIs) during Dataform compilation, and this is unlikely to ever change.

You can however use JSON - our compilation process supports require ing json files, e.g:

const json = require("./test.json");

This can be performed in JS files, or in SQLX files via the JS block and then the data in it used programmatically. This is probably the easiest path forward to automating certain aspects of your pipeline. Hope that helps.

novas

Thank you for your swift response, @lewish . I find the ability to load JSON using require extremely useful.

Just to clarify, am I correct in understanding that integrating #DATAFORM with other Google services like GCS or external APIs (to load configuration files for pipeline generation, for example) is not on the roadmap for Dataform? Is this possible with other tools like DBT?

Best regards,

lewish

Correct, it is not on the roadmap.

To explain our philosophy here - we want the compilation process - turning your code into a complete pipeline representation - to be fast, reproducible, secure and to scale to thousands of models. Querying systems for data during compilation can hurt all of these things - and we would encourage you to avoid it if possible especially as your project grows!

While it's a bit more involved, automating systems to commit or update data files (JSON) as part of your repository is our recommended way to do this, as it can happen asynchronously and keep compilation fast.

wiseguy

You can leverage BigQuery Remote Functions that can let you write any code in Cloud Functions to transform your data
https://cloud.google.com/bigquery/docs/remote-functions
Thats as flexible as it gets!