Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataform Development

Hi All,

I am trying to set up my teams' infrastructure in BQ and currently, we are looking to have 2 projects 1 for prod and 1 for dev. and in each of the projects we hope to set up our BQ with a medallion structure. we hope to have the prod to be a copy of the dev at least the source tables are expected always to be updated simultaneously using airflow.

Since we use dataform for our transformation, how can I set up dataform to work effectively with both environments? Currently, I am thinking of setting the default schema in dataform to the development schema and using workspace compilation override to ensure that everyone can track what they are working on and then use the workflow configuration to run models that have tags (say daily, weekly, monthly) in them to run in the production environment and another to run in the development environment, so that this way, I promote only the final tables from the transformation development. I have 2 challenges with this approach

1. I might end up having too many redundant tables and might need to schedule a monthly cleanup of personal development schemas
2. My Boss wants even more as he wants the dev transformation layer to be a copy of the dev prod layer.

My major question is what is the best practice around this and how can I fulfil as many requirements as I have?

Thank you @ms4446 for always responding to my questions.

Solved Solved
8 1 691
1 ACCEPTED SOLUTION

Hi @francisatoyebi ,

Your approach of separate development and production projects, along with a medallion architecture, aligns perfectly with industry best practices. Utilizing Dataform's capabilities for environment-specific configurations is essential. By managing different schemas for development and production through variable settings, you ensure that dataset names are tailored per environment, minimizing risks of unintended production deployments.

Your idea to use workspace compilation overrides is excellent. This approach will maintain clarity within your team, enabling each member to effectively track their own contributions.

Implementing workflow configurations to selectively run models based on tags (e.g., daily, weekly, monthly) is a prudent strategy. It not only fosters automation but also helps in maintaining a clear separation between your environments.

Addressing Your Challenges:

  • Managing Redundant Tables:

    • Prefixing & Temporary Tables: Implementing a clear prefixing system for development tables and using temporary tables can significantly reduce clutter and optimize resource usage.
    • Automated Cleanup & Table Expiration: Setting up regularly scheduled scripts or configuring expiration times for development tables will help maintain an efficient and clean environment.
  • Mirroring Development and Production Transformations:

    • Risk Communication: It is crucial to discuss the complexities and potential performance impacts of mirroring production transformations with your stakeholders. This ensures that everyone is making informed decisions.
    • Hybrid Approach: A balanced strategy of mirroring essential transformations while simplifying others can maintain crucial accuracy while fostering development agility. Adjusting transformation logic or using scaled-down datasets can help avoid performance discrepancies if development data volumes are significantly lower than production.

Additional Recommendations:

  • Gradual Rollouts & CI/CD Integration: Strategies like canary deployments and integrating CI/CD pipelines will further protect your production data, allowing for controlled testing of new changes before a full-scale rollout.

  • Documentation & Training: Utilizing documentation tools like Sphinx or MkDocs to create comprehensive, accessible guides and regular training on BigQuery and Dataform will enhance your team's effectiveness.

  • Version Control & Robust Testing: Ensuring a robust version control system such as Git is in place, along with comprehensive testing frameworks, is critical. These systems are indispensable for managing code changes, tracking history, and ensuring the reliability and quality of both environments.

View solution in original post

1 REPLY 1

Hi @francisatoyebi ,

Your approach of separate development and production projects, along with a medallion architecture, aligns perfectly with industry best practices. Utilizing Dataform's capabilities for environment-specific configurations is essential. By managing different schemas for development and production through variable settings, you ensure that dataset names are tailored per environment, minimizing risks of unintended production deployments.

Your idea to use workspace compilation overrides is excellent. This approach will maintain clarity within your team, enabling each member to effectively track their own contributions.

Implementing workflow configurations to selectively run models based on tags (e.g., daily, weekly, monthly) is a prudent strategy. It not only fosters automation but also helps in maintaining a clear separation between your environments.

Addressing Your Challenges:

  • Managing Redundant Tables:

    • Prefixing & Temporary Tables: Implementing a clear prefixing system for development tables and using temporary tables can significantly reduce clutter and optimize resource usage.
    • Automated Cleanup & Table Expiration: Setting up regularly scheduled scripts or configuring expiration times for development tables will help maintain an efficient and clean environment.
  • Mirroring Development and Production Transformations:

    • Risk Communication: It is crucial to discuss the complexities and potential performance impacts of mirroring production transformations with your stakeholders. This ensures that everyone is making informed decisions.
    • Hybrid Approach: A balanced strategy of mirroring essential transformations while simplifying others can maintain crucial accuracy while fostering development agility. Adjusting transformation logic or using scaled-down datasets can help avoid performance discrepancies if development data volumes are significantly lower than production.

Additional Recommendations:

  • Gradual Rollouts & CI/CD Integration: Strategies like canary deployments and integrating CI/CD pipelines will further protect your production data, allowing for controlled testing of new changes before a full-scale rollout.

  • Documentation & Training: Utilizing documentation tools like Sphinx or MkDocs to create comprehensive, accessible guides and regular training on BigQuery and Dataform will enhance your team's effectiveness.

  • Version Control & Robust Testing: Ensuring a robust version control system such as Git is in place, along with comprehensive testing frameworks, is critical. These systems are indispensable for managing code changes, tracking history, and ensuring the reliability and quality of both environments.