Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

BigQuery - automation

Hi,

In my project the data lands in bigquery via an Alteryx workflow. Afterwards the bigquery tables are transformed using SQL and the transformed tables are written to back BigQuery from where it is fed to dashboard.

Now I want to automate and schedule these BigQuery SQL scripts to run on the 5th day of every month. Can we do this using just BigQuery? Or would we need any other tool as well? Please suggest the same.

Thanks

PS: There are 4-5 sql scripts that runs in succession, so there is a dependency

0 3 3,007
3 REPLIES 3

To automate and schedule the execution of your BigQuery SQL scripts, you have a couple of primary options within Google Cloud's ecosystem, including using BigQuery's own scheduled queries feature for straightforward tasks or leveraging external orchestration tools for more complex workflows. Here's a refined approach:

Using BigQuery Scheduled Queries for Automation

Crafting Your SQL Scripts: First, ensure each of your SQL scripts is optimized for performance and accuracy, with each script designed to execute its transformation logic based on the appropriate input from BigQuery tables.

Creating Scheduled Queries:

  1. Navigate to the BigQuery console.

  2. Execute the SQL query you wish to schedule to verify its correctness.

  3. Click on the 'Schedule' button, then select 'Create new scheduled query'.

  4. Configure the schedule for your query to run on the 5th of each month, specifying the destination table for your transformed data and any necessary notifications.

  5. Repeat the process for each SQL script.

Managing Dependencies Between Scripts

Option 1: Sequential Scheduling

For linear dependencies where the output of one script is the direct input for the next, you can sequence your scheduled queries. This involves timing each subsequent query to start after the expected completion time of the previous one. However, this method has limitations, including potential timing issues and a lack of failure handling mechanisms.

Option 2: Using External Orchestration Tools

For more complex dependencies or when you require advanced features like error handling, conditional logic, or integration with external services, consider using an orchestration tool:

  • Cloud Composer: A powerful, managed service based on Apache Airflow, ideal for complex workflow orchestration within Google Cloud. It allows for detailed dependency management, error handling, and scheduling.

  • Other Orchestration Tools: Tools like Prefect or Luigi can also orchestrate BigQuery jobs, offering various features and integrations.

Additional Considerations

  • BigQuery Scripting: For simpler dependencies, consider using BigQuery's scripting capabilities to combine multiple SQL statements into a single scheduled query. This can reduce the need for external tools for some workflows.

  • Cloud Workflows and Cloud Scheduler: For Google Cloud-centric workflows, Cloud Workflows can orchestrate tasks across services, including BigQuery, with Cloud Scheduler triggering these workflows based on your specified schedule.

When deciding on the best approach to automate and schedule BigQuery SQL scripts, consider the complexity of your workflow, the need for robust error handling, and whether your process extends beyond BigQuery. For straightforward, linear workflows, BigQuery's scheduled queries may suffice. However, for more complex scenarios or when integrating with other cloud services, external orchestration tools like Cloud Composer or a combination of Cloud Workflows and Cloud Scheduler offer more flexibility and control.

Hi, thanks for the info, just a follow up question:

If I am going with the cloud composer for scheduling, would it be possible to integrate files from GitHub repo?

The SQL scripts that I have are in a GitHub repo and enhancements will be made to those scripts from time to time, so it will be ideal if we would just push the changes in GitHub repo and Airflow will pull the updated/latest SQL scripts from git and execute it.

Thanks
 

Cloud Composer (Apache Airflow) integrates seamlessly with GitHub for dynamic DAG creation and version control. Here's how you can achieve this:

1. Structure Your GitHub Repository

  • DAGs Folder: It's recommended to have a dedicated dags folder to store your Airflow DAG definition files (.py files).
  • SQL Scripts: Store SQL scripts in a separate folder for organization. Here's an example structure:
├── dags/  
│   ├── dag1.py 
│   ├── dag2.py 
├── sql/ 
│   ├── transformations/ 
│   │   ├── script1.sql 
│   │   ├── script2.sql
...

2. Set up Cloud Composer Environment

  • Enable Git Sync: Cloud Composer allows syncing a Git repository (like GitHub) for DAGs. Be aware that this sync feature primarily targets DAG definition files (.py), not other code or SQL scripts.
  • Repo Details: Provide your repository URL and desired branch for syncing.
  • Sync Interval: Configure how frequently the environment syncs with the repository.

3. Create Airflow DAGs

  • DAG Definition: Your DAG code will orchestrate SQL script execution. Ensure your DAGs correctly reference SQL scripts (usually stored in Google Cloud Storage or a similar accessible location).
  • Operators:
    • There's no GitSyncOperator. Git sync happens at the environment level.
    • Use BigQueryExecuteQueryOperator to run SQL scripts. Load SQL scripts from storage instead of the DAGs folder.
  • Revised Example DAG Structure:
 
from airflow import DAG 
from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator 
from airflow.utils.dates import days_ago 

default_args = { 
    'owner': 'airflow', 
    'start_date': days_ago(1), 
} 

with DAG( 
    'example_dag', 
    default_args=default_args, 
    schedule_interval='@daily',  
) as dag: 

    execute_sql_script_1 = BigQueryExecuteQueryOperator( 
        task_id='execute_sql_script_1', 
        sql='gs://my-bucket/sql/transformations/script1.sql',  # GCS path 
        use_legacy_sql=False, 
    ) 
    # Define additional tasks and dependencies 

Benefits of Using Cloud Composer with GitHub

  • Version Control Track changes and roll back if needed.
  • Seamless Updates: Sync new DAGs and changes automatically.
  • Collaboration: Enables teamwork on your data pipelines.
  • Workflow Management: Streamlines development and deployment.

Important Note: GitHub integration seamlessly manages DAG definitions. For SQL execution, reference scripts from storage or, in specific cases, embed SQL logic into your DAGs.