Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataform backfill for daily and historic snapshot tables

Hi there,

We are currently evaluating how to migrate our analytics setup from Scheduled Queries to a Dataform repository. At the moment, we have an SQL workflow that runs as a scheduled query to produce daily snapshots, which are then stored in a historical snapshots table containing data for a specified date range. Additionally, we have a separate query to handle backfilling.

Is it possible to implement a similar design in Dataform? Specifically, how would we handle backfilling the historical snapshot table? I came across the example in the documentation for backfilling an incremental table with JavaScript, but the example provided only covers a simple SQL statement rather than executing an entire SQL workflow over a given time frame.

Would you be able to provide guidance on the best approach for this in Dataform?

Thank you for your help!

0 3 209
3 REPLIES 3

In your original SQL workflow, what was your query for handling backfills?

Hi @mrni11,

Welcome to Google Cloud Community!

Dataform can support this setup by using SQL for transformations and JavaScript for managing the backfilling process. The core idea is to create dynamic SQL queries for each date in the backfill range and use Dataform’s operations to execute them sequentially. You can define SQL workflows in SQLX files, with tables, views, or incremental tables executing SQL queries. JavaScript in operations allows for handling more complex tasks like scheduling and backfilling.

You can implement your design in Dataform as follows:

  1. Daily Snapshot Table: In Dataform, you can define a table that represents the historical snapshot. This will be generated using SQL code similar to your existing scheduled query. This can either be a full refresh or an incremental load depending on your needs.
  2. Backfilling the Historical Snapshot: Since Dataform supports JavaScript in the operations section, you can write a script that loops through a date range and runs the transformation logic for each day. The primary challenge here is backfilling the historical snapshot table for a particular date range.

Backfilling in Dataform is typically managed by following these steps:

  1. Use JavaScript to Create a Loop for Backfilling: You can define a JavaScript operation to trigger your SQL transformation logic over a specific date range (like your backfilling process).
  2. Integrating with Existing SQL Workflow: You can reuse the SQL logic from your daily snapshots table for backfilling by adding a date filter, ensuring consistent transformation across both daily snapshots and backfilled data.
  3. Incremental Tables: If you're using incremental tables (as opposed to full refresh), you can define an incremental load model in Dataform. With incremental models, you can set up logic to only process new or modified data since the last run. If you want to backfill, you can simply set up a custom operation that populates the table for the missing date range.
  4. Scheduling the Backfill Process: You can schedule this backfilling operation to run through your Dataform CLI or Dataform Web UI, depending on your setup. If you're using the Dataform CLI, you can trigger the operations via a cron job or any scheduling system that works with your existing pipeline.

Additionally, I came across a thread where someone asked about backfilling tables via Dataform, and it was answered by one of our expert staff. Take a look at it, as I believe it will be very helpful for your migration process.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Hi, 

Thank you for your reply, it was helpful. So far, I understand the process, I also came across the thread you linked to. 

But I can't find a solution for the main question, is it possible to trigger a whole SQL workflow from the JavaScript backfill script? That is, in the example script, while looping through the dates, it is only possible to execute SQL operations defined in that very script.
which would not help with the outlined design. 

Thank you!