Solved: Backfilling tables with via Dataform

jack_bennett · 06-20-2023 06:47 AM

Hi all,

We're just starting the process of transitioning from scheduled queries to Dataform pipelines.

One of the really useful parts of scheduled queries is backfilling via the UI (and being able to use @run_time in the code to do this quickly).

I came across the 'Backfilling a daily table with JavaScript' here, but I'm unable to get this working and wondered how others are overcoming this challenge? https://cloud.google.com/dataform/docs/reference/sample-scripts

Cheers

ms4446

Here are a few general troubleshooting steps that you might find helpful:

Check for Syntax Errors: Ensure that the JavaScript code has been copied correctly and that there are no syntax errors. Also, make sure the SQL queries inside the operate functions are correct and compatible with your database.
Verify the Date Format: The getDateArray function expects dates in the "YYYY-MM-DD" format. Make sure the dates you're inputting are in this format.
Check the Table Structure: The operate function is used to create a table and then insert data into it. Ensure that the structure of the table (defined in the create table if not exists backfill_table ('fields') part) is correct and suitable for your data. You will need to replace 'fields' with the actual fields of your table.
Validate the Data Insertion Query: The data insertion operation is performed for each date in the array. Make sure the SQL query in this part (insert into backfill_table select fields where day = '${day}') is correct. You may need to adjust this query to suit your specific use case.

If you're still having trouble, it would be helpful if you could provide more specific details about the problem.

View solution in original post

ms4446

Here are a few general troubleshooting steps that you might find helpful:

Check for Syntax Errors: Ensure that the JavaScript code has been copied correctly and that there are no syntax errors. Also, make sure the SQL queries inside the operate functions are correct and compatible with your database.
Verify the Date Format: The getDateArray function expects dates in the "YYYY-MM-DD" format. Make sure the dates you're inputting are in this format.
Check the Table Structure: The operate function is used to create a table and then insert data into it. Ensure that the structure of the table (defined in the create table if not exists backfill_table ('fields') part) is correct and suitable for your data. You will need to replace 'fields' with the actual fields of your table.
Validate the Data Insertion Query: The data insertion operation is performed for each date in the array. Make sure the SQL query in this part (insert into backfill_table select fields where day = '${day}') is correct. You may need to adjust this query to suit your specific use case.

If you're still having trouble, it would be helpful if you could provide more specific details about the problem.

jack_bennett

Thanks @ms4446, this is really useful.

In terms of of point 3 above, if we've got a table that has 100+ fields in it, do we need to list each of these individually in the code here?

Thanks

ms4446

If you're creating a new table and it has 100+ fields, you would typically need to list each of these fields with their corresponding data types in the create table statement. But if you're backfilling an existing table, the create table if not exists clause will not do anything, as the table already exists.

If your goal is to backfill an existing table, you don't need to define the table structure in the script. The insert into statement doesn't require you to list out all the fields if you're inserting into all fields and the data you're inserting matches the existing structure of the table.

Keep in mind that insert into backfill_table select * from original_table where day = '${day}' assumes that the structure of original_table and backfill_table are identical. If that's not the case, you would need to list out the specific fields you're selecting and inserting.

sysph

Hi there,

May i know how you trigger the backfilling javascript in Dataform (https://cloud.google.com/dataform/docs/reference/sample-scripts#backfilling_a_daily_table_with_javas...) ? or do you manually execute (run/compile) in Dataform repo?

It seems i can only run one by one. Are you able to batch run them?

Thanks

ms4446

Yes, you would typically manually execute (run/compile) scripts in the Dataform web interface. As for batch running scripts, there is no built-in command for batch running scripts. However, you can create a custom script to execute multiple scripts in a batch. Here's an example of how you can do it:

Create a new JavaScript file in your project directory. Let's call it runScripts.js.
In runScripts.js, you can use Node.js's built-in child_process module to execute shell commands. Here's an example of how you can use it to run multiple Dataform scripts:

const { exec } = require("child_process");

// List of your scripts
const scripts = ["script1.js", "script2.js", "script3.js"];

scripts.forEach((script) => {
exec(`dataform run ${script}`, (error, stdout, stderr) => {
if (error) {
console.log(`Error: ${error.message}`);
return;
}
if (stderr) {
console.log(`Stderr: ${stderr}`);
return;
}
console.log(`Stdout: ${stdout}`);
});
});

In this script, replace script1.js, script2.js, script3.js with the paths to your scripts. This script will run each script in the scripts array using the dataform run command.

Run the runScripts.js file with Node.js:

node runScripts.js

This will execute all your Dataform scripts in a batch.

sysph

Thanks ms4446. This is very helpful!

sysph

Hi there,

I got another question here. How shall i improve the render_script_with() func if i have multiple recursive CTEs or multiple select statements to make it more flexible? Shall i write different functions? (I'm only over beginner level in JS)

I have this includes/script_builder.js

//render_script returns a query
function render_script_select(fromTable, selectStatement, whereConditions, groupByStatement) {
  return `${selectStatement}
      FROM ${fromTable}
      ${whereConditions}
      ${groupByStatement}
    `;
}

function render_script_with(withName, fromTable_inWith, selectStatement_inWith, whereConditions_inWith, groupBy_inWith,
  fromTable_inFrom, select_inFrom, where_inFrom,groupBy_inFrom,
  select_outsideWith, where_outsideWith, groupBy_outsideWith){
  return `WITH ${withName} as ( \n ${selectStatement_inWith} \n
    FROM ( \n
    ${render_script_select(fromTable_inFrom, select_inFrom, where_inFrom, groupBy_inFrom)}      
    ) \n
    ${whereConditions_inWith}
    ${groupBy_inWith}
    ) \n
    ${render_script_select(withName, select_outsideWith, where_outsideWith, groupBy_outsideWith)}
    `;
}


module.exports = { 
  render_script_select,
  render_script_with,
}

examples from google BigQuery doc.

WITH q1 AS (my_query)
SELECT *
FROM
  (WITH q2 AS (SELECT * FROM q1) SELECT * FROM q2)

WITH subQ1 AS (SELECT SchoolID FROM Roster),
     subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2

WITH q1 AS (my_query)
SELECT *
FROM
  (WITH q2 AS (SELECT * FROM q1),  # q1 resolves to my_query
        q3 AS (SELECT * FROM q1),  # q1 resolves to my_query
        q1 AS (SELECT * FROM q1),  # q1 (in the query) resolves to my_query
        q4 AS (SELECT * FROM q1)   # q1 resolves to the WITH subquery on the previous line.
    SELECT * FROM q1)              # q1 resolves to the third inner WITH subquery.

ms4446

Your current render_script_with() function is quite flexible and can handle a variety of SQL queries. However, if you have multiple recursive CTEs or multiple SELECT statements, you might want to modify it to accept arrays of parameters instead of individual parameters. This way, you can handle any number of CTEs or SELECT statements. You can use the example you provided from BQ documentation that use CTEs with the WITH clause.

CTEs are temporary result sets that can be referenced within another SQL statement, including SELECT, INSERT, UPDATE, or DELETE statements. They are useful for simplifying complex queries, breaking down large queries into smaller, more readable chunks, and performing multiple steps of data manipulation in a sequence.

In your examples, q1, q2, q3, q4, subQ1, and subQ2 are all CTEs. They are defined using the WITH clause and can be used in the subsequent SELECT statements.

However, please note that the scope of a CTE is only within the query where it is defined, and it cannot be referenced in subsequent queries. Also, if a CTE is defined inside a subquery, it can only be used in that subquery.

If you want to use these examples in your render_script_with() function, you would need to modify the function to accept an array of CTEs and an array of corresponding queries. Each CTE and query would be a string that contains the SQL code for that CTE or query. The function would then concatenate these strings to form the final SQL query.

Here's an example of how you can modify your function to handle these examples:

return `WITH ${withClause}`;
}

sysph

Thank you ms4446. It's so helpful!