EDIT: I've figured out how I could do this using operations within Dataform. Article here.
Hi everyone,
I have a data asset that needs to be computed every day (hencewhy I'm using dataform - it's a great tool), but I've come across a limitation. The computation requires pivoting data. The dataset looks like this:
Institution | Year | Result |
Glasgow | 2022 | 5 |
Glasgow | 2023 | 1 |
Aberdeen | 2023 | 6 |
Aberdeen | 2022 | 6 |
I need to pivot it such that it looks like this
Institution | 2022 Result | 2023 Result |
Glasgow | 5 | 1 |
Aberdeen | 6 | 6 |
It's easy enough to do in the BigQuery editor using dynamic pivot statements as this blog outlines. Super useful.
The problem is that I can't run EXECUTE IMMEDIATE FORMAT() statements in Dataform. I tried to dynamically reference my tables wrapped within a pivot() function I wrote:
let pivot = (
schemaRef,
tableName,
aggregatedColumn,
pivotColumn
) => {
let refString = `
ref({
schema: ${schemaRef},
name: "${tableName}"
})`
return `EXECUTE IMMEDIATE FORMAT("""
SELECT * FROM ${refString}
PIVOT (SUM(${aggregatedColumn}) as ${aggregatedColumn} FOR ${pivotColumn} IN %s);
""", (SELECT CONCAT("(", STRING_AGG(DISTINCT CONCAT("'", ${pivotColumn}, "'"), ','), ")") FROM ${refString}))
`
};
module.exports = {
pivot
};
And with the SQLX looking like this:
config {
type: "table",
description: "Each record represents a response within typeform.",
schema: dataform.projectConfig.vars.typeform_intermediate_schema,
tags: ["typeform_mart"]
}
${
pivot.pivot(schemaRef = schema,
tableName = "response_counts",
aggregatedColumn = "response_count",
pivotColumn = "form_year")
}
But the issue is that the ref() string can only be used within an SQLX context. Big issue, because I need this to run after the tables prior "response_counts" have materialised as a part of the data pipeline.
Any help would be appreciated!
The core issue lies in the way Dataform handles dynamic SQL execution. While EXECUTE IMMEDIATE
is excellent for flexible SQL statements, Dataform primarily prioritizes predeclaration of tables. This means you cannot dynamically build an entire table structure using EXECUTE IMMEDIATE FORMAT()
.
Hey, not to flame, but your response seems very GPT-like. Let's deconstruct your suggestions:
"Predefined Pivots for Predictable Columns"
This suggestion isn't sustainable due to the dynamic nature of one of the columns. The pivot needs to be dynamic. It's an analytics-ready table that relies on the columns spanning all distinct values of a column. Creating CI/CD overhead to achieve this isn't feasible, as we have SQL engineers who can debug SQL but not CICD engineers who can debug CICD pipelines.
"Conditional Logic for Data Integrity"
This solution is a non-solution that doesn't address the question.
"Custom Stored Procedure for Centralized Logic"
This could potentially be a solution, but it relies on business logic being encoded outside of dataform. This isn't scalable and maintainable, as when the logic breaks (all logic eventually breaks), engineers won't be able to find where the error is. I don't want to take on tech debt when I know that the solution can be achieved in Dataform.
Hybrid External Scripting for Maximum Flexibility
Again, totally misses the question.
Honestly, I'm quite disappointed with this response. I thought that forums were meant to be spaces where people can exchange ideas and come to novel solutions. I asked whether this solution was possible within Dataform and if anyone had come against this problem before. I don't feel that you addressed the question, and I feel like the GPT-like response (I can't confirm whether it is or isn't, but I'm 99% sure) totally devalues the reason for posting in a community forum.
I'd rather get no response than an LLM automated one. Thanks.
Chiming here. I understand many answers would help but unfortunately dataform remains unpopular after 2 years being taking over, and besides ms4443 reply, you won't find (m)any more.
I find that AI-assisted - ms4443 is still reviewing & editing the response - are usually quite useful, especially if there is no other feedback.
Back to your question, dataform only handles declarative substitutions ie. at compilation time. Your only option as frustrating as it is, are writing procedures. It's something we lost coming from airflow but pipelining sql's are easier
We all hope dataform will evolve but honestly there is little indication and no communication about any.
Another way I got around this issue was completing the EXECUTE IMMEDIATE as a preoperation. Within the execute immediate, create a temp table to place the results. Then pull from this temp table in the body of the Dataform definition.