HI,
Dataform generates code with procedure name like:
Solved! Go to Solution.
You're correct! That's an important nuance, and I apologize for not addressing that earlier. The procedure names in Dataform are generated based on a hash that includes factors like the execution start time, ensuring uniqueness and tracking changes to your data transformations. However, this means that procedures executed simultaneously can have identical names. This scenario is particularly problematic when these processes aim to merge data into the same destination table, increasing the risk of data loss or duplication.
Solutions and Best Practices
Sequential Execution: To circumvent this issue, the most reliable method is to schedule your Dataform processes to run sequentially. This ensures that no two processes targeting the same table are executed at the same time.
Dependency Management: In cases where sequential execution is impractical, meticulously structure your Dataform definitions to establish explicit dependencies between processes. This means:
Advanced Strategy: Temporary Tables and Merging: For complex scenarios, consider the following approach:
MERGE
operation to consolidate the data from the temporary table with the target table in a controlled manner. Refer to the official BigQuery MERGE
documentation for details: [invalid URL removed]Important Considerations
ok thank you!
Unfortunately, you cannot directly change the auto-generated procedure names within Dataform. Dataform automatically generates procedure names in the format project_id.dataset_id.df_d1b3851d423cca3c78ade6504a94655edc9a9d9e4b127897c0a14462705e8e9a
. These hashed names, while not directly modifiable, serve important purposes:
Workarounds for Improved Readability
While you cannot directly change the auto-generated names, here are effective strategies to enhance readability and manageability:
View Creation:
CREATE VIEW my_descriptive_view_name AS
SELECT * FROM `project_id.dataset_id.df_d1b3851d423cca3c78ade6504a94655edc9a9d9e4b127897c0a14462705e8e9a`;
Custom Operations:
CREATE OR REPLACE PROCEDURE
to define your own procedures with descriptive names.-- Custom operation in Dataform
CREATE OR REPLACE PROCEDURE your_project.your_dataset.your_meaningful_procedure_name()
BEGIN
-- Your procedure logic here
END;
Thank you but there is another problem.
The hashed name doesn't depend only on the name but on start time of execution.
If you execute 2 process at the same time that merge data into the same target table
then they execute the same procedure twice as the name is generated the same for them which leads to skipping data from one source and produce duplicates.
You're correct! That's an important nuance, and I apologize for not addressing that earlier. The procedure names in Dataform are generated based on a hash that includes factors like the execution start time, ensuring uniqueness and tracking changes to your data transformations. However, this means that procedures executed simultaneously can have identical names. This scenario is particularly problematic when these processes aim to merge data into the same destination table, increasing the risk of data loss or duplication.
Solutions and Best Practices
Sequential Execution: To circumvent this issue, the most reliable method is to schedule your Dataform processes to run sequentially. This ensures that no two processes targeting the same table are executed at the same time.
Dependency Management: In cases where sequential execution is impractical, meticulously structure your Dataform definitions to establish explicit dependencies between processes. This means:
Advanced Strategy: Temporary Tables and Merging: For complex scenarios, consider the following approach:
MERGE
operation to consolidate the data from the temporary table with the target table in a controlled manner. Refer to the official BigQuery MERGE
documentation for details: [invalid URL removed]Important Considerations
ok thank you!
@ms4446 I just hit on the same problem while investigating random `Procedure is not found` errors which stop workflows with no option to retry.
Because our workflows are externally triggered (from GCP Workflows, on blob creation in GCS), 2 df procedures may get the same name, overwriting each other, and if timing is right (I mean wrong) one workflow can drop the procedure after execution, before the other one tries to execute it. It could lead either to the wrong SQL definition being run as per OP, or the error that procedure doesn't exist (anymore).
I understand we can't control the temp name that df generates for the procedure but can you reach out internally and tell us which exact parameters are used to make up the hash assuming it's not just the timestamp?
Hi @yan-hic,
The issue you're experiencing, particularly the 'Procedure not found' error in workflows triggered by external events like GCS blob creation, indeed points to a challenge in using Dataform in pro
In the interim, consider the following strategies to mitigate the impact:
Serialization: Where possible, adjust your workflow triggers to ensure Dataform processes are executed sequentially. This can help avoid race conditions.
Enhanced Error Handling: Implement comprehensive error logging and notification mechanisms within your Dataform projects to better manage and respond to 'Procedure not found' errors.
Provisional Procedures: As a temporary workaround, manually creating and managing procedures with known names might help reduce the frequency of these errors. While not ideal, this approach could offer a stopgap solution.
Engaging with Google Cloud Support
Support Channels: Reach out through Google Cloud's official support channels. Detailed reports, especially those affecting production workflows, are taken seriously and can lead to more direct action or guidance.
Issue Tracker: Check Issue Tracker for similar reports or create a new issue detailing your experience. This visibility can help prioritize a resolution.
@ms4446 that AI-assisted answer was unfortunately not edited by the product expert that you are, and just transcribed as-is. However, to answer on the proposed strategies:
- Serialization: it is not possible if triggered externally if one doesn't control the time and rate that files get copied to GCS. Dataform should have been designed - or corrected since take-over - to support concurrency. The solution is easy - read on.
- Error Handling: errors are raised by Dataform wrapping, not by user code. Users have no control on error handling raised by the Dataform wrapping into procedures. This is always the case for incremental table where a unique key is specified: DF generates a script that creates a procedure to MERGE. The reason we (wanted to) use DF is that we don't need to code for incremental/full refresh boilerplate.
- Temp procedures: you basically advise to drop incremental tables and use `operations` instead across a workflow...
The solution in my mind is straightforward though: DF should use a UUID for naming the temp procedure, instead of a hash based on execution timestamp. This ensures uniqueness and isolation. To me, this is more a bug or bad design assumption.
In the meantime, I would appreciate you reach out to answer my previous post question: how is the hash exactly made of ?? If we know how built, we can see how to use distinct factors, that will render distinct hash values.