Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataform: Exceeded rate limits: too many dataset metadata update operations for this dataset.

Hi,

When I run all the actions in my Dataform workspace, some actions fail if the dataset does not already exist.

In the job results, I see this error message:

CREATE SCHEMA IF NOT EXISTS `my-project.my_dataset` OPTIONS(location="EU")
Exceeded rate limits: too many dataset metadata update operations for this dataset. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

This failed "CREATE SCHEMA" command results in the next part of the job that attempts to CREATE OR REPLACE the resource in this dataset not found error:

Not found: Dataset my-project:my_dataset was not found in location EU

My dataform model has 5 actions that do not have dependencies (views, operations like the creation of UDFs, and a table). I suspect this might be related to the issue.

This problem does not occur under the following conditions:

- If the dataset already exists, so the second execution succeeds.
- When I execute all actions using the Dataform CLI.
- When I add explicit dependencies between some of the initially independent actions.

Questions:

1. How can I avoid this failure?
2. I can't find specific information about Exceeded rate limits: too many dataset metadata update operations for this dataset inhttps://cloud.google.com/bigquery/docs/troubleshoot-quotas. What does this limit mean?

Thanks for your help!

Solved Solved
0 4 1,062
1 ACCEPTED SOLUTION

The error you're encountering in Google Cloud Dataform is due to a combination of BigQuery rate limits, Dataform concurrency, and dataset creation delay. BigQuery enforces limits on metadata operations like creating datasets or tables within a short timeframe to maintain system stability and prevent abuse. Running multiple independent actions in Dataform can trigger these limits, especially when they attempt to create the dataset concurrently. Additionally, Dataform’s default behavior is to execute independent actions in parallel to speed up processing, which can lead to conflicts. Even after a dataset is successfully created, there might be a slight propagation delay before it is fully available for other operations, causing subsequent actions to fail.

To address these issues, several strategies can be implemented:

Explicit Dependencies: Define dependencies between actions in your Dataform model to ensure that dataset creation is completed before any dependent actions run. This prevents concurrent attempts to create the dataset and avoids exceeding rate limits. For example, you can configure an action to depend on the dataset creation action using:

config {
  dependencies: [ref("my_dataset_creation_action")]
}

Serial Execution: As a temporary workaround, you can force Dataform to execute actions serially (one after another) by setting concurrentActions to 1 in your Dataform configuration file (dataform.json) :

 
{
  "concurrentActions": 1
}

Pre-Create the Dataset: Manually creating the dataset in BigQuery before running your Dataform actions can avoid the race condition altogether. This ensures that the dataset already exists, preventing Dataform from attempting to create it multiple times concurrently.

Batching with ref Function: For finer control, use Dataform’s ref function to reference the results of other actions within your SQL statements. This ensures that dependent actions execute in the correct order:

 
CREATE OR REPLACE TABLE `my_project.my_dataset.${ref("create_my_table")}` AS 
SELECT ...

The specific rate limits for BigQuery dataset operations are not detailed in the official documentation because they can vary and are subject to change. However, a few operations per second are generally acceptable, while rapid bursts of dozens or hundreds of requests might trigger the limit.

Additional tips to consider include implementing error handling and retry logic in your Dataform actions to manage dataset creation failures gracefully. Using the Dataform CLI for running actions can also provide more detailed error messages, aiding in troubleshooting.

View solution in original post

4 REPLIES 4

The error you're encountering in Google Cloud Dataform is due to a combination of BigQuery rate limits, Dataform concurrency, and dataset creation delay. BigQuery enforces limits on metadata operations like creating datasets or tables within a short timeframe to maintain system stability and prevent abuse. Running multiple independent actions in Dataform can trigger these limits, especially when they attempt to create the dataset concurrently. Additionally, Dataform’s default behavior is to execute independent actions in parallel to speed up processing, which can lead to conflicts. Even after a dataset is successfully created, there might be a slight propagation delay before it is fully available for other operations, causing subsequent actions to fail.

To address these issues, several strategies can be implemented:

Explicit Dependencies: Define dependencies between actions in your Dataform model to ensure that dataset creation is completed before any dependent actions run. This prevents concurrent attempts to create the dataset and avoids exceeding rate limits. For example, you can configure an action to depend on the dataset creation action using:

config {
  dependencies: [ref("my_dataset_creation_action")]
}

Serial Execution: As a temporary workaround, you can force Dataform to execute actions serially (one after another) by setting concurrentActions to 1 in your Dataform configuration file (dataform.json) :

 
{
  "concurrentActions": 1
}

Pre-Create the Dataset: Manually creating the dataset in BigQuery before running your Dataform actions can avoid the race condition altogether. This ensures that the dataset already exists, preventing Dataform from attempting to create it multiple times concurrently.

Batching with ref Function: For finer control, use Dataform’s ref function to reference the results of other actions within your SQL statements. This ensures that dependent actions execute in the correct order:

 
CREATE OR REPLACE TABLE `my_project.my_dataset.${ref("create_my_table")}` AS 
SELECT ...

The specific rate limits for BigQuery dataset operations are not detailed in the official documentation because they can vary and are subject to change. However, a few operations per second are generally acceptable, while rapid bursts of dozens or hundreds of requests might trigger the limit.

Additional tips to consider include implementing error handling and retry logic in your Dataform actions to manage dataset creation failures gracefully. Using the Dataform CLI for running actions can also provide more detailed error messages, aiding in troubleshooting.

Hi @ms4446 ,

Thanks for your detailed reply! For my use case, I choose for the explicit creation of the dataset in an operation where to all actions that do not  already have a dependency explicitly define a dependency.
To make this work, I had te remove the "ref" part in your example:

 

config {
  dependencies: ["my_dataset_creation_action"]
}

instead of:

​config {
  dependencies: [ref("my_dataset_creation_action")]
}

I try to set "concurrentActions" but it does not work.

I use "@dataform/core": "3.0.2" on GCP

{
  "concurrentActions": 1
}

 

Hi @uiltjesrups,

Welcome to the Google Cloud Community!

In addition to what @ms4446 mentioned, please note that your project can perform a maximum of five dataset update operations every 10 seconds.

I also agree that adding dependencies in Dataform is crucial for ensuring the correct execution order of your actions and this may prevent multiple actions from attempting dataset creation at the same time.

I hope the above information is helpful.