Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Question about Dataform Jobs

I am using Dataform's workflow to schedule regular executions.

In Dataform, only one job is shown, but in BigQuery, I see multiple queries being executed.

Why is this happening?

I have not checked the option to run dependent targets.

There is no job of other workflows being executed.

0 5 481
5 REPLIES 5

Hi @hiracky16,

Welcome to Google Cloud Community!

Even though you only see one Dataform job, it likely contains multiple SQL operations (like creating or updating tables). Dataform translates each of these operations into individual queries that are run on BigQuery. This is why you see multiple BigQuery queries even though there's just one overarching Dataform job.

This behavior is expected because Dataform is designed to manage a series of related data transformations, and it ensures these transformations happen in the right order on your BigQuery data.

I hope the above information is helpful.

Thank you for your reply.

I understand that having multiple SQL operations (such as CREATE TABLE or ALTER TABLE) within SQLX can result in multiple jobs being created.

However, I have noticed that when a job is triggered from Dataform, a single job with the prefix dataform-gcp- is issued. Sometimes there are multiple such jobs, and other times only one.

Why does this occur?

I hope this issue can be resolved.

Hi @hiracky16,

It's totally normal to see different numbers of "dataform-gcp-" jobs at different times. This is because Dataform is smart about how it runs your data pipeline to get the best performance. It can run tasks that don't rely on each other at the same time (which means more jobs running), and it runs tasks that do rely on each other one after the other (which could mean fewer jobs running at once). This helps your whole pipeline run as smoothly and efficiently as possible. If you want to see exactly what each job is doing, you can always check the detailed logs.

Thank you for your explanation. However, could you explain why the number of dataform-gcp- jobs issued seems to vary depending on the day?

The number of dataform-gcp- jobs issued can vary day-to-day due to how Dataform optimizes your data pipeline's performance:

  • Parallel Execution: Dataform identifies independent tasks within your workflow and runs them simultaneously. If there are more independent tasks on a given day, you'll see more dataform-gcp- jobs running concurrently.

  • BigQuery Resource Availability: Dataform adapts to the current load on your BigQuery resources. If BigQuery is busy, Dataform might break down your workflow into smaller chunks (more jobs) to utilize available resources efficiently.

  • Cached Results: Dataform intelligently uses cached results for parts of your pipeline that haven't changed since the last run, potentially reducing the number of jobs on certain days.

  • Data Volume: Larger datasets might require Dataform to divide tasks into smaller, more manageable units, leading to more dataform-gcp- jobs for processing.

Essentially, Dataform adjusts its execution strategy dynamically, leading to fluctuations in the number of visible jobs while aiming for optimal performance and resource utilization.