Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Composer Airflow Dataset Error

wdrdg_0-1746810173281.png

Hi there! We met a scheduler error in Composer Airflow where scheduler does not work can continuously meet such errors if we submit DAGs with Dataset Trigger.

0 6 530
6 REPLIES 6

Hi wdrdg, This kind of scheduler error with Airflow + Composer when using Dataset Triggers can sometimes be related to:
– Airflow version mismatch (older versions may not fully support Dataset Triggers)
– Bugs in dataset dependency resolution (especially if mixing DAGs and datasets across environments)
– Resource constraints in the Composer environment (check Cloud Logging for scheduler memory/cpu errors)

I recommend:
Confirm your Composer Airflow version — upgrade to at least Airflow 2.5+ if not already.
Check the exact stack trace (your screenshot shows a type comparison error — maybe related to how datasets are linked).
Test with a minimal DAG+Dataset setup to isolate the failing trigger pattern.

Hi @a_aleinikov !

Thanks for your response. In my case, the version is 2.9.1. We have a DAG A with 2 Datasets to trigger it, which can run. However, if we create another DAG B with another 2 Datasets to trigger it, it will cause the scheduler error showing in the screenshot. 

Maybe the way we defined the Dataset has something wrong?

Hi @wdrdg,

Welcome to Google Cloud Community!

The error you’re encountering means that Airflow's scheduler, specifically when trying to calculate the data interval for a DAG run triggered by datasets, encountered a situation where it needed to find the minimum value (min()) in a list (start_dates). However, this list contained at least two None values (or only None values). Python's min() function cannot compare None with None (or None with a date/time object) using the less-than operator (<), hence the TypeError.

When a DAG is triggered by datasets, Airflow needs to determine the appropriate data_interval_start and data_interval_end for the triggered DAG run. This often involves looking at:

  1. The timestamps of the triggering dataset events.
  2. The DAG's own start_date.
  3. The DAG's schedule/timetable logic (even if primarily dataset-driven, some base logic applies).

The min(start_dates) call likely occurs when the timetable logic tries to determine the earliest possible start point for the data interval based on available information. If crucial date/time information is missing (resulting in None), this error occurs.

Based on the error and your description (it works with DAG A, breaks when adding DAG B), here’s the workaround that you may try:

  1. Ensure DAG B has a valid, concrete start_date defined.

    Even though the DAG is triggered by datasets, the underlying timetable logic might still require a valid start_date for initialization or edge cases. If DAG B's start_date is missing, set to None, or perhaps set to pendulum.DateTime.min incorrectly, it could lead to None values during interval calculation.

  2. Ensure the schedule parameter in DAG B is set only to the list of Dataset objects that should trigger it. Do not mix it with cron strings or timedeltas directly in the schedule parameter if it's purely dataset-driven. 

    Mixing dataset triggers with traditional schedules in the schedule parameter, or having schedule=None when it should be the list of datasets (for Airflow >= 2.4), might confuse the timetable. Or, if using Airflow < 2.4, incorrectly setting schedule when using trigger_rule with datasets. Given Composer 2.9.1 uses a recent Airflow version, the list method is correct.

  3. Check the Dataset Events in the Airflow UI (Browse -> Datasets -> select your dataset -> Events). See if the timestamps look correct for the events related to DAG B's datasets. Ensure the tasks producing the datasets for DAG B are correctly signaling the update.

    While less common, it's conceivable that the dataset events triggering DAG B somehow got recorded without proper timestamps, leading to None values when the scheduler processes them. This could indicate an issue with the tasks updating those datasets or a deeper Airflow bug.

  4. Check the Airflow GitHub issues and Google Cloud Composer release notes/issue tracker for similar TypeError problems related to datasets in your specific version. Consider upgrading Composer/Airflow if a fix is available in a later version.

    There might be a subtle bug in Airflow 2.9.1 (or the specific Composer image version) related to handling multiple dataset-triggered DAGs, especially concerning how their start dates or event times interact within the scheduler's timetable logic.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

 
 
 
 
 

 

 

Thanks for your reply! We re-deploy the dag and the new dag runs well with Dataset. However, for the old one, I paused it and it showed "6 of 6 Dataset updated". If I unpaused it, the scheduler will get the same issue. I have defined the start_date and the Datasets have the updated event records. Do you know what is the issue there?

 I still have the issue.

My DAG A updates two Datasets and DAG B is scheduled by these 2 Datasets.  The first run ran successfully but the second run met such errors

For the DAG triggered by datasets I paused it and it showed "6 of 6 Dataset updated". If I unpaused it, the scheduler will get the error. I have defined the start_date and the Datasets have the updated event records. Do you know what is the issue there?