Tasks both in "no state" and in success state in Cloud Composer

Hello,

 

I am using the managed version of Apache Airflow with Cloud Composer in version composer-2.6.1-airflow-2.6.3.

 

I have got some of my tasks that are both in a success state (I am able to find them and their logs using Browse > Task Instances) and in no state at all (when going in the grid view of the DAG, there is simply no square representing the task, see screenshot). Tasks are not in the removed state, there is really no square hence no state.
Solved Solved
4 10 171
1 ACCEPTED SOLUTION

OK, if someone is trying to access the airflow db from the "managed" version of apache airflow by google, here is how I did it.

First of all, you need to find out what the host / user / password / port / name are.

In order to do that, you can export the list of connections using the airflow CLI.

gcloud composer environments run --location=YOUR_REGION YOUR_COMPOSER_ENV_NAME connections export -- - --format=env | grep airflow_db

The connection uri contains all the information you need.

Then, you need to setup kubernetes on your machine.
Find out the namespace in which your composer resources are deployed using
 
kubectl describe ns
 
(one of the namespaces is actually kind of the name of your composer version)
 
Then you need to identify a pod that has access to the airflow db, for instance the name of a scheduler pod:
 
kubectl get pods --namespace=YOUR_COMPOSER_NAMESPACE
 
Run a terminal in this pod using
 
kubectl exec --namespace=YOUR_COMPOSER_NAMESPACE -it YOUR_SCHEDULER_POD /bin/bash

Great, you are logged in the pod. Using psql, you are now able to access the airflow db (using the values retrieved in the first step)
 
psql -h HOST -U USER -p PORT -d DATABASE_NAME
 
Enjoy!

View solution in original post

10 REPLIES 10

(continuing the issue description here as I wasn't allowed to post the rest of the message)

 

 

When I am trying to rerun the dagrun that contains the suspicious tasks using the menu Clear > Queue up new tasks, nothing happens, which seems pretty normal because the tasks actually ran and actually are in a success state.

 

Well, one could think it's only a visual issue that do not matter too much as I'm still able to access the logs of the tasks using the Browse > Task Instances menu.

 

However, I've got some external sensors that do not meet their requirement as they were waiting for the faulty tasks to be successful.

 

I believe this is all due to an upgrade of the composer environment. I noticed I had a worker pod that was trying to connect to an old version of the airflow database (composer-2-3-5-airflow-2-5-3) even months after the update.

 

the error message from the worker pod looked like this

 

OperationalError: FATAL: database "composer-2-3-5-airflow-2-5-3-550b9f5a" does not exist
I managed somehow to kill this faulty worker pod and now, all the worker pods are "talking" to the correct airflow database, but the tasks are still both in a successful state and in no state at all.

 

I could set all the ExternalTaskSensor s to a successful state by hand but I feel like this will happen again.

 

According to the documentation, I should be able to connect to the airflow database however, the sql_alchemy_conn parameter is < hidden > in my Configuration UI.

 

Does someone have any clue? Is there an official way to make an airflow task fully recognized as successful? How can I prevent this from happening again?

You're encountering a situation where tasks are marked as successful in the backend (Task Instances view) yet are missing from the DAG graph (no state displayed). This inconsistency prevents external sensors that depend on these tasks from completing their checks, thus blocking your workflows. The likely cause is the recent Composer environment upgrade. A worker pod trying to access an outdated Airflow database suggests potential issues with database synchronization or task metadata consistency.

Troubleshooting and Resolution

1. Verify Database Consistency

Access the Airflow Database: Despite the sql_alchemy_conn being hidden in the UI, follow these steps to connect to the database:

  • Obtain Credentials: Navigate to the Google Cloud Console to find the connection details for your Composer environment's PostgreSQL database.

  • Set Up Cloud SQL Proxy: Install and configure the Cloud SQL Proxy to create a secure connection to your database from your local environment.

  • Connect using psql: Use the psql client and the credentials obtained to connect to the database.

  • Inspect Task Metadata: Execute SQL queries to examine the task_instance table for discrepancies between database records and what's shown in the Airflow UI.

2. Clear Task State and Re-run DAGs

  • Clear Task Instances: Use the Airflow UI or CLI to clear the states of the affected tasks. Be cautious, as this could affect other tasks.

  • Re-trigger DAGs: Manually trigger the DAGs containing these tasks to attempt to normalize the state.

3. Address the Worker Pod Issue

  • Investigate the Pod: Use Kubernetes commands like kubectl logs for the worker pod to explore error messages or other clues pointing to the root cause.

  • Potential Fixes:

    • Restart Composer: Consider restarting your Composer environment, which involves not just a simple reboot but a comprehensive refresh of the environment settings.

    • Upgrade Airflow: If you suspect a bug, update your Airflow version within Composer while ensuring compatibility with existing configurations.

4. Preventive Measures

  • Monitor Worker Logs: Keep an eye on the logs for any anomalies.

  • Implement Health Checks: Use monitoring tools to continuously track the health of your Airflow workflows and databases.

  • Testing and Staging: Always test environment upgrades in a staging setup before rolling them out to production.

Important Considerations

  • Data Loss Risk: Evaluate the implications of clearing task states. If tasks perform critical data operations, their states should be managed carefully to avoid data loss.

  • Database Backups: Maintain regular backups of your Airflow database to safeguard against unforeseen issues.

Hello and thanks for your answer ms4446,

 

1. Verify Database Consistency

It's precisely because the sql_alchemy_conn is hidden that I am not able to find the credentials to the airflow database. 

I have no idea where to search in order to get these infos from the google cloud console (it doesn't seem like the airflow database appears as a Cloud SQL instance). Do you have some hints?

 

2. Clear Task State and Re-run DAGs

I'd prefer to avoid this as the tasks who disappeared (they were for sure considered as successful even in the UI at some point) are not idempotent, they ran once and I'm happy with that.

 

3. Address the Worker Pod Issue

It looks like an issue with the last upgrade which didn't go well. I think it's acceptable as this feature is documented as "in preview" so I'm fine with meeting this issue.

 

Direct access to the Airflow database can be complex due to the hidden sql_alchemy_conn. Here are two practical methods to find the necessary information:

A. Cloud SQL Proxy

Mechanism: This method uses a local proxy to securely connect to the Cloud SQL instance that backs your Composer Airflow database.

Steps:

  • Enable Cloud SQL Admin API: Make sure the Cloud SQL Admin API is enabled in your Google Cloud Project.

  • Install Cloud SQL Proxy: Download and install the Cloud SQL Proxy for your operating system.

  • Locate Instance Connection Name:

    1. Navigate to the Composer Environments page in the Cloud Console.

    2. Click on your environment's name to view details.

    3. In the Configuration section, find the Database entry, which should display the Cloud SQL instance name (e.g., projects/<your-project>/instances/<instance-name>).

  • Start Proxy: Run the cloud_sql_proxy command with the instance connection name and specify a local port.

  • Connect with psql: Use psql to connect to the database using the local port opened by the proxy. The typical database name is airflow.

B. Kubernetes Port Forwarding (Advanced)

Mechanism: This method forwards a port from a Kubernetes pod in your Composer environment directly to your local machine.

Steps:

  • Locate Airflow Pod: Use kubectl get pods to find a running Airflow pod in your Composer environment's GKE cluster.

  • Port Forward: Execute kubectl port-forward <pod-name> <local-port>:5432 to forward the PostgreSQL port (5432) to a local port.

  • Connect with psql: Connect to the forwarded local port using psql.

Important Notes:

  • Credentials: For both methods, you will need to authenticate using a service account with appropriate IAM permissions to access the Cloud SQL instance.

  • Security: Prioritize secure connections and ensure database credentials are not exposed.

2. Alternatives to Clearing Task States

If clearing task states is not a suitable option due to the non-idempotent nature of the tasks, consider the following approaches:

  • Database Manipulation (Careful): If you have expertise in SQL, you could directly update the task state records in the Airflow database. This approach should be undertaken with caution, as it involves understanding the Airflow database schema comprehensively.

  • Wait and See: Sometimes, Airflow's internal synchronization mechanisms may resolve such inconsistencies over time. It's advisable to monitor the situation before taking drastic measures.

Hello,

 


@ms4446 wrote:

In the Configuration section, find the Database entry, which should display the Cloud SQL instance name (e.g., projects/<your-project>/instances/<instance-name>).


Unfortunately this is not accurate. As in the documentation and as you can see in the environment page, there is no such thing as "database entry" nor any clue regarding the SQL connection, hence my post here.


@ms4446 wrote:

Port Forward: Execute kubectl port-forward <pod-name> <local-port>:5432 to forward the PostgreSQL port (5432) to a local port.


I believe this is wrong, this listens to the local port and forwards to the postgreSQL port, not the other way around. Besides, I'm not even sure listening to the 5432 port could help me in any way to connect to the airflow db 😕

Anyway, I have no clue about how to use a google service account to connect to a db using psql.

 

I am surprised to see that there doesn't seem to be an easy way to connect to the airflow db? Composer is meant to be a service managed by Google so it's "ok-ish" to not let the end user access to some parts of the product, but if the "managed service" doesn't work, it would make sense to let the user access to the stuff that are not properly managed.

Is there really no way to connect to the airflow db to fix the metadata table content after the failure of upgrading to a newer composer version?

Cloud Composer indeed manages a lot of components for you, including the underlying database that powers Apache Airflow. However, direct access to the database can be restricted and not straightforward, especially since Google Cloud manages the environment to ensure security and stability. Here are a few steps and tips that could help you address the problem of connecting to the Airflow database and potentially fixing metadata issues:

  1. Access via Cloud SQL Proxy: Google Cloud Composer uses Google Cloud SQL to manage the Airflow metadata database. While direct access isn't provided in the Composer interface, you can use the Cloud SQL Proxy to securely connect to your database. To do this:

    • Enable the Cloud SQL Admin API.
    • Install the Cloud SQL Proxy on your local machine.
    • Use the gcloud command-line tool to obtain the instance connection name (you can find this in the Composer environment's details under the Cloud SQL section).
    • Start the proxy with commands like:
      ./cloud_sql_proxy -instances=<INSTANCE_CONNECTION_NAME>=tcp:5432
    • This sets up a secure tunnel to the database, and you can then use a PostgreSQL client tool to connect using the local port 5432.
  2. Service Account Configuration: To use the proxy and connect to the database, you'll need a service account with appropriate permissions. Ensure the service account has roles like Cloud SQL Client and Cloud SQL Editor. You can set this up in the IAM & Admin section of the Google Cloud Console.

  3. Connecting using psql: Once the proxy is running, you can connect using psql or any other PostgreSQL-compatible client. For instance:

    psql -h 127.0.0.1 -p 5432 -U <USERNAME> -d <DATABASE_NAME>

    The username and database name can usually be found in your environment's configuration; the default user for Airflow databases in Composer is often airflow.

  4. Port Forwarding Misunderstanding: The statement about port forwarding you mentioned is indeed about mapping a local port to the service's port in the pod. The command kubectl port-forward <pod-name> <local-port>:5432 correctly maps a local port to the 5432 port of a PostgreSQL pod. This could be useful in Kubernetes environments where you might want to temporarily expose a pod's service for debugging. However, for Composer, using the Cloud SQL Proxy as described above is the recommended method.

  5. Troubleshooting Failed Upgrades: If an upgrade fails and you suspect metadata corruption or issues, connecting to the database to manually intervene should be done with caution. Ensure you have backups and understand the schema changes that might be involved with version upgrades.

  6. Contact Support: If you're facing persistent issues or the upgrade has corrupted data in a way that isn't manageable, contacting Google Cloud Support is advisable. They can provide more direct assistance and possibly access more tools to help resolve your issue.

Hello,

Using the airflow cli, I was able to export all connections including airflow_db connection, hence retrieving the user (root), the password and the database name (I believe this should be documented because none of the method you suggested worked or corresponded to what's displayed in the UI).

Now, I'm struggling to get this cloud SQL proxy to work.

First of all, I'd like to confirm with you that Cloud SQL Proxy simply does not exist, and that you were referring to Cloud SQL Auth proxy ?

If that's the case, the syntax you are showing is wrong, right? ./cloud_sql_proxy is actually ./cloud-sql-proxy, and it does not take '-instance as an argument.

Last but not least, the instance name is obviously still unknown, I was thinking you could maybe share a bit of information regarding how to find this instance name?

The ./cloud-sql-proxy seems to be looking for names like project:region:instance and as you know, I have no clue of the name of the project since the db is actually in one of your google managed project, not in my project.

You're right;  Let me clarify a few things about using the Cloud SQL Auth proxy and how it connects to Cloud Composer's underlying Cloud SQL instance.

  1. Cloud SQL Auth Proxy vs. Cloud SQL Proxy: The correct term is indeed "Cloud SQL Auth Proxy," often referred to just as "Cloud SQL Proxy." The executable's name can vary depending on how it was downloaded or installed. Typically, it is downloaded as cloud_sql_proxy on Linux and macOS. If your executable is named differently, you should use that name in your commands.

  2. Correct Command Syntax: The correct command to start the Cloud SQL Auth proxy, assuming the executable name is cloud_sql_proxy and the placeholder <INSTANCE_CONNECTION_NAME> represents your actual instance connection name, is:

    ./cloud_sql_proxy -instances=<INSTANCE_CONNECTION_NAME>=tcp:5432

    Here, instances is indeed the correct argument. The executable may not run if there's a typo or the instance connection name is not properly formatted.

  3. Finding the Instance Name: For Cloud Composer, the Cloud SQL instance is indeed managed by Google and not directly exposed in your Google Cloud project's resources. To find the instance connection name:

    • Go to the Cloud Composer section in the Google Cloud Console.
    • Select your environment.
    • Look under the Environment details or Cloud SQL instance section. While this is typically the place to find it, if it's not visible due to Google's management policies, this information might not be directly accessible.
  4. If Instance Name is Not Available: If you cannot locate the instance name through the Cloud Composer UI, it may be because Google does not expose these details for managed services to prevent unintended changes that could destabilize your environment. In such cases, the standard practice would be to handle database-related operations through the Airflow Web UI or the Airflow CLI, which it seems you have already been using effectively.

  5. Alternative Approaches: Since you've been able to extract the database credentials using the Airflow CLI, you might consider the following if direct access to the database is crucial:

    • Use the Airflow Web UI: For many operational tasks, such as modifying connections or adjusting configurations, the Airflow Web UI is sufficient and recommended.
  6. Final Steps: If the Cloud SQL Proxy method is essential for your workflow, ensure that:

    • You have the correct permissions assigned to your Google account or the service account you're using.
    • You're using the correct syntax for your specific setup and following any updated guidelines from the Google Cloud documentation.

Hey, thanks for the answer.


The executable's name can vary depending on how it was downloaded or installed. Typically, it is downloaded as 
cloud_sql_proxy on Linux and macOS. If your executable is named differently, you should use that name in your commands.

==> This is not true, on both linux and mac os, it's definitely cloud-sql-proxy. I am still a bit afraid that we aren't talking about the same thing, especially since you keep mistyping the name of the command.


Here, instances is indeed the correct argument.


No, there is definitely NO "--instances" parameter. I am asked to type the name of the instance right away, so I disagree with what you wrote. Can you confirm we are talking about the same tool? I really doubt it since nothing of what you say matches what I see in the link I shared above.


./cloud-sql-proxy my-project:us-central1:my-db-server


Then, you ask me again to go in the environment details page. Can you screenshot me where to find Database entry on this page ? As I told you, I believe it simply does not exist but you keep asking me to go here so I'm not sure what to do. 

 

"If Instance Name is Not Available" Well I don't know, you tell me! 

Composer is a product managed by Google, not by me. Also, none of the solution you are referring could potentially work as you can't connect to the airflow db through the UI nor using the cli

 

 

Not mentioning the remaining answers as it looks like some bot generated responses.

 

 

 

OK, if someone is trying to access the airflow db from the "managed" version of apache airflow by google, here is how I did it.

First of all, you need to find out what the host / user / password / port / name are.

In order to do that, you can export the list of connections using the airflow CLI.

gcloud composer environments run --location=YOUR_REGION YOUR_COMPOSER_ENV_NAME connections export -- - --format=env | grep airflow_db

The connection uri contains all the information you need.

Then, you need to setup kubernetes on your machine.
Find out the namespace in which your composer resources are deployed using
 
kubectl describe ns
 
(one of the namespaces is actually kind of the name of your composer version)
 
Then you need to identify a pod that has access to the airflow db, for instance the name of a scheduler pod:
 
kubectl get pods --namespace=YOUR_COMPOSER_NAMESPACE
 
Run a terminal in this pod using
 
kubectl exec --namespace=YOUR_COMPOSER_NAMESPACE -it YOUR_SCHEDULER_POD /bin/bash

Great, you are logged in the pod. Using psql, you are now able to access the airflow db (using the values retrieved in the first step)
 
psql -h HOST -U USER -p PORT -d DATABASE_NAME
 
Enjoy!