Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Cloud SQL Proxy - randomly loosing connection due to being "NOT_AUTHORIZED"

We are using Cloud SQL Proxy as a sidecar container in a kubernetes pod to access a Cloud SQL PostgreSQL instance. The proxy uses a kubernetes service account linked to a IAM service account which has the roles "Cloud SQL Instance User" and "Cloud SQL Client". Also a iam-service-policy has been added that the kubernetes service account can use the IAM account as a workloadIdentifyUser.

 

Usually the connection works good and we can do a JDBC connection from our spring boot application (using HikariCP) and a postgresql user account to connect. When the connection is established successfully, the auditlog shows that the connection has been done via the principal of the IAM service account, with it begin delegated by the kubernetes service account.

 

The HikariCP refreshes its connection at least once every 30 minutes, in which case the connection is closed and a new connection will be established. Sometimes this is not working properly: the connection is rejected with a status containing a code 7 and the message "boss::NOT_AUTHORIZED: Not authorized to access resource. Possibly missing permission cloudsql.instances.connect on resource instances/<NAME_OF_OUR_CLOUDSQL_INSTANCE>". The Cloud SQL Proxy container has the following log output: "[PROJECT:REGION:INSTANCE] failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = \"PROJECT:REGION:INSTANCE\"): googleapi: Error 403: boss::NOT_AUTHORIZED: Not authorized to access resource. Possibly missing permission cloudsql.instances.get on resource instances/INSTANCE., forbidden"". Meanwhile our application shows error messages when trying to utilize a connection via the HikariPool.

 

This goes on for a random amount of time (between half an hour to maybe a couple of hours) before it suddently works again for hours or a day without issues. The behavior is unexplainable to us. I already tried creating a new IAM service account since i suspected it might have been broken somehow but the new user also leads to this unreliable behavior.

 

CloudSQLProxy container is version 2.1.0. We are using Google Cloud SQL Postgres 15.2 in private ip mode.

 

Thanks in advance for all your input!

 

0 4 8,991
4 REPLIES 4

The error message "boss::NOT_AUTHORIZED: Not authorized to access resource. Possibly missing permission cloudsql.instances.connect on resource instances/<NAME_OF_OUR_CLOUDSQL_INSTANCE>" indicates that the Cloud SQL Proxy is not authorized to connect to your Cloud SQL instance. This can happen for a few reasons:

  • The Cloud SQL Proxy is not using the correct service account. In Kubernetes, this is typically specified via Workload Identity by mapping the Kubernetes service account to a GCP service account.
  • The service account does not have the necessary permissions on the Cloud SQL instance. Ensure the service account is assigned a role in the IAM & Admin console that includes the cloudsql.instances.connect permission, such as 'Cloud SQL Client' or 'Cloud SQL Admin'.
  • There is a problem with the IAM service account. This could be due to a number of factors, such as an expired token, a misconfigured Workload Identity binding, or an insufficient IAM role.
  • There is a problem with the Cloud SQL Proxy itself. There might be an issue with your current version of Cloud SQL Proxy, or some unexpected behavior. Try updating to the latest version.

Additionally, consider the following:

  • Is there any specific pattern to when the connection failures occur? For example, do they always occur at the same time of day, or after a certain amount of time has passed?
  • Are there any other applications or services that are connecting through the same Cloud SQL Proxy instance or using the same service account? If so, are they also experiencing connection failures?
  • Have you made any recent changes to your Cloud SQL instance, the Cloud SQL Proxy configuration, or the Kubernetes environment? Consider any network changes, IAM role modifications, software updates, or configurations that might impact connectivity.

Look at the logs for both the Cloud SQL Proxy container and the Cloud SQL instance. Logs often provide valuable insights into connection failures and can be instrumental in troubleshooting.

The protoPayload in the failed connect requests seems to indicate that access to Cloud SQL was attempted without delegation to the IAM account:

authenticationInfo: {
  principalSubject: "serviceAccount:PROJECT.svc.id.goog[NAMESPACE/SERVICEACCOUNT]"
  serviceAccountDelegationInfo: [
    0: {
    }
  ]
}

A successful login looks like this:

authenticationInfo: {
    principalEmail: "IAM-SERVICE-ACCOUNT@PROJECT.iam.gserviceaccount.com"
    principalSubject: "serviceAccount:IAM-SERVICE-ACCOUNT@PROJECT.iam.gserviceaccount.com"
    serviceAccountDelegationInfo: [
        0: {
            principalSubject: "serviceAccount:PROJECT.svc.id.goog[NAMESPACE/SERVICEACCOUNT]"
        }
    ]
}

 But this happens just for a short period of time (like ~ 2 hours) and at no time pattern that is recognizable.

Also we have two containers with separate applications running a sidecar container each, and while one container is having that issue, the other is working fine at the same time (might just be not disconnected and living on an "old" session).

We have updated to the latest CloudSQL container (2.6.1), double checked roles and bindings. The application is working most of the time, just sometimes it stops working and we are absolutely unsure how to proceed in that matter.

We will consult with a Google partner company later this day, but we discuss testing alternatives such as not using Cloud SQL Proxy at all. But we prefer to do it "the intended way", it just needs to work reliably.

Thank you for the additional information. It is very helpful to know that the protoPayload in the failed connect requests indicates that access to Cloud SQL was attempted without delegation to the IAM account. This suggests that the Cloud SQL Proxy is not correctly authenticating with Cloud SQL using the IAM service account.

There are a few possible reasons for this:

  • The Cloud SQL Proxy may not be configured to use the IAM service account.
  • The IAM service account may not have the necessary permissions to access Cloud SQL.
  • There may be a problem with the IAM service account itself.
  • There may be a problem with the Cloud SQL Proxy itself.

Here are some things you can check:

  • Make sure that the Cloud SQL Proxy is configured to use the IAM service account. You can do this by checking the --service_account flag in the Cloud SQL Proxy configuration.
  • Make sure that the IAM service account has the cloudsql.instances.connect permission on the Cloud SQL instance. You can check this in the IAM & Admin console.
  • Try restarting the Cloud SQL Proxy container.
  • Try updating the Cloud SQL Proxy to the latest version.

If you are still having problems, you can contact Google Cloud support for assistance.

Here are some additional suggestions:

  • Try to reproduce the problem in a test environment so that you can troubleshoot it more easily.
  • Check the logs for the Cloud SQL Proxy container and the Cloud SQL instance for any errors or warnings.
  • Try to identify any common factors between the times when the problem occurs. For example, does it only happen when the application is under heavy load? Does it only happen on certain days of the week?
  • Try to disable any non-essential features of the Cloud SQL Proxy or the application to see if that resolves the problem.

 

Hi All - We are dodging an exactly similar problem. We have a Java Spring Boot 3 App, hosted on GKE, enabled with a bespoke K8s service account that is wrapped into a Google service account. It tries to connect to a Cloud SQL PostgreSQL instance in a different project that has private ip mode only using sql auth proxy sidecar through workload identity. The GKE cluster and namespace is enabled with Istio. Network rules are all good, IAM roles have been applied however the connection fails with the same error stated here.