Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

12 hours of TransactionalTaskException when Adding Tasks (AppEngineJava) - Self-Resolved Cause Unkno

Dear Google Cloud Community and Support Team,

12 hours of TransactionalTaskException when Adding Tasks (App Engine Java) - Self-Resolved - Cause Unknown

We had 12 hours of TransactionalTaskException when Adding Tasks to legacy Task Queues that previously (maybe 2 years ago?) were automatically converted by Google to "Cloud Task" queues. Then the TransactionalTaskException stopped happening as mysteriously as it started.

I spoke to Gemini a lot about it while I was trying to figure out the cause, and once it had stopped, I asked Gemini to summarise the situation:

You are seeking diagnostic assistance regarding a critical issue in your App Engine Java application experienced over a ~12-hour period. The problem has spontaneously resolved, but due to its severity and impact across all your deployed versions, you cannot simply disregard it.

Problem Summary:

Our App Engine Java application began throwing com.google.appengine.api.taskqueue.TransactionalTaskException errors when attempting to add tasks to a queue. The Datastore transaction context appeared to be valid at the point of the task enqueue attempt. No tasks were observed to be added to the queue, neither in a pending nor a failed state.

Key Observations & Debugging Steps:

  1. Affected All Versions: The issue impacted all deployed versions of our App Engine service simultaneously, including older versions that had been stable for extended periods. On the basis that the task queues were actively refusing to accept tasks, this makes complete sense. In this respect, at least, it was "not code related".
  2. Datastore Indexes Checked: We checked our Datastore Indexes in the Google Cloud Console. All indexes showed a "SERVING" status (green ticks), ruling out missing or building indexes as a direct cause.
  3. Google Cloud Quotas Reviewed: We accessed the IAM & Admin > Quotas page. While we initially faced a transient console loading error, once accessible, we confirmed that no quota metrics (for Datastore API, Cloud Tasks API, App Engine Admin API, etc.) were showing utilization at or above 90%.
  4. Transactional Task Limit (5 tasks/transaction): We implemented diagnostic logging to count the number of transactional tasks added within a single Datastore transaction. However, the issue resolved itself before we could capture logs from the point of failure with this new logging in place.
  5. Task Payload Simplification: We had planned to test with a simplified task payload to rule out serialization/size issues, but this was not executed as the problem disappeared.

The Incident's Resolution:

The TransactionalTaskException errors simply stopped appearing after approximately 12 hours, and our application's task enqueuing functionality returned to normal without any intervention on our part. By no invention, I mean no intervention that made any obvious difference. We did deploy various test versions with adjusted code in the hope that the problem might be something that could be fixed by code changes. However, ultimately when it stopped bugging out, all previous versions started working again as well as the new test deployment. No configuration changes were made.

Request for help:

  1. Diagnosis: Given that no code changed, all versions were affected, quotas appeared fine, and indexes were healthy, what could have caused a prolonged TransactionalTaskException on App Engine (Java) that then spontaneously resolved? Are there known platform issues, internal throttling mechanisms, or other factors that could manifest in this way?
  2. Prevention/Future Insight: How can we definitively diagnose the root cause of such an incident after it has resolved itself? What are the best practices for setting up monitoring and alerts that would pinpoint such an issue (e.g., specific metrics for Datastore or Cloud Tasks that indicate underlying platform strain, even if not hitting a visible quota limit)?

We have the diagnostic logging for transactional task counting in place, which we plan to keep deployed. 

Any insights, debugging strategies, or recommendations for engaging with Google Cloud Support effectively (beyond this forum post) would be greatly appreciated.

Thank you for your time and assistance.

0 0 31
0 REPLIES 0
Top Solution Authors