Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

12 hours of TransactionalTaskException when Adding Tasks (AppEngineJava) - Self-Resolved Cause Unkno

Dear Google Cloud Community and Support Team,

12 hours of TransactionalTaskException when Adding Tasks (App Engine Java) - Self-Resolved - Cause Unknown

We had 12 hours of TransactionalTaskException when Adding Tasks to legacy Task Queues that previously (maybe 2 years ago?) were automatically converted by Google to "Cloud Task" queues. Then the TransactionalTaskException stopped happening as mysteriously as it started.

I spoke to Gemini a lot about it while I was trying to figure out the cause, and once it had stopped, I asked Gemini to summarise the situation:

You are seeking diagnostic assistance regarding a critical issue in your App Engine Java application experienced over a ~12-hour period. The problem has spontaneously resolved, but due to its severity and impact across all your deployed versions, you cannot simply disregard it.

Problem Summary:

Our App Engine Java application began throwing com.google.appengine.api.taskqueue.TransactionalTaskException errors when attempting to add tasks to a queue. The Datastore transaction context appeared to be valid at the point of the task enqueue attempt. No tasks were observed to be added to the queue, neither in a pending nor a failed state.

Key Observations & Debugging Steps:

  1. Affected All Versions: The issue impacted all deployed versions of our App Engine service simultaneously, including older versions that had been stable for extended periods. On the basis that the task queues were actively refusing to accept tasks, this makes complete sense. In this respect, at least, it was "not code related".
  2. Datastore Indexes Checked: We checked our Datastore Indexes in the Google Cloud Console. All indexes showed a "SERVING" status (green ticks), ruling out missing or building indexes as a direct cause.
  3. Google Cloud Quotas Reviewed: We accessed the IAM & Admin > Quotas page. While we initially faced a transient console loading error, once accessible, we confirmed that no quota metrics (for Datastore API, Cloud Tasks API, App Engine Admin API, etc.) were showing utilization at or above 90%.
  4. Transactional Task Limit (5 tasks/transaction): We implemented diagnostic logging to count the number of transactional tasks added within a single Datastore transaction. However, the issue resolved itself before we could capture logs from the point of failure with this new logging in place.
  5. Task Payload Simplification: We had planned to test with a simplified task payload to rule out serialization/size issues, but this was not executed as the problem disappeared.

The Incident's Resolution:

The TransactionalTaskException errors simply stopped appearing after approximately 12 hours, and our application's task enqueuing functionality returned to normal without any intervention on our part. By no invention, I mean no intervention that made any obvious difference. We did deploy various test versions with adjusted code in the hope that the problem might be something that could be fixed by code changes. However, ultimately when it stopped bugging out, all previous versions started working again as well as the new test deployment. No configuration changes were made.

Request for help:

  1. Diagnosis: Given that no code changed, all versions were affected, quotas appeared fine, and indexes were healthy, what could have caused a prolonged TransactionalTaskException on App Engine (Java) that then spontaneously resolved? Are there known platform issues, internal throttling mechanisms, or other factors that could manifest in this way?
  2. Prevention/Future Insight: How can we definitively diagnose the root cause of such an incident after it has resolved itself? What are the best practices for setting up monitoring and alerts that would pinpoint such an issue (e.g., specific metrics for Datastore or Cloud Tasks that indicate underlying platform strain, even if not hitting a visible quota limit)?

We have the diagnostic logging for transactional task counting in place, which we plan to keep deployed. 

Any insights, debugging strategies, or recommendations for engaging with Google Cloud Support effectively (beyond this forum post) would be greatly appreciated.

Thank you for your time and assistance.

Solved Solved
0 2 184
1 ACCEPTED SOLUTION

So I guess no one has responded because no one knows, anymore than I do, how this could happen?

In summary, a critical system running on Google's Cloud service mysteriously stops working for 12 hours without any explanation, and no one knows why it could happen or what I should do about it.

View solution in original post

2 REPLIES 2

So I guess no one has responded because no one knows, anymore than I do, how this could happen?

In summary, a critical system running on Google's Cloud service mysteriously stops working for 12 hours without any explanation, and no one knows why it could happen or what I should do about it.

Hi @Bindon,

When exactly did you encounter the TransactionalTaskException issue? I just want to confirm the timeline to make sure we have everything aligned.

If it was around mid-June, it could be related to some updates that Google Cloud was pushing at the time. They were testing a new default behavior through an A/B experiment, which started the week of June 17. The experiment involved two changes—one on the Java Runtime and one on the AppServer side. However, due to an unrelated memory issue, the Google Cloud team had to roll back the AppServer to an older version on June 19, which caused the necessary backend change to be missing. This led to the TransactionalTaskException error you saw.

The Google Cloud team quickly rolled back the Java runtime experiment and restored everything to the correct state by June 20. They are now closely monitoring the situation and planning to push the changes again once they’re confident everything is stable.

If you’re still facing issues, you may reach out to Google Cloud Support for further assistance. They can also, if needed, exclude your application from any upcoming experiments for a period of time to help avoid further disruptions—though please note this depends on the specific context and is handled on a case-by-case basis.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.