My subscriber receives a message from GCP subscription with exactly 60 seconds delay sometimes

I have an issue which happends periodically in my system, but causes major problems.

We use GCP Pub/Sub and sometimes subscriber receives a message with exactly 1 minute delay. In these cases only the following metrics are really spiked:

  1. oldest_unacked_message_age
  2. delivery_latency_health_score.expired_ack_deadlines = 0
  3. expired_ack_deadlines_count

Here are the details of my subscription: subscription details

Notes:

  1. unacked_messages_count isn't spiked, so the load for the system is usuall.
  2. I am sure that delayed messages were sucssesfuly published in Pub/Sub.
  3. by all others metrics I can see that system isn't overloaded and subscriber is continue to pull other messages.
  4. I print a log as soon as the message processing started in the subscriber, that's how I can see this delay.

We are using google-cloud-pubsub, spring-cloud-gcp-pubsub and proto-google-cloud-pubsub-v1 client libraries to StreamingPull messages.

I assume that messages sometimes could be lost due to transient failure, but in this case they should be redelivered in 10 seconds based on my ack deadline, isn't it?

I also thought that the problem could be in modifyAckDeadline requests, but I don't have any custom overriding. And as I see my client library by default use DEFAULT_MAX_ACK_EXTENSION_PERIOD = 0. But even if in some other places client library override this value, as I understand this is not possible to send modifyAckDeadline of message which is lost.

Any advices how this could be resolved?

I expect that subscriber can receive the message immediately after it is published, or if there is a failure/lost, the message will be resend after 10 seconds based on my ack_deadline rather than 60 seconds.

0 10 1,021
10 REPLIES 10

Here are some potential Causes and Troubleshooting Steps:

  1. Explore Retry Policy:

    • The retry policy might cause delays beyond the 10-second ack deadline. Review logs for delivery attempts and backoff durations. Adjust the retry policy if it's contributing to delays.
  2. Investigate Ack Deadline Extensions:

    • Ensure there are no unintentional ack deadline extensions in the client libraries. Verify configurations and code for potential modifications.
  3. Examine StreamingPull Behavior:

    • Analyze how StreamingPull interacts with ack deadlines. Prompt acknowledgment or negative acknowledgment (nacking) of messages is crucial.
  4. Consider Client Library Issues:

    • Investigate known issues or limitations in the client libraries. Consider using alternative versions or libraries if necessary.
  5. Analyze Network Latency:

    • Monitor network performance for potential issues that could delay message delivery.
  6. Monitor for Resource Constraints:

    • Ensure subscriber instances have sufficient resources (CPU, memory, network) for efficient message processing.
  7. Additional Considerations:

    • Message Ordering: Check if message ordering is enabled, as it can introduce delays.
    • Pub/Sub System Behavior: Consider possible delays due to the Pub/Sub system, especially under high loads or transient failures.
    • Detailed Logging: Implement enhanced logging to capture the exact timing of message receipt and processing.
    • Architecture Review: Reassess the overall architecture and message flow for optimization.

Next Steps:

  • Gather more detailed logs and metrics to pinpoint the cause of delays.
  • Experiment with different configurations and client library versions.

Small update:

II was able to reproduce this on my dev env when I restarted my subscribers while receiving messages. But this could just be a coincidence, because I restarted them many times, but the “lucky timing” was only once. Also I was able to see that the **googclient_deliveryattempt** attribute of the delayed pubsub message is 1! But I assumed that it should be a redelivery from pubsub after 60 seconds, but since Deliveryattempt=1, it can't be a redelivery. How is it possible that pubsub doesn't resend the message within 60 seconds?

I notice that the subscription defined in your screen shot is defined as a pull subscription.  This means that after publication, the messages sit on the topic until the client application explicitly requests the next message.  This is making me think that the messages might not be being requested faster than once a minute.

Let's drill down into the symptom.  I hear you say that the messages are received with one minute delay.  Let's drill down into this.  I'm imagining a publisher making a message available at 12:00:00 at that moment it should be in the topic.  Here is my first question ... when you say that the message is received after a minute's delay, how do you know that?  Is it a log from your subscribing app?  Is it a timestamp in the message ... or something else.   For there to be a delay, there has to be an event time (the time that the message was published) and then a processing time (the time that the message was seen by the subscriber).

See if we can't capture the exact nature of how the subscriber is retrieving messages.  For example, if the client is Java, I could imagine some property:

while (true) {
  if (we have messages) {
    get messages, process messages;
  }
  else {
    sleep(60 seconds)
  }
}

Since a PULL subscription is a "loop" pulling messages, there is likely some delay/pause in the story ... so if we can understand the logic of how you are obtaining messages, that may help us.

Hi Kolban. Thank you for your quick reply.

We use StreamingPull mechanism to receive messages in subscriber app. As I described above in 99.9 of cases messages are received without any delays. However, in rare but very critical cases for my systems, such unpleasant delays still occur.

How I see this 1 minute delay:
When my publisher app receives a successful response from Pub/Sub, I log this information. When my subscriber app receives a message, I log this. So this 60 second delay represents the difference in timestamps between these logs.

I am sure that everything is fine with my publish app because this problem is also reproduced when I have 3 subscriptions and 3 subscriber apps correspondce to 1 topic.
In this case sometimes 2 subscriber application receives a message without delay, but 1 other subscriber receives a message only after 60 seconds.  However, during this 60-second delay, the subscriber app continue to successfully receive other messages. Thus, the subscriber application continues StreamingPull messages and can't sleep.
Again, this is a rare case. In the 99.9% of cases everything works without delays.

According to the Pub/Sub documentation it means that some messages are stuck/ for some reason in our client or in the Pub/Sub.
I tried to check the behaiviour in our client libraries, but can't understand this "floating" delay.
As I explained above here are 3 libraries which our app uses to communicate with Pub/Sub:
google-cloud-pubsub, spring-cloud-gcp-pubsub and proto-google-cloud-pubsub-v1.

Oh cool ... thanks for the link to Streaming pull, I'd not come across this before.  I'm going to work on the assumption that nothing is intrinsically "broken" in the Google Cloud Pub/Sub ... as, if there were, then there would be a loud wailing to be heard.  This implies that whatever is causing our puzzle is a locally related to how we are using the environment.  My first suggestion is that given you are sharing your story here, maybe create a Google Doc with no write permission to anyone but you and post a link to it.  In that Google Doc, put as much PUBLIC information as you are willing to share.  Perhaps high level schematics of the topics, projects and apps involved.

I'm also of a mind that we will need "clues".  I like puzzles like this to "there has been a crime" ... but without evidence and clues, its hard to find the culprit.  It sounds like you have logging involved (Cloud Logging?).  I'd be tempted to suggest that we log all published messages and all received messages and start looking for such clues.   I am imagining that a message is published at 12:00:00.  That message will have a unique message id.  I'm interested in the clues for a delayed message.  Maybe if we enable Cloud Logging Data Audit for pub/sub we can "track" that individual message.   If we can start the Google Doc with all the information in one place, maybe we can start to pull apart the pieces and contemplate remedies.   We may also be able to switch on diagnostics/logging in these client libraries.  Nothing is immediately coming to mind based on the information so far.  Best I can offer right now are suggested next steps to make progress. 

Small update: Perhaps this will give you other thoughts

II was able to reproduce this on my dev env when I restarted my subscribers while receiving messages. But this could just be a coincidence, because I restarted them many times, but the “lucky timing” was only once. Also I was able to see that the **googclient_deliveryattempt** attribute of the delayed pubsub message is 1! But I assumed that it should be a redelivery from pubsub after 60 seconds, but since Deliveryattempt=1, it can't be a redelivery. How is it possible that pubsub doesn't resend the message within 60 seconds?

By the way, you mentioned Cloud Logging Data Audit.  But in the docs I see this: 
Note: Pub/Sub does not currently write Data Access audit logs for message operations, such as Publish, Subscribe, and Acknowledge.

Does this mean that these logs are useless in our case?

Thats a good point.  I had merely assumed that PubSub (if Audit Data was fully enabled) might have written publish/subscribe/ack audit logs ... but your find seems to say that they don't.  

Any new thoughts based on my upper comment with **googclient_deliveryattempt** = 1 for delayed messages?

It seems like changing StreamAckDeadlineSeconds parameter in StreamingPullRequest to 10 seconds fixed this problem.
Hovewer, I am confused between the difference in GCP subcription ackDeadline and StreamAckDeadlineSeconds of the StreamingPullRequest parameters.
I don't understand what cases one or another Deadline is used.