Hi Team,
While working with the Pub-Sub Java client library, I encountered an issue related to broadcasting messages to multiple subscriptions. The problem arose when, for some unexpected reasons, a few subscribers were removed or detached from certain subscriptions. Although the subscriber maintains its lifecycle state, I was unable to determine what went wrong, even after adding logging in the overridden subscriber listener. The issue only came to my attention through the alerting system configured in the application. If I had been able to identify the reason for the subscriber stopping its message pulling, I could have proactively addressed the issue, preventing the subscription queue from piling up.
FCS: maxCount=10, maxBytes=2MiB, AckDeadline: 600sec, AckLatency: 300ms, PublishRate: 10msg/sec, Pulling Rate: almost similar
If i could know the reason of subscriber being stopped from pulling message I could have proactively mitigated the problem which eventually leads to a pileup in subscription queue.
What I came up:
Workaround: get the subscriber status, then restart/start the subscriber
Temp Sol: Auto-heal
Let me know what else can be checked and how can i solve this problem.
Regards, Anurag
Hi @iamanurag,
Welcome to Google Cloud Community!
Here are some possible reasons and suggestions that may help resolve the issue:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Hi @marckevin,
Based on your suggestions, I have reviewed certain use cases and observed the following behavior:
1. Audit Logging: Audit logging is in place; however, since this pertains to the subscriber being detached or stopped from the subscription, audit logging does not apply in this scenario.
2. Flow Control Settings (FCS): Our FCS configuration was overly restrictive. After updating the FCS, the message pile-up issue should be resolved, yet my subscriber is stopped so FCS will not work here since it subscriber is not running.
3. Failure Handling: All failure scenarios are being handled, including mechanisms like exponential backoff and dead-letter queues.
4. Connectivity: I agree that network connectivity issues could cause the subscriber to stop. However, the subscriber listener is not logging any messages, even though it should log events such as "Subscriber terminated due to network connectivity issues," or similar.
My concern is that I am still unable to pinpoint the root cause of why the subscriber failed. While I understand there could be multiple reasons for the subscriber stopping, I am unable to identify the specific cause. Could you advise on the specific log lines or audit entries I should look for?
Looking forward to your insights and eager to learn more about distributed systems and design. Thank you!
Regards, Anurag