Hy,
TL;DR
I have a push subscription which the metrics charts always shows a non delivered message (I already ruled out problems in the application that receives the messages). One of two:
My subscription with filters is showing inaccurate metrics (I know this can happen), this is a problem especially because I have alert policies over those metrics, so I can't really rely on them;
My subscription have a "stuck" message which the subscription are not trying to send to the webhook anymore.
Questions
How can I get a 100% accurate response of which of the two cases is happening?
How inaccurate can this metrics be under those circumstances?
How can I circumvent this inaccuracy?
In case that this is a Pub/Sub problem by not sending the message, how can I approach this to get a resolution?
Was it Pub/Sub fault for not delivering the message or the message just wasn't there in the first place?
Full Sad Story
I have a topic that receives 9 types of different messages, for a specific application I only need 3 of those 9 types of messages published in this topic, so I have a `push` subscription configured with filters. To achieve some degree of observability I have an alert policy configured over the `oldest_unacked_message` subscription metric, I used this approach before for other cases and always worked very well, but not for this case.
One day I got alerted over the `oldest_unacked_message` policy and immediately went to check the metrics charts, the `num_unacked_messages_by_region` chart showed that there is 1 message on the subscription. I proceeded to check the application that receives the messages, there were no logs pointing to any message being received by the application at all (the app was stall). As I was investigating more messages kept coming, I checked the charts and the logs again, the logs show the incoming messages and the number of chart messages fluctuated, but after the new messages were acked that one message kept hanging. I made sure that wasn't the application failing silently while processing the messages, increased logs, error handling, tests and what not, and I still had no evidence whatsoever of this hanging message being sent to the application. Our infrastructure team checked directly in the application pod and got no evidence either, we even changed the subscription from push to pull and tried to pull the message using the GCP GUI and the CLI, with no success.
I kept researching and saw that those metrics may be inaccurate when you have a subscription with filters, but I have no idea if this is a simple case of metrics inaccuracy. After the 7 days expiration the charts went back to "normal" and started pointing to 0 messages in the subscription and the oldest unacked message dropped to zero again leaving me confused. I have no evidence whatsoever of what happened, did anyone experienced that before? Was it Pub/Sub fault for not delivering the message or the message just wasn't there in the first place?