Using postgres as socket.io adapter in CloudSQL --...

miriamconnor · 07-29-2024 12:04 PM

Has anyone else seen this? I'm using a table in our CloudSQL postgres db as a socket.io adapter (we're using socket.io in a CloudRun service). We've found a couple of times that the adapter suddenly stops working and our socket messages get lost when Cloud Run scales up the number of instances. According to the socket.io documentation:

> The Postgres adapter relies on the NOTIFY and LISTEN commands.

I'm wondering if the notify/listen functionality stops working? Reads/writes in our other tables are working normally.

ms4446

The issue you're encountering with the Postgres adapter for socket.io in your Cloud Run service could be due to several factors, particularly related to the NOTIFY and LISTEN functionalities. Here are some potential causes and solutions.

First, consider connection limits. When Cloud Run scales up, new instances might open additional connections to your Postgres database. If the total number of connections exceeds the maximum allowed by your database instance, some connections may get dropped, disrupting the NOTIFY/LISTEN mechanism. To address this, check the maximum number of connections allowed in your Postgres configuration and ensure your Cloud Run instances do not exceed this limit. You might also consider increasing the max_connections parameter if feasible or using a connection pooler like pgBouncer, which can help manage and reuse database connections.

Another factor could be idle connection timeouts. Cloud SQL may disconnect idle LISTEN commands after a period of inactivity. Implementing a keep-alive mechanism in your Cloud Run service, where simple queries are periodically sent, can help maintain these connections. This proactive approach ensures the connection remains active and functional.

Network latency and timeouts are also potential disruptors. Network issues or latency between your Cloud Run instances and Cloud SQL can interrupt the NOTIFY/LISTEN mechanism. Ensuring that your Cloud Run services and Cloud SQL instance are located in the same region can minimize latency. Additionally, monitoring for network-related issues or timeouts is essential to maintaining stable connections.

Postgres configuration settings can also impact the NOTIFY/LISTEN functionalities. Review your Postgres logs for any warnings or errors related to these commands and ensure the max_connections parameter is sufficiently high for your needs. Check if parameters like listen_addresses might be affecting the connection.

The scaling logic in your Cloud Run and the adapter implementation should be robust enough to handle new instances without losing messages. Review your Cloud Run scaling settings and ensure they are suitable for your workload. Explicitly renewing the LISTEN command periodically can help maintain its active listening state, adding another layer of reliability.

Extensive logging and monitoring are crucial for capturing when the NOTIFY/LISTEN functionality stops working. Implement logging to track when LISTEN commands are issued and NOTIFY messages are received. Utilize monitoring tools to track the health of your database connections and alert you to any issues promptly.

For enhanced reliability, consider integrating a message queue (like Pub/Sub or RabbitMQ) with the Postgres adapter. This approach can mitigate issues related to message delivery during scaling, ensuring that messages are queued and delivered reliably even if there are temporary connection disruptions.

By investigating these areas and implementing the suggested solutions, you should be able to identify and mitigate the issues causing the NOTIFY/LISTEN functionality to stop working in your Postgres adapter for socket.io.

miriamconnor

Hi again! I've tried out a few of the suggestions, and here's what I've found:

It appears that we are staying under our connection limits, so this is unlikely to be the case
It also appears that we have very frequent queries to the table in question, so it's unlikely that the connection has timed out (unless I'm misunderstanding the guidance for the keep-alive mechanism)
Our instances are located in the same region
I did not find any errors or warnings in our postgres logs related to the notify/listen commands
I wasn't easily able to figure out where to see the listen_addresses configuration. However, since the listen/notify command does work the majority of the time (which we can see because our app functions normally), I don't think that's the issue. (However, if you can point me to where I can view/change that setting, I can confirm).
Cloud run scaling seems to be working fine.
Some of the other suggestions aren't possible through the socket.io posgres adapter, since we don't have direct access to the listen/notify command to add logging or to add a message queue. We will enable query logging for our instance to see if there is anything of note going on with the listen/notify commands.

Right now, our best idea is to just do weekly restarts of the database, since restarts seem to resolve the issue for us. This obviously isn't ideal because it involves database downtime, but right now our user base is small and only uses the product during working hours, so that's okay. However, as we scale up, this won't be a sustainable solution.

ms4446

Hi @miriamconnor ,

You've done a thorough job investigating the issue and ruling out some of the common causes. Given your findings,

the problem might be more complex, possibly related to transient connection issues or resource exhaustion that builds up over time.

One potential solution is to implement connection pooling, which can help manage and reuse database connections more efficiently, reducing the likelihood of transient issues. Since database restarts temporarily resolve the issue, this suggests that something accumulates over time, leading to instability. Scheduling restarts during off-peak hours could be a temporary fix, though it’s not sustainable long-term.

Enabling query logging and monitoring specific to NOTIFY/LISTEN events is a good step toward identifying any patterns or issues in your setup. Since the socket.io Postgres adapter abstracts away direct access to these commands, reviewing its source code or documentation could provide insights or reveal known issues under certain load conditions.

Given that your application generally works well, this could be an edge case or a bug in either the Cloud SQL environment or the socket.io adapter. Reaching out to the maintainers or support channels might help uncover if others have encountered similar issues.

As your user base grows, consider planning for scalability by possibly decoupling parts of your architecture. For example, integrating a message queue could enhance reliability, ensuring message delivery even if the Postgres NOTIFY/LISTEN mechanism fails. This approach could provide a more robust solution as your application scales.

While your current strategy of weekly database restarts is a temporary fix, investing in a more resilient architecture will be important as you continue to grow.

miriamconnor

Thank you for the detailed answer! I'll try out these recommendations.

Using postgres as socket.io adapter in CloudSQL -- LISTEN/NOTIFY stops working