Solved: Should I implement a Cloud Run bulk insert (from P...

tommed · 11-21-2023 10:21 AM

Evening all, we run a service on GCP at the moment which pushes payloads into Pub/Sub, triggering a Cloud Run instance which stores the data into Elastic Cloud (but could conceptually be any database technology).

As a Pub/Sub subscriber, the Cloud Run instance is obviously sending one payload into the database at a time, which is great for when the system isn't processing so much, but when the throughput is very high, the pub/sub isn't efficiently sending data to the database service, and can even bring down the database due to 429 or 503 errors.

Rather than controlling the rate the messages are sent to from pub/sub to Cloud Run (i.e., buffered ingestion), I'd really like to use Cloud Run as a batch processor, receiving a stream of data and every x seconds flushing these in bulk to the database.

I was wondering if this is a standard approach and if anyone has done this successfully? I'm assuming there are many considerations in getting this working just right, for example:

Would I set the maximum Cloud Run instance count to 1? (to ensure there weren't multiple batches being created as the service scaled the number of instances)
Do I need to worry about the service shutting down due to the queue being empty? (as Cloud Run will turn off all instances when not in use, by design)
What is the recommended approach for collecting a batch of payloads? I am using golang to implement our service. If I was using .NET, I'd probably opt for Reactive Extensions, but I'd be interested if this is a standard approach, what other people recommend to implement this part?
If I am only going to have a single instance, how will I ensure Pub/Sub doesn't end up DoS'ing the 1 instance when under high load?
Obviously the ACK timeout will need to include the bulk window time, otherwise some of the payloads will timeout ACKnowledgement and be re-queued, meaning they will be processed again despite the bulk actually working!

The last point here makes me wonder if this is such a good idea, or if I'm architecting a bad solution? If so, is there a better approach to avoid overwhelming the database when the service is extremely busy?

Thanks for your time, this is my first post here! 🙂

tommed

Advice from StackOverflow, sharing for reference: "If you're using Elasticsearch as a database, I recommend using Google Dataflow's "Pub/Sub to Elasticsearch" template. Dataflow manages the data flow to prevent overwhelming your infrastructure. You can create a pipeline with the desired frequency (daily, hourly, etc.)"

View solution in original post

tommed

Advice from StackOverflow, sharing for reference: "If you're using Elasticsearch as a database, I recommend using Google Dataflow's "Pub/Sub to Elasticsearch" template. Dataflow manages the data flow to prevent overwhelming your infrastructure. You can create a pipeline with the desired frequency (daily, hourly, etc.)"

Should I implement a Cloud Run bulk insert (from Pub/Sub subscription) using windowed buffering?