Cloud event management: Best practices to prepare for peak season, high traffic, and launch events

Lauren_vdv
Community Manager
Community Manager

Do you have a season or time frame when you expect higher volumes of traffic to your app, website, or service? 

For retailers, it’s usually during the holidays or Black Friday/Cyber Monday when shoppers are more active than normal. Or the months of October and November for the healthcare industry due to spikes in online traffic for benefits enrollment.

With any peak traffic or launch event, it’s important your team and services are prepared to deliver great performance with minimal to no downtime that could impact your end user’s experience. 

In our latest Google Cloud Community event led by our Technical Account Management team, Jose Chavez (@thehokage) shared best practices and resources to help make sure you’re prepared going into your next peak season, high traffic, or launch event. 

In this article, we’ll dive into the key takeaways and recommendations from the session, including supporting resources and written Q&A. Let’s get started! 

Session recording

Tip 👉 Use the timestamp links in the YouTube description to quickly get to the topics you care about most. 

Key phases of cloud event management

There are three key phases to cloud event management, which we’ll cover in more detail throughout this article:

  • Preparation: Activities that help you prepare for your event, including an architecture review, capacity planning, and creating reservations - just to name a few.
  • Execution: As your event begins you’ll need to closely monitor and react accordingly.
  • Analysis: After your event is completed, analyze what went well, what didn’t, and how to improve for future events.

Preparation 

Consider the following activities and recommendations during the preparation phase that will help set you up for a successful peak season or high traffic event.  

Capacity planning

A critical component of the preparation phase is capacity planning, where you determine the amount of cloud resources needed to ensure your workloads have what they need to operate effectively, without over provisioning and paying unnecessarily for what you don’t need.

Capacity is the total amount of a particular resource that’s available, shared across all customers. But to ensure that a few customers or projects can’t monopolize resources, Google Cloud restricts how much of a particular shared Google Cloud resource that you can use with quotas

Each quota represents a specific countable resource, such as API calls to a particular service, the number of VMs used by your project at a given time, the number of load balancers used concurrently by your project, the number of projects that you can create, etc. 

While many services have default quotas for some resources, the set of quota limits that apply to your applications are specific to you, your project, or your organization. For example, if you are using a free trial account to explore the platform, you might have a very low quota for some resources compared to even the lowest quotas for a billed account. Enabling billing for your project increases quotas for most services. Quotas can also increase as your use of Google Cloud expands over time.

As you’re preparing for your high traffic event or peak season, you need to ensure your quotas match your resource requirements so you don’t face unexpected failures. Consider these recommendations:

  • Use Google’s monitoring tools to get visibility into application usage and capacity, and the overall health of your applications and infrastructure.
  • Evaluate the average and peak utilizations of your top cloud workloads, and their current and future capacity needs, to determine how much over-provisioning is needed to prepare for traffic spikes. 
  • Run load tests to determine how much load the system can handle while meeting its latency targets, given a fixed amount of resources.

See capacity planning template and manage capacity and quota for more details and recommendations.

Submit quota increase requests

If your quotas aren’t sufficient for what you need, you can request a quota increase. There are three primary ways to submit a quota increase request (as outlined in the diagram below): the Cloud Console, a Support ticket, and directly with your Account Team.

quota increase request process flowsquota increase request process flows

If you need a quota increase request addressed quickly, you may want to consider using the Cloud Console. However, if the request needs more analysis, you will need to work with your Account Team.

Most quota increase requests are evaluated by automated systems based on strict criteria, including the availability of resources, the length of time you've used Google Cloud, and other factors. In some cases, quota increase requests are escalated to human reviewers, who also follow strict criteria, but can consider your unique circumstances.

You can find out more about how quota increase requests work in About quota increase requests

Use reservations

To make sure resources are available when you need them, we recommend using reservations. A reservation is a capacity fulfillment offering that provides Google Cloud customers very high assurance in obtaining capacity for their business critical workloads by “reserving” Google Cloud resources (currently, reservations apply to Compute Engine, Dataproc, and Google Kubernetes Engine (GKE) VM usage).

Reservations are billed at the same rate as the running resources they’re reserving, and as such, qualify for committed-use discounts and sustained-use discounts. Consider combining reservations with resource commitments to get discounted and reserved resources. 

The main reasons you’d want to consider using reservations are:

  • To make sure resources are available when you need them
  • To protect existing resources from being re-allocated and ensure elastic workloads have the necessary resources to scale up
  • You may have a known large event and want to ensure your deployment can scale to meet demand
  • If you have GKE or Compute Engine autoscaling and your deployment scales down/in, then needs to scale up/out
  • You may want to protect the minimum or expected capacity of Managed Instance Groups (MIGs)
  • If you need to stop-start (destroy-recreate) instances (e.g. a rolling update/patch which invokes an instance stop-start operation or reconfiguration - changing a boot disk)

Learn more about how reservations work and how to get started here

Execution

Execution is when your event begins and you’ll need to closely monitor activity and react as needed. Consider the following recommendations during the execution phase of your peak traffic or launch event.

Scale manually to peak capacity and pre-warm your resources 

Before your event, it’s recommended to scale up manually. Although you may have autoscaling configured, there’s a good chance because of the velocity of traffic for the event, autoscaling may not be able to catch up with demand. So pre-warm any resources ahead of time, including:

  • Virtual machines
  • Caches if you want to pre-load
  • Serverless components to prevent cold-starts

One thing to note is that Google Cloud Load Balancing does not require pre-warming, but if you’re using other cloud providers, check with them, as some require load balancer pre-warming.

Set up a virtual war room

Create a chat room or conference call for cross-collaboration and communication between teams and vendors so there’s an instant channel for updates and progress. 

Monitor business-critical traffic, logs, and quota levels

Since you’ve set up monitoring, alerting, and logging in the preparation phase, you can use this information to help speed root-cause analysis and reduce mean time to resolution if any issues come up.

Leverage an incident management and escalation process

If any issues occur, a well-defined incident management process is key to reducing the effort and time it takes to address and resolve the issue.

If you don’t have one established yet, follow the steps to establish a cloud support and escalation process here.

Work with Google Support and loop in your Technical Account Manager for additional support. If you’re opening a support case, make sure to include all the relevant information so the Support Engineer has all the information they need to begin troubleshooting (i.e. which project is impacted, what time frame did you notice the problem, any specific logs or identifiers, location of impact, etc.). 

Analysis

When your peak season or launch event is over, review the event to document the lessons you learned so you can apply your learnings to the next major event. The following are recommended focus areas:

  • Timeline recap: Capture when your traffic began to increase and the key events (peaks) during the event period. Identify when, if any, issues arose.
  • Root cause analysis: Investigate any issues that occurred. Is there anything that you or Google could have done differently? Is this something to consider for next time? Document any lessons learned and necessary steps to improve for the future.
  • Compare predictions versus actual: Analyze your traffic prediction versus the actual traffic you recorded. Where were additional resources needed? Where were resources underutilized and/or unnecessary?
  • Postmortem: Share and review the above information with key stakeholders. As you do so, promote a blameless culture, where you assume everyone involved had good intentions and you’re focused on identifying contributing causes without indicating any individual or team. If a culture of finger pointing prevails, people will not bring issues to light for fear of punishment.

For more information about postmortems, see Postmortem Culture: Learning from Failure and this detailed postmortem checklist from Google Cloud.

How Google Cloud can help: Event Management Service

For Premium Support customers, you can utilize our Event Management Service, where Google Cloud will support you through each phase of your event - from preparation to execution and post-event analysis.   

In addition to the Event Management Service as part of Premium Support, we’ve also designed an Advanced Event Management Service for deeper architecture reviews and resourcing that can be purchased separately.

See the table below for a comparison of our Event Management Service offerings. To get started or to learn more about which option is best for your upcoming event, please contact your Technical Account Manager or Account Team. 

  Event Management Service Advanced Event Management Service
Pricing Included in Premium Support PSO Engagement
Resourcing TAM, CE TAM, CE, SCE, Cloud Consultant
Capacity planning Yes Yes
Best practices Yes Yes
Load testing and review No Yes
Architecture review No Yes
Event monitoring room Best effort Negotiable*
Proactive monitoring No Yes*
Highly customizable No Yes

*24x7 not guaranteed; depends on scope

Cloud event management Q&A

  1. What is the difference between Cloud Logging client libraries and agents? How do I know which one I should use?

    One of the main differences between these mechanisms is how they call the Logging API. An application using the client libraries calls the API directly, but an agent serves as a proxy for your applications. You should ask yourself questions like: How do you want your application to communicate with the Logging API? Is recovery of logs during an application crash critical to your business? Does your application sit outside of Google Cloud and need to connect to the Logging API? etc. See documentation that walks you through the differences and considerations for this question here.

  2. GKE in region us-east-1 is not able to connect with Internal load balancer (us-central-1) configured with Cloud Run as backend in us-central-1 even though they're in the same VPC. How do I make it possible?

    Your GKE instance, Load Balancer, and Cloud Run must be located within the same region and connected to the same VPC to work. Check out this internal load balancing documentation for more information. 

  3. How can I calculate the bandwidth usage of a single VM starting from the 1st of each month? I’m using Metrics Explorer, but it doesn't support each month.

    You can use Monitoring Query Language to construct a query that fetches the metrics for bytes sent or bytes received over the period of 30 days. Our data has limitations, such as retention timeline, so this may be more of a mitigation. Another option may be exporting the metrics to BigQuery and playing with the data there.

  4. Suppose I’m running jobs with ~1000 parallel tasks each. Roughly how many jobs can I run simultaneously without overriding the system in a project?

    Assuming you’re referring to Cloud Run, currently, each job can execute up to 100 tasks in parallel. You can then run 1000 jobs per region.

  5. How can I extract application logs (magento 2 in my case) from a GCE virtual machine? I tried to use the ops agent but it doesn't extract the application logs (only the instances), audit, or user logs.

    Check out this video to learn how to install and configure the Ops Agent to stream any third party application log into Cloud Logging.

  6. Does either GKE auto pilot mode or Cloud Run support auto scaling based on custom metrics sent from my application?

    GKE does provide horizontal Pod autoscaling (HPA) based on specific metrics available in Cloud Monitoring. You can read more about this in our documentation here.

  7. How do I know which load balancer option to use?

    Below are links to documentation to help you decide on which load balancer option you want to go with, but in general, you want to consider your requirements: do you need external or internal? regional or global? pass through vs proxy? DDOS protections? etc.

    Choose a load balancer [documentation]

    Choosing the right load balancer in Google Cloud [blog]

  8. Can I request GPU quota increases when I’m on the free GCP trial?

    If you’ve activated the free trial offer ($300 promotional credit over 90 days), you cannot add GPUs to your VM instances and subsequently, you cannot increase the quota. Once you use the $300 credits or 90 days have passed (whichever comes first), you must upgrade to a paid account which will give you more access across our GCP console.

  9. Hi there, I was limited by Google: CPU all regions (12 vCPU) and CPU in the region (8 vCPU). Can I increase it? Because my GKE needs to auto-provision and auto scaling with high traffic. Can anyone explain the quota to me and help me resolve it?

    CPU (All Regions) quota is sometimes associated with new projects or accounts on Google Cloud. This will take higher priority over the regional quotas, so if between all of your regional quotas you are exceeding the (All Regions) quota, you will see failures. You can work with our support team to submit a request to increase to this limit.

    See more information on resource usage quotas and CPU quota limits here.

  10. Why is my quota increase request being denied? What do I do now? Support told me I could go through a partner, but this isn’t feasible for my use case.

    A quota increase request can be denied for many reasons, such as the size or type of the request. When this happens, you have a few options depending on your situation.

    If you’re a customer with a Google Cloud Account team, you can escalate this request to them and your Technical Account Manager can work with Support in the background to submit the proper capacity plans and provide additional context to our capacity teams.

    Additionally, you can create a capacity plan with your Account Team, so rather than working with support, your Technical Account Manager can submit a request to work with our capacity teams directly.

    If you have a project for a personal account rather than a business, the next recommendation is to try requesting a smaller increase. Our quota increase requests have certain thresholds that they review per customer, and you may be able to get an approval for a smaller amount.

    One thing to look out for is if you are using the free GCP trial. As part of the free offer, you are not allowed to request quota increases.

  11. Is there any way to hard cap money spend on GCP to avoid unexpected charges?

    Yes. One way to effectively cap any spend on Google Cloud is to disable billing to stop usage. There are some drawbacks to this, where resources might not shut down gracefully and might be irretrievably deleted. There is no graceful recovery if you disable Cloud Billing. You can re-enable Cloud Billing, but there is no guarantee of service recovery and manual configuration is required.

    To prevent overspending, consider configuring default budgets and alerts with high thresholds for all your projects. Learn more cost optimization best practices here.
  12.  I want to be more proactive about knowing when I reach quotas before I do. What do you recommend as the best route to go?

    You’ll want to set up alerts using Cloud Monitoring and set the alert to trigger at certain thresholds that you deem are necessary. See create and manage alerts using the console and How to monitor quotas in Google Cloud for more information.

  13. I’m getting a “Quota exceeded: Your table exceeded quota for imports or query appends per table.” error while loading a file from google storage to BigQuery, but I can’t find where I exceeded any of the limits. I’m using the PHP sdk with the method loadFromStorage. Where can we find the quota problem we’re currently facing?

    This is something that our Support team can provide more context to but in short, this is more of a limits conversation. The quota that we’re referring to is for standard_tables for BigQuery, specifically table modifications per day. Our recommendations are to:
    • Batch the jobs together to merge the updates into a single update, so that we can overcome the 1500 table operation limit.
    • Use Streaming API if you want real-time updates to the data.
  14. If I have an existing GCE instance in a certain region, if I purchase CUDs for it will it automatically apply to the existing one? How does that work?

    Yes. When you purchase Google Cloud committed use discounts, you commit to a consistent amount of usage for a one- or three-year period. Any usage over the committed amount will result in charges at an on demand rate. 

Cloud event management resources