System Design: Compute Best Practices

Lauren_vdv · ‎12-13-2021

In this article, you'll find recommendations and best practices focused on the topic of Compute, as part of the System Design Pillar of the Google Cloud Architecture Framework.

Throughout this article, we often refer to the choose and manage compute documentation. We suggest you review this documentation to learn basic concepts before evaluating the following assessment questions and recommendations.

Designing workloads

How do you plan to use compute? Does your application use heavy or light logic implementation?

For applications or workloads that require lightweight logic, consider serverless offerings like Cloud Functions or Cloud Run before investing in Google Kubernetes Engine (GKE) or Compute Engine implementations to abstract operational overhead and optimize for cost and performance.
If your application is always “ON” or requires heavyweight logic, use Compute Engine for VM-based applications or GKE for container-based applications. You can use specialized hardware like graphics processing units (GPU) or tensor processing units (TPU) to accelerate these heavy logic implementations.

For applications or workloads that require lightweight logic, consider serverless offerings like Cloud Functions or Cloud Run before investing in Google Kubernetes Engine (GKE) or Compute Engine implementations to abstract operational overhead and optimize for cost and performance. If your application is always “ON” or requires heavyweight logic, use Compute Engine for VM-based applications or GKE for container-based applications. You can use specialized hardware like graphics processing units (GPU) or tensor processing units (TPU) to accelerate these heavy logic implementations.

Is your application stateful or stateless?

Where possible, decouple your applications to be stateless to maximize using serverless options. This allows you to use managed compute offerings, scale applications based on demand, and optimize for cost and performance.
In cases where your application is designed to be stateful, you should try using caching logic described in the database section to decouple and make your workload scalable.
Use live migration by setting instance availability policies to allow seamless Google maintenance upgrades..

Where possible, decouple your applications to be stateless to maximize using serverless options. This allows you to use managed compute offerings, scale applications based on demand, and optimize for cost and performance. In cases where your application is designed to be stateful, you should try using caching logic described in the database section to decouple and make your workload scalable. Use live migration by setting instance availability policies to allow seamless Google maintenance upgrades..

Are your applications containerized or do they have any legacy dependencies?

Google Cloud supports running various open-source software (OSS) and third-party software. Review Google Cloud Marketplace offerings to evaluate if your application is listed under a supported vendor.
Consider Migrate for Anthos and GKE to extract and package your VM-based application as a containerized application running on GKE.
If you have legacy dependencies running in a VM-based application, you can use Compute Engine to run your applications on Google Cloud, given you meet all your vendor requirements.

Google Cloud supports running various open-source software (OSS) and third-party software. Review Google Cloud Marketplace offerings to evaluate if your application is listed under a supported vendor. Consider Migrate for Anthos and GKE to extract and package your VM-based application as a containerized application running on GKE. If you have legacy dependencies running in a VM-based application, you can use Compute Engine to run your applications on Google Cloud, given you meet all your vendor requirements.

Scaling workloads

How do you plan to scale your applications?

For stateful applications, where possible, use startup and shutdown scripts to gracefully start and stop your application state.
When using Compute Engine VMs, managed instance groups (MIGs) support features like autohealing, load balancing, autoscaling, auto-updating and stateful workloads. You can create zonal or regional MIGs based on your availability goals. Use MIGs for both stateless serving or batch workloads and for stateful applications that need to preserve each VM’s unique state.
When using GKE, use horizontal and/or vertical pod autoscalers to scale your workloads and node auto-provisioning to scale the underlying compute resources.
Use Cloud Load Balancing to distribute your application instances across more than one region or zone. A load balancer allows you to easily scale your applications globally. Where possible, we recommend using Cloud CDN (Content Delivery Network) for caching static content to optimize for end-user latency.

For stateful applications, where possible, use startup and shutdown scripts to gracefully start and stop your application state. When using Compute Engine VMs, managed instance groups (MIGs) support features like autohealing, load balancing, autoscaling, auto-updating and stateful workloads. You can create zonal or regional MIGs based on your availability goals. Use MIGs for both stateless serving or batch workloads and for stateful applications that need to preserve each VM’s unique state. When using GKE, use horizontal and/or vertical pod autoscalers to scale your workloads and node auto-provisioning to scale the underlying compute resources. Use Cloud Load Balancing to distribute your application instances across more than one region or zone. A load balancer allows you to easily scale your applications globally. Where possible, we recommend using Cloud CDN (Content Delivery Network) for caching static content to optimize for end-user latency.

How do you plan to scale your backend application?

Evaluate using Internal Load Balancing to scale your decoupled architecture.
While switching from traditional on-prem use cases like HA-Proxy usage, refer to the best practices for floating IP addresses guide to evaluate your options.

Evaluate using Internal Load Balancing to scale your decoupled architecture. While switching from traditional on-prem use cases like HA-Proxy usage, refer to the best practices for floating IP addresses guide to evaluate your options.

How do you plan to operationalize scaling?

Automate instance creation and evaluate appropriate machine types based on your application needs.
Minimize human errors in the production environment by automating compute creation and management. See operational best practices in the Operational Excellence Pillar.
Google Cloud offers various machine types that offer you the flexibility to choose compute based on cost and performance parameters. You can choose a low performance offering to optimize for cost, or a high performance compute, which comes at a higher cost. For details, see the Cost Optimization Pillar and Performance Optimization Pillar.

Automate instance creation and evaluate appropriate machine types based on your application needs. Minimize human errors in the production environment by automating compute creation and management. See operational best practices in the Operational Excellence Pillar. Google Cloud offers various machine types that offer you the flexibility to choose compute based on cost and performance parameters. You can choose a low performance offering to optimize for cost, or a high performance compute, which comes at a higher cost. For details, see the Cost Optimization Pillar and Performance Optimization Pillar.

Management operations

How do you plan to manage your VM configurations?

Use VM Manager to manage operating systems for your large VM fleets running Windows or Linux on Compute Engine. This will help apply consistent configuration policies and reduce operational overhead.
Use machine images to store all the configurations, metadata, permissions, and data from one or more disk required to create a virtual machine instance.

Use VM Manager to manage operating systems for your large VM fleets running Windows or Linux on Compute Engine. This will help apply consistent configuration policies and reduce operational overhead. Use machine images to store all the configurations, metadata, permissions, and data from one or more disk required to create a virtual machine instance.

How do you plan to manage your GKE clusters?

Consider using GKE Autopilot and let Google SRE fully manage your clusters.
For policy and configuration management across your GKE clusters, use Anthos Config Management.

Consider using GKE Autopilot and let Google SRE fully manage your clusters. For policy and configuration management across your GKE clusters, use Anthos Config Management.

Do you need base image management?

We recommend using public images supplied by Google Cloud, which are regularly updated, but you can create your own images with specific configurations and settings. Where possible, automate and centralize image creation in a separate project that can be shared with authorized users within your organization.
Snapshots allow you to create backups for your instances, especially for stateful applications. If you frequently find yourself using snapshots to create new instances, create a base image out of that snapshot to optimize this process.
Use a machine image to store all the configuration, metadata, permissions, and data from one or more disks required to create a VM instance. Refer to the machine images documentation to know more.

We recommend using public images supplied by Google Cloud, which are regularly updated, but you can create your own images with specific configurations and settings. Where possible, automate and centralize image creation in a separate project that can be shared with authorized users within your organization. Snapshots allow you to create backups for your instances, especially for stateful applications. If you frequently find yourself using snapshots to create new instances, create a base image out of that snapshot to optimize this process. Use a machine image to store all the configuration, metadata, permissions, and data from one or more disks required to create a VM instance. Refer to the machine images documentation to know more.

Capacity, reservations, and isolation

Do you have capacity requirements for a specific zone or region?

Google Cloud can scale to facilitate your various compute needs, but if you need a large amount of a specific machine type in a specific region or zone, you’ll need to work with your account teams to ensure availability.
Google Cloud allows you to define reservations for your workloads to ensure those resources are available to you. There is no additional charge to create reservations, but you will pay for the reserved resources even if you don't use them. Refer to the consuming and managing reservations documentation for details.
You can reduce your opex cost for always “ON” workloads by utilizing committed use discounts. Review the Cost Optimization Pillar to learn best practices for using cost optimization strategies like committed use discounts.

Google Cloud can scale to facilitate your various compute needs, but if you need a large amount of a specific machine type in a specific region or zone, you’ll need to work with your account teams to ensure availability. Google Cloud allows you to define reservations for your workloads to ensure those resources are available to you. There is no additional charge to create reservations, but you will pay for the reserved resources even if you don't use them. Refer to the consuming and managing reservations documentation for details. You can reduce your opex cost for always “ON” workloads by utilizing committed use discounts. Review the Cost Optimization Pillar to learn best practices for using cost optimization strategies like committed use discounts.

Do you need node isolation?

You can use sole-tenant nodes to minimize noisy neighbors or to meet your compliance requirements (e.g., isolating payments processing workloads).

You can use sole-tenant nodes to minimize noisy neighbors or to meet your compliance requirements (e.g., isolating payments processing workloads).

VM migration

Are you planning to move your workloads to Google Cloud from on-premises or from another cloud environment?

Evaluate native migration tools to quickly move your workloads. Google Cloud offers various tools and services to quickly migrate your workloads that can help you optimize for cost and performance.
You can get a free migration cost assessment based on your current IT landscape with the Google Cloud Rapid Assessment & Migration Program (RAMP). RAMP helps simplify and accelerate your path to success.

Evaluate native migration tools to quickly move your workloads. Google Cloud offers various tools and services to quickly migrate your workloads that can help you optimize for cost and performance. You can get a free migration cost assessment based on your current IT landscape with the Google Cloud Rapid Assessment & Migration Program (RAMP). RAMP helps simplify and accelerate your path to success.

Do you plan to bring your own license (BYOL)?

Use the import virtual disk tool to import customized supported operating systems. Sole-tenant nodes can help you meet your hardware BYOL requirements that need per core or per processor licenses.
For Oracle workloads, use the Bare Metal Solution for Oracle to jumpstart your cloud journey with minimal risk, while taking advantage of various Google services.

Use the import virtual disk tool to import customized supported operating systems. Sole-tenant nodes can help you meet your hardware BYOL requirements that need per core or per processor licenses. For Oracle workloads, use the Bare Metal Solution for Oracle to jumpstart your cloud journey with minimal risk, while taking advantage of various Google services.

Key Google Cloud services

Compute Engine: Secure and customizable compute service that lets you create and run VMs on Google’s infrastructure
Cloud Run: Develop and deploy highly scalable containerized applications on a fully managed serverless platform
Cloud Functions: Scalable pay-as-you-go functions as a service (FaaS) to run your code with zero server management
Google Kubernetes Engine: A simple way to automatically deploy, scale, and manage Kubernetes
Spot VMs: Affordable compute instances suitable for batch jobs and fault-tolerant workloads
Google Cloud VMware Engine: Easily lift and shift your VMware-based applications to Google Cloud without changes to your apps, tools, or processes. The service provides all the hardware and VMware licenses you need to run in a dedicated VMware SDDC in Google Cloud.
Bare Metal Solution for Oracle: Bring your Oracle workloads to Google Cloud with Bare Metal Solution and jumpstart your cloud journey with minimal risk
Sole-tenant nodes: Dedicated hardware for compliance, licensing, and management
Shielded VMs: Hardened virtual machines on Google Cloud
Workflows: Orchestrate and automate Google Cloud and HTTP-based API services with serverless workflows

Resources

What's next?

We've just covered Compute as part of the System Design Pillar of the Google Cloud Architecture Framework. There are several other topics within the System Design Pillar that may be of interest to you: