Congratulations! You've deployed your app, set up your Kubernetes cluster, or launched a new service. But now come the challenges of day 2 operations - managing configuration drift, service reliability, scaling up, and more.
In our latest Ask Me Anything session, Google Cloud experts, Omkar Suram (@omkarsuram_) and Rakesh Dhoopar, shared key principles and best practices to reduce the complexity of day 2 cloud operations, including how to use Google Cloud products and new features in the operational tools portfolio.
In this article, we cover the key takeaways from the session, along with written Q&A and supporting resources.
If you have any additional questions, please leave a comment below and someone from the Community or Google Cloud team will be happy to help.
Tip 👉 Use the time stamplinks in the YouTube description to quickly get to the topics you care about most.
There are 3 key themes to optimize your day 2 operations. The following provides a quick overview of these themes, of which we dive into more detail further below.
Based on the Google Cloud Architecture Framework - specifically the Operational Excellence and Reliability pillars - here are a few key recommendations to design resilient services:
Below is an example of sample architecture that has incorporated various managed services, such as Google Cloud Armor, Cloud Load Balancing, Apigee, Managed Instance Groups, GKE Autopilot, Cloud SQL, Operations Suite, etc. Such an architecture provides better scalability and reduces burden on your operations teams.
Your architecture design will vary based on your use case or business needs, but the main point here is to showcase the value of managed services in operationalizing your workloads. Let Google help you take care of maintaining availability of these services, while you focus on improving your applications.
Remember, optimization is a continuous, ongoing exercise and a robust architecture is foundational to achieving operational excellence.
Get started building your own architecture today with the Google Cloud Architecture diagramming tool.
As you embark on your operational journey, you should develop a good understanding of the complete portfolio of tools and capabilities that help you get your jobs done. With this knowledge, you can strategically plan your day-to-day tasks and leverage the most relevant capabilities to derive the most value for your work and time.
In the past, day 2 activities ranged widely and often required a multitude of tasks delivered through siloed tools. However, as cloud platforms have evolved, day 2 activities and tools (especially those considered to be in the operations space) can now be categorized into a few primary groups.
First, many day 2 operations activities that were performed manually in the past have been automated and integrated into the core cloud platform itself. These native day 2 capabilities are available as part of Google Cloud’s compute, network, storage, and security services, including auto-scaling, auto-patching, auto-healing, backups, runtime security postures, etc.
The second group of day 2 operations activities are related to aggregate observability across the entire platform, including Google Cloud services and applications that users deploy on top of these services. These activities are performed either at a fleet level, service level, or at the application level, and the observability capabilities include monitoring, logging, tracing and auditing that help customers understand the health of their Google Cloud deployment and troubleshoot problems.
Last but not least is the top layer, which provides different surfaces through which your cloud operations tools and capabilities can be accessed. This includes APIs, SDKs, the Cloud Console, and automation using Terraform.
We’ll dive deeper into a few specific tools in Google Cloud’s operations portfolio later on in this article, including Google Cloud Managed Service for Prometheus, Log Analytics, Network Intelligence Center, Apigee API Management, and Active Assist.
After you develop an understanding of your solutions portfolio, how do you leverage these tools and implement an effective cloud operations strategy? In this section, we’ll cover best practices to help your organization simplify and improve its cloud operations, as well as touch on some of the newer capabilities Google Cloud has recently announced to support these best practices.
There are 3 types of observability data: metrics, logs, and traces. In many instances, customers deploy different collectors or agents for collecting these different signals, which can be challenging from both a collector administration and resource management point-of-view.
With that in mind, our first best practice is to deploy unified collectors that can help collect metrics, logs, and traces. This is certainly possible with the OTEL (OpenTelemetry) collector and the unified cloud Ops Agent offered by Google.
Traditionally logs have been unstructured or semi-structured, but increasingly, logs have become more structured. So as a second best practice, we recommend that users adopt structured logging.
With unstructured logs, the primary mode of exploration is search oriented. You can enforce some structure during your query, but that becomes quite onerous. However, if you generate structured logs to begin with, then it’s easier to apply additional techniques, including analytics, AI and ML, correlation, etc. to easily find patterns and outliers. This comes in very handy especially in complex and highly distributed application architectures.
Lastly, as applications become more complex, it becomes increasingly important to collect richer context from your signals. Whether you collect that through labels in metrics, or through fields in structured log events, this context helps you slice and dice the data, quickly analyze and make correlations.
Furthermore, to help make troubleshooting easier, consider capturing a unique identifier - such as a Trace ID - that spans a user request across all services.
The second group of best practices are related to building your observability footprint based on open standards.
Over the past two decades, one of the big challenges for operators was how to keep up with collecting metrics and logs data. There were many proprietary solutions and all of them were siloed and incomplete. Users spent more time figuring out how to ingest data rather than how to derive actionable insights and get their jobs done. Adopting open APIs and standards like OTEL can reduce data collection pain and enable you to focus on using analytics for your operational use cases.
In addition to open APIs and standards, there’s a rich open source ecosystem available that enables you to adopt and deploy tools for a variety of different observability use cases, including tools like Prometheus, Elastic, Jaeger, etc. We recommend that users use tools and services that are compatible with the OSS ecosystem - either through DIY implementation or compatible managed services.
Lastly, we’ve often seen that tools offered in the OSS ecosystem work well as starter tools, but can pose challenges around scaling and operations as your business grows and your services expand. In such cases, look for managed services that support open APIs and interfaces, but have proprietary implementations that address scaling and globalization challenges.
In the next section, we’ll take a look at one such service that Google offers, Google Cloud Managed Service for Prometheus.
Google Cloud Managed Service for Prometheus is an example of a managed service that’s compatible with open source interfaces of Prometheus, but is delivered not just by running the OSS Prometheus, but by replacing the data store and query engine with a proprietary implementation. What does this mean exactly?
Prometheus is a popular solution, but we’ve seen that the OSS implementation has three problems:
Managed Service for Prometheus offers a managed service that addresses these challenges by supporting the APIs for data collection and query using PromQL, but it stores and manages data in Google’s proprietary time series database called Monarch, which is a highly scalable monitoring service.
When users want to switch from their own instance to a managed service, they simply have to replace the OSS Prometheus upstream binary with a Google-offered Prometheus binary. This distribution scrapes the data exactly like the OSS version, but it writes data to the Google backend. Once that’s done, users can run their PromQL compatible queries against Google’s data store and everything just continues to work as it was earlier. This solution has a very low friction onboarding and all your existing dashboards and workflows continue to work.
The next cloud operations solution we want to highlight is Log Analytics - now in public preview - that helps you take advantage of structured logs and all the rich context you can capture from these logs.
Previously, when you stored logs in Cloud Logging, they were stored in a proprietary store and had a query language that enabled you to search through these logs. With Log Analytics, Cloud Logging is integrated with BigQuery, making BigQuery the native store for logs.This change enables users to now tackle a variety of use cases:
As your deployment grows, so does the networking complexity, so it’s important to understand how your network behaves, especially when things break and you’re trying to fix issues with complex dependencies.
Network Intelligence Center is Google Cloud’s automated analytics and observability platform that helps deepen your ability to proactively monitor, visualize, and troubleshoot network health.
Your operations teams can now have a single place to view network deployments, understand traffic flows, quickly identify issues, and focus on improving network performance.
We’ve recently launched a few new Network Intelligence Center features to simplify networking operations even further:
When it comes to APIs, Apigee can help you minimize many operational problems and improve observability, including these key features:
As shown in the slide below, Apigee also offers security features that help address misconfigured APIs and identify malicious bots.
In this article, we covered Google Cloud’s operational tools portfolio and took a closer look into a few specific solutions and recently-released capabilities that can help you improve day 2 operations in the cloud. With all that said, we understand it can become an operational task in itself to view various tools individually and understand their insights, which is why the last solution we want to cover in this article is Active Assist.
Whether you’re interested in security, cost, networking, compute, data, or operations, Active Assist provides insights and recommendations that span across products, all in a single place.
The address type is part of the versionedResources metadata for the compute.googleapis.com/Address resource. Unfortunately, it's not currently possible to search versionedResources directly in Cloud Asset Inventory. You can, however, do one of the following:
- Export Cloud Asset Inventory to BigQuery and write your own queries.
- If you have a Linux shell and jq installed, you can stitch together the following to get a CSV-formatted list of all external IP addresses in the organization:
gcloud asset search-all-resources \
--scope=organizations/{org-id} \
--asset-types='compute.googleapis.com/Address' \
--read-mask='*' \
--format=json \
| jq -r '.[] | select(.versionedResources[].resource.addressType=="EXTERNAL") | [.displayName, .versionedResources[].resource.address, .parentFullResourceName] | @csv'
Have additional questions? Please leave a comment below and someone from the Community or Google Cloud team will be happy to help.