Architecture Framework Event Recap: Operational Excellence and Security Best Practices in Action

Lauren_vdv · 02-25-2022 02:01 PM

In the latest session of our Architecture Framework Ask Me Anything series, we focused on how to apply the best practices and guidance outlined in the Operational Excellence and Security Pillars of the Google Cloud Architecture Framework. @omkarsuram_, Program Lead of the Architecture Framework, led the session by outlining core operational excellence and cloud security principles, explaining how to apply them using a business example, and then answering questions live at the end.

In this blog, we share the session recording, written questions and answers, as well as supporting documentation and resources, so you can refer back to them at any time. If you have any further questions, please add a comment below and we’d be happy to help!

With this series, it's our goal to provide a trusted space where you can receive support and guidance along your cloud journey. So if you have any feedback or topic requests for our next sessions, please let us know in the comments, or by submitting the feedback form. You can keep an eye on upcoming sessions from the Cloud Events page in the Community. Thank you!

Session recording and slides

Watch the recording: https://youtu.be/cBeflYlfsa4

Core principles of Operational Excellence

The Operational Excellence Pillar of the Architecture Framework explains how to set up observability, automation, and scalability to efficiently run, manage, and monitor systems that deliver business value.

The core principles of the Operational Excellence Pillar are:

Automate your deployments: Standardize builds, tests, and deployments by eliminating human-induced errors for repeated processes.
Set up monitoring, alerting, and logging: Collect, analyze, and use information about your applications and infrastructure in order to guide business decisions
Establish cloud support and escalation processes: A well-defined escalation process is key to reducing the time and effort it takes to identify and address any issues, including issues that require support from Google Cloud or other service providers
Manage capacity and quota: Optimize your spending by evaluating your capacity requirements and setting quotas that restrict how much of a particular shared Google Cloud resource you can use.
Plan for peak traffic and launch events: Avoid business disruptions by planning resource requirements around business-related events that cause traffic increases beyond an applications standard baseline.
Create a culture of automation: Leverage automation to reduce toil (manual and repetitive work with no enduring value), improve release velocity, and minimize human-induced errors.

Core principles of Security

The Security Pillar of the Architecture Framework shows you how to architect and operate secure services on Google Cloud through multiple layered defenses, including IAM policies and controls, encryption, networking, monitoring, detection and logging.

The core principles of the Security Pillar are:

Build a layered security approach: Implement security at each level in your application and infrastructure by applying a defense-in-depth approach.
Design for secured decoupled systems: Simplify system design to accommodate flexibility where possible, and document security requirements for each component.
Automate deployment of sensitive tasks: Take humans out of the workstream by automating deployment and other admin tasks.
Automate security monitoring: Use automated tools to monitor your application and infrastructure.
Meet regional compliance requirements: Obfuscate or redact personally identifiable information (PII). Where possible, automate your compliance efforts.
Comply with data residency and sovereignty requirements: Control the locations of data storage and processing based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture.
Shift security left: DevOps and deployment automation let your organization increase the velocity of delivering products.

Operational Excellence and Security questions and answers

1. How do I identify potential gaps in my cloud security I’m not aware of, before an incident occurs?

First, identify your security risk. Think about it like red teaming for your environment - looking for what could go wrong and what you can do proactively to fix it. Set up guardrails and security policies, and make sure you’re testing them frequently with red teaming exercises, and ensure they’re applied as a standard across your deployments.

We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). In addition, there are threat models such as OWASP application threat modeling that provide you with a list of potential gaps, and that suggest actions to remediate any gaps that are found. You can also check our partner directory for a list of experts in conducting risk assessments for Google Cloud.

To help catalog your risks, consider leveraging Google Cloud’s security tool, Risk Manager, which is part of the Risk Protection Program (this program is currently in preview.) Risk Manager scans your workloads to help you understand your business risks. Its detailed reports provide you with a security baseline, and you can use Risk Manager reports to compare your risks against the risks outlined in the Center for Internet Security (CIS) Benchmark.

Specifically for CISOs and enterprise organizations, we recommend you look into the Google Cybersecurity Action Team, Google’s premier security advisory team whose mission is to support customers’ security transformation - from your first transformation roadmap and implementation through increasing your cyber-resilience preparedness for potential incidents and engineering new solutions as your requirements change. The Cybersecurity Action Team website hosts a variety of resources catered to helping security leaders, including the CISO’s Guide to Security Transformation, best practices for building secure and reliable systems, the Board of Directors Handbook for Cloud Risk Governance, and many more.

Ultimately, there a few key security themes to consider when you’re operating in the cloud to minimize security gaps:

Think differently about security: Use a layered approach with different components, rather than a blanket approach
Adopt a zero-trust philosophy: Minimize blast radius by making sure resources are only accessible to authorized users with a specific intent. This is enforced with policies or guardrail controls.
Adopt a culture of automation: Embrace automation so you can adhere to the speed and agility of DevOps, reduce operational overhead, and minimize risk at scale across your environment.

Supporting resources:

2. How do I know if I need to invest in automation? When is it advised versus when is it not?

This question brings up the important concept of “toil.” As defined in Google’s Site Reliability Engineering book, “Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” Examples of toil include:

Handling quota requests
Applying database schema changes
Reviewing non-critical monitoring alerts
Copying and pasting commands from a playbook

You should continually aim to reduce or eliminate toil. Otherwise, operational work can eventually overwhelm operators, and any growth in product use or complexity can require additional staffing.

Automation is a key way to minimize toil, and can also help improve release velocity, minimize human-induced errors. To help you identify, measure, eliminate, and automate toil in your organization, refer to the following resources:

3. How can I integrate my existing security systems with cloud? What’s the quickest way to secure deployments?

Many Google Cloud customers coming from on-premises environments want to modernize their security. One of the easiest ways to do this is to create a Cloud Interconnect between your Google Cloud projects and your existing on-premises environments. By leveraging solutions like Cloud Armor, Cloud Load Balancing, and BeyondCorp, you can start to shift from a more traditional, perimeter security approach to a zero trust approach.

If you already use a SIEM system or other security monitoring system, integrate your Google Cloud assets with that system. Integration ensures that your organization has a single, comprehensive view into all resources, regardless of environment. Google Security Command Center helps you connect your Google Cloud deployments to existing security solutions.

Security is an ongoing effort - modernizing one component at a time allows you to slowly adopt the latest security practice. Once you have integrated your existing deployments, you can move workloads to the cloud and leverage modern security controls to secure your deployments.

Supporting resources:

4. How can I know how much capacity I would need in the future? What are other things I should consider?

To manage your capacity effectively, you need to know your organization's capacity requirements. Start by identifying your top cloud workloads and evaluate the average and peak utilizations of these workloads, as well as their current and future capacity needs.

Identify the teams who use these top workloads. Work with them to establish an internal demand-planning process. Use this process to understand their current and forecasted cloud resource needs.

Analyze load pattern and call distribution. Use factors like last 30 days peak, hourly peak, and peak per minute in your analysis.

Consider using Cloud Monitoring to get visibility into the performance, uptime, and overall health of your applications and infrastructure.

For more detailed instructions on evaluating and planning your capacity and quotas in the cloud, please refer to the following resources:

5. I want to control my own encryption keys. How can I do that securely?

By default, Google encrypts all your data in rest and in transit. There are a few different options to control your own encryption keys in Google Cloud.

One such option is with Cloud Key Management Service (Cloud KMS), which allows you to create, import, and manage cryptographic keys and perform cryptographic operations from a single centralized cloud service. With Cloud KMS, you’re the custodian of your data - you can manage keys in the cloud in the same ways you do on-premises, and have a monitorable source of trust over your data.

You can also use Cloud HSM, a cloud-hosted Hardware Security Module (HSM) service that allows you to host encryption keys on a dedicated hardware module.

You can also choose to manage your keys externally and provide them during API calls to encrypt and decrypt your data. Google won't store your keys and you will have to maintain security and availability of such keys.

Another way is a third-party key management system with Cloud External Key Manager (Cloud EKM). Cloud EKM protects your data at rest by using encryption keys that are stored and managed in a third-party key management system that you control outside of the Google infrastructure.

As you move away from Google Managed services, you take control of the associated operational risk of managing keys. We recommend you evaluate the need of managing keys based on your compliance and regulatory requirements.

Supporting resources: