Site Reliability Engineering (SRE) was born out of an internal need at Google in the early 2000s and has since evolved to become the industry-leading practice for service reliability.
In our latest Community event, Technical Account Manager, Pamella Canova (@pamcanova) shared the fundamental SRE principles and practices that you can apply in your organization that enable your systems to be more scalable, reliable, and efficient.
In this article, we share key takeaways from the presentation, the session recording, slides, and supporting resources so you can refer back to them at any time.
If you have any further questions, please add a comment below and we’d be happy to help!
It's our goal to provide a trusted space where you can receive support and guidance along your cloud journey. So if you have any feedback or topic requests for our next sessions, please let us know in the comments, or by submitting the feedback form.
You can keep an eye on upcoming sessions led by Google Cloud Technical Account Managers in the Community here. Thank you!
See the session in Portuguese here. Plus, see the slides in English and Portuguese attached at the bottom of this article.
In this article, we’ll cover five key areas of SRE:
Typically in software engineering, developers build new features, pass it along to operators, and then move on to the next thing. Whereas operators are responsible for everything from scalability, reliability, and maintenance of the software.
In other words, developers want to go as fast as they can, while operators are looking to slow things down to reduce risks.
Since each group's priorities are misaligned with the needs of the business, this kind of setup often creates tension between developers and operators, and ultimately, slows down the product lifecycle.
SRE was born at Google in the early 2000s when a team of software engineers led by Ben Treynor Sloss, was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. The practices they developed responded so well to Google’s needs that other big tech companies also adopted them and brought new practices to the table.
Site Reliability Engineering (SRE), as it has come to be generally defined at Google, is what happens when you ask a software engineer to solve an operational problem. SRE is an essential part of engineering at Google. It’s a mindset, job role, and a set of practices, metrics, and prescriptive ways to ensure systems reliability.
There’s a common misconception that 100% reliability should be the goal for your systems - that they’re available 100% of the time. However, having a 100% availability requirement slows down developers’ ability to deliver new features, updates, and improvements to a system.
Rather than a 100% reliability target, developers, product management, and SRE teams should define a mutually-acceptable availability target that will meet the needs of your users. This is where error budgets come into play.
The error budget is the amount of unavailability you’re willing to tolerate, which is used to achieve a balance between system stability and developer agility. As a general rule of thumb:
Hand in hand with error budgets are SLIs, SLOs, and SLAs.
Your error budget is calculated as 100% - SLO over a period of time. It indicates if your system has been more or less reliable than what is needed over a certain time period and how many minutes of downtime are allowed during that period.
For example, if your availability SLO is 99.9%, your error budget over a 30-day period is (1 - 0.999) x 30 days x 24 hours x 60 minutes = 43.2 minutes. So if your system has had 10 minutes of downtime in the past 30 days and started the 30-day period with the full error budget of 43.2 minutes unutilized, then the remaining error budget would be 33.2 minutes.
Your SLO is the target value for your SLI over a window of time, which is agreed upon by the business, developers, and operators.
So the concept of SLOs and error budgets are all good and well, but how do we actually put them into practice, defending them from day to day? We need to work in a few key areas of practice:
To maintain reliable systems, it’s important to measure how well you’re doing against your targets and to create alerts that signal when your error budget is threatened.
Google Cloud provides an SLO Monitoring feature that helps reduce the complexity of monitoring and alerting. With SLO Monitoring, the user selects the desired SLIs (reliability metrics) and SLOs (target values), and the framework handles the mathematical manipulations required to monitor SLOs and error budgets.
Demand forecasting and capacity planning are critical to make sure you have enough resources available to meet demand and your reliability goals, considering both organic growth (increased product adoption and usage by end users/customers) and inorganic growth (launches and new features that consume more resources).
Not only is it important to make sure you have enough capacity, but your CFO surely cares a lot about the cost and efficiency of running your service as well. If you over-provision resources, your service is running well, but you’re also spending unnecessarily on capacity you don’t need. If you under-provision resources, you’re saving more money, but at the detriment of your service performance and user experience.
Check out this article, Strategic Cloud Capacity Planning, for best practices based on the Google Cloud Architecture Framework.
Good change management is about managing risk - even with the smallest changes - in a routine, programmatic way. The business, developers, and operators should be aligned on change management processes, considering the following best practices:
It’s unrealistic to expect you’ll never have something break. So when it does, you need to have a well-defined process in place.
Both developers and SREs should know the emergency response processes, and practice them in advance, so when you’re faced with a real incident, you don’t panic and can focus on mitigating the issue and restoring the service to a healthy state. Once that’s done, then you can work on the longer-term work of troubleshooting and fixing underlying issues so that it will not happen again.
Before an incident occurs, you need to define your incident threshold - at what point do you declare an incident? When do you bring in other responders to help?
Typically, your incident threshold is determined by how much damage is done to your SLO (how much of your error budget you've consumed). If you're about to run out of your entire error budget for the quarter in the next few hours, you probably want to declare an incident. On the other hand, if you have it under control and your error budget isn't at risk, then you can spend a few more hours to look at what's happening on your own.
Define these thresholds ahead of time so that when things go wrong, people don't have to wonder or worry about whether to declare an incident.
After an incident is resolved, your focus should be on postmortem: making sure you’ve identified the root cause of the issue, preventing it from happening again, and having an action plan in place to improve for future scenarios. See a detailed postmortem checklist from Google Cloud here.
As a final note, postmortems must focus on a culture of blamelessness:
The last key practice area of SRE is a cultural one - the ability for SREs to regulate their operational workload, also known as toil management.
We define operational work as things you need to do to run a service that are manual, repetitive, automatable, reactive, without enduring value (no long-term system improvements), or that scale linearly (grows with user count or service size).
Operational work is important because not everything can be automated, and it provides the hands-on experience you need to improve your service, design better systems, and know what can be automated.
However, you don’t want all your SRE’s time spent on operational work. At Google, we have a policy of capping the amount of operational work at 50% of one’s time.
A good team culture with the right mix of skills and workload amongst team members helps alleviate operational burden and empower SREs to focus on work that delivers value for the business.
If you’re relatively new to the concept of SRE or are just beginning to stand up a culture of SRE in your organization, we recommend you start with these four things:
As a final piece of advice, start small and grow from there. Pick one service to run according to the SRE model, empower the team with strong executive sponsorship and support, provide a culture of psychological safety, measure your progress, and aim for improvements over time - not perfection.