We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader of Names, ends the show on a dark note.
Sadly, O’Reilly is ending their partnership with ACM, so you’ll no longer get access to their Learning Platform if you’re a member. (news.ycombinator.com)
Chapter 3: Embracing Risk
Google aims for 100% reliability right? Wrong…
Increasing reliability is always better for the service, right? Not necessarily.
It’s very expensive to add another 9 of reliability, and
Can’t iterate on features as you spend more time and resources making the service more stable.
Users don’t typically notice the difference between very reliable and extremely reliable services.
The systems using these services usually aren’t 100% reliable, so the chances of noticing are very low.
SRE’s try to balance the risk of unavailability with innovation, new features, and efficient service operations by optimizing for the right balance of all.
Unstable systems diminish user confidence. We want to avoid that.
Cost does not scale with improvements to reliability.
As you improve reliability the cost can actually increase many times over.
Two dimensions of cost:
Cost of redundancy in compute resources, and
The opportunity cost of trading features for reliability focused time.
SREs try to balance business goals in reliability with the risk of service reliability.
If the business goal is 99.99% reliable, then that’s exactly what the SRE will aim for, with maybe just a touch more.
They treat the target like a minimum and a maximum
Measuring Service Risk
Identify an objective metric for a property of the system to optimize.
Only by doing this can you measure improvements or degradation over time.
At Google, they focus on unplanned downtime.
Unplanned downtime is measured in relation to service availability.
Availability = Uptime / (Uptime + Downtime).
A 99.99% target means a maximum of 52.56 minutes downtime in a year.
At Google, they don’t use uptime as the metric as their services are globally distributed and may be up in many regions while being down in another.
Rather, they use the successful request rate.
Success rate = total successful requests / total requests.
A 99.99% target here would mean you could have 250 failures out of 2.5M requests in a day.
NOTE: not all services are the same.
A new user signup is likely way more important than a polling service for checking for new emails for a user.
At Google they also use this success rate for non-customer facing systems.
Google often sets quarterly availability targets and may track those targets weekly or even daily.
Doing so allows for fixing any issues as quickly as possible.
Risk Tolerance Services
SRE’s should work directly with the business to define goals that can be engineered.
Sometimes this can be difficult because measuring consumer services is clearly definable whereas infrastructure services may not have a direct owner.
Identifying the Risk Tolerance of Consumer Services
Often a service will have its own dedicated team and that team will best know the reliability requirements of that service.
If there is no owning team, often times the engineers will assume the role of defining the reliability requirements.
Factors in assessing the risk tolerance of a service
What level of availability is needed?
Do different failures have different effects on the service?
Use the service cost to help identify where on the risk continuum it belongs.
What are the important metrics to track?
Target level of availability
What do the users expect?
Is the service linked directly to revenue, either for Google or for a customer?
Is it a free or paid service?
If there’s a competing service, what is their level of service?
What’s the target market? Consumers or enterprises?
Consider Google Apps that drive businesses, externally they may have a 99.9% reliability because downtime really impacts the end businesses ability to do critical business processes. Internally they may have a higher targeted reliability to ensure the enterprises are getting the best level of customer service.
When Google purchased YouTube, their reliability was lower because Google was more focused on introducing features for the consumer.
Types of failures
Know the shapes of errors.
Which is worse, a constant trickle of errors throughout the day or a full site outage for a short amount of time?
Example they provided:
Intermittent avatars not loading so it’d show a missing icon on a page, vs
Potential issue where private user information may be leaked.
A large trust impact is worth having a short period of full outage to fix the problem rather than have the potential of leaking sensitive information.
Another example they used was for ads:
Because most users used the ads system during working hours, they deemed it ok to have service periods (planned downtime) in off hours.
Very high on the deciding factors for how reliable to make a service.
Questions to help determine cost vs reliability:
If we built in one more 9 of reliability, how much more revenue would it bring in?
Does the additional revenue offset the cost of that reliability goal?
Other service metrics
Knowing which metrics are important and which ones aren’t, allow you to make better informed decisions.
Search’s primary metric was speed to results, i.e. lowest latency possible.
AdSense’s primary metric was making sure it didn’t slow down a page load it appeared on rather than the latency at which it appears.
Because of the looser goal on appearance latency, they could reduce their costs by reducing the number of regions AdSense is served by.
Identifying the Risk Tolerance of Infrastructure Services
Infrastructure services different requirements than consumer services typically because they are serving multiple clients.
Target level of availability
One approach of reliability may not be suitable for all needs.
Real time querying for online applications means it has a high availability/reliability requirement.
Offline analytical processing, however, has a lower availability requirement.
Using an always highly available reliability target for both use cases would be hyper expensive due to the amount of compute that would be required.
Types of failures
Real-time querying wants request queues to almost always be empty so it can service requests ASAP.
Offline analytical processing cares more about throughput, so it never wants the queues to be empty, i.e. always be processing.
Success and failure for both use cases are opposites in this scenario. Its the same underlying infrastructure systems serving different use-cases.
Can partition the services into different clusters based on needs.
Low latency/high availability Bigtable cluster is a high level of service and more costly.
Throughput cluster can be built with less redundancy and need less headroom meaning they’re constantly processing making it much more cost effective.
Exposing those cost savings to the end customer helps customers choose the right availability model for their real needs.
This is all done via delineated service levels.
Much of this can all be done via configurations of the various services, i.e. redundancy, amount of compute resources, etc.
… Google SRE’s unofficial motto is “Hope is not a strategy”.
Site Reliability Engineering: How Google Runs Production Systems
Motivation for Error Budgets
Tensions form between feature development teams and SRE teams.
Software fault tolerance: How fault tolerant should the software be? How does it handle unexpected events?
Testing: Too little and it’s a bad end-user experience, too much and you never ship.
Push frequency: Code updates are risky. Should you reduce pushes or work on reducing the risks?
Canary duration and size: Test deploys on a subset of a usual workload. How long do you wait on canary testing and how big do you make the canary?
Forming Your Error Budget
Both teams should define a quarterly error budget based on the service’s SLO (service level objectives).
This determines how unreliable a service can be within a quarter.
This removes the politics between the SREs and product development teams.
Product management sets the SLO of the required uptime for the quarter.
Actual uptime is measured by an uninvolved third party, in Google’s case, “their monitoring system”.
The difference between actual downtime and allowed downtime is the budget.
As long as there is budget remaining, new releases and pushes are allowed.
This approach provides a good balance for both teams to succeed.
If the budget is nearly empty, the product developers will spend more time testing, hardening, or slowing release velocity.
This sort of has the effect of having a product development team become self-policing.
What about some uncontrollable event, such as hardware failures, etc.?
Everyone shares the same SLO objectives, so the number of releases will be reduced for the remainder of the quarter.
This also helps bring to light some of the overly aggressive reliability targets that can slow new features from being released. This may lead to renegotiating the SLO to allow for more feature releases.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
There are a couple convenient flags for git checkout. Next time you are switching branches, try the --track or -t flag. It makes sure that your branch has your checkout.defaultRemote upstream set (typically “origin”), making for easier pulling and pushing. (git-scm.com)
git checkout -b <branchname> -t
There is a -vv flag you can pass to git branch to list all the branches you have locally, including the remote info if they are tracked so you can find any branches that don’t have the upstream set. (git-scm.com)
git branch -vv
You can configure git to always set up new branches so that git pull will automatically merge from the starting point branch (assuming you are tracking an upstream branch, see previous 2 tips.) (git-scm.com)
git config --global branch.autoSetupMerge always
From Michael Warren on the comments from last episode, Caffeine is an updated take on the caching code founding in the Java Guava library from google (GitHub)
Great tips from @msuriar!
Great talk from Tanya Reilly about “glue work”, some of the most important work can be hard to see and appreciate. How do we make this better? Technical leadership and glue work – Tanya Reilly | #LeadDevNewYork (YouTube)
Google has a free book available on Incident Response! Great advice on handling and preventing incidents. Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
Minikube is a great way to run Kubernetes clusters locally. It’s cross platform and has a lot of nice features while also still being relatively simple to use and user-friendly. (minikube.sigs.k8s.io)
Minikube has addons that you can install that add additional capabilities, like a metrics server you can use to see what resources are being used, and by what!
minikube addons enable metrics-server
You can also run a “top” style command to see utilization once you have enabled the metrics. (linuxhint.com)
kubectl top pods
There’s also a dashboard that’s available that you can use to deploy, troubleshoot, manage resources, and make changes. (minikube.sigs.k8s.io)