Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn’t know how web pages work anymore, and Joe isn’t allowed to beg.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Monolithic repos … meh. But monolithic builds … oh noes.
Chapter 4: Service Level Objectives
Service Level Indicators
A very well and carefully defined metric of some aspect of the service or system.
Response latency, error rate, system throughput are common SLIs.
SLIs are typically aggregated over some predefined period of time.
Usually, SLIs directly measure some aspect of a system but it’s not always possible, as with client side latency.
Availability is one of the most important SLIs often expressed as a ratio of the number of requests that succeed, sometimes called yield.
For storage purposes, durability, i.e. the retention of the data over time, is important.
Service Level Objectives
The SLO is the range of values that you want to achieve with your SLIs.
Choosing SLOs can be difficult. For one, you may not have any say in it!
An example of an SLO would be for response latency to be less than 250ms.
Often one SLI can impact another. For instance, if your number of requests per second rises sharply, so might your latency.
It is important to define SLOs so that users of the system have a realistic understanding of what the availability or reliability of the system is. This eliminates arbitrary “the system is slow” or the “system is unreliable” comments.
Google provided an example of a system called Chubby that is used extensively within Google where teams built systems on top of Chubby assuming that it was highly available, but no claim was made to that end.
Sort of crazy, but to ensure service owners didn’t have unrealistic expectations on the Chubby’s up-time, they actually force downtime through the quarter.
Service Level Agreements
These are the agreements of what is to happen if/when the SLOs aren’t met.
If there is no consequence, then you’re likely talking about an SLO and not an SLA.
Typically, SLA’s consequences are monetary, i.e. there will be a credit to your bill if some service doesn’t meet it’s SLO.
SLAs are typically decided by the business, but SREs help in making sure SLO consequences don’t get triggered.
SREs also help come up with objective ways to measure the SLOs.
Google search doesn’t have an SLA, even though Google has a very large stake in ensuring search is always working.
However, Google for Work does have SLAs with its business customers.
What Should You Care About?
You should not use every metric you can find as SLIs.
Too many and it’s just noisy and hard to know what’s important to look at.
Too few and you may have gaps in understanding the system reliability.
A handful of carefully selected metrics should be enough for your SLIs.
User facing services:
Availability – could the request be serviced,
Latency – how long did it take the request to be serviced, and
Throughput – how many requests were able to be serviced.
Latency – how long did it take to read/write,
Availability – was it available when it was requested, and
Durability – is the data still there when needed.
Big data systems:
Throughput – how much data is being processed, and
End to end latency – how long from ingestion to completion of processing.
Everything should care about correctness.
Many metrics come from the server side.
Some metrics can be scraped from logs.
Don’t forget about client-side metric gathering as there might be some things that expose bad user experiences.
Typically aggregate raw numbers/metrics but you have to be careful.
Aggregations can hide true system behavior.
Example given averaging requests per second: if odd seconds have 200 requests per second and even seconds have 0, then your average is 100 but what’s being hidden is your true burst rate of 200 requests.
Same thing with latencies, averaging latencies may paint a pretty picture but the long tail of latencies may be terrible for a handful of users.
Using distributions may be more effective at seeing the true story behind metrics.
In Prometheus, using a Summary metric uses quantiles so that you can see typical and worst case scenarios.
Quantile of 50% would show you the average request, while
Quantile of 99.99% would show you the worst request durations.
A really interesting takeaway here is that studies have shown that users prefer a system with low-variance but slower over a system with high variance but mostly faster.
In a low-variance system, SREs can focus on the 99% or 99.99% numbers, and if those are good, then everything else must be, too.
At Google, they prefer distributions over averages as they show the long-tail of data points, as mentioned earlier, averages can hide problems.
Also, don’t assume that data is distributed normally. You need to see the real results.
Another important point here is if you don’t truly understand the distribution of your data, your system may be taking actions that are wrong for the situation. For instance, if you think that you are seeing long latency times but you don’t realize that those latencies actually occur quite often, your systems may be restarting themselves prematurely.
Standardize some SLIs
This just means if you standardize on how, when, and what tools you use for gathering some of the metrics, you don’t have to convince or describe those metrics on every new service or project. Examples might include:
Aggregation intervals – distribution per minute, and
Frequency of metrics gathered – pick a time such as every 5 seconds, 10, etc.
Build reusable SLI templates so you don’t have to recreate the wheel every time.
Objectives in Practice
Find out what the users care about, not what you can measure!
If you choose what’s easy to measure, your SLOs may not be all that useful.
SLOs should define how they’re measured and what conditions make them valid.
Example of a good SLO definition – 99% of RPC calls averaged over one minute return in 100ms as measured across all back-end servers.
It is unrealistic to have your SLOs met 100%..
As we mentioned in the previous episode, striving for 100% takes time away from adding new features or makes your team design overly conservatively.
This is why you should operate with an error budget.
An error budget is just an SLO for meeting other SLOs!
Site Reliability Engineering: How Google Runs Production Systems
Don’t choose SLO targets based on current performance.
Keep the SLOs simple. Making them overly complex makes them hard to understand and may be difficult to see impacts of system changes.
Avoid absolutes like “can scale infinitely”. It’s likely not true, and if it is, that means you had to spend a lot of time designing it to be that way and is probably overkill.
Have as few SLOs as possible. You want just enough to be able to ensure you can track the status of your system and they should be defendable.
Perfection can wait. Start with loose targets that you can refine over time as you learn more.
SLOs should be a major driver in what SREs work on as they reflect what the business users care about
Monitor system SLIs.
Compare SLIs to SLOs and see if action is needed.
If action is needed, figure out what action should be taken.
Take the action.
Example that was given is if you see latency climbing, and it appears to be CPU bound, then increasing the CPU capacity should lower latencies and not trigger an SLO consequence.
SLOs Set Expectations
Publishing SLOs make it so users know what to expect.
You may want to use one of the following approaches:
Keep a safety margin by having a stricter internal SLO than the public facing SLO.
Don’t overachieve. If your performance is consistently better than your SLO, it might be worth introducing purposeful downtime to set user expectations more in line with the SLO, i.e. failure injection.
Agreements in Practice
The SRE’s role is to help those writing SLAs understand the likelihood or difficulty of meeting the SLOs/SLA being implemented.
You should be conservative in the SLOs and SLAs that you make publicly available.
These are very difficult to change once they’ve been made public.
SLAs are typically misused when actually talking about an SLO. SLA breaches may trigger a court case.
If you can’t win an argument about a particular SLO, it’s probably not worth having an SRE team work on it.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
If you switch to a Mac and you’re struggling with the CMD / CTRL switch from Windows, look for driver software from the keyboard manufacturer as they likely have an option to swap the keys for you!
Metrics aren’t free! Be careful to watch your costs or you can get up to babillions quickly!
Did you know there is a file format you can use to import bookmarks? It’s really simple, just an HTML file. You can even use it for onboarding! (j11g.com)
Powerlevel10k is a Zsh theme that looks nice and is easy to configure, but it’s also good about caching your git status so it doesn’t bog down your computer trying to pull the status on every command, a must for Zsh users with large repos! (GitHub)
We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader of Names, ends the show on a dark note.
Sadly, O’Reilly is ending their partnership with ACM, so you’ll no longer get access to their Learning Platform if you’re a member. (news.ycombinator.com)
Chapter 3: Embracing Risk
Google aims for 100% reliability right? Wrong…
Increasing reliability is always better for the service, right? Not necessarily.
It’s very expensive to add another 9 of reliability, and
Can’t iterate on features as you spend more time and resources making the service more stable.
Users don’t typically notice the difference between very reliable and extremely reliable services.
The systems using these services usually aren’t 100% reliable, so the chances of noticing are very low.
SRE’s try to balance the risk of unavailability with innovation, new features, and efficient service operations by optimizing for the right balance of all.
Unstable systems diminish user confidence. We want to avoid that.
Cost does not scale with improvements to reliability.
As you improve reliability the cost can actually increase many times over.
Two dimensions of cost:
Cost of redundancy in compute resources, and
The opportunity cost of trading features for reliability focused time.
SREs try to balance business goals in reliability with the risk of service reliability.
If the business goal is 99.99% reliable, then that’s exactly what the SRE will aim for, with maybe just a touch more.
They treat the target like a minimum and a maximum
Measuring Service Risk
Identify an objective metric for a property of the system to optimize.
Only by doing this can you measure improvements or degradation over time.
At Google, they focus on unplanned downtime.
Unplanned downtime is measured in relation to service availability.
Availability = Uptime / (Uptime + Downtime).
A 99.99% target means a maximum of 52.56 minutes downtime in a year.
At Google, they don’t use uptime as the metric as their services are globally distributed and may be up in many regions while being down in another.
Rather, they use the successful request rate.
Success rate = total successful requests / total requests.
A 99.99% target here would mean you could have 250 failures out of 2.5M requests in a day.
NOTE: not all services are the same.
A new user signup is likely way more important than a polling service for checking for new emails for a user.
At Google they also use this success rate for non-customer facing systems.
Google often sets quarterly availability targets and may track those targets weekly or even daily.
Doing so allows for fixing any issues as quickly as possible.
Risk Tolerance Services
SRE’s should work directly with the business to define goals that can be engineered.
Sometimes this can be difficult because measuring consumer services is clearly definable whereas infrastructure services may not have a direct owner.
Identifying the Risk Tolerance of Consumer Services
Often a service will have its own dedicated team and that team will best know the reliability requirements of that service.
If there is no owning team, often times the engineers will assume the role of defining the reliability requirements.
Factors in assessing the risk tolerance of a service
What level of availability is needed?
Do different failures have different effects on the service?
Use the service cost to help identify where on the risk continuum it belongs.
What are the important metrics to track?
Target level of availability
What do the users expect?
Is the service linked directly to revenue, either for Google or for a customer?
Is it a free or paid service?
If there’s a competing service, what is their level of service?
What’s the target market? Consumers or enterprises?
Consider Google Apps that drive businesses, externally they may have a 99.9% reliability because downtime really impacts the end businesses ability to do critical business processes. Internally they may have a higher targeted reliability to ensure the enterprises are getting the best level of customer service.
When Google purchased YouTube, their reliability was lower because Google was more focused on introducing features for the consumer.
Types of failures
Know the shapes of errors.
Which is worse, a constant trickle of errors throughout the day or a full site outage for a short amount of time?
Example they provided:
Intermittent avatars not loading so it’d show a missing icon on a page, vs
Potential issue where private user information may be leaked.
A large trust impact is worth having a short period of full outage to fix the problem rather than have the potential of leaking sensitive information.
Another example they used was for ads:
Because most users used the ads system during working hours, they deemed it ok to have service periods (planned downtime) in off hours.
Very high on the deciding factors for how reliable to make a service.
Questions to help determine cost vs reliability:
If we built in one more 9 of reliability, how much more revenue would it bring in?
Does the additional revenue offset the cost of that reliability goal?
Other service metrics
Knowing which metrics are important and which ones aren’t, allow you to make better informed decisions.
Search’s primary metric was speed to results, i.e. lowest latency possible.
AdSense’s primary metric was making sure it didn’t slow down a page load it appeared on rather than the latency at which it appears.
Because of the looser goal on appearance latency, they could reduce their costs by reducing the number of regions AdSense is served by.
Identifying the Risk Tolerance of Infrastructure Services
Infrastructure services different requirements than consumer services typically because they are serving multiple clients.
Target level of availability
One approach of reliability may not be suitable for all needs.
Real time querying for online applications means it has a high availability/reliability requirement.
Offline analytical processing, however, has a lower availability requirement.
Using an always highly available reliability target for both use cases would be hyper expensive due to the amount of compute that would be required.
Types of failures
Real-time querying wants request queues to almost always be empty so it can service requests ASAP.
Offline analytical processing cares more about throughput, so it never wants the queues to be empty, i.e. always be processing.
Success and failure for both use cases are opposites in this scenario. Its the same underlying infrastructure systems serving different use-cases.
Can partition the services into different clusters based on needs.
Low latency/high availability Bigtable cluster is a high level of service and more costly.
Throughput cluster can be built with less redundancy and need less headroom meaning they’re constantly processing making it much more cost effective.
Exposing those cost savings to the end customer helps customers choose the right availability model for their real needs.
This is all done via delineated service levels.
Much of this can all be done via configurations of the various services, i.e. redundancy, amount of compute resources, etc.
… Google SRE’s unofficial motto is “Hope is not a strategy”.
Site Reliability Engineering: How Google Runs Production Systems
Motivation for Error Budgets
Tensions form between feature development teams and SRE teams.
Software fault tolerance: How fault tolerant should the software be? How does it handle unexpected events?
Testing: Too little and it’s a bad end-user experience, too much and you never ship.
Push frequency: Code updates are risky. Should you reduce pushes or work on reducing the risks?
Canary duration and size: Test deploys on a subset of a usual workload. How long do you wait on canary testing and how big do you make the canary?
Forming Your Error Budget
Both teams should define a quarterly error budget based on the service’s SLO (service level objectives).
This determines how unreliable a service can be within a quarter.
This removes the politics between the SREs and product development teams.
Product management sets the SLO of the required uptime for the quarter.
Actual uptime is measured by an uninvolved third party, in Google’s case, “their monitoring system”.
The difference between actual downtime and allowed downtime is the budget.
As long as there is budget remaining, new releases and pushes are allowed.
This approach provides a good balance for both teams to succeed.
If the budget is nearly empty, the product developers will spend more time testing, hardening, or slowing release velocity.
This sort of has the effect of having a product development team become self-policing.
What about some uncontrollable event, such as hardware failures, etc.?
Everyone shares the same SLO objectives, so the number of releases will be reduced for the remainder of the quarter.
This also helps bring to light some of the overly aggressive reliability targets that can slow new features from being released. This may lead to renegotiating the SLO to allow for more feature releases.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
There are a couple convenient flags for git checkout. Next time you are switching branches, try the --track or -t flag. It makes sure that your branch has your checkout.defaultRemote upstream set (typically “origin”), making for easier pulling and pushing. (git-scm.com)
git checkout -b <branchname> -t
There is a -vv flag you can pass to git branch to list all the branches you have locally, including the remote info if they are tracked so you can find any branches that don’t have the upstream set. (git-scm.com)
git branch -vv
You can configure git to always set up new branches so that git pull will automatically merge from the starting point branch (assuming you are tracking an upstream branch, see previous 2 tips.) (git-scm.com)
git config --global branch.autoSetupMerge always
From Michael Warren on the comments from last episode, Caffeine is an updated take on the caching code founding in the Java Guava library from google (GitHub)
Great tips from @msuriar!
Great talk from Tanya Reilly about “glue work”, some of the most important work can be hard to see and appreciate. How do we make this better? Technical leadership and glue work – Tanya Reilly | #LeadDevNewYork (YouTube)
Google has a free book available on Incident Response! Great advice on handling and preventing incidents. Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
Minikube is a great way to run Kubernetes clusters locally. It’s cross platform and has a lot of nice features while also still being relatively simple to use and user-friendly. (minikube.sigs.k8s.io)
Minikube has addons that you can install that add additional capabilities, like a metrics server you can use to see what resources are being used, and by what!
minikube addons enable metrics-server
You can also run a “top” style command to see utilization once you have enabled the metrics. (linuxhint.com)
kubectl top pods
There’s also a dashboard that’s available that you can use to deploy, troubleshoot, manage resources, and make changes. (minikube.sigs.k8s.io)