Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn’t know how web pages work anymore, and Joe isn’t allowed to beg.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Monolithic repos … meh. But monolithic builds … oh noes.
Chapter 4: Service Level Objectives
Service Level Indicators
A very well and carefully defined metric of some aspect of the service or system.
Response latency, error rate, system throughput are common SLIs.
SLIs are typically aggregated over some predefined period of time.
Usually, SLIs directly measure some aspect of a system but it’s not always possible, as with client side latency.
Availability is one of the most important SLIs often expressed as a ratio of the number of requests that succeed, sometimes called yield.
For storage purposes, durability, i.e. the retention of the data over time, is important.
Service Level Objectives
The SLO is the range of values that you want to achieve with your SLIs.
Choosing SLOs can be difficult. For one, you may not have any say in it!
An example of an SLO would be for response latency to be less than 250ms.
Often one SLI can impact another. For instance, if your number of requests per second rises sharply, so might your latency.
It is important to define SLOs so that users of the system have a realistic understanding of what the availability or reliability of the system is. This eliminates arbitrary “the system is slow” or the “system is unreliable” comments.
Google provided an example of a system called Chubby that is used extensively within Google where teams built systems on top of Chubby assuming that it was highly available, but no claim was made to that end.
Sort of crazy, but to ensure service owners didn’t have unrealistic expectations on the Chubby’s up-time, they actually force downtime through the quarter.
Service Level Agreements
These are the agreements of what is to happen if/when the SLOs aren’t met.
If there is no consequence, then you’re likely talking about an SLO and not an SLA.
Typically, SLA’s consequences are monetary, i.e. there will be a credit to your bill if some service doesn’t meet it’s SLO.
SLAs are typically decided by the business, but SREs help in making sure SLO consequences don’t get triggered.
SREs also help come up with objective ways to measure the SLOs.
Google search doesn’t have an SLA, even though Google has a very large stake in ensuring search is always working.
However, Google for Work does have SLAs with its business customers.
What Should You Care About?
You should not use every metric you can find as SLIs.
Too many and it’s just noisy and hard to know what’s important to look at.
Too few and you may have gaps in understanding the system reliability.
A handful of carefully selected metrics should be enough for your SLIs.
User facing services:
Availability – could the request be serviced,
Latency – how long did it take the request to be serviced, and
Throughput – how many requests were able to be serviced.
Latency – how long did it take to read/write,
Availability – was it available when it was requested, and
Durability – is the data still there when needed.
Big data systems:
Throughput – how much data is being processed, and
End to end latency – how long from ingestion to completion of processing.
Everything should care about correctness.
Many metrics come from the server side.
Some metrics can be scraped from logs.
Don’t forget about client-side metric gathering as there might be some things that expose bad user experiences.
Typically aggregate raw numbers/metrics but you have to be careful.
Aggregations can hide true system behavior.
Example given averaging requests per second: if odd seconds have 200 requests per second and even seconds have 0, then your average is 100 but what’s being hidden is your true burst rate of 200 requests.
Same thing with latencies, averaging latencies may paint a pretty picture but the long tail of latencies may be terrible for a handful of users.
Using distributions may be more effective at seeing the true story behind metrics.
In Prometheus, using a Summary metric uses quantiles so that you can see typical and worst case scenarios.
Quantile of 50% would show you the average request, while
Quantile of 99.99% would show you the worst request durations.
A really interesting takeaway here is that studies have shown that users prefer a system with low-variance but slower over a system with high variance but mostly faster.
In a low-variance system, SREs can focus on the 99% or 99.99% numbers, and if those are good, then everything else must be, too.
At Google, they prefer distributions over averages as they show the long-tail of data points, as mentioned earlier, averages can hide problems.
Also, don’t assume that data is distributed normally. You need to see the real results.
Another important point here is if you don’t truly understand the distribution of your data, your system may be taking actions that are wrong for the situation. For instance, if you think that you are seeing long latency times but you don’t realize that those latencies actually occur quite often, your systems may be restarting themselves prematurely.
Standardize some SLIs
This just means if you standardize on how, when, and what tools you use for gathering some of the metrics, you don’t have to convince or describe those metrics on every new service or project. Examples might include:
Aggregation intervals – distribution per minute, and
Frequency of metrics gathered – pick a time such as every 5 seconds, 10, etc.
Build reusable SLI templates so you don’t have to recreate the wheel every time.
Objectives in Practice
Find out what the users care about, not what you can measure!
If you choose what’s easy to measure, your SLOs may not be all that useful.
SLOs should define how they’re measured and what conditions make them valid.
Example of a good SLO definition – 99% of RPC calls averaged over one minute return in 100ms as measured across all back-end servers.
It is unrealistic to have your SLOs met 100%..
As we mentioned in the previous episode, striving for 100% takes time away from adding new features or makes your team design overly conservatively.
This is why you should operate with an error budget.
An error budget is just an SLO for meeting other SLOs!
Site Reliability Engineering: How Google Runs Production Systems
Don’t choose SLO targets based on current performance.
Keep the SLOs simple. Making them overly complex makes them hard to understand and may be difficult to see impacts of system changes.
Avoid absolutes like “can scale infinitely”. It’s likely not true, and if it is, that means you had to spend a lot of time designing it to be that way and is probably overkill.
Have as few SLOs as possible. You want just enough to be able to ensure you can track the status of your system and they should be defendable.
Perfection can wait. Start with loose targets that you can refine over time as you learn more.
SLOs should be a major driver in what SREs work on as they reflect what the business users care about
Monitor system SLIs.
Compare SLIs to SLOs and see if action is needed.
If action is needed, figure out what action should be taken.
Take the action.
Example that was given is if you see latency climbing, and it appears to be CPU bound, then increasing the CPU capacity should lower latencies and not trigger an SLO consequence.
SLOs Set Expectations
Publishing SLOs make it so users know what to expect.
You may want to use one of the following approaches:
Keep a safety margin by having a stricter internal SLO than the public facing SLO.
Don’t overachieve. If your performance is consistently better than your SLO, it might be worth introducing purposeful downtime to set user expectations more in line with the SLO, i.e. failure injection.
Agreements in Practice
The SRE’s role is to help those writing SLAs understand the likelihood or difficulty of meeting the SLOs/SLA being implemented.
You should be conservative in the SLOs and SLAs that you make publicly available.
These are very difficult to change once they’ve been made public.
SLAs are typically misused when actually talking about an SLO. SLA breaches may trigger a court case.
If you can’t win an argument about a particular SLO, it’s probably not worth having an SRE team work on it.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
If you switch to a Mac and you’re struggling with the CMD / CTRL switch from Windows, look for driver software from the keyboard manufacturer as they likely have an option to swap the keys for you!
Metrics aren’t free! Be careful to watch your costs or you can get up to babillions quickly!
Did you know there is a file format you can use to import bookmarks? It’s really simple, just an HTML file. You can even use it for onboarding! (j11g.com)
Powerlevel10k is a Zsh theme that looks nice and is easy to configure, but it’s also good about caching your git status so it doesn’t bog down your computer trying to pull the status on every command, a must for Zsh users with large repos! (GitHub)