Coding Blocks

April 2022
S	M	T	W	T	F	S

					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Sun, 24 April 2022

Site Reliability Engineering – Service Level Indicators, Objectives, and Agreements

Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn’t know how web pages work anymore, and Joe isn’t allowed to beg.

The full show notes for this episode are available at https://www.codingblocks.net/episode183.

Survey Says

News

Monolithic repos … meh. But monolithic builds … oh noes.

Chapter 4: Service Level Objectives

Cover of the "Site Reliability Engineering" book from O'Reilly — The famous “SRE Book” from Google

Service Level Indicators

A very well and carefully defined metric of some aspect of the service or system.
- Response latency, error rate, system throughput are common SLIs.
SLIs are typically aggregated over some predefined period of time.
Usually, SLIs directly measure some aspect of a system but it’s not always possible, as with client side latency.
Availability is one of the most important SLIs often expressed as a ratio of the number of requests that succeed, sometimes called yield.
- For storage purposes, durability, i.e. the retention of the data over time, is important.

Service Level Objectives

The SLO is the range of values that you want to achieve with your SLIs.
Choosing SLOs can be difficult. For one, you may not have any say in it!
An example of an SLO would be for response latency to be less than 250ms.
Often one SLI can impact another. For instance, if your number of requests per second rises sharply, so might your latency.
It is important to define SLOs so that users of the system have a realistic understanding of what the availability or reliability of the system is. This eliminates arbitrary “the system is slow” or the “system is unreliable” comments.
- Google provided an example of a system called Chubby that is used extensively within Google where teams built systems on top of Chubby assuming that it was highly available, but no claim was made to that end.
- Sort of crazy, but to ensure service owners didn’t have unrealistic expectations on the Chubby’s up-time, they actually force downtime through the quarter.

Service Level Agreements

These are the agreements of what is to happen if/when the SLOs aren’t met.
If there is no consequence, then you’re likely talking about an SLO and not an SLA.
Typically, SLA’s consequences are monetary, i.e. there will be a credit to your bill if some service doesn’t meet it’s SLO.
SLAs are typically decided by the business, but SREs help in making sure SLO consequences don’t get triggered.
SREs also help come up with objective ways to measure the SLOs.
Google search doesn’t have an SLA, even though Google has a very large stake in ensuring search is always working.
However, Google for Work does have SLAs with its business customers.

What Should You Care About?

You should not use every metric you can find as SLIs.
- Too many and it’s just noisy and hard to know what’s important to look at.
- Too few and you may have gaps in understanding the system reliability.
- A handful of carefully selected metrics should be enough for your SLIs.

Some Examples

User facing services:
- Availability – could the request be serviced,
- Latency – how long did it take the request to be serviced, and
- Throughput – how many requests were able to be serviced.
Storage systems:
- Latency – how long did it take to read/write,
- Availability – was it available when it was requested, and
- Durability – is the data still there when needed.
Big data systems:
- Throughput – how much data is being processed, and
- End to end latency – how long from ingestion to completion of processing.
Everything should care about correctness.

Collecting Indicators

Many metrics come from the server side.
Some metrics can be scraped from logs.
Don’t forget about client-side metric gathering as there might be some things that expose bad user experiences.
- Example Google used is knowing what the latency before a page can be used is as it could be bad due to some JavaScript on the page.

Aggregation

Typically aggregate raw numbers/metrics but you have to be careful.
- Aggregations can hide true system behavior.
  - Example given averaging requests per second: if odd seconds have 200 requests per second and even seconds have 0, then your average is 100 but what’s being hidden is your true burst rate of 200 requests.
  - Same thing with latencies, averaging latencies may paint a pretty picture but the long tail of latencies may be terrible for a handful of users.
Using distributions may be more effective at seeing the true story behind metrics.
- In Prometheus, using a Summary metric uses quantiles so that you can see typical and worst case scenarios.
  - Quantile of 50% would show you the average request, while
  - Quantile of 99.99% would show you the worst request durations.
A really interesting takeaway here is that studies have shown that users prefer a system with low-variance but slower over a system with high variance but mostly faster.
- In a low-variance system, SREs can focus on the 99% or 99.99% numbers, and if those are good, then everything else must be, too.
At Google, they prefer distributions over averages as they show the long-tail of data points, as mentioned earlier, averages can hide problems.
- Also, don’t assume that data is distributed normally. You need to see the real results.
- Another important point here is if you don’t truly understand the distribution of your data, your system may be taking actions that are wrong for the situation. For instance, if you think that you are seeing long latency times but you don’t realize that those latencies actually occur quite often, your systems may be restarting themselves prematurely.

Standardize some SLIs

This just means if you standardize on how, when, and what tools you use for gathering some of the metrics, you don’t have to convince or describe those metrics on every new service or project. Examples might include:
- Aggregation intervals – distribution per minute, and
- Frequency of metrics gathered – pick a time such as every 5 seconds, 10, etc.
Build reusable SLI templates so you don’t have to recreate the wheel every time.

Objectives in Practice

Find out what the users care about, not what you can measure!
- If you choose what’s easy to measure, your SLOs may not be all that useful.

Defining Objectives

SLOs should define how they’re measured and what conditions make them valid.
- Example of a good SLO definition – 99% of RPC calls averaged over one minute return in 100ms as measured across all back-end servers.
It is unrealistic to have your SLOs met 100%..
- As we mentioned in the previous episode, striving for 100% takes time away from adding new features or makes your team design overly conservatively.
  - This is why you should operate with an error budget.

An error budget is just an SLO for meeting other SLOs!
Site Reliability Engineering: How Google Runs Production Systems

Choosing Targets

Don’t choose SLO targets based on current performance.
Keep the SLOs simple. Making them overly complex makes them hard to understand and may be difficult to see impacts of system changes.
Avoid absolutes like “can scale infinitely”. It’s likely not true, and if it is, that means you had to spend a lot of time designing it to be that way and is probably overkill.
Have as few SLOs as possible. You want just enough to be able to ensure you can track the status of your system and they should be defendable.
Perfection can wait. Start with loose targets that you can refine over time as you learn more.
SLOs should be a major driver in what SREs work on as they reflect what the business users care about

Control Measures

Kubernetes is great but … it’s complicated!

Monitor system SLIs.
Compare SLIs to SLOs and see if action is needed.
If action is needed, figure out what action should be taken.
Take the action.
Example that was given is if you see latency climbing, and it appears to be CPU bound, then increasing the CPU capacity should lower latencies and not trigger an SLO consequence.

SLOs Set Expectations

Publishing SLOs make it so users know what to expect.
You may want to use one of the following approaches:
- Keep a safety margin by having a stricter internal SLO than the public facing SLO.
- Don’t overachieve. If your performance is consistently better than your SLO, it might be worth introducing purposeful downtime to set user expectations more in line with the SLO, i.e. failure injection.

Agreements in Practice

The SRE’s role is to help those writing SLAs understand the likelihood or difficulty of meeting the SLOs/SLA being implemented.
You should be conservative in the SLOs and SLAs that you make publicly available.
- These are very difficult to change once they’ve been made public.
SLAs are typically misused when actually talking about an SLO. SLA breaches may trigger a court case.
If you can’t win an argument about a particular SLO, it’s probably not worth having an SRE team work on it.

Resources we Like

Links to Google’s free books on Site Reliability Engineering (sre.google)
Amazon S3 SLA (aws.amazon.com)

Tip of the Week

If you switch to a Mac and you’re struggling with the CMD / CTRL switch from Windows, look for driver software from the keyboard manufacturer as they likely have an option to swap the keys for you!
Metrics aren’t free! Be careful to watch your costs or you can get up to babillions quickly!
Did you know there is a file format you can use to import bookmarks? It’s really simple, just an HTML file. You can even use it for onboarding! (j11g.com)
Powerlevel10k is a Zsh theme that looks nice and is easy to configure, but it’s also good about caching your git status so it doesn’t bog down your computer trying to pull the status on every command, a must for Zsh users with large repos! (GitHub)
- Also mentioned in episode 172.

Direct download: coding-blocks-episode-183.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sponsors