We haven’t finished the Site Reliability Engineering book yet as we learn how to monitor our system while the deals at Costco as so good, Allen thinks they’re fake, Joe hasn’t attended a math class in a while, and Michael never had AOL.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
News
Thank you for the reviews! just_Bri, 1234556677888999900000, Mannc, good beer hunter
Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)
Couldn’t resist posting this:
Survey Says
Monitor Some of the Things
Terminology
Monitoring – Collecting, processing, and aggregating quantitative information about a system.
White-box monitoring – Monitoring based on metrics exposed by a system, i.e. logs, JVM profiling, etc.
Black-box monitoring – Monitoring a system as a user would see it.
Dashboard – Provides a summary view of the most important service metrics. May display team information, ticket queue size, high priority bugs, current on call engineer, recent pushes, etc.
Alert – Notification intended to be read by a human, such as tickets, email alerts, pages, etc.
Root cause – A defect, that if corrected, creates a high confidence level that the same issue won’t be seen again. There can be multiple root causes for a particular incident (including a lack of testing!)
Node and machine – A single instance of a running kernel.
Kernel – The core of the operating system. Generally controls everything on the system, always resident in memory, and facilitates interactions between the system hardware and software. (Wikipedia)
There could be multiple services worth monitoring on the same node that could be either related or unrelated.
Push – Any change to a running service or it’s configuration.
Why Monitor?
Some of the main reasons include:
To analyze trends,
To compare changes over time, and
Alerting when there’s a problem.
To build dashboards to answer basic questions.
Ad hoc analysis when things change to identify what may have caused it.
Monitoring lets you know when the system is broken or may be about to break.
You should never alert just if something seems off.
Paging a human is an expensive use of time.
Too many pages may be seen as noise and reduce the likelihood of thorough investigation.
Effective alerting systems have good signal and very low noise.
Setting Reasonable Expectations for Monitoring
Monitoring complex systems is a major undertaking.
The book mentions that Google SRE teams with 10-12 members have one or two people focused on building and maintaining their monitoring systems for their service.
They’ve reduced the headcount needed for maintaining these systems as they’ve centralized and generalized their monitoring systems, but there’s still at least one human dedicated to the monitoring system.
They also ensure that it’s not a requirement that an SRE stare at the screen to identify when a problem comes up.
Google has since moved to simpler and faster monitoring systems that provide better tools for ad hoc analysis and avoid systems that try to determine causality
This doesn’t mean they don’t monitor for major changes in common trends.
SRE’s at Google seldom use tiered rule triggering.
Why? Because they’re constantly changing their service and/or infrastructure.
When they do alert on these dependent types of rules, it’s when there’s a common task that’s carried out that is relatively simple.
It is critical that from the instant a production issue arises, that the monitoring system alert a human quickly, and provide an easy to follow process that people can use to find the root cause quickly.
Alerts need to be simple to understand and represent the failure clearly.
Symptoms vs Causes
A monitoring system should answer these two questions:
What is broken? This is the symptom.
Why is it broken? This is the cause.
The book says that drawing the line between the what and why is one of the most important ways to make a good monitoring system with high quality signals and low noise.
An example might be:
Symptom: The web server is returning 500s or 404s,
Cause: The database server ran out of hard-drive space.
Black-Box vs White-Box
Google SRE’s use white-box monitoring heavily, and much less black-box monitoring except for critical uses.
White-box monitoring relies on inspecting the internals of a system.
Black-box monitoring is symptom oriented and helps identify unplanned issues.
Interesting takeaway for the white-box monitoring is this exposes issues that may be hidden by things like retries.
A symptom for one team can be a cause for another.
White-box monitoring is crucial for telemetry.
Example: The website thinks the database is slow, but does the database think itself is slow? If not, there may be a network issue.
Benefit of black-box monitoring for alerting is black-box monitoring indicates a problem that is currently happening, but is basically useless in letting you know that a problem may happen.
Four Golden Signals
Latency – The time it takes to service a request.
Important to separate successful request latency vs failed request latency.
A slow error is worse than a fast error!
Traffic – How much demand is being placed on your system, such as requests per second for a web request, or for streaming audio/video, it might be I/O throughput.
Errors – The rate of requests that fail, either explicitly or implicitly.
Explicit errors are things like a 500 HTTP response.
Implicit might be any request that took over 2 seconds to finish if your goal is to respond in less than 2 seconds.
Saturation – How full your service is.
A measure of resources that are the most constrained, such as CPU or I/O, but note that things usually start to degrade before 100% utilization.
This is why having a utilization target is important.
Latency increases are often indicators of saturation.
Measuring 99% response time over a small interval can be an early signal of saturation.
Saturation also concerns itself when predicting imminent issues, like filling up drive space, etc.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)
Tip of the Week
Prometheus has configurations that let you tune how often it looks for metrics, i.e. the scrape_interval. Too much and you’re wasting resources, not enough and you can miss important information and get false alerts. (Prometheus)
There’s a reason WordPress is so popular. It’s fast and easy to setup, especially if you use Webinonly. (Webinonly.com)
Looking for great encryption libraries for Java or PHP? Check out Bouncy Castle! (Bouncy Castle)
Big thanks to @bicylerepairmain for the tip on the running lines of code in VS Code with a keyboard shortcut. The option workbench.action.terminal.runSelectedText is under File -> Preferences -> Keyboard Shortcuts. (Stack Overflow)
Need to see all of the files you’ve changed since you branched off of a commit? Use git diff --name-only COMMIT_ID_SHA HEAD. (git-scm.com)
Couple with Allen’s tip from episode 182 to make it easier to find that starting point!
We say “toil” a lot this episode while Joe saw a movie, Michael says something controversial, and Allen’s tip is to figure it out yourself, all while learning how to eliminate toil.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Reviews
Thank you for the reviews! AA, Franklin MacDunnaduex, BillyVL, DOM3ag3
Toil is not just work you don’t wanna do, nor is it just administrative work or tedious tasks.
Toil is different for every individual.
Some administrative work has to be done and is not considered toil but rather it’s overhead.
HR needs, trainings, meetings, etc.
Even some tedious tasks that pay long term dividends cannot be considered toil.
Cleaning up service configurations was an example of this.
Toil further defined is work that is often times manual, repetitive, can be automated, has no real value, and/or grows as the service does.
Manual – Something a human has to do.
Repetitive – Running something once or twice isn’t toil. Having to do it frequently is.
Automatable – If a machine can do it, then it should be done by the machine. If the task needs human judgement, it’s likely not toil.
Tactical – Interrupt driven rather than strategy driven. May never be able to eliminate completely but the goal is to minimize this type of work.
No enduring value – If your service didn’t change state after the task was completed, it was likely toil. If there was a permanent improvement in the state of the service then it likely wasn’t toil.
O(n) with service growth – If the amount of work grows with the growth of your service usage, then it’s likely toil.
Why is Less Toil Better?
At Google, the goal is to keep each SRE’s toil at less than 50%.
The other 50% should be developing solutions to reduce toil further, or make new features for a service.
Where features mean improving reliability, performance, or utilization.
The goal is set at 50% because it can easily grow to 100% of an SRE’s time if not addressed.
The time spent reducing toil is the “engineering” in the SRE title.
This engineering time is what allows the service to scale with less time required by an SRE to keep it running properly and efficiently.
When Google hires an SRE, they promise that they don’t run a typical ops organization and mention the 50% rule. This is done to help ensure the group doesn’t turn into a full time ops team.
Calculating Toil
The book gave the example of a 6 person team and a 6 week cycle:
Assuming 1 week of primary on-call time and 1 week of secondary on-call time, that means an SRE has 2 of 6 weeks with “interrupt” type of work, or toil, meaning 33% is the lower bound of toil.
With an 8 person team, you move to an 8 week cycle, so 2 weeks on call out of 8 weeks mean a 25% toil lower bound.
At Google, SRE’s report their toil is spent most on interrupts (non-urgent, service related messages), then on-call urgent responses, then releases and pushes.
Surveys at Google with SRE’s indicate that the average time spent in toil is closer to 33%.
Like all averages, it leaves out outliers, such as people who spend 0 time toiling, and others who spend as much as 80% of their time on toil.
If there is someone taking on too much toil, it’s up to the manage to spread that out better.
What Qualifies as Engineering?
Work that requires human judgement,
Produces permanent improvements in a service and requires strategy,
Design driven approach, and
The more generic or general, the better as it may be applied to multiple services to get even greater gains in efficiency and reliability.
Typical SRE Activities
Software engineering – Involves writing or modifying code.
Systems engineering – Configuring systems, modifying configurations, or documenting systems that provide long term improvements.
Toil – Work that is necessary to run a service but is manual, repetitive, etc.
Overhead – Administrative work not directly tied to a service such as hiring, HR paperwork, meetings, peer-reviews, training, etc.
The 50% goal is over a few quarters or year. There may be some quarters where toil goes above 50%, but that should not be sustained. If it is, management needs to step in and figure out how to bring that back into the goal range.
“Let’s invent more, and toil less”
Site Reliability Engineering: How Google Runs Production Systems
Is Toil Always Bad?
The fact that some amount of toil is predictable and repeatable makes some individuals feel like they’re accomplishing something, i.e. quick wins that may be low risk and low stress.
Some amount of toil is expected and unavoidable.
When the amount of time spent on toil becomes too large, you should be concerned and “complain loudly”.
Potential issues with large amounts of toil:
Career stagnation – If you’re not spending enough time on projects, your career progression will suffer.
Low morale – Too much toil leads to burnout, boredom, and being discontent.
Too much time on toil also hurts the SRE team.
Creates confusion – The SRE team is supposed to do engineering, and if that’s not happening, then the goal of the team doesn’t match the work being done by the team.
Slows progress – The team will be less productive if they’re focused on toil.
Sets precedent – If you take on too much toil regularly, others will give you more.
Promotes attrition – If your group takes on too much toil, talented engineers in the group may leave for a position with more development opportunities.
Causes breach of faith – If someone joins the team but doesn’t get to do engineering, they’ll feel like they were sold a bill of goods.
Commit to cleaning up a bit more toil each week with engineering activities.
Resources We Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
The Greatest Inheritance, uh stars Jaleel White (IMDb)
Clean Code – How to Write Amazing Unit Tests (episode 54)
DevOps Vs SRE: Enabling Efficiency And Resiliency (harness.io)
Tip of the Week
Pandas is a great tool for data analysis. It’s fast, flexible and easy to use. Easy to work with information from GCS buckets. (pandas.pydata.org)
7 GUIs you can build to study graphical user interface design. Start with a counter and build up to recreating Excel, programming language agnostic! (eugenkiss.github.io)
Did you know there’s a bash util for sorting, i.e. sort? (manpages.ubuntu.com)
Using Minikube? Did you know you can transfer images with minikube image save from your Minikube environment to Docker easily? Useful for running things in a variety of ways. (minikube.sigs.k8s.io)
Ever have a multi-stage docker, where you only wanted to build one of the intermediary stages? Great for debugging as well as part of your caching strategy, use docker build --target <stage name> to build those intermediary stages. (docs.docker.com)