Toil is not just work you don’t wanna do, nor is it just administrative work or tedious tasks.
Toil is different for every individual.
Some administrative work has to be done and is not considered toil but rather it’s overhead.
HR needs, trainings, meetings, etc.
Even some tedious tasks that pay long term dividends cannot be considered toil.
Cleaning up service configurations was an example of this.
Toil further defined is work that is often times manual, repetitive, can be automated, has no real value, and/or grows as the service does.
Manual – Something a human has to do.
Repetitive – Running something once or twice isn’t toil. Having to do it frequently is.
Automatable – If a machine can do it, then it should be done by the machine. If the task needs human judgement, it’s likely not toil.
Tactical – Interrupt driven rather than strategy driven. May never be able to eliminate completely but the goal is to minimize this type of work.
No enduring value – If your service didn’t change state after the task was completed, it was likely toil. If there was a permanent improvement in the state of the service then it likely wasn’t toil.
O(n) with service growth – If the amount of work grows with the growth of your service usage, then it’s likely toil.
Why is Less Toil Better?
At Google, the goal is to keep each SRE’s toil at less than 50%.
The other 50% should be developing solutions to reduce toil further, or make new features for a service.
Where features mean improving reliability, performance, or utilization.
The goal is set at 50% because it can easily grow to 100% of an SRE’s time if not addressed.
The time spent reducing toil is the “engineering” in the SRE title.
This engineering time is what allows the service to scale with less time required by an SRE to keep it running properly and efficiently.
When Google hires an SRE, they promise that they don’t run a typical ops organization and mention the 50% rule. This is done to help ensure the group doesn’t turn into a full time ops team.
The book gave the example of a 6 person team and a 6 week cycle:
Assuming 1 week of primary on-call time and 1 week of secondary on-call time, that means an SRE has 2 of 6 weeks with “interrupt” type of work, or toil, meaning 33% is the lower bound of toil.
With an 8 person team, you move to an 8 week cycle, so 2 weeks on call out of 8 weeks mean a 25% toil lower bound.
At Google, SRE’s report their toil is spent most on interrupts (non-urgent, service related messages), then on-call urgent responses, then releases and pushes.
Surveys at Google with SRE’s indicate that the average time spent in toil is closer to 33%.
Like all averages, it leaves out outliers, such as people who spend 0 time toiling, and others who spend as much as 80% of their time on toil.
If there is someone taking on too much toil, it’s up to the manage to spread that out better.
What Qualifies as Engineering?
Work that requires human judgement,
Produces permanent improvements in a service and requires strategy,
Design driven approach, and
The more generic or general, the better as it may be applied to multiple services to get even greater gains in efficiency and reliability.
Typical SRE Activities
Software engineering – Involves writing or modifying code.
Systems engineering – Configuring systems, modifying configurations, or documenting systems that provide long term improvements.
Toil – Work that is necessary to run a service but is manual, repetitive, etc.
Overhead – Administrative work not directly tied to a service such as hiring, HR paperwork, meetings, peer-reviews, training, etc.
The 50% goal is over a few quarters or year. There may be some quarters where toil goes above 50%, but that should not be sustained. If it is, management needs to step in and figure out how to bring that back into the goal range.
“Let’s invent more, and toil less”
Site Reliability Engineering: How Google Runs Production Systems
Is Toil Always Bad?
The fact that some amount of toil is predictable and repeatable makes some individuals feel like they’re accomplishing something, i.e. quick wins that may be low risk and low stress.
Some amount of toil is expected and unavoidable.
When the amount of time spent on toil becomes too large, you should be concerned and “complain loudly”.
Potential issues with large amounts of toil:
Career stagnation – If you’re not spending enough time on projects, your career progression will suffer.
Low morale – Too much toil leads to burnout, boredom, and being discontent.
Too much time on toil also hurts the SRE team.
Creates confusion – The SRE team is supposed to do engineering, and if that’s not happening, then the goal of the team doesn’t match the work being done by the team.
Slows progress – The team will be less productive if they’re focused on toil.
Sets precedent – If you take on too much toil regularly, others will give you more.
Promotes attrition – If your group takes on too much toil, talented engineers in the group may leave for a position with more development opportunities.
Causes breach of faith – If someone joins the team but doesn’t get to do engineering, they’ll feel like they were sold a bill of goods.
Commit to cleaning up a bit more toil each week with engineering activities.
Resources We Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
The Greatest Inheritance, uh stars Jaleel White (IMDb)
Using Minikube? Did you know you can transfer images with minikube image save from your Minikube environment to Docker easily? Useful for running things in a variety of ways. (minikube.sigs.k8s.io)
Ever have a multi-stage docker, where you only wanted to build one of the intermediary stages? Great for debugging as well as part of your caching strategy, use docker build --target <stage name> to build those intermediary stages. (docs.docker.com)