rupeshbende asks: How do you find time to do this along with your day job and hobbies as this involves so much studying on your part?
Why Do We Automate Things?
Consistency: Humans make mistakes, even on simple tasks. Machines are much more reliable. Besides, tasks like creating accounts, resetting passwords, applying updates aren’t exactly fun.
Platform: Automation begets automation, smaller tasks can be tweaked or combined into bigger ones.
Pays dividends, providing value every time it’s used as opposed to toil which is essentially a tax.
Platforms centralize logic too, making it easier to organize, find, and fix issues.
Automation can provide metrics, measurements that can be used to make better decisions.
Faster Repairs: The more often automation runs, it hits the same problems and solutions which brings down the average time to fix. The more often the process runs, the cheaper it becomes to repair.
Faster Actions: Automations are faster than humans. Many automations would be prohibitively expensive for humans to do,
Time Saving: It’s faster in terms of actions, and anybody can run it.
If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.
The Value of SRE at Google
Google has a strong bias for automation because of their scale.
Google’s core is software, and they don’t want to use software where they don’t own the code and they don’t want processes in place that aren’t automated. You can’t scale tribal knowledge.
They invest in platforms, i.e. systems that can be improved and extended over time.
Google’s Use Cases for Automation
Much of Google’s automation is around managing the lifecycle of systems, not their data.
They use tools such as chef, puppet, cfengine, and PERL(!?).
The trick is getting the right level of abstraction.
Higher level abstractions are easier to work with and reason about, but are “leaky”.
Hard to account for things like partial failures, partial rollbacks, timeouts, etc.
The more generic a solution, the easier it is to apply more generally and tend to be more reusable, but the downside is that you lose flexibility and resolution.
The Use Cases for Automation
Google’s broad definition of automation is “meta-software”: software that controls software.
Account creation, termination,
Cluster setup, shutdown,
Software install and removal,
Configuration changes, and
A Hierarchy of Automation Classes
Ideally you wouldn’t need to stitch systems together to get them to work together.
Systems that are separate, and glue code can suffer from “bit rot”, i.e. changes to either system can work poorly with each other or with the havoc.
Glue code is some of the hardest to test and maintain.
There are levels of maturity in a system. The more rare and risky a task is, the less likely it is to be fully automated.
No automation: database failover to a new location manually.
Externally maintained system-specific automations: SRE has a couple commands they run in their notes.
Externally maintained generic system-specific automation: SRE adds a script to a playbook.
Internally maintained system-specific automation: the database ships with a script.
System doesn’t need automation: Database notices and automatically fails over.
Can you automate so much that developers are unable to manually support systems when a (very rare) need occurs?
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Chapter 7: The Evolution of Automation at Google (sre.google)
Ultimate List of Programmer Jokes, Puns, and other Funnies (Medium)
Shared success in building a safer open source community (blog.google)
One Man’s Nearly Impossible Quest to Make a Toaster From Scratch (Gizmodo)
The Man Who Spent 17 Years Building The Ultimate Lamborghini Replica In His Basement Wants to Sell It (Jalopnik)
Tip of the Week
There’s an easy way to seeing Mongo queries that are running in your Spring app, by just setting the appropriate logging level like: logging.level.org.springframework.data.mongodb.core.MongoTemplate=DEBUG
This can be easily done at runtime if you have actuators enabled: (Spring)
There’s a new, open-core product from Grafana called OnCall that helps you manage production support. Might be really interesting if you’re already invested in Grafana and a lot of organizations are invested in Grafana. (Grafana)
How can you configure your Docker container to run as a restricted user? It’s easy! (docs.docker.com)
iOS – Remember the days of being about to rearrange your screens in iTunes? Turns out you still can, but in iOS. Tap and hold the dots to rearrange them! (support.apple.com)
Another great post from @msuriar, this time about the value of hiring junior developers. (suriar.net)
More about Monitoring Less
Instrumentation and Performance
Need to be careful and not just track times, such as latencies, on medians or means.
A better way is to bucketize the data as a histogram, meaning to count how many instances of a request occurred in the given bucket, such as the example latency buckets in the book of 0ms – 10ms, 10ms – 30ms, 30ms-100ms, etc.
Choosing the Appropriate Resolution for Measurements
The gist is that you should measure at intervals that support the SLO’s and SLA’s.
For example, if you’re targeting a 99.9% uptime, there’s no reason to check for hard-drive fullness more than once or twice a minute.
Collecting measurements can be expensive, for both storage and analysis.
Best to take an approach like the histogram and keep counts in buckets and aggregate the findings, maybe per minute.
As Simple as Possible, No Simpler
It’s easy for monitoring to become very complex:
Alerting on varying thresholds and measurements,
Code to detect possible causes,
Monitoring can become so complex that it becomes difficult to change, maintain, and it becomes fragile.
Some guidelines to follow to keep your monitoring useful and simple include:
Rules that find incidents should be simple, predictable and reliable,
Data collection, aggregation and alerting that is infrequently used (the book said less than once a quarter) should be a candidate for the chopping block, and
Data that is collected but not used in any dashboards or alerting should be considered for deletion.
Avoid attempting to pair simple monitoring with other things such as crash detection, log analysis, etc. as this makes for overly complex systems.
Tying these Principles Together
Google’s monitoring philosophy is admittedly maybe hard to attain but a good foundation for goals.
Ask the following questions to avoid pager duty burnout and false alerts:
Does the rule detect something that is urgent, actionable and visible by a user?
Will I ever be able to ignore this alert and how can I avoid ignoring the alert?
Does this alert definitely indicate negatively impacted users and are there cases that should be filtered out due to any number of circumstances?
Can I take action on the alert and does it need to be done now and can the action be automated? Will the action be a short-term or long-term fix?
Are other people getting paged about this same incident, meaning this is redundant and unnecessary?
Those questions reflect these notions on pages and pagers:
Pages are extremely fatiguing and people can only handle a few a day, so they need to be urgent.
Every page should be actionable.
If a page doesn’t require human interaction or thought, it shouldn’t be a page.
Pages should be about novel events that have never occurred before.
It’s not important whether the alert came from white-box or black-box monitoring.
It’s more important to spend effort on catching the symptoms over the causes and only detect imminent causes.
Monitoring for the Long Term
Monitoring systems are tracking ever-changing software systems, so decisions about it need to be made with long term in mind.
Sometimes, short-term fixes are important to get past acute problems and buy you time to put together a long term fix.
Two case studies that demonstrate the tension between short and long term fixes
Originally Bigtable’s SLO was based on an artificial, good client’s mean performance.
Bigtable had some low level problems in storage that caused the worst 5% of requests to be significantly slower than the rest.
These slow requests would trip alerts but ultimately the problems were transient and unactionable.
People learned to de-prioritize these alerts, which sometimes were masking legitimate problems.
Google SRE’s temporarily dialed back the SLO to the 75th percentile to trigger fewer alerts and disabled email alerts, while working on the root cause, fixing the storage problems.
By slowing the alerts it gave engineers the breathing room they needed to deep dive the problem.
Gmail was originally built on a distributed process management system called Workqueue which was adapted to long-lived processes.
Tasks would get de-scheduled causing alerts, but the tasks only affected a very small number of users.
The root cause bugs were difficult to fix because ultimately the underlying system was a poor fit.
Engineers could “fix” the scheduler by manually interacting with it (imagine restarting a server every 24 hours).
Should the team automate the manual fix, or would this just stall out what should be the real fix?
These are 2 red flags: Why have rote tasks for engineers to perform? That’s toil. Why doesn’t the team trust itself to fix the root cause just because an alarm isn’t blaring?
What’s the takeaway? Do not think about alerts in isolation. You must consider them in the context of the entire system and make decisions that are good for the long term health of the entire system.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Python has built in functionality for dynamically reloading modules: Reloading modules in Python. (GeeksForGeeks)
Concatenate RUN statements like RUN some_command && some_other_command instead of splitting it out into two separate RUN command strings to reduce the layer count.
Prefer apk add --no-cache some_package over apk update && apk add some_package to reduce the layer and image size. And if you’re using apt-get instead of apk, be sure to include apt-get clean as the final command in the RUN command string to keep the layer small.
When using ADD and COPY, be aware that Docker will need the file(s)/directory in order to compute the checksum to know if a cached layer already exists. This means that while you can ADD some_url, Docker needs to download the file in order to compute the checksum. Instead, use curl or wget in a RUN statement when possible, because Docker will only compute the checksum of the RUN command string before executing it. This means you can avoid unnecessarily downloading files during builds (especially on a build server and especially for large files). (docs.docker.com)