We take a few to step back and look at how things have changed since we first started the show while Outlaw is dancing on tables, Allen really knows his movie monsters, and Joe's math is on point.
We've wrapped up 9 years…how have we changed the most…why?
Bonus: Buying a window with 3 huge tvs (youtube.com)
Top 3 things you've gotten out of it …
Alphabetize all the things in your class
A better understanding of DB technologies and the impact of their underlying data structures
It's forced us to study various topics …
Amazing friends, community
The application tier can / should be your most powerful
Don't make your tech-du-jour a hammer
Tip of the Week
If you want to enable Markdown support, open a document in Google Docs, head over to the top of the screen, go to “Tools” then “Preferences” and enable “Automatically detect Markdown.” After that, you’re good to go..except this only works for the current doc. (techcrunch.com)
Markdown Viewer is also a plugin for Chrome that lets you support .md files in Google Drive (workspace.google.com)
DataGrip's useless "error at position" messages are frustrating, but the IDE actually does give you the info you need. Check your cursor!
Minikube's "profile" feature makes it easy to swap between clusters. No more tearing down and rebuilding if you need to switch to a new task! (minikube.sigs.k8s.io)
SQLforDevs.com has a free ebook: Next-Level Database Techniques for Developers. (sqlfordevs.com)
We talk about career management and interview tips, pushing data contracts "left", and our favorite dev books while Outlaw is [redacted], Joe's trying to figure out how to hire junior devs, and Allen's trying to screw some nails in.
Interesting article about AI potentially replacing recruiters at Amazon (vox.com)
From 'Round the Water-Cooler
Why don't companies want junior developers?
You see a lot of advice out there for developers to get that first job, but what advice does the industry have to trying to hire and support them? …not much
How long do you need to stay at a job?
What do you do if you're worried about being a "job hopper"?
Interviewing…know what the company is creating so you'll have an idea of what challenges they may have technically and so you can look up how you might solve some of those problems
How do you decide when to bring in new tech?
Right tool for the job - don't always be jumping ship to the newest, shiniest thing - it might be you just need to augment your stack with a new piece of technology rather than thinking something new will solve ALL your problems
Tip of the Week
Did you know Obsidian has a command palette similar to Code? Same short-cut (Cmd/Ctrl-P) as VS Code and it makes for a great learning curve! Don't know how to make something italic? Cmd-P. Insert a template? Cmd-P. Pretty much anything you want to do, but don't know how to do. Cmd P! (help.obsidian.md)
Ghostery plugin for Firefox cuts down on ads and protects your privacy. Thanks for the tip Aaron Jeskie! (addons.mozilla.org)
Amazing prank to play on Windows user, hit F-11 to full screen this website next time your co-worker or family member leaves their computer unlocked. Thanks Scott Harden! (fakeupdate.net)
We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them both.
In 2019, over 100 million people per day would visit Twitter.
Every tweet and user action creates an event that is used by machine learning and employees for analytics.
Their goal was to democratize data analysis within Twitter to allow people with various skillsets to analyze and/or visualize the data.
At the time, various technologies were used for data analysis:
Scalding which required programmer knowledge, and
Presto and Vertica which had performance issues at scale.
Another problem was having data spread across multiple systems without a simple way to access it.
Moving pieces to Google Cloud Platform
The Google Cloud big data tools at play:
BigQuery, a cost-effective, serverless, multicloud enterprise data warehouse to power your data-driven innovation.
DataStudio, unifying data in one place with ability to explore, visualize and tell stories with the data.
History of Data Warehousing at Twitter
2011 – Data analysis was done with Vertica and Hadoop and data was ingested using Pig for MapReduce.
2012 – Replaced Pig with Scalding using Scala APIs that were geared towards creating complex pipelines that were easy to test. However, it was difficult for people with SQL skills to pick up.
2016 – Started using Presto to access Hadoop data using SQL and also used Spark for ad hoc data science and machine learning.
2018 …
Scalding for production pipelines,
Scalding and Spark for ad hoc data science and machine learning,
Vertica and Presto for ad hoc, interactive SQL analysis,
Druid for interactive, exploratory access to time-series metrics, and
Tableau, Zeppelin, and Pivot for data visualization.
So why the change? To simplify analytical tools for Twitter employees.
BigQuery for Everyone
Challenges:
Needed to develop an infrastructure to reliably ingest large amounts of data,
Support company-wide data management,
Implement access controls,
Ensure customer privacy, and
Build systems for:
Resource allocation,
Monitoring, and
Charge-back.
In 2018, they rolled out an alpha release.
The most frequently used tables were offered with personal data removed.
Over 250 users, from engineering, finance, and marketing used the alpha.
Sometime around June of 2019, they had a month where 8,000 queries were run that processed over 100 petabytes of data, not including scheduled reports.
The alpha turned out to be a large success so they moved forward with more using BigQuery.
They have a nice diagram that’s an overview of what their processes looked like at this time, where they essentially pushed data into GCS from on-premise Hadoop data clusters, and then used Airflow to move that into BigQuery, from which Data Studio pulled its data.
Ease of Use
BigQuery was easy to use because it didn’t require the installation of special tools and instead was easy to navigate via a web UI.
Users did need to become familiar with some GCP and BigQuery concepts such as projects, datasets, and tables.
They developed educational material for users which helped get people up and running with BigQuery and Data Studio.
In regards to loading data, they looked at various pieces …
Cloud Composer (managed Airflow) couldn’t be used due to Domain Restricted Sharing (data governance).
Google Data Transfer Service was not flexible enough for data pipelines with dependencies.
They ended up using Apache Airflow as they could customize it to their needs.
For data transformation, once data was in BigQuery, they created scheduled jobs to do simple SQL transforms.
For complex transformations, they planned to use Airflow or Cloud Composer with Cloud Dataflow.
Performance
BigQuery is not for low-latency, high-throughput queries, or for low-latency, time-series analytics.
It is for SQL queries that process large amounts of data.
Their requirements for their BigQuery usage was to return results within a minute.
To achieve these requirements, they allowed their internal customers to reserve minimum slots for their queries, where a slot is a unit of computational capacity to execute a query.
The engineering team had to analyze 800+ queries, each processing around 1TB of data, to figure out how to allocate the proper slots for production and other environments.
Data Governance
Twitter focused on discoverability, access control, security, and privacy.
For data discovery and management, they extended their DAL to work with both their on-premise and GCP data, providing a single API to query all sets of data.
In regards to controlling access to the data, they took advantage of two GCP features:
Domain restricted sharing, meaning only users inside Twitter could access the data, and
VPC service controls to prevent data exfiltration as well as only allow access from known IP ranges.
Authentication, Authorization, and Auditing
For authentication, they used GCP user accounts for ad hoc queries and service accounts for production queries.
For authorization, each dataset had an owner service account and a reader group.
For auditing, they exported BigQuery stackdriver logs with detailed execution information to BigQuery datasets for analysis.
Ensuring Proper Handling of Private Data
They required registering all BigQuery datasets,
Annotate private data,
Use proper retention, and
Scrub and remove data that was deleted by users.
Privacy Categories for Datasets
Highly sensitive datasets are available on an as-needed basis with least privilege.
These have individual reader groups that are actively monitored.
Medium sensitivity datasets are anonymized data sets with no PII (Personally identifiable information) and provide a good balance between privacy and utility, such as, how many users used a particular feature without knowing who the users were.
Low sensitivity datasets are datasets where all user level information is removed.
Public datasets are available to everyone within Twitter.
Scheduled tasks were used to register datasets with the DAL, as well as a number of additional things.
Cost
Roughly the same for querying Presto vs BigQuery.
There are additional costs associated with storing data in GCS and BigQuery.
Utilized flat-rate pricing so they didn’t have to figure out fluctuating costs of running ad hoc queries.
In some situations where querying 10’s of petabytes, it was more cost-effective to utilize Presto querying data in GCS storage.
Could you build Twitter in a weekend?
Resources
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Characters Sets (No Excuses!) (JoelOnSoftware.com)
Scaling data access by moving an exabyte of data to Google Cloud (blog.twitter.com)
Democratizing data analysis with Google BigQuery (blog.twitter.com)
Google BigQuery, Cloud data warehouse to power your data-driven innovation (cloud.google.com)
Elon Musk and Twitter employees engage in war of words (NewsBytesApp.com)
Tip of the Week
VS Code has a plugin for Kubernetes and it’s actually very nice! Particularly when you “attach” to the container. It installs a couple bits on the container, and you can treat it like a local computer. The only thing to watch for … it’s very easy to set your local context! (marketplace.visualstudio.com)
kafkactl is a great command line tool for managing Apache Kafka and has a consistent API that is intuitive to use. (deviceinsight.github.io)
Cruise Control is a tool for Apache Kafka that helps balance resource utilization, detect and alert on problems, and administrate. (GitHub)
iTerm2 is a terminal emulator for macOS that does amazing things. Why aren’t you already using it? (iterm2.com)
Message compression in Kafka will help you save a lot of space and network bandwidth, and the compression is per message so it’s easy to enable in existing systems! (cwiki.apache.org)
It’s that time of year where we’ve got money burning a hole in our pockets. That’s right, it’s time for the annual shopping spree. Meanwhile, Fiona Allen is being gross, Joe throws shade at Burger King, and Michael has a new character encoding method.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Well, you know Joe has to be a little different so the format’s a bit different here! What if there was a way to spend money that could actually make you happy? Check out this article: Yes, you can buy happiness … if you spend it to save time (CNBC).
Ideas for ways to spend $2k to save you time
A good mattress will improve your sleep, and therefore your amount of quality time in a day! ($1k),
Cleaning Service ($100 – $300 per month),
Massage ($50 per month),
Car Wash Subscription ($20 per month),
Grocery Delivery Service (Shipt is $10 a month + up charges on items),
How do you fix a typo on your phone? Try pressing and then sliding your thumb on the space bar! It’s a nifty trick to keep you in the flow. And it works on both Android and iOS.
Heading off to holiday? Here’s an addendum to episode 191‘s Tip of the Week … Don’t forget your calendar!
On iOS, go to Settings -> Mail -> Accounts -> Select your work account -> Turn off the Mail and Calendar sliders.
Also, in Slack, you can pause notifications for an extended period and if you do, it’ll automatically change your status to Vacationing .
Did you know that Docker only has an image cache locally, there isn’t a local registry installed? This matters if you go to use something like microk8s instead of minikube! (microk8s.io)
What if you want to see what process has a file locked?
In Windows, Ronald Sahagun let us know you can use File Locksmith in PowerToys from Microsoft. (learn.microsoft.com)
In Linux based systems, Dave Follett points out you can just cat the process ID file in your /proc directory: cat /proc/<processId> to see what’s locked. LS Locks makes it easy too, just run the command and grep for your file. (Stack Exchange)
We gather around the watercooler to discuss the latest gossip and shenanigans have been called while Coach Allen is not wrong, Michael gets called out, and Joe gets it right the first time.
DuckDB is an in-process SQL OLAP database management system. You can use it from the command line, drop it into your POM file, pip install it, or npm install it, and then you can easily work with CSV or Parquet files as if they were a database. (duckdb.org)
It’s really easy to try out in the browser too! (shell.duckdb.org)
Want to be sure a file or URL is safe? Use Virus Total to find out. From VirusTotal: VirusTotal inspects items with over 70 antivirus scanners and URL/domain blocklisting services, in addition to a myriad of tools to extract signals from the studied content. (virustotal.com)
How to Show & Verify Code Signatures for Apps in Mac OS X (osxdaily.com)
tldr: codesign -dv --verbose=4 /path/to/some.app
How to Get GitHub-like Diff Support in Git on the Command-Line (matthewsetter.com)
Speed up development cycles when working in Kubernetes with Telepresence. (telepresence.io)
We wrap up Git from the Bottom Up by John Wiegley while Joe has a convenient excuse, Allen gets thrown under the bus, and Michael somehow made it worse.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
News
Thanks for the reviews on iTunes jessetsilva, Marco Fernandooo, and sysadmike702!
Git’s reset is likely one of the commands that people shy away from using because it can mess with your working tree as well as what commit HEAD references.
reset is a reference editor, an index editor and a working tree editor.
git reset
Modifies HEAD?
Modifies the index?
Modifies the working tree?
--mixed
YES
YES. Removes all staged changes from the index, effectively unstaging them back to the working tree.
YES. All changes from the reset commit(s) are put in the working tree. Any previous changes are merged with the reset commit(s)’s changes in the working tree.
--soft
YES
YES. All changes from the reset commit(s) are put in the index. Any previously staged changes are merged with the reset commit(s)’s changes in the index.
NO. Any changes in the working tree are left untouched.
--hard
YES
YES. Clears the index of any staged changes.
YES. Clears the working tree of any unstaged changes.
What do the git reset mode flags change?
Mixed reset
--mixed is the default mode.
If you do a reset --mixed of more than one commit, all of those changes will be put back in the working tree together essentially setting you up for a squash of those commits.
Soft Reset
These two commands are equivalent, both effectively ignoring the last commit:
git reset --soft HEAD^
git update-ref HEAD HEAD^
If you did a git status after either of the previous commands, you’d see more changes because your working tree is now being compared to a different commit, assuming you previously had changes in your working tree.
This effectively allows you to create a new commit in place of the old one.
Instead of doing this, you can always do git commit --amend.
Similar to the use of --mixed for multiple commits, if you do a reset --soft of more than one commit, all of those changes will be put back in the index together essentially setting you up for a squash of those commits.
Hard Reset
This can be one of the most consequential commands.
Performing git reset --hard HEAD will get rid of any changes in your index and working tree to all tracked files, such that all of your files will match the contents of HEAD.
If you do a reset --hard to an earlier commit, such as git reset --hard HEAD~3, Git is removing changes from your working tree to match the state of the files from the earlier commit, and it’s changing HEAD to reference that earlier commit. Similar to the previous point, all uncommitted changes to tracked files are undone.
Again, this is a destructive/dangerous way to do something like this and there is another way that is safer:
Instead, perform a git stash followed by git checkout -b new-branch HEAD~3.
This will save, i.e. stash, your index and working tree changes, and then check out a new branch that references HEAD‘s great grandparent.
git stash saves your work in a stash that you can then apply to any branch you wish in the future; it is not branch specific.
Checking out a new branch to the older state allows you to maintain your previous branch and still make the changes you wanted on your new branch.
If you decide that you like what is in your new branch better than your old branch, you can run these commands:
git branch -D oldbranch
git branch -m newbranch oldbranch
After learning all of this, the author’s recommendation is to always do the stashing/branch creation as it’s safer and there’s basically no real overhead to it.
If you do accidentally blow away changes, the author mentions that you can do a restore from the reflog such as git reset --hard HEAD@{1}.
The author also recommends ALWAYS doing a git stash before doing a git reset --hard
This allows you to do a git stash apply and recover anything you lost, i.e. nice backup plan.
As mentioned previously, if you have other consumers of your branch/commits, you should be careful when making changes that modify history like this as it can force unexpected merges to happen to your consumers.
Stashing and the Reflog
There are two new ways that blobs can make their way into the repository.
The first is the reflog, a metadata repository that records everything you do in your repository.
So any time you make a commit in your repository, a commit is also being made to the reflog.
You can view the reflog with git reflog.
The glorious thing about the reflog is even if you did something like a git reset and blew away your changes, any changes previously committed would still exist in the reflog for at least 30 days, before being garbage collected (assuming you don’t manually run garbage collection).
This allows you to recover a commit that you deleted in your repository.
The other place that a blob can exist is in your working tree, albeit not directly noticeable.
If you modified foo.java but you didn’t add it to the index, you can still see what the hash would be by running git hash-object foo.java.
In this regard, the change exists on your filesystem instead of Git’s repository.
The author recommends stashing any changes at the end of the day even if you’re not ready to add anything to your index or commit it.
By doing so, Git will store all of your working tree changes and current index as the necessary trees and blobs in your git repository along with a couple of commits for storing the state of the working tree and index.
The next day, you come back in, run a git stash apply and all of your changes are back in your working tree.
So why do that? You’re just back in the same state you were the night before, yeah? Well, except now those commits that happened due to the stash are something you can go back to in your reflog, in case of an emergency!
Another special thing, because stashes are stored as commits, you can interact with them just like any other branch, at any time!
git checkout -b temp stash@{32}
In the above command, you can checkout a stash you did 32 days ago, assuming you were doing a single stash per day!
If you want to cleanup your stash history, DO NOT USE git stash clear as it kills all your stash history.
Instead, use git reflog expire --expire=30.days refs/stash to let your stashes expire.
One last tip the author mentioned is you could even roll your own snapshot type command by simply doing a git stash && git stash apply.
The Pragmatic Programmer – How to Build Pragmatic Teams (Episode 114)
Tip of the Week
A couple episodes back (episode 192), Allen mentioned Obsidian, a note taking app that operates on markdown files so you can use it offline if you want or you can keep the files in something like DropBox or pay a monthly fee for syncing. Good stuff, and if you ever want to leave the service … you have the markdown files! That’s an old tip, but Joe has been using it lately and wanted add a couple supplemental tips now that he’s gotten more experience with it.
If Obsidian just manages markdown files, then why bother? Why not just use something like VSCode? Because, Obsidian is also a rich client that is designed to help you manage markdown with features built in for things like search, tags, cross-linking etc.
Obsidian supports templates, so you can, for example, create a template for common activities … like if you keep a daily TODO list that has the same items on it every day, you can just {{include}} it to dump a copy of that checklist or whatever in. (help.obsidian.md)
Obsidian is designed to support multiple “vaults” up front. This lets you, for example, have one vault that you use for managing your personal life that you choose to sync to all of your devices, and one for work that is isolated in another location and doesn’t sync so you don’t have to worry about exfiltrating my work notes.
Community extensions! People have written interesting extensions, like a Calendar View or a Kanban board, but ultimately they serialize down to markdown files so if the extension (for example) doesn’t work on mobile then you can still somewhat function.
All of the files that Obsidian manages have to have a .md file extension. Joe wanted to store some .http files in his vault because it’s easy to associate them with his notes, but he also wanted to be able to execute them using the REST Client extension … which assumes a .http extension. The easiest solution Joe found was just to change the file type in the lower right hand corner in VSCode and it works great. This works for other extensions, too, of course! (GitHub)
[Wireless] How to improve compatibility of IoT device with ASUS WiFi 6(AX) Router? (ASUS)
Google’s new mesh Wi-Fi solution with support for Wi-Fi 6e is out, Google Nest Wifi Pro, and looks promising. (store.google.com)
Terran Antipodes sent Allen a tip that we had to share, saying that you can place your lower lip between your teeth to hold back a sneeze. No need to bite down or anything, it just works! All without the worry of an aneurysm.
This episode, we learn more about Git’s Index and compare it to other version control systems while Joe is throwing shade, Michael learns a new command, and Allen makes it gross.
Ludum Dare is a bi-annual game jam that’s been running for over 20 years now. Jam #51 is coming up September 30th to October 3rd. (ldjam.com)
We previously talked about Ludum Dare in episode 146.
The Index
Meet the Middle Man
The index refers to the set of blobs and trees created when running a git add, when you “stage” files.
These trees and blobs are not a part of the repository yet!
If you were to unstage the changes using a reset, you’d have an orphaned blob(s) that would eventually get cleaned up.
The index is a staging area for your next commit.
The staging area allows you to build up your next commit in stages.
You can almost ignore the index by doing a git commit -a (but shouldn’t).
In Subversion, the next set of changes is always determined by looking at the differences in the current working tree.
In Git, the next set of changes is determined by looking at your index and comparing that to the latest HEAD.
git add allows you to make additional changes before executing your commit with things like git add --patch and git add --interactive parameters.
For Emacs fans out there, the author mentioned gitsum. (GitHub)
Taking the Index Further
The author mentions “Quilt!”, is it this? (man7.org)
The primary difference between Git and Quilt is Git only allows one patch to be constructed at a time.
Situation the author describes is: What if I had multiple changes I wanted to test independently with each other?
There isn’t anything built into Git to allow you to try out parallel sets of changes on the fly.
Multiple branches would allow you to try out different combinations and the index allows you to stage your changes in a series of commits, but you can’t do both at the same time.
To do this you’d need an index that allows for more than a single commit at a time.
Stacked Git is a tool that lets you prepare more than one index at a time. (stacked-git.github.io)
The author gives an example of using regular Git to do two commits by interactively selecting a patch.
Then, the author gives the example of how you’d have to go about disabling one set of changes to test the other set of changes. It’s not great … swapping between branches, cherry-picking changes, etc.
If you find yourself in this situation, definitely take a look at Stacked Git. Using Stacked Git, you are basically pushing and popping commits on a stack.
Diffusion Bee is GUI for running Stable Diffusion on M1 macs. It’s got a one-click installer that you can get up and generating weird computer art in minutes … as long as you’re on a recent version of macOS and M1 hardware. (GitHub)
No M1 Mac? You can install the various packages you need to do it yourself, some assembly required! (assembly.ai)
Git Tower is a fresh take on Git UI that lets you drag-n-drop branches, undo changes, and manage conflicts. Give it a shot! (git-tower.com)
Git Kraken is the Gold Standard when it comes to Git UIs. It’s a rich, fully featured environment for managing all of your branches and changes. They are also the people behind the popular VS Code Extension GitLens (gitkraken.com)
GitHub CLI is an easy to use command line interface for interacting with GitHub. Reason 532 to love it … draft PR creation via gh pr create --draft ! (cli.github.com)
It’s time to understand the full power of Git’s rebase capabilities while Allen takes a call from Doc Brown, Michael is breaking stuff all day long, and Joe must be punished.
Ludum Dare is a bi-annual game jam that’s been running for over 20 years now. Jam #51 is coming up September 30th to October 3rd. (ldjam.com)
We previously talked about Ludum Dare in episode 146.
Branching and the power of rebase
Every branch you work in typically has one or more base commits, i.e. the commits the branch started from.
git branch shows the branches in your local repo.
git show-branch shows the branch ancestry in your local repo.
Reading the output from the bottom up takes you from oldest to newest history in the branches
Plus signs, are used to indicate commits on divergent branches from the one that’s currently checked out.
An asterisk, is used to indicate commits that happened on the current branch.
At the top of the output above the dashed line, the output shows the branches, the column and color that will identify their commits, and the label used when identifying their commits.
Consider an example repo where we have two branches, T and F, where T = Trunk and F = Feature and the commit history looks like this:
What we want to do is bring Feature up to date with what’s in Trunk, so bring T2, T3, and T4 into F3.
In most source control systems, your only option here is to merge, which you can also do in Git, and should be done if this is a published branch where we don’t want to change history.
After a merge, the commit tree would look like this:
The F3' commit is essentially a “meta-commit” because it’s showing the work necessary to bring T4 and F3 together in the repository but contains no new changes from the working tree (assuming there were no merge conflicts to resolve, etc.)
If you would rather have your work in your Feature branch be directly based on the commits from Trunk rather than merge commits, you can do a git rebase, but you should only do this for local development.
The resulting branch would look like this:
You should only rebase local branches because you’re potentially rewriting commits and you should not change public history.
When doing the merge, the merge commit, F3' is an instruction on how to transform F3 + T4.
When doing the rebase, the commits are being rewritten, such that F1' is based on T4 as if that’s how it was originally written by the author.
Use rebase for local branches that don’t have other branches off it, otherwise use merge for anything else.
Interactive rebasing
git rebase will try to automatically do all the merging.
git rebase -i will allow you to handle every aspect of the rebase process.
pick – This is the default behavior when not using -i. The commit should be applied to its rewritten parent. If there are conflicts, you’re allowed to resolve them before continuing.
squash – Use this option when you want to combine the contents of a commit into the previous commit rather than keeping the commits separate. This is useful for when you want multiple commits to be rewritten as a single commit.
edit – This will stop the rebasing process at that commit and let you make any changes before doing a git rebase --continue. This allows you to make changes in the middle of the process, making it look like the edit was always there.
drop – Use when you want to remove a commit from the history as if it had never been committed. You can also remove the commit from the list or comment it out from the rebase file to get the same results. If there were any commits later that depended on the dropped commit, you will get merge conflicts.
Interactive gives you the ability to reshape your branch to how you wish you’d done it in the first place, such as reordering commits.
Site Reliability Engineering – Embracing Risk (episode 182)
Tip of the Week
Russian Circles is a rock band that makes gloomy, mid-tempo, instrumental music that’s perfect for coding. They just put out a new album and, much like the others, it’s great for coding to! (YouTube)
GitLens for Visual Studio Code is an open-source extension for Visual Studio Code that brings in a lot more information from your Git repository into your editor. (marketplace.visualstudio.com)
JSON Crack is a website that makes it easy to “crack” JSON documents and view them hierarchically. Great for large docs. Thanks for the tip Thiyagu! (JsonCrack.com)
Handle is a Windows utility that you can use to see which process has a “handle” on your resource. Thanks for the tip Larry Weiss! (docs.microsoft.com)
Crunchy Data has made it so you can run PostgreSQL in the browser thanks to WASM. Technically very cool, and it’s a great way to learn Postgres. Thanks for the tip Mikerg! (Crunchy Data)
Divvy is a cool new window manager for macOS. It’s cool, modern, and much more powerful than the built in manager! Thanks for the tip jonasbn! (apps.apple.com)
We are committed to continuing our deep dive into Git from the Bottom Up by John Wiegley, while Allen puts too much thought into onions, Michael still doesn’t understand proper nouns, and Joe is out hat shopping.
Ludum Dare is a bi-annual game jam that’s been running for over 20 years now. Jam #51 is coming up Sept 30th to October 3rd. (ldjam.com)
We previously talked about Ludum Dare in episode 146.
Commitment Issues
Commits
A commit can have one or more parents.
Those commits can have one more parents.
It’s for this reason that commits can be treated like branches, because they know their entire lineage.
You can examine top level referenced commits with the following command: git branch -v.
A branch is just a named reference to a commit!
A branch and a tag both name a commit, with the exception that a tag can have a description, similar to a commit.
Branches are just names that point to a commit.
Tags have descriptions and point to a commit.
Knowing the above two points, you actually don’t technically need branches or tags. You could do everything pointing to the commit hash id’s if you were insane enough to do so.
Here’s a dangerous command:
git reset --hard commitHash – This is dangerous. --hard says to erase all changes in the working tree, whether they were registered for a check-in or not and reset HEAD to point to the commitHash.
Here’s a safer command:
git checkout commitHash – This is a safer option, because files changed in the working tree are preserved. However, adding the -f parameter acts similar as the previous command, except that it doesn’t change the branch’s HEAD, and instead only changes the working tree.
Some simple concepts to grasp:
If a commit has multiple parents, it’s a merge commit.
If a commit has multiple children, it represents the ancestor of a branch.
Simply put, Git is a collection of commits, each of which holds a tree which reference other trees and blobs, which store data.
All other things in Git are named concepts but they all boil down to the above statement.
A commit by any other name
The key to knowing Git is to truly understand commits.
Learning to name your commits is the way to mastering Git.
branchname – The name of a branch is an alias to the most recent commit on that branch.
tagname – Similar to the branch name in that the name points to a specific commit but the difference is a tag can never change the commit id it points to.
HEAD – The currently checked out commit. Checking out a specific commit takes you out of a “branch” and you are then in a “detached HEAD” state.
The 40 character hash id – A commit can always be referenced by the full SHA1 hash.
You can refer to a commit by a shorter version of the hash id, enough characters to make it unique, usually 6 or 7 characters is enough.
name^ – Using the caret tells Git to go to the parent of the provided commit. If a commit has more than one parent, the first one is chosen.
name^^ – Carets can be stacked, so doing two carets will give the parent of the parent of the provided commit.
name^2 – If a commit has multiple parents, you can choose which one to retrieve by using the caret followed by the number of the parent to retrieve. This is useful for things like merge commits.
name~10 – Same thing as using the commit plus 10 carets. It refers to the named commit’s 10th generation ancestor.
name:path – Used to reference a specific file in the commit’s content tree, excellent when you need to do things like compare file diffs in a merge, like: git diff HEAD^1:somefile HEAD^2:somefile.
name^{tree} – Reference the tree held by a commit rather than the commit itself.
name1..name2 – Get a range of commits reachable from name2 all the way back to, but not including, name1. Omitting name1 or name2 will substitute HEAD in the place.
name1…name2 – For commands like log, gets the unique commits that are referenced by name1 or name2. For commands like diff, the range is is between name2 and the common ancestor of name1 and name2.
main.. – Equivalent to main..HEAD and useful when comparing changes made in the current branch to the branch named main.
..main – Equivalent to HEAD..main and useful for comparing changes since the last rebase or merge with the branch main, after fetching it.
-since=”2 weeks ago” – All commits from a certain relative date.
–until=”1 week ago” – All commits before a certain relative date.
–grep=pattern – All commits where the message meets a certain regex pattern.
–committer=pattern — Find all the commits where the committer matches a regex pattern.
–author=pattern – All commits whose author matches the pattern.
So how’s that different than the committer? “The author of a commit is the one who created the changes it represents. For local development this is always the same as the committer, but when patches are being sent by e-mail, the author and the committer usually differ.”
–no-merges – Only return commits with a single parent, i.e. ignore all merge commits.
Not sure where the history of your branch started from and want an easy button? Check out Allen’s TotW from episode 182.
Need to search the entire history of the repo for some content (text, code, etc.) that’s not part of the current branch? Content, not a commit comment, not a commit ID, but content. Check out Michael’s TotW from episode 31.
Nobody Likes Onions, a podcast that has been making audiences laugh at the absurd, the obvious, and the wrong, for a very long time. (NobodyLikesOnions.com)
Tip of the Week
Supabase is an open-source alternative to Google’s Firebase that is based on PostgreSQL. The docs are great and it’s really easy to work through the “Getting Started” guide to set up a new project in the top framework of your choice, complete with a (for now) free, hosted PostgreSQL database on Heroku, with authentication (email/password or a myriad of providers). RBAC is controlled via database policies and everything can be administered through the portal. You can query the database with a simple DSL. Joe was able to work through a small project and get it hosted on Netlify (with SSL!) all for free in under 2 hours. (supabase.com)
Obsidian is a really cool way to associate markdown data with your files. (Thanks Simon Barker!) (obsidian.md)
Ever use a “mind map” tool? MindNode is a great, free, mind mapping tool to help you organize your thoughts (Thanks Sean Martz!) (mindnode.com)
Ink Drop is a cool way to organize and search your markdown files (inkdrop.app) (Thanks Lars!)
Tired of git log knocking the rest of your content off screen? You can configure Git to run a custom “core.pager” command with the args you prefer: (serebrov.github.io)
To configure just Git: git config --global --replace-all core.pager "less -iXFR"
Or, to modify how less prints to the screen and commands that rely on it, including Git, edit your ~/.bashrc or ~/.zshrc, etc. and add export LESS=-iXFR to the file.
It’s surprising how little we know about Git as we continue to dive into Git from the Bottom Up, while Michael confuses himself, Joe has low standards, and Allen tells a joke.
Thanks for all the great feedback on the last episode and for sticking with us!
Directory Content Tracking
Put simply, Git just keeps a snapshot of a directory’s contents.
Git represents your file contents in blobs (binary large object), in a structure similar to a Unix directory, called a tree.
A blob is named by a SHA1 hashing of the size and contents of the file.
This verifies that the blob contents will never change (given the same ID).
The same contents will ALWAYS be represented by the same blob no matter where it appears, be it across commits, repositories, or even the Internet.
If multiple trees reference the same blob, it’s simply a hard link to the blob.
As long as there’s one link to a blob, it will continue to exist in the repository.
A blob stores no metadata about its content.
This is kept in the tree that contains the blob.
Interesting tidbit about this: you could have any number of files that are all named differently but have the same content and size and they’d all point to the same blob.
For example, even if one file were named abc.txt and another was named passwords.bin in separate directories, they’d point to the same blob.
The author creates a file and then calculates the ID of the file using git hash-object filename.
If you were to do the same thing on your system, assuming you used the same content as the author, you’d get the same hash ID, even if you name the file different than what they did.
git cat-file -t hashID will show you the Git type of the object, which should be blob.
git cat-file blob hashID will show you the contents of the file.
The commands above are looking at the data at the blob level, not even taking into account which commit contained it, or which tree it was in.
Git is all about blob management, as the blob is the fundamental data unit in Git.
Blobs are Stored in Trees
Remember there’s no metadata in the blobs, and instead the blobs are just about the file’s contents.
Git maintains the structure of the files within the repository in a tree by attaching blobs as leaf nodes within a tree.
git ls-tree HEAD will show the tree of the latest commit in the current directory.
git rev-parse HEAD decodes the HEAD into the commit ID it references.
git cat-file -t HEAD verifies the type for the alias HEAD (should be commit).
git cat-file commit HEAD will show metadata about the commit including the hash ID of the tree, as well as author info, commit message, etc.
To see that Git is maintaining its own set of information about the trees, commits and blobs, etc., use find .git/objects -type f and you’ll see the same IDs that were shown in the output from the previous Git commands.
How Trees are Made
There’s a notion of an index, which is what you use to initially create blobs out of files.
If you just do a git add without a commit, assuming you are following along here (jwiegly.github.io), git log will fail because nothing has been committed to the repository.
git ls-files --stage will show your blob being referenced by the index.
At this point the file is not referenced by a tree or a commit, it’s only in the .git/index file.
git write-tree will take the contents of the index and write it to a tree, and the tree will have it’s own hash ID.
If you followed along with the link above, you’d have the same hash from the write-tree that we get.
A tree containing the same blob and sub-trees will always have the same hash.
The low-level write-tree command is used to take the contents of the index and write them into a new tree in preparation for a commit.
git commit-tree takes a tree’s hash ID and makes a commit that holds it.
If you wanted that commit to reference a parent, you’d have to manually pass in the parent’s commit ID with the -p argument.
This commit ID will be different for everyone because it uses the name of the creator of the commit as well as the date when the commit is created to generate the hash ID.
Now you have to overwrite the contents of .git/refs/heads/master with the latest commit hash ID.
This tells Git that the branch named master should now reference the new commit.
A safer way to do this, if you were doing this low-level stuff, is to use git update-ref refs/heads/master hashID.
git symbolic-ref HEAD refs/heads/master then associates the working tree with the HEAD of master.
What Have We Learned?
Blobs are unique!
Blobs are held by Trees, Trees are held by Commits.
HEAD is a pointer to a particular commit.
Commits usually have a parent, i.e. previous, commit.
We’ve got a better understanding of the detached HEAD state.
What a lot of those files mean in the .git directory.
Resources We Like
Things I wish everyone knew about Git (Part 1) (blog.plover.com)
Have you ever heard the tale of … the forbidden files in Windows? Windows has a list of names that you cannot use for files. Twitter user @foone has done the unthinkable and created a repository of these files. What would happen if you checked this repository out on Windows?
Check out this convenient repository in Windows. (GitHub)
When you use mvn dependency:tree, grep is your enemy. If you want to find out who is bringing in a specific dependency, you really need to use the -Dincludes flag.
Thanks to @ttutko for this tip about redirecting output:
kafkacat 2>&1 | grep "". If you’re not familiar with that syntax, it just means pipe STDERR to STDOUT and then pipe that to grep.
Thanks Volkmar Rigo for this one!
Dangit, Git!? Git is hard: messing up is easy, and figuring out how to fix your mistakes is impossible. This website has some tips to get you out of a jam. (DangitGit.com)
How to vacay … step 1 temporarily disable your work email (and silence Slack, Gchat, whateves).
On iOS, go to Settings -> Mail -> Accounts -> Select your work account -> Turn off the Mail slider.
After working with Git for over a decade, we decide to take a deep dive into how it works, while Michael, Allen, and Joe apparently still don’t understand Git.
This episode was inspired by an article written by Mark Dominus.
Git commits are immutable snapshots of the repository.
Branches are named sequences of commits.
Every object gets a unique id based on its content.
The author is not a fan of how the command set has evolved over time.
With Git, you need to think about what state your repository is in, and what state you would like to be in.
There are likely a number of ways to achieve that desired state.
If you try to understand the commands without understanding the model, you can get lost. For example:
git reset does three different things depending on the flags used,
git checkout even worse (per the author), and
The opposite of git-push is not git-pull, it’s git-fetch.
Possibly the worst part of the above is if you don’t understand the model and what’s happening to the model, you won’t know the right questions to ask to get back into a good state.
Mark said the thing that saved him from frustration with Git is the book Git from the Bottom Up by John Wiegley (jwiegley.github.io)
Mark doesn’t love Git, but he uses it by choice and he uses it effectively. He said that reading Wiegley’s book is what changed everything for him. He could now “see” what was happening with the model even when things went wrong.
It is very hard to permanently lose work. If something seems to have gone wrong, don’t panic. Remain calm and ask an expert.
Mark Dominus
Git from the Bottom Up
A repository – “is a collection of commits, each of which is an archive of what the project’s working tree looked like at a past date, whether on your machine or someone else’s.” It defines HEAD, which identifies the branch or commit the current tree started from, and contains a set of branches or tags that allow you to identify commits by a name.
The index is what will be committed on the next commit. Git does not commit changes from the working tree into the repository directly so instead, the changes are registered into the index, which is also referred to as a staging area, before committing the actual changes.
A working tree is any directory on your system that is associated with a Git repository and typically has a .git folder inside it.
Why typically? Thanks to the git-worktree command, one .git directory can be used to support multiple working trees, as previously discussed in episode 128.
A commit is a snapshot of your working tree at some point in time. “The state of HEAD (see below) at the time your commit is made becomes that commit’s parent. This is what creates the notion of a ‘revision history’.”
A branch is a name for a commit, also called a reference. This stores the history of commits, the lineage and is typically referred to as the “branch of development”
A tag is also a name for a commit, except that it always points to the same commit unlike a branch which doesn’t have to follow this rule as new commits can be made to the branch. A tag can also have its own description text.
master was typically, maybe not so much now, the default branch name where development is done in a repository. Any branch name can be configured as the default branch. Currently, popular default branch names include main, trunk, and dev.
HEAD is an alias that lets the repository identify what’s currently checked out. If you checkout a branch, HEAD now symbolically points to that branch. If you checkout a tag, HEAD now refers only to that commit and this state is referred to as a “detached HEAD“.
The typical work flow goes something like:
Create a repository,
Do some work in your working tree,
Once you’ve achieved a good “stopping point”, you add your changes to the index via git add, and then
Once your changes are in the state you want them and in your index, you are ready to put your changes into the actual repository, so you commit them using git commit.
Resources We Like
Things I wish everyone knew about Git (Part 1) (blog.plover.com)
Designing Data-Intensive Applications – SSTables and LSM-Trees (episode 128)
Tip of the Week
Celeste is a tough, but forgiving game that is on all major platforms. It was developed by a tiny team, 2 programmers, and it’s a really rewarding and interesting experience. Don’t sleep on this game any longer! (CelesteGame.com)
Enforcer Maven plugin is a tool for unknotting dependency version problems, which can easily get out of control and be a real problem when trying to upgrade!
Maven Enforcer Plugin – The Loving Iron Fist of MavenTM (maven.apache.org)
Tired of sending messages too early in Slack? You can set your Slack preferences to make ENTER just do a new line! Then use CMD + ENTER on MacOS or CTRL + ENTER on Windows to send the message! Thanks for the amazing tip from Jim Humelsine! (Slack)
Using Docker Desktop, and want to run a specific version? Well … you can’t really! You have to pick a version of Docker Desktop that corresponds to your target version of Kubernetes!
Alternatively you can just use Minikube to target a specific Kubernetes version (minikube.sigs.k8s.io)
Save a life, donate blood, platelets, plasma, or marrow (redcrossblood.org)
What if you want to donate blood marrow or cord blood? You need to be matched with a recipient first. Check eligibility on the website at Be The Match. (bethematch.org)
Also, not quite as important, you can disable all of the stupid sounds (bells) in WSL!
Disable beep in WSL terminal on Windows 10 (Stack Overflow)
Once again, Stack Overflow takes the pulse of the developer community where we have all collectively decided to switch to Clojure, while Michael is changing things up, Joe is a future predicting trailblazer, and Allen is “up in the books”.
Joe’s going to be speaking at the Orlando Elastic Meetup about running Elasticsearch in Kubernetes on July 27th 2022 (Meetup)
Recommendation, keep your API interface in separate modules from your implementation! That makes it easier to re-use that code in new ways without having to refactor first.
Do you worry about talking too much in virtual meetings? This app monitors your mic and lets you know when you’re waffling on. (unblah.me)
Did you know you can do a regex search in grep? Example: grep -Pzo "(?s)^(\s)\Nmain.?{.?^\1}" *.c (Stack Overflow)
But what if you want to do that in vim? By default vim treats characters literally, but you can turn on “magic” characters with :set magic and then you’re off to the races! (Stack Overflow)
Looking for some great IntelliJ Code Completion Tips? Check out this video! (YouTube)
Did you know yum install will not return an error code when installing multiple packages at the same time if one succeeds and another fails. Yikes! So be sure to install your dependencies independent of their dependents. (Stack Overflow)
We’re going back in time, or is it forward?, as we continue learning about Google’s automation evolution, while Allen doesn’t like certain beers, Joe is a Zacker™, and Michael poorly assumes that UPSes work best when plugged in.
A cautionary, err, educational tale of automating MySQL for Ads and automating replica replacements.
Migrating MySQL to Borg (Google Cluster Manager)
Large-scale cluster management at Google with Borg (research.google)
Desired goals of the project:
Eliminate machine/replica maintenance,
Ability to run multiple instances on same machine.
Came with additional complications – Borg task moving caused problems for master database servers.
Manual failovers took a long time.
Human involvement in the failovers would take longer than the required 30 seconds or less downtime.
Led to automating failover and the birth of MoB (MySQL on Borg).
Again, more problems because now application code needed to become much more failure tolerant.
After all this, mundane tasks dropped by 95%, and with that they were able to optimize and automate other things causing total operational costs to drop by 95% as well.
Automating Cluster Delivery
Story about a particular setup of Bigtable that didn’t use the first disk of a 12 disk cluster.
Some automation thought that if the first disk wasn’t being utilized, then none of the disks weren’t configured and were safe to be wiped.
Automation should be careful about implicit “safety” signals.
Cluster delivery automation depended on a lot of bespoke shell scripts which turned out to be problematic over time.
Detecting Inconsistencies with ProdTest
Cluster automations required custom flags, which led to constant problems / misconfigurations.
Shell scripts became brittle over time.
Were all the services available and configured properly?
Were the packages and configurations consistent with other deployments?
Could configuration exceptions be verified?
For this, ProdTest was created.
Tests could be chained to other tests and failures in one would abort causing subsequent tests to not run.
The tests would show where something failed and with a detailed report of why.
If something new failed, they could be added as new tests to help quickly identify them in the future.
These tools gave visibility into what was causing problems with cluster deployments.
While the finding of things quicker was nice, that didn’t mean faster fixes. Dozens of teams with many shell scripts meant that fixing these things could be a problem.
The solution was to pair misconfigurations with automated fixes that were idempotent
This sounded good but in reality some fixes were flaky and not truly idempotent and would cause the state to be “off” and other tests would now start failing.
There was also too much latency between a failure, the fix, and another run.
Specializing
Automation processes can vary in one of three ways:
Competence,
Latency,
Relevance: the proportion of real world processes covered by automation.
They attempted to use “turnup” teams that would focus on automation tasks, i.e. teams of people in the same room. This would help get things done quicker.
This was short-lived.
Could have been over a thousand changes a day to running systems!
When the automation code wasn’t staying in sync with the code it was covering, that would cause even more problems. This is the real world. Underlying systems change quickly and if the automation handling those systems isn’t kept up, then more problems crop up.
This created some ugly side effects by relieving teams who ran services of the responsibility to maintain and run their automation code, which created ugly organizational incentives:
A team whose primary task is to speed up the current turnup has no incentive to reduce the technical debt of the service-owning team running the service in production later.
A team not running automation has no incentive to build systems that are easy to automate.
A product manager whose schedule is not affected by low-quality automation will always prioritize new features over simplicity and automation.
Turnups became inaccurate, high-latency, and incompetent.
They were saved by security by the removal of SSH approaches to more auditable / less-privileged approaches.
Service Oriented Cluster Turnup
Changed from writing shell scripts to RPC servers with fine-grained ACL (access control lists).
Service owners would then create / own the admin servers that would know how their services operated and when they were ready.
These RPC’s would send more RPC’s to admin server’s when their ready state was reached.
This resulted in low-latency, competent, and accurate processes.
Autonomous systems that need no human intervention”
Borg: Birth of the Warehouse-Scale Computer
In the early days, Google’s clusters were racks of machines with specific purposes.
Developers would log into machines to perform tasks, like delivering “golden” binaries.
As Google grew, so did the number and type of clusters. Eventually machines started getting a descriptor file so developers could act on types of machines.
Automation eventually evolved to storing the state of machines in a proper database, with sophisticated monitoring tools.
This automation was severely limited by being tied to physical machines with physical volumes, network connections, IP addresses, etc.
Borg let Google orchestrate at the resource level, allocating compute dynamically. Suddenly one physical computer could have multiple types of workloads running on it.
This let Google centralize it’s logic, making it easier to make systemic changes that improve efficiency, flexibility, and reliability.
This allowed Google to greatly scale it’s resources without scaling it’s labor.
Thousands of machines are born, die, and go into repair daily without any developer interaction.
They effectively turned a hardware problem into a software problem, which allowed them to take advantage of well known techniques and algorithms for scheduling processes.
This couldn’t have happened if the system wasn’t self-healing. Systems can’t grow past a certain point without this.
Reliability is the Fundamental Feature
Internal operations that automation relies on needs to be exposed to the people as well.
As systems become more and more automated, the ability for people to reason about the system deteriorates due to lack of involvement and practice.
They say that the above is true when systems are non-autonomous, i.e. the manual actions that were automated are assumed to be able to be done manually still, but doesn’t reflect the current reality.
While Google has to automate due to scale, there is still a benefit for software / systems that aren’t that at their scale and this is reliability. Reliability is the ultimate benefit to automation.
Automation also speeds processes up.
Best to start thinking about automation in the design phase as it’s difficult to retrofit.
Beware – Enabling Failure at Scale
Story about automation that wiped out almost all the machines on a CDN because when they re-ran the process to do a Diskerase, it found that there were no machines to wipe, but the automation then saw the “empty set” as meaning, wipe everything.
This caused the team to build in more sanity checks and some rate limiting!
Resources We Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Apple’s Self Service Repair now available (apple.com)
Tip of the Week
kubectl debug is a useful utility command that helps you debug issues. Here are a couple examples from the docs using kubectl debug (kubernetes.io)
Adding ephemeral debug containers to pods,
Copying existing pods and add additional containers,
Debugging pending pods,
Pods that immediately fail.
The Kubernetes docs feature a lot of nice tips for debugging (kubernetes.io)
Did you know that JetBrains makes it easy to add logging while you’re debugging? Just highlight the code you want to log the value of, then SHIFT-CLICK the gutter to set a logging point during debugging!
Want to copy a file out of an image, without running it? You can’t, however you can create a non-running container that will spin up a lite/idle container that will do the job. Just make sure to rm it when you’re done. Notice how helpful it was for later commands to name the container when it was created! Here’s an example workflow to copy out some.file. (docs.docker.com)
We explore the evolution of automation as we continue studying Google’s Site Reliability Engineering, while Michael, ah, forget it, Joe almost said it correctly, and Allen fell for it.
rupeshbende asks: How do you find time to do this along with your day job and hobbies as this involves so much studying on your part?
Survey Says
Automation
Why Do We Automate Things?
Consistency: Humans make mistakes, even on simple tasks. Machines are much more reliable. Besides, tasks like creating accounts, resetting passwords, applying updates aren’t exactly fun.
Platform: Automation begets automation, smaller tasks can be tweaked or combined into bigger ones.
Pays dividends, providing value every time it’s used as opposed to toil which is essentially a tax.
Platforms centralize logic too, making it easier to organize, find, and fix issues.
Automation can provide metrics, measurements that can be used to make better decisions.
Faster Repairs: The more often automation runs, it hits the same problems and solutions which brings down the average time to fix. The more often the process runs, the cheaper it becomes to repair.
Faster Actions: Automations are faster than humans. Many automations would be prohibitively expensive for humans to do,
Time Saving: It’s faster in terms of actions, and anybody can run it.
If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.
Joseph Bironas
The Value of SRE at Google
Google has a strong bias for automation because of their scale.
Google’s core is software, and they don’t want to use software where they don’t own the code and they don’t want processes in place that aren’t automated. You can’t scale tribal knowledge.
They invest in platforms, i.e. systems that can be improved and extended over time.
Google’s Use Cases for Automation
Much of Google’s automation is around managing the lifecycle of systems, not their data.
They use tools such as chef, puppet, cfengine, and PERL(!?).
The trick is getting the right level of abstraction.
Higher level abstractions are easier to work with and reason about, but are “leaky”.
Hard to account for things like partial failures, partial rollbacks, timeouts, etc.
The more generic a solution, the easier it is to apply more generally and tend to be more reusable, but the downside is that you lose flexibility and resolution.
The Use Cases for Automation
Google’s broad definition of automation is “meta-software”: software that controls software.
Examples:
Account creation, termination,
Cluster setup, shutdown,
Software install and removal,
Software upgrades,
Configuration changes, and
Dependency changes
A Hierarchy of Automation Classes
Ideally you wouldn’t need to stitch systems together to get them to work together.
Systems that are separate, and glue code can suffer from “bit rot”, i.e. changes to either system can work poorly with each other or with the havoc.
Glue code is some of the hardest to test and maintain.
There are levels of maturity in a system. The more rare and risky a task is, the less likely it is to be fully automated.
Maturity Model
No automation: database failover to a new location manually.
Externally maintained system-specific automations: SRE has a couple commands they run in their notes.
Externally maintained generic system-specific automation: SRE adds a script to a playbook.
Internally maintained system-specific automation: the database ships with a script.
System doesn’t need automation: Database notices and automatically fails over.
Can you automate so much that developers are unable to manually support systems when a (very rare) need occurs?
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Chapter 7: The Evolution of Automation at Google (sre.google)
Ultimate List of Programmer Jokes, Puns, and other Funnies (Medium)
Shared success in building a safer open source community (blog.google)
One Man’s Nearly Impossible Quest to Make a Toaster From Scratch (Gizmodo)
The Man Who Spent 17 Years Building The Ultimate Lamborghini Replica In His Basement Wants to Sell It (Jalopnik)
Tip of the Week
There’s an easy way to seeing Mongo queries that are running in your Spring app, by just setting the appropriate logging level like: logging.level.org.springframework.data.mongodb.core.MongoTemplate=DEBUG
This can be easily done at runtime if you have actuators enabled: (Spring)
There’s a new, open-core product from Grafana called OnCall that helps you manage production support. Might be really interesting if you’re already invested in Grafana and a lot of organizations are invested in Grafana. (Grafana)
How can you configure your Docker container to run as a restricted user? It’s easy! (docs.docker.com)
User <user>[:<group>]
User <UID>[:<GID>]
iOS – Remember the days of being about to rearrange your screens in iTunes? Turns out you still can, but in iOS. Tap and hold the dots to rearrange them! (support.apple.com)
We finished. A chapter, that is, of the Site Reliability Engineering book as Allen asks to make it weird, Joe has his own pronunciation, and Michael follows through on his promise.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Another great post from @msuriar, this time about the value of hiring junior developers. (suriar.net)
Survey Says
More about Monitoring Less
Instrumentation and Performance
Need to be careful and not just track times, such as latencies, on medians or means.
A better way is to bucketize the data as a histogram, meaning to count how many instances of a request occurred in the given bucket, such as the example latency buckets in the book of 0ms – 10ms, 10ms – 30ms, 30ms-100ms, etc.
Choosing the Appropriate Resolution for Measurements
The gist is that you should measure at intervals that support the SLO’s and SLA’s.
For example, if you’re targeting a 99.9% uptime, there’s no reason to check for hard-drive fullness more than once or twice a minute.
Collecting measurements can be expensive, for both storage and analysis.
Best to take an approach like the histogram and keep counts in buckets and aggregate the findings, maybe per minute.
As Simple as Possible, No Simpler
It’s easy for monitoring to become very complex:
Alerting on varying thresholds and measurements,
Code to detect possible causes,
Dashboards, etc.
Monitoring can become so complex that it becomes difficult to change, maintain, and it becomes fragile.
Some guidelines to follow to keep your monitoring useful and simple include:
Rules that find incidents should be simple, predictable and reliable,
Data collection, aggregation and alerting that is infrequently used (the book said less than once a quarter) should be a candidate for the chopping block, and
Data that is collected but not used in any dashboards or alerting should be considered for deletion.
Avoid attempting to pair simple monitoring with other things such as crash detection, log analysis, etc. as this makes for overly complex systems.
Tying these Principles Together
Google’s monitoring philosophy is admittedly maybe hard to attain but a good foundation for goals.
Ask the following questions to avoid pager duty burnout and false alerts:
Does the rule detect something that is urgent, actionable and visible by a user?
Will I ever be able to ignore this alert and how can I avoid ignoring the alert?
Does this alert definitely indicate negatively impacted users and are there cases that should be filtered out due to any number of circumstances?
Can I take action on the alert and does it need to be done now and can the action be automated? Will the action be a short-term or long-term fix?
Are other people getting paged about this same incident, meaning this is redundant and unnecessary?
Those questions reflect these notions on pages and pagers:
Pages are extremely fatiguing and people can only handle a few a day, so they need to be urgent.
Every page should be actionable.
If a page doesn’t require human interaction or thought, it shouldn’t be a page.
Pages should be about novel events that have never occurred before.
It’s not important whether the alert came from white-box or black-box monitoring.
It’s more important to spend effort on catching the symptoms over the causes and only detect imminent causes.
Monitoring for the Long Term
Monitoring systems are tracking ever-changing software systems, so decisions about it need to be made with long term in mind.
Sometimes, short-term fixes are important to get past acute problems and buy you time to put together a long term fix.
Two case studies that demonstrate the tension between short and long term fixes
Bigtable SRE
Originally Bigtable’s SLO was based on an artificial, good client’s mean performance.
Bigtable had some low level problems in storage that caused the worst 5% of requests to be significantly slower than the rest.
These slow requests would trip alerts but ultimately the problems were transient and unactionable.
People learned to de-prioritize these alerts, which sometimes were masking legitimate problems.
Google SRE’s temporarily dialed back the SLO to the 75th percentile to trigger fewer alerts and disabled email alerts, while working on the root cause, fixing the storage problems.
By slowing the alerts it gave engineers the breathing room they needed to deep dive the problem.
Gmail
Gmail was originally built on a distributed process management system called Workqueue which was adapted to long-lived processes.
Tasks would get de-scheduled causing alerts, but the tasks only affected a very small number of users.
The root cause bugs were difficult to fix because ultimately the underlying system was a poor fit.
Engineers could “fix” the scheduler by manually interacting with it (imagine restarting a server every 24 hours).
Should the team automate the manual fix, or would this just stall out what should be the real fix?
These are 2 red flags: Why have rote tasks for engineers to perform? That’s toil. Why doesn’t the team trust itself to fix the root cause just because an alarm isn’t blaring?
What’s the takeaway? Do not think about alerts in isolation. You must consider them in the context of the entire system and make decisions that are good for the long term health of the entire system.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Python has built in functionality for dynamically reloading modules: Reloading modules in Python. (GeeksForGeeks)
Dockerfile tips-n-tricks:
Concatenate RUN statements like RUN some_command && some_other_command instead of splitting it out into two separate RUN command strings to reduce the layer count.
Prefer apk add --no-cache some_package over apk update && apk add some_package to reduce the layer and image size. And if you’re using apt-get instead of apk, be sure to include apt-get clean as the final command in the RUN command string to keep the layer small.
When using ADD and COPY, be aware that Docker will need the file(s)/directory in order to compute the checksum to know if a cached layer already exists. This means that while you can ADD some_url, Docker needs to download the file in order to compute the checksum. Instead, use curl or wget in a RUN statement when possible, because Docker will only compute the checksum of the RUN command string before executing it. This means you can avoid unnecessarily downloading files during builds (especially on a build server and especially for large files). (docs.docker.com)
We haven’t finished the Site Reliability Engineering book yet as we learn how to monitor our system while the deals at Costco as so good, Allen thinks they’re fake, Joe hasn’t attended a math class in a while, and Michael never had AOL.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
News
Thank you for the reviews! just_Bri, 1234556677888999900000, Mannc, good beer hunter
Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)
Couldn’t resist posting this:
Survey Says
Monitor Some of the Things
Terminology
Monitoring – Collecting, processing, and aggregating quantitative information about a system.
White-box monitoring – Monitoring based on metrics exposed by a system, i.e. logs, JVM profiling, etc.
Black-box monitoring – Monitoring a system as a user would see it.
Dashboard – Provides a summary view of the most important service metrics. May display team information, ticket queue size, high priority bugs, current on call engineer, recent pushes, etc.
Alert – Notification intended to be read by a human, such as tickets, email alerts, pages, etc.
Root cause – A defect, that if corrected, creates a high confidence level that the same issue won’t be seen again. There can be multiple root causes for a particular incident (including a lack of testing!)
Node and machine – A single instance of a running kernel.
Kernel – The core of the operating system. Generally controls everything on the system, always resident in memory, and facilitates interactions between the system hardware and software. (Wikipedia)
There could be multiple services worth monitoring on the same node that could be either related or unrelated.
Push – Any change to a running service or it’s configuration.
Why Monitor?
Some of the main reasons include:
To analyze trends,
To compare changes over time, and
Alerting when there’s a problem.
To build dashboards to answer basic questions.
Ad hoc analysis when things change to identify what may have caused it.
Monitoring lets you know when the system is broken or may be about to break.
You should never alert just if something seems off.
Paging a human is an expensive use of time.
Too many pages may be seen as noise and reduce the likelihood of thorough investigation.
Effective alerting systems have good signal and very low noise.
Setting Reasonable Expectations for Monitoring
Monitoring complex systems is a major undertaking.
The book mentions that Google SRE teams with 10-12 members have one or two people focused on building and maintaining their monitoring systems for their service.
They’ve reduced the headcount needed for maintaining these systems as they’ve centralized and generalized their monitoring systems, but there’s still at least one human dedicated to the monitoring system.
They also ensure that it’s not a requirement that an SRE stare at the screen to identify when a problem comes up.
Google has since moved to simpler and faster monitoring systems that provide better tools for ad hoc analysis and avoid systems that try to determine causality
This doesn’t mean they don’t monitor for major changes in common trends.
SRE’s at Google seldom use tiered rule triggering.
Why? Because they’re constantly changing their service and/or infrastructure.
When they do alert on these dependent types of rules, it’s when there’s a common task that’s carried out that is relatively simple.
It is critical that from the instant a production issue arises, that the monitoring system alert a human quickly, and provide an easy to follow process that people can use to find the root cause quickly.
Alerts need to be simple to understand and represent the failure clearly.
Symptoms vs Causes
A monitoring system should answer these two questions:
What is broken? This is the symptom.
Why is it broken? This is the cause.
The book says that drawing the line between the what and why is one of the most important ways to make a good monitoring system with high quality signals and low noise.
An example might be:
Symptom: The web server is returning 500s or 404s,
Cause: The database server ran out of hard-drive space.
Black-Box vs White-Box
Google SRE’s use white-box monitoring heavily, and much less black-box monitoring except for critical uses.
White-box monitoring relies on inspecting the internals of a system.
Black-box monitoring is symptom oriented and helps identify unplanned issues.
Interesting takeaway for the white-box monitoring is this exposes issues that may be hidden by things like retries.
A symptom for one team can be a cause for another.
White-box monitoring is crucial for telemetry.
Example: The website thinks the database is slow, but does the database think itself is slow? If not, there may be a network issue.
Benefit of black-box monitoring for alerting is black-box monitoring indicates a problem that is currently happening, but is basically useless in letting you know that a problem may happen.
Four Golden Signals
Latency – The time it takes to service a request.
Important to separate successful request latency vs failed request latency.
A slow error is worse than a fast error!
Traffic – How much demand is being placed on your system, such as requests per second for a web request, or for streaming audio/video, it might be I/O throughput.
Errors – The rate of requests that fail, either explicitly or implicitly.
Explicit errors are things like a 500 HTTP response.
Implicit might be any request that took over 2 seconds to finish if your goal is to respond in less than 2 seconds.
Saturation – How full your service is.
A measure of resources that are the most constrained, such as CPU or I/O, but note that things usually start to degrade before 100% utilization.
This is why having a utilization target is important.
Latency increases are often indicators of saturation.
Measuring 99% response time over a small interval can be an early signal of saturation.
Saturation also concerns itself when predicting imminent issues, like filling up drive space, etc.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Post-Incident Review on the Atlassian April 2022 outage (Atlassian)
Great episode on All The Code featuring Brandon Lyons and his journey to Microsoft. (ListenNotes.com)
Tip of the Week
Prometheus has configurations that let you tune how often it looks for metrics, i.e. the scrape_interval. Too much and you’re wasting resources, not enough and you can miss important information and get false alerts. (Prometheus)
There’s a reason WordPress is so popular. It’s fast and easy to setup, especially if you use Webinonly. (Webinonly.com)
Looking for great encryption libraries for Java or PHP? Check out Bouncy Castle! (Bouncy Castle)
Big thanks to @bicylerepairmain for the tip on the running lines of code in VS Code with a keyboard shortcut. The option workbench.action.terminal.runSelectedText is under File -> Preferences -> Keyboard Shortcuts. (Stack Overflow)
Need to see all of the files you’ve changed since you branched off of a commit? Use git diff --name-only COMMIT_ID_SHA HEAD. (git-scm.com)
Couple with Allen’s tip from episode 182 to make it easier to find that starting point!
We say “toil” a lot this episode while Joe saw a movie, Michael says something controversial, and Allen’s tip is to figure it out yourself, all while learning how to eliminate toil.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Reviews
Thank you for the reviews! AA, Franklin MacDunnaduex, BillyVL, DOM3ag3
Toil is not just work you don’t wanna do, nor is it just administrative work or tedious tasks.
Toil is different for every individual.
Some administrative work has to be done and is not considered toil but rather it’s overhead.
HR needs, trainings, meetings, etc.
Even some tedious tasks that pay long term dividends cannot be considered toil.
Cleaning up service configurations was an example of this.
Toil further defined is work that is often times manual, repetitive, can be automated, has no real value, and/or grows as the service does.
Manual – Something a human has to do.
Repetitive – Running something once or twice isn’t toil. Having to do it frequently is.
Automatable – If a machine can do it, then it should be done by the machine. If the task needs human judgement, it’s likely not toil.
Tactical – Interrupt driven rather than strategy driven. May never be able to eliminate completely but the goal is to minimize this type of work.
No enduring value – If your service didn’t change state after the task was completed, it was likely toil. If there was a permanent improvement in the state of the service then it likely wasn’t toil.
O(n) with service growth – If the amount of work grows with the growth of your service usage, then it’s likely toil.
Why is Less Toil Better?
At Google, the goal is to keep each SRE’s toil at less than 50%.
The other 50% should be developing solutions to reduce toil further, or make new features for a service.
Where features mean improving reliability, performance, or utilization.
The goal is set at 50% because it can easily grow to 100% of an SRE’s time if not addressed.
The time spent reducing toil is the “engineering” in the SRE title.
This engineering time is what allows the service to scale with less time required by an SRE to keep it running properly and efficiently.
When Google hires an SRE, they promise that they don’t run a typical ops organization and mention the 50% rule. This is done to help ensure the group doesn’t turn into a full time ops team.
Calculating Toil
The book gave the example of a 6 person team and a 6 week cycle:
Assuming 1 week of primary on-call time and 1 week of secondary on-call time, that means an SRE has 2 of 6 weeks with “interrupt” type of work, or toil, meaning 33% is the lower bound of toil.
With an 8 person team, you move to an 8 week cycle, so 2 weeks on call out of 8 weeks mean a 25% toil lower bound.
At Google, SRE’s report their toil is spent most on interrupts (non-urgent, service related messages), then on-call urgent responses, then releases and pushes.
Surveys at Google with SRE’s indicate that the average time spent in toil is closer to 33%.
Like all averages, it leaves out outliers, such as people who spend 0 time toiling, and others who spend as much as 80% of their time on toil.
If there is someone taking on too much toil, it’s up to the manage to spread that out better.
What Qualifies as Engineering?
Work that requires human judgement,
Produces permanent improvements in a service and requires strategy,
Design driven approach, and
The more generic or general, the better as it may be applied to multiple services to get even greater gains in efficiency and reliability.
Typical SRE Activities
Software engineering – Involves writing or modifying code.
Systems engineering – Configuring systems, modifying configurations, or documenting systems that provide long term improvements.
Toil – Work that is necessary to run a service but is manual, repetitive, etc.
Overhead – Administrative work not directly tied to a service such as hiring, HR paperwork, meetings, peer-reviews, training, etc.
The 50% goal is over a few quarters or year. There may be some quarters where toil goes above 50%, but that should not be sustained. If it is, management needs to step in and figure out how to bring that back into the goal range.
“Let’s invent more, and toil less”
Site Reliability Engineering: How Google Runs Production Systems
Is Toil Always Bad?
The fact that some amount of toil is predictable and repeatable makes some individuals feel like they’re accomplishing something, i.e. quick wins that may be low risk and low stress.
Some amount of toil is expected and unavoidable.
When the amount of time spent on toil becomes too large, you should be concerned and “complain loudly”.
Potential issues with large amounts of toil:
Career stagnation – If you’re not spending enough time on projects, your career progression will suffer.
Low morale – Too much toil leads to burnout, boredom, and being discontent.
Too much time on toil also hurts the SRE team.
Creates confusion – The SRE team is supposed to do engineering, and if that’s not happening, then the goal of the team doesn’t match the work being done by the team.
Slows progress – The team will be less productive if they’re focused on toil.
Sets precedent – If you take on too much toil regularly, others will give you more.
Promotes attrition – If your group takes on too much toil, talented engineers in the group may leave for a position with more development opportunities.
Causes breach of faith – If someone joins the team but doesn’t get to do engineering, they’ll feel like they were sold a bill of goods.
Commit to cleaning up a bit more toil each week with engineering activities.
Resources We Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
The Greatest Inheritance, uh stars Jaleel White (IMDb)
Clean Code – How to Write Amazing Unit Tests (episode 54)
DevOps Vs SRE: Enabling Efficiency And Resiliency (harness.io)
Tip of the Week
Pandas is a great tool for data analysis. It’s fast, flexible and easy to use. Easy to work with information from GCS buckets. (pandas.pydata.org)
7 GUIs you can build to study graphical user interface design. Start with a counter and build up to recreating Excel, programming language agnostic! (eugenkiss.github.io)
Did you know there’s a bash util for sorting, i.e. sort? (manpages.ubuntu.com)
Using Minikube? Did you know you can transfer images with minikube image save from your Minikube environment to Docker easily? Useful for running things in a variety of ways. (minikube.sigs.k8s.io)
Ever have a multi-stage docker, where you only wanted to build one of the intermediary stages? Great for debugging as well as part of your caching strategy, use docker build --target <stage name> to build those intermediary stages. (docs.docker.com)
Welcome to the morning edition of Coding Blocks as we dive into what service level indicators, objectives, and agreements are while Michael clearly needs more sleep, Allen doesn’t know how web pages work anymore, and Joe isn’t allowed to beg.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Survey Says
News
Monolithic repos … meh. But monolithic builds … oh noes.
Chapter 4: Service Level Objectives
Service Level Indicators
A very well and carefully defined metric of some aspect of the service or system.
Response latency, error rate, system throughput are common SLIs.
SLIs are typically aggregated over some predefined period of time.
Usually, SLIs directly measure some aspect of a system but it’s not always possible, as with client side latency.
Availability is one of the most important SLIs often expressed as a ratio of the number of requests that succeed, sometimes called yield.
For storage purposes, durability, i.e. the retention of the data over time, is important.
Service Level Objectives
The SLO is the range of values that you want to achieve with your SLIs.
Choosing SLOs can be difficult. For one, you may not have any say in it!
An example of an SLO would be for response latency to be less than 250ms.
Often one SLI can impact another. For instance, if your number of requests per second rises sharply, so might your latency.
It is important to define SLOs so that users of the system have a realistic understanding of what the availability or reliability of the system is. This eliminates arbitrary “the system is slow” or the “system is unreliable” comments.
Google provided an example of a system called Chubby that is used extensively within Google where teams built systems on top of Chubby assuming that it was highly available, but no claim was made to that end.
Sort of crazy, but to ensure service owners didn’t have unrealistic expectations on the Chubby’s up-time, they actually force downtime through the quarter.
Service Level Agreements
These are the agreements of what is to happen if/when the SLOs aren’t met.
If there is no consequence, then you’re likely talking about an SLO and not an SLA.
Typically, SLA’s consequences are monetary, i.e. there will be a credit to your bill if some service doesn’t meet it’s SLO.
SLAs are typically decided by the business, but SREs help in making sure SLO consequences don’t get triggered.
SREs also help come up with objective ways to measure the SLOs.
Google search doesn’t have an SLA, even though Google has a very large stake in ensuring search is always working.
However, Google for Work does have SLAs with its business customers.
What Should You Care About?
You should not use every metric you can find as SLIs.
Too many and it’s just noisy and hard to know what’s important to look at.
Too few and you may have gaps in understanding the system reliability.
A handful of carefully selected metrics should be enough for your SLIs.
Some Examples
User facing services:
Availability – could the request be serviced,
Latency – how long did it take the request to be serviced, and
Throughput – how many requests were able to be serviced.
Storage systems:
Latency – how long did it take to read/write,
Availability – was it available when it was requested, and
Durability – is the data still there when needed.
Big data systems:
Throughput – how much data is being processed, and
End to end latency – how long from ingestion to completion of processing.
Everything should care about correctness.
Collecting Indicators
Many metrics come from the server side.
Some metrics can be scraped from logs.
Don’t forget about client-side metric gathering as there might be some things that expose bad user experiences.
Example Google used is knowing what the latency before a page can be used is as it could be bad due to some JavaScript on the page.
Aggregation
Typically aggregate raw numbers/metrics but you have to be careful.
Aggregations can hide true system behavior.
Example given averaging requests per second: if odd seconds have 200 requests per second and even seconds have 0, then your average is 100 but what’s being hidden is your true burst rate of 200 requests.
Same thing with latencies, averaging latencies may paint a pretty picture but the long tail of latencies may be terrible for a handful of users.
Using distributions may be more effective at seeing the true story behind metrics.
In Prometheus, using a Summary metric uses quantiles so that you can see typical and worst case scenarios.
Quantile of 50% would show you the average request, while
Quantile of 99.99% would show you the worst request durations.
A really interesting takeaway here is that studies have shown that users prefer a system with low-variance but slower over a system with high variance but mostly faster.
In a low-variance system, SREs can focus on the 99% or 99.99% numbers, and if those are good, then everything else must be, too.
At Google, they prefer distributions over averages as they show the long-tail of data points, as mentioned earlier, averages can hide problems.
Also, don’t assume that data is distributed normally. You need to see the real results.
Another important point here is if you don’t truly understand the distribution of your data, your system may be taking actions that are wrong for the situation. For instance, if you think that you are seeing long latency times but you don’t realize that those latencies actually occur quite often, your systems may be restarting themselves prematurely.
Standardize some SLIs
This just means if you standardize on how, when, and what tools you use for gathering some of the metrics, you don’t have to convince or describe those metrics on every new service or project. Examples might include:
Aggregation intervals – distribution per minute, and
Frequency of metrics gathered – pick a time such as every 5 seconds, 10, etc.
Build reusable SLI templates so you don’t have to recreate the wheel every time.
Objectives in Practice
Find out what the users care about, not what you can measure!
If you choose what’s easy to measure, your SLOs may not be all that useful.
Defining Objectives
SLOs should define how they’re measured and what conditions make them valid.
Example of a good SLO definition – 99% of RPC calls averaged over one minute return in 100ms as measured across all back-end servers.
It is unrealistic to have your SLOs met 100%..
As we mentioned in the previous episode, striving for 100% takes time away from adding new features or makes your team design overly conservatively.
This is why you should operate with an error budget.
An error budget is just an SLO for meeting other SLOs!
Site Reliability Engineering: How Google Runs Production Systems
Choosing Targets
Don’t choose SLO targets based on current performance.
Keep the SLOs simple. Making them overly complex makes them hard to understand and may be difficult to see impacts of system changes.
Avoid absolutes like “can scale infinitely”. It’s likely not true, and if it is, that means you had to spend a lot of time designing it to be that way and is probably overkill.
Have as few SLOs as possible. You want just enough to be able to ensure you can track the status of your system and they should be defendable.
Perfection can wait. Start with loose targets that you can refine over time as you learn more.
SLOs should be a major driver in what SREs work on as they reflect what the business users care about
Control Measures
Monitor system SLIs.
Compare SLIs to SLOs and see if action is needed.
If action is needed, figure out what action should be taken.
Take the action.
Example that was given is if you see latency climbing, and it appears to be CPU bound, then increasing the CPU capacity should lower latencies and not trigger an SLO consequence.
SLOs Set Expectations
Publishing SLOs make it so users know what to expect.
You may want to use one of the following approaches:
Keep a safety margin by having a stricter internal SLO than the public facing SLO.
Don’t overachieve. If your performance is consistently better than your SLO, it might be worth introducing purposeful downtime to set user expectations more in line with the SLO, i.e. failure injection.
Agreements in Practice
The SRE’s role is to help those writing SLAs understand the likelihood or difficulty of meeting the SLOs/SLA being implemented.
You should be conservative in the SLOs and SLAs that you make publicly available.
These are very difficult to change once they’ve been made public.
SLAs are typically misused when actually talking about an SLO. SLA breaches may trigger a court case.
If you can’t win an argument about a particular SLO, it’s probably not worth having an SRE team work on it.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
If you switch to a Mac and you’re struggling with the CMD / CTRL switch from Windows, look for driver software from the keyboard manufacturer as they likely have an option to swap the keys for you!
Metrics aren’t free! Be careful to watch your costs or you can get up to babillions quickly!
Did you know there is a file format you can use to import bookmarks? It’s really simple, just an HTML file. You can even use it for onboarding! (j11g.com)
Powerlevel10k is a Zsh theme that looks nice and is easy to configure, but it’s also good about caching your git status so it doesn’t bog down your computer trying to pull the status on every command, a must for Zsh users with large repos! (GitHub)
We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader of Names, ends the show on a dark note.
Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
Survey Says
Reviews
Thanks for the help Richard Hopkins and JR! Want to help out the show? Leave us a review!
Sadly, O’Reilly is ending their partnership with ACM, so you’ll no longer get access to their Learning Platform if you’re a member. (news.ycombinator.com)
Chapter 3: Embracing Risk
Google aims for 100% reliability right? Wrong…
Increasing reliability is always better for the service, right? Not necessarily.
It’s very expensive to add another 9 of reliability, and
Can’t iterate on features as you spend more time and resources making the service more stable.
Users don’t typically notice the difference between very reliable and extremely reliable services.
The systems using these services usually aren’t 100% reliable, so the chances of noticing are very low.
SRE’s try to balance the risk of unavailability with innovation, new features, and efficient service operations by optimizing for the right balance of all.
Managing Risk
Unstable systems diminish user confidence. We want to avoid that.
Cost does not scale with improvements to reliability.
As you improve reliability the cost can actually increase many times over.
Two dimensions of cost:
Cost of redundancy in compute resources, and
The opportunity cost of trading features for reliability focused time.
SREs try to balance business goals in reliability with the risk of service reliability.
If the business goal is 99.99% reliable, then that’s exactly what the SRE will aim for, with maybe just a touch more.
They treat the target like a minimum and a maximum
Measuring Service Risk
Identify an objective metric for a property of the system to optimize.
Only by doing this can you measure improvements or degradation over time.
At Google, they focus on unplanned downtime.
Unplanned downtime is measured in relation to service availability.
Availability = Uptime / (Uptime + Downtime).
A 99.99% target means a maximum of 52.56 minutes downtime in a year.
At Google, they don’t use uptime as the metric as their services are globally distributed and may be up in many regions while being down in another.
Rather, they use the successful request rate.
Success rate = total successful requests / total requests.
A 99.99% target here would mean you could have 250 failures out of 2.5M requests in a day.
NOTE: not all services are the same.
A new user signup is likely way more important than a polling service for checking for new emails for a user.
At Google they also use this success rate for non-customer facing systems.
Google often sets quarterly availability targets and may track those targets weekly or even daily.
Doing so allows for fixing any issues as quickly as possible.
Risk Tolerance Services
SRE’s should work directly with the business to define goals that can be engineered.
Sometimes this can be difficult because measuring consumer services is clearly definable whereas infrastructure services may not have a direct owner.
Identifying the Risk Tolerance of Consumer Services
Often a service will have its own dedicated team and that team will best know the reliability requirements of that service.
If there is no owning team, often times the engineers will assume the role of defining the reliability requirements.
Factors in assessing the risk tolerance of a service
What level of availability is needed?
Do different failures have different effects on the service?
Use the service cost to help identify where on the risk continuum it belongs.
What are the important metrics to track?
Target level of availability
What do the users expect?
Is the service linked directly to revenue, either for Google or for a customer?
Is it a free or paid service?
If there’s a competing service, what is their level of service?
What’s the target market? Consumers or enterprises?
Consider Google Apps that drive businesses, externally they may have a 99.9% reliability because downtime really impacts the end businesses ability to do critical business processes. Internally they may have a higher targeted reliability to ensure the enterprises are getting the best level of customer service.
When Google purchased YouTube, their reliability was lower because Google was more focused on introducing features for the consumer.
Types of failures
Know the shapes of errors.
Which is worse, a constant trickle of errors throughout the day or a full site outage for a short amount of time?
Example they provided:
Intermittent avatars not loading so it’d show a missing icon on a page, vs
Potential issue where private user information may be leaked.
A large trust impact is worth having a short period of full outage to fix the problem rather than have the potential of leaking sensitive information.
Another example they used was for ads:
Because most users used the ads system during working hours, they deemed it ok to have service periods (planned downtime) in off hours.
Cost
Very high on the deciding factors for how reliable to make a service.
Questions to help determine cost vs reliability:
If we built in one more 9 of reliability, how much more revenue would it bring in?
Does the additional revenue offset the cost of that reliability goal?
Other service metrics
Knowing which metrics are important and which ones aren’t, allow you to make better informed decisions.
Search’s primary metric was speed to results, i.e. lowest latency possible.
AdSense’s primary metric was making sure it didn’t slow down a page load it appeared on rather than the latency at which it appears.
Because of the looser goal on appearance latency, they could reduce their costs by reducing the number of regions AdSense is served by.
Identifying the Risk Tolerance of Infrastructure Services
Infrastructure services different requirements than consumer services typically because they are serving multiple clients.
Target level of availability
One approach of reliability may not be suitable for all needs.
Bigtable example:
Real time querying for online applications means it has a high availability/reliability requirement.
Offline analytical processing, however, has a lower availability requirement.
Using an always highly available reliability target for both use cases would be hyper expensive due to the amount of compute that would be required.
Types of failures
Real-time querying wants request queues to almost always be empty so it can service requests ASAP.
Offline analytical processing cares more about throughput, so it never wants the queues to be empty, i.e. always be processing.
Success and failure for both use cases are opposites in this scenario. Its the same underlying infrastructure systems serving different use-cases.
Cost
Can partition the services into different clusters based on needs.
Low latency/high availability Bigtable cluster is a high level of service and more costly.
Throughput cluster can be built with less redundancy and need less headroom meaning they’re constantly processing making it much more cost effective.
Exposing those cost savings to the end customer helps customers choose the right availability model for their real needs.
This is all done via delineated service levels.
Much of this can all be done via configurations of the various services, i.e. redundancy, amount of compute resources, etc.
… Google SRE’s unofficial motto is “Hope is not a strategy”.
Site Reliability Engineering: How Google Runs Production Systems
Motivation for Error Budgets
Tensions form between feature development teams and SRE teams.
Software fault tolerance: How fault tolerant should the software be? How does it handle unexpected events?
Testing: Too little and it’s a bad end-user experience, too much and you never ship.
Push frequency: Code updates are risky. Should you reduce pushes or work on reducing the risks?
Canary duration and size: Test deploys on a subset of a usual workload. How long do you wait on canary testing and how big do you make the canary?
Forming Your Error Budget
Both teams should define a quarterly error budget based on the service’s SLO (service level objectives).
This determines how unreliable a service can be within a quarter.
This removes the politics between the SREs and product development teams.
Product management sets the SLO of the required uptime for the quarter.
Actual uptime is measured by an uninvolved third party, in Google’s case, “their monitoring system”.
The difference between actual downtime and allowed downtime is the budget.
As long as there is budget remaining, new releases and pushes are allowed.
Benefits
This approach provides a good balance for both teams to succeed.
If the budget is nearly empty, the product developers will spend more time testing, hardening, or slowing release velocity.
This sort of has the effect of having a product development team become self-policing.
What about some uncontrollable event, such as hardware failures, etc.?
Everyone shares the same SLO objectives, so the number of releases will be reduced for the remainder of the quarter.
This also helps bring to light some of the overly aggressive reliability targets that can slow new features from being released. This may lead to renegotiating the SLO to allow for more feature releases.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
There are a couple convenient flags for git checkout. Next time you are switching branches, try the --track or -t flag. It makes sure that your branch has your checkout.defaultRemote upstream set (typically “origin”), making for easier pulling and pushing. (git-scm.com)
git checkout -b <branchname> -t
There is a -vv flag you can pass to git branch to list all the branches you have locally, including the remote info if they are tracked so you can find any branches that don’t have the upstream set. (git-scm.com)
git branch -vv
You can configure git to always set up new branches so that git pull will automatically merge from the starting point branch (assuming you are tracking an upstream branch, see previous 2 tips.) (git-scm.com)
git config --global branch.autoSetupMerge always
From Michael Warren on the comments from last episode, Caffeine is an updated take on the caching code founding in the Java Guava library from google (GitHub)
Great tips from @msuriar!
Great talk from Tanya Reilly about “glue work”, some of the most important work can be hard to see and appreciate. How do we make this better? Technical leadership and glue work – Tanya Reilly | #LeadDevNewYork (YouTube)
Google has a free book available on Incident Response! Great advice on handling and preventing incidents. Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
Minikube!
Minikube is a great way to run Kubernetes clusters locally. It’s cross platform and has a lot of nice features while also still being relatively simple to use and user-friendly. (minikube.sigs.k8s.io)
Minikube has addons that you can install that add additional capabilities, like a metrics server you can use to see what resources are being used, and by what!
minikube addons enable metrics-server
You can also run a “top” style command to see utilization once you have enabled the metrics. (linuxhint.com)
kubectl top pods
There’s also a dashboard that’s available that you can use to deploy, troubleshoot, manage resources, and make changes. (minikube.sigs.k8s.io)
It’s finally time to learn what Site Reliability Engineering is all about, while Jer can’t speak nor type, Merkle got one (!!!), and Mr. Wunderwood is wrong.
Thanks for the review “Amazon Customer”! (You, er, we know who you are.)
Site Reliability Engineering
Site Reliability Engineering: How Google Runs Production Systems is a collections of essays, from Google’s perspective, released in 2016 … and it’s free. (sre.google)
There’s a free workbook to go along with it too. (sre.google)
These essays are what one company did, that company being Google.
The book is told from the perspective of people within the company.
It is about scaling a business process, rather than just the machinery.
Site Reliability Engineering: How Google Runs Production Systems
Their tale should be used for emulating, not copying.
40-90% of your effort is after you have deployed a system.
The notion that once your software is “stable”, the easy part starts is just plain wrong.
Yeah, but what is a Site Reliability Engineering role?
It’s engineers who apply the principles of computer science and engineering to the design and development of computing systems, usually large distributed ones.
It includes writing software for those systems.
Including building all the additional pieces those systems need, i.e. backups, load balancers, etc.
Reliability … the most fundamental feature of any product?
Software doesn’t matter much if it can’t be used.
Software need only to be reliable “enough”.
Once you’ve accomplished this, you spend time building more features or new products.
SRE’s also focus on operating services on top of the distributed computing systems. Examples include:
Storage,
Email, and
Search.
Reliability is regarded as the primary focus of the SRE.
The book was largely written to help the community as a whole by exposing what Google did to solve the post deploy problems as well as to help define what they believe the role and function is for an SRE.
They also call out in the book that they hope the information in the book will work for small to large businesses. Even though they know small businesses don’t have the budget and manpower of larger businesses, the concepts here should help any software development shop.
However, we acknowledge that smaller organizations may be wondering how they can best use the experience represented here: much like security, the earlier you care about reliability, the better.
Site Reliability Engineering: How Google Runs Production Systems
It’s less costly to implement the beginnings of lightweight reliability support early in the software process rather than introduce something later that’s not present at all or has no foundation.
Who was the first SRE? Maybe Margaret Hamilton? (Wikipedia)
The SRE way:
Thoroughness,
Dedication,
Belief in the value of preparation and documentation, and
Awareness of what could go wrong, and the strong desire to prevent it.
Hope is not a strategy.
Site Reliability Engineering: How Google Runs Production Systems
Chapter 1 – Introduction
Consider the sysadmin approach to system management:
The sysadmins run services and respond to events and updates as they happen.
Teams typically grow as the capacity is needed.
Usually the skills for a product developer and a sysadmin are different, therefore they end up on different teams, i.e. a development team and an operations team (i.e. the sysadmins).
This approach is easy to implement.
Disadvantages of the sysadmin approach:
Direct costs that are not subtle and are easy to see.
As the size and complexity of the services managed by the operations team grows, so does the operations team.
Doesn’t scale well because manual intervention with regards to change management and process updates requires more manpower.
Indirect costs that are subtle and often more costly than the direct costs.
Both teams speak about things with different vocabularies (i.e. no ubiquitous language from back in the DDD days).
Each team has different assumptions about risk and possibilities for technical solutions.
Each team has different assumptions about target level of product stability.
Due to these differences, these teams usually end up in conflict.
How quickly should software be released to production?
Developers want their features out as soon as possible for their customers.
Operations teams want to make sure the software won’t break and be a pain to manage in production.
A developer always wants their software released as fast as possible.
An ops person would want to minimize the amount of changes to ensure the system is as stable as possible.
This results in trench warfare between the two groups!
Operations introduces launch and change gates, such as test for every problem that’s ever happened.
Development teams introduce fewer changes and introduce more feature flags, such as sharding the features so they’re not beholden to the launch review.
What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team.
Site Reliability Engineering: How Google Runs Production Systems
Google’s Approach to this Problem?
Focus on hiring software engineers to run their products (not sysadmins).
Create systems to accomplish the work that would have historically been done by sysadmins.
SRE can be broken down into two main categories:
50-60% are Google software engineers, that is people who were hired via the standard hiring procedure.
40-50% are candidates who were very close to the Google software engineer qualifications but didn’t quite make the original cut.
Additionally, they had skills that would be very valuable for SRE’s but not as common in typical software engineers, like Unix system internals and networking knowledge.
SREs believe in building software to solve complex technical problems.
Google has tracked the progress career-wise of the two groups and have found very little difference in their performance over time.
Software engineers get bored by nature doing repetitive work and are mentally geared towards automating problems with software solutions.
SRE teams must be focused on engineering.
Traditional ops groups scale linearly by service size, hiring more people to do the same tasks over and over.
For this reason, Google puts a 50% utilization cap on SRE’s doing traditional ops work.
This ensures the SRE team has time to automate and stabilize the software through means of automation.
Over time, as the SRE team has automated most of the tasks, their operations workload should be reduced to minimal amounts as the software runs and heals itself.
The goal is that the other 50% of the SRE’s time is on development.
Only way to maintain those rates is to measure them.
Google has found that SRE teams are cheaper than traditional ops teams with fewer employees because they know the systems well and prevent problems.
… we want systems that are automatic, not just automated.
Site Reliability Engineering: How Google Runs Production Systems
Challenges
Hiring is hard and the SRE role competes with product teams.
Pager duty!
Requires developer skills as well as system engineering.
This is a new discipline.
Requires strong management to protect the budgets, such as stopping releases, respecting the 50% rules, etc.
One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.
Site Reliability Engineering: How Google Runs Production Systems
Tenants of SRE
Availability
Latency
Performance
Efficiency
Change Management
Monitoring
Emergency Response
Capacity Planning
Durable Focus on Engineering
In order to keep time for project work, SREs should receive a maximum of 2 events per 8-12 hour on-call shift.
This low volume allows the engineer to spend adequate time for accuracy, cleanup, and postmortem.
More than events that mean you have a problem to solve or more SREs to hire, less and you have too many SREs.
Postmortems should be written for all significant incidents, whether paged or not.
Non-paged work might be even more important since it can point to a hole in the monitoring.
Cultivate a blame-free postmortem culture.
Max Change Velocity
An error budget is an interesting way to balance innovation and reliability.
Too many problems and you need to slow down and focus more on reliability, not enough problems and you’re probably gold plating.
Ever have a manager push back on tech-debt? Maybe they aren’t aware of this balance? What can you do to quantity it?
100% uptime is generally considered to not be worth it, as gets more expensive as you get closer to the mark and your customers generally don’t have 100% uptime, so it’s wasteful.
What is the right reliability number though? That’s a business decision.
What downtime percentage will the users allow, based on their usage of the product?
How critical is your service? Is there a workaround?
How well does the experience degrade?
What could a team do if there’s not anymore room in the budget?
What if there’s too much?
Monitoring
Monitoring is how to track the system’s health and availability.
Classic approach was to have an alert get sent when some event or threshold is crossed.
This is flawed though because anything that requires human intervention is by it’s very definition, not automated and introduces latency.
Software should be interpreting and people should only be involved when the software can’t do what it needs to do.
Three types of valid monitoring:
Alerts – a person needs to take immediate action.
Tickets – a person needs to take action but not immediately. The event cannot automatically be handled but can wait a few days to be resolved.
Logging – nobody needs to do anything. The logs should only be viewed if something prompts them to do so.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR).
Site Reliability Engineering: How Google Runs Production Systems
Emergency Response
The best metric for determining effectiveness of an emergency response is the MTTR, i.e. how quickly things got back into a healthy state.
People add latency. Even if there are more failures, a system that can avoid emergencies that require people to do something, will still have higher availability.
Thinking through problems before they happen and creating a playbook resulted in 3x improvement in MTTR as opposed to “winging it”.
On call SRE’s always have on-call playbooks while also doing exercises they dub the Wheel of Misfortune to prepare for on call events.
Change Management
70% of outages are due to changes in a live system.
Best practices:
Progressive rollouts,
Quickly and accurately detecting problems, and
Ability to rollback safely when something goes wrong.
Removing people from the loop, the practices above help improve release velocity and safety.
Demand Forecasting and Capacity Planning
Forecasting helps you ensure service availability and keep costs in check and understood.
Be sure to account for both organic growth, i.e. normal usage, and inorganic growth, such as launches, marketing, etc.
Three mandatory steps:
Accurate organic forecast, extending beyond the leadtime for adding capacity,
Accurate incorporation of inorganic demand sources, and
Regular load testing.
Provisioning
The faster provisioning is, the later you can do it.
The later you can do it, the less expensive it is.
Not all scaling is created equally. Adding a new instance may be cheap but repartitioning can be very risky and time consuming.
Efficiency and Performance
Since SRE are in charge of provisioning and usage, they are close to the costs.
It’s important to maximize resources, which fundamentally affect the success of the project.
Systems get slower as load is added, and slowness can also be viewed as a loss of capacity.
There is a balance between cost and speed. SREs are responsible for defining and maintaining SLOs.
Resources we Like
Links to Google’s free books on Site Reliability Engineering (sre.google)
Why is SRE Becoming 2021’s Hottest Hire? (GlobalDots.com)
We’re living through the tail end, maybe?, of the Great Resignation, so we dig into how that might impact software engineering careers while Allen is very somber, Joe’s years are … different, and Michael pronounces each hump.
Mergify – Save time by automating your pull requests and securing the code merge using a merge queue.
Survey Says
Reviews
Thanks for the review Chuck Rugged (or is it Rugged?).
What is “the great resignation”?
The Great Resignation is an ongoing economic trend where a lot of people started quitting their jobs in 2021 and peaked at 3% unemployment (up roughly 50% from the pre-COVID unemployment average).
Primarily, but not exclusively, in the US, but also trended in Europe, China, India, Australia as well.
Some interesting factors:
High worker demand and labor shortages.
High unemployment.
Employees between 30 and 45 years old have had the greatest increase in resignation rates, with an average increase of more than 20% between 2020 and 2021.
Resignation rates actually dropped for people in their 20s.
Tech and healthcare led the trend, 4.5% for US, 3.6% for healthcare.
Reasons cited included stagnant wages and working conditions.
Why is this a big deal?
Hiring is expensive! Think of thinks like referral fees, recruiter’s percentage, takes a while for people to become productive, onboarding, etc.
What does this mean for working conditions? More remote, better compensation, more flexibility, etc.?
Senior engineers are senior developers who may specialize in a specific area, oversee projects, and manage junior developers.
Principal Engineer is a highly experienced engineer who oversees a variety of projects from start to finish.
Staff engineer is a senior, individual contributor role in a software engineering organization. There is no “one” kind of staff engineer and many fall into one of four archetypes: Tech Lead, Architect, Solver, and Right Hand. (staffeng.com)
Is there a hiring level cap? What does that mean?
What can you lose?
The people,
The grass isn’t always greener,
Seniority (don’t be the “At X we …” person), and/or
Did you know you can expand or collapse all the files in a pull request on GitHub? Press Alt + Click on any file chevron in the pull request to collapse or expand them all! (github.blog)
Thanks to Dave Follett for sharing How to securely erase your hard drive or SSD! (pcworld.com)
Thanks to Fuzzy Muffin for sharing Nvchad, a nice face for Neovim (Nvim) that adds some nice features, like directory access and tabs. (nvchad.github.io)
Use git-sizer to get various statistics about your repository. (GitHub)
How to find/identify large commits in git history? (Stack Overflow)
Then forget about BFG and filter-branch, git filter-repo is the way to remove large files from your Git repo (GitHub)
Use --shallow-exclude to exclude commits found in the supplied ref in either (or both) your git clone (git-scm.com) or git fetch operations. (git-scm.com)
Limit your git push operation “up to” a commit by using the format git push <remote name> <commit ID>:refs/heads/<branch name>. (If the <branch name> already exists on the <remote name>, you can leave off the refs/heads/ portion. (git-scm.com)
We dive into what it takes to adhere to minimum viable continuous delivery while Michael isn’t going to quit his day job, Allen catches the earworm, and Joe is experiencing full-on Stockholm syndrome.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Survey Says
Sidebar
Revisiting unit testing private methods in 2022, what would you do?
Minimum Viable Continuous Integration
CD is the engineering discipline of delivering all changes in a standard way safely.
minimumcd.org
The belief is that you must at least put a certain core set of pieces in play to reap the benefits of Continuous Delivery.
The outcome that they’re looking for is the improved speed, quality, and safety of the deployment pipeline.
Minimum requirements:
Use continuous integration, continuously integrating work into the trunk of your version control and ensuring, as much as possible, that the product is releasable.
The application pipeline is the ONLY way to deploy to an environment.
The pipeline decides if the work is releasable.
The artifacts created by the pipeline meet the organization’s requirement for being deployable.
The artifacts are considered immutable, nobody may change them after they were created by the pipeline.
All feature work stops if the pipeline status is red.
Must have a production like test environment.
Must have rollback on demand capability.
Application configuration is deployed with the artifacts.
If the pipeline says everything looks good, that should be enough – it forces the focus on what ‘releasable’ means.
Dave Farley
Continuous Integration
Use trunk based development.
Integrated daily at a minimum.
Automated testing before merging work code to the trunk.
Work is tested with other work automatically during a merge.
All feature work stops when the build is red.
New work does not break already delivered work.
Trunk Based development
What is trunk based development?
Developers collaborate on a single branch, usually named trunk, main, or something similar.
You must resist any pressure to create other long-lived development branches.
The argument is that the simplicity of this structure is more than worth anything you might gain by any other structure.
For small teams, this is easy, each committer commits straight to trunk, after a build/test gate.
For larger teams, you use short-lived feature branches that might live for a couple days max and end with a PR review and build/test gate.
What does this buy us?
The codebase is always releasable on demand.
Google, Facebook, authors of Continuous Delivery and The DevOps Handbook advocate for it.
But how do we …
Big feature? Feature flag it off.
Hot fix? Fix forward.
But …
What if you need multiple CONSECUTIVE releases? i.e. think of the Kubernetes release cycle.
What if you need multiple CONCURRENT releases? i.e. think of Microsoft support for multiple versions of Windows.
Our discussions of The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Coding Blocks)
Tip of the Week
Did you know you cannot set environment variables in Java?
Terms & Conditions Apply is a game where you have to avoid giving up all your juicy data to Evil Corp by carefully avoiding accepting the terms and conditions. Good luck. Thanks Lars Onselius! (TermsAndConditions.game)
Test Containers is a Java library that gives you a way to interact with and automate containers for testing purposes. Thanks Rionmonster! (TestContainers.org)
Maybe it’s time for JSON to die? YAML is finicky, but it’s easier to read and it allows comments.
YamlDotNet is a library that makes this easy in C#. (YamlDotNet.org)
PowerToys are a collection of utilities from Microsoft that extend Window with some really powerful and easy to use features. Thanks Scott Harden! (Microsoft)
Did you know you can now include diagrams inside your markdown files on Github? Mermaid is the name and you can create the diagrams directly in your files and keep it versioned along with your code. Thanks Murali and Scott Harden! (github.blog)
We have a retrospective about our recent Game Ja Ja Ja Jam, while Michael doesn’t know his A from his CNAME, Allen could be a nun, and Joe still wants to be a game developer.
Followers is a clever 2D platformer puzzle game where you play as 2 different characters in 2 different levels at the same time. You hit the jump button, they both jump. You move left or right, they both attempt to move left or right. The task is to safely collect all the fruit in the level but you “safely” is the keyword here because you have to be really thoughtful about your actions. The trick is utilize the various obstacles in the different levels to position your characters just right to get through. Doing this is a real mind bender, but it’s really satisfying when you figure out how to get through. The art and music fit perfectly and it’s such a cool riff on the the theme since both characters literally follow along with whatever you tell them…which usually leads to their demise!
Knock ‘Em is a 3D arcade style action game where you play as a bowling ball tasked with knocking down bowling pins…that sounds pretty normal, but once you start the game you realize that you are in for a treat. You have full reign of the bowling alley, and the characters in this game are all bowling pins. That’s right, you are a bowling ball blasting around in an alley full of bowling pins that are…bowling! You get a point for every pin and are chased by alley staff pins and police pins that will ultimately wear down your health and end the game. The action is frantic and fun and funny, with a lot of attention to details.
Michael M. – Programming Lead, Game Design, Level Design, UI Design
Cheyenne M. – Art, Programming, Game Design, Sound Design
Alex M. – Game Testing, Special Thanks
Just Down the Hall is a spooky 2D action platformer. In the game a deadly shadow is following behind you, jumping when you jump. If you clear an obstacle the shadow will plow into it, slowing it down. If you hit an obstacle, you slow down and the shadow catches up. The game has a really cool split screen feature where you can watch the shadow following you and smacking into obstacles. This is really cool to watch and it leads to some really tense and rewarding moments as it gets closer and closer to you. The art is beautiful as well so it actually feels good when it inevitably ends.
Live & Evil is a 2D puzzle platformer game about a robot named “Live” and their show “Evil”…which is also the word “live” spelled backwards and the the logo reflects that in a cool way. This is important to note because in this game you swap between the two characters by hitting the Q button. Live walks on top of the platforms, Evil on the bottom. Items like collectables, switches or platforms may only be visible or usable by one of the characters so you’ll have to use both of them to win. This plays out in increasingly interesting ways as the game progresses…and again, you’ll have to see it to believe it.
Light of the World is a 2D puzzle platformer with dark atmospheric and spooky levels, you are charged with rushing between beautiful beacons of light before the your enemy can attack. It has a light but powerful narrative with a really cool bouncing shield mechanic that you can throw and retrieve to solve some light, but clever, puzzles. It’s hard to really talk about this game because every aspect of it is done so well that your jaw is on the floor the whole time. Even the YouTube video trailer for the game was expertly done and you can tell that user experience was always the top priority. This game was rated highest in all 3 categories!
“a path alone” is a thoughtful 2D box pushing puzzle game. It’s dark, and moody, and beautiful and strange. Every level features a beautiful pixel art animal that will grant you a new ability in exchange for a favor. The advice they give you is somber and there’s something about the music and the animations and the mood that gives these interactions some emotional weight, so you are feeling something as you play this game. It’s hard to pin down exactly what that feeling is, and that’s part of the fun. The amount of polish is really evident here and you can really see how much that care and attention given to user experience pays off.
What am I supposed to do is 2D drag and drop puzzle / action game in which the player is charged with helping a character escape from a terrifying and really cool looking “windy monstrosity”. You don’t directly control the character, but you do have items that you can drop into the stages to help the character out. Drop a sword and a character will slash, drop a cloud if there’s a long fall to help your character land softly. The music adds to the frantic pace and the game really steps it up in later levels where you get random items and have to quickly figure out how to make do with what you have available.
It Follows Me from Fussenkuh was a really fun and unique take on the theme, where you have to thumbs up or thumbs down pairs of words based on whether the first word has the letters “me” and the second word has “it”. Get it? You up vote the pairs of words where “it” follows “me”. You only have a few seconds to make each decision so even though this is a word game, I’m probably more likely to classify it as an action game. Slapping the thumbs up and thumbs down actions are reminiscent of social media, and given the popularity of a little game called wordle, this game has an interesting contemporary tone that gives you this odd feeling that you’re playing an fun artifact from…right now.
Quirk
I became a Treasure Hunter to Pay Off My Student Debt, but now an Immortal Snail is Coming after me with a Knife (itch.io)
“I became a Treasure Hunter to Pay Off My Student Debt, but now an Immortal Snail is Coming after me with a Knife”, which also wins the award for best title. In this game you try to collect all the treasure before you are caught by a knife wielding snail. The snail is slow at first, but speeds up as you collect each treasure making for a really tense end game experience.
Ducks in Space is a beautiful 3D snake like game where you gather little ducklings who follow you as you swim around a cool spherical planet. If you run into your duckling tail, the game is over, but you also have to avoid some hungry herons. The game looks, sounds, and plays great and was written in native html and JavaScript, which makes everything even the game even more impressive!
Tip of the Week
We just couldn’t help ourselves and we took all of our tips from Simon Barker this week!
Having a tough time trying to figure out a name for your new app? Check out this site, it’ll help you find that name and tell you what platforms are available for it! (namae.dev)
Check out the rebranded/relaunched podcast “All The Code” from Simon Barker (podcasts.apple.com)
Crontab guru makes it easy to build and understand cron schedule expressions (crontab.guru)
We wrap up our discussion of PagerDuty’s Security Training, while Joe declares this year is already a loss, Michael can’t even, and Allen says doody, err, duty.
Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Survey Says
News
Ja Ja Ja Jamuary is complete and there are 46 new games in the world. Go play! (itch.io)
Session Management
Session management is the ability to identify a user over multiple requests.
HTTP is stateless, so there needs to be a way to maintain state.
Cookies are commonly used to store information on the client to be sent back to the server on subsequent requests.
They usually contains a session token of some sort, which should be a random unique string.
Do NOT store sensitive information in the cookie, such as no usernames, passwords, etc.
Besides tampering, it can be difficult to revoke the cookies.
Session Hijacking
Session hijacking is stealing a user’s session, possibly by:
Guessing or stealing the session identifiers, or
Taking over cookies that weren’t properly locked down.
Session Fixation
Session fixation is when a bad actor creates a session that you will unknowingly take over, thus giving the bad actor access to the data in the user’s session.
This used to be more of an issue when session tokens were passed around in the URL (remember CFID and CFTOKEN?!).
Always treat cookies like any other user input, don’t implicitly trust it, because it can be manipulated on the client.
How to Secure / Verify Sessions
Add extra pieces of data to the session you can verify when requests are made.
Ensure you actually created the session.
Make sure it hasn’t expired and ensure you set expirations for sessions.
All of this just catches the easy stuff.
Session ID’s should be unique and random.
Ensure the following when sending cookies to the client:
Secure flag is set,
httpOnly flag is set, and
The domain is set on the cookie so it can only be used by your application.
To avoid the session fixation we mentioned earlier, ALWAYS make sure to send a new session ID when privileges are elevated, i.e. a login.
Always keep information stored on the server side, not on the client.
Make sure you have an expiration that is set on the server side session. This should be completely independent of the cookie because the cookie values can be manipulated.
When a user logs out or the session expires, ensure you fully destroy all session information.
NEVER TRUST USER INPUT!
Permissions
Try to avoid using sudo in any shell scripts if you can.
If you can’t avoid it, use it with care.
The the principle of least privilege, i.e. more restrictive permissions, as in, can you live with read-only perms?
Revoke permissions you don’t need.
Create separate users for separate needs.
If you need to delete files from a storage bucket, have a service account or user set up with just that permission.
Same for managing compute instances.
Use the least permissive approach you can as it greatly reduces risks.
Other Classic Vulnerabilities
Buffer overflow: This is when a piece of data is stored somewhere it shouldn’t be able to access.
From Wikipedia, a buffer overflow _”is an anomaly where a program, while writing data to a buffer, overruns the buffer’s boundary and overwrites adjacent memory locations.”_
Typically these are used to execute malicious code by putting instructions in a piece of memory that is to be executed after a previous statement completes.
One malicious use of a buffer overflow is using a NOP sled (no-operation sled) to fill up the buffer with a lot of NOPs with your malicious code at the end of the ride.
Apparently you can use this method to easily get a root shell – article linked in the resources
Path Traversal: This is when you “break out” of the web server’s directory and are able to access, or serve up, content from elsewhere on the server
Remember, your dependencies may also have vulnerabilities such as this. You need to run scans on your apps, code, and infrastructure.
Side Channel Attacks: This is when the attacker is using information that’s not necessarily part of a process to get information about that process. Examples include:
Timing attack: Understanding how long certain processes take can allow you to infer information about the process. For example, multiplication takes longer than addition so you might be able to determine that there’s multiplication happening.
Power analysis: This is when you can actually figure out what a processor is doing by analyzing the electrical power being consumed. An example of this process is called differential power analysis.
Acoustic cryptanalysis: This is when the attacker is analyzing sounds to find out what’s going on, such as using a microphone to listen to the sounds of typing a password.
Data remanence: This is when an attacker gets sensitive data after it was thought to have been deleted.
Did you know you can use your phone as a pro level webcam? Thanks Simon Barker! (reincubate.com)
From the tip hotline (cb.show/tips) – Mikerg sent us a great site for learning VSCode. Some are free, some require a $3 monthly subscription, but the ones Joe has done have been really good. Not just VSCode either! IntelliJ, Gmail, lots of other stuff! (keycombiner.com)
How to use Visual Studio Code as the default editor for Git MergeTool (stackoverflow.com)
Five Easy to Miss PostgreSQL Query Performance Bottlenecks (pawelurbanek.com)
We’re pretty sure we’re almost done and we’re definitely all present for the recording as we continue discussing PagerDuty’s Security Training, while Allen won’t fall for it, Joe takes the show to a dark place, and Michael knows obscure, um, stuff.
Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Survey Says
News
Thanks for the reviews!
iTunes: YouCanSayThisNickname
Game Ja Ja Ja Jam is coming up! Just a few days away! (itch.io)
XSS – Cross Site Scripting
Q: What is XSS? A: XSS is injecting snippets of code onto webpages that will be viewed by others.
This can allow the attacker to basically have access to everything a user does or types on a page.
Consider something like a comment on a forum, or blog that allows one to save malicious code.
The attacker could potentially access cookies and session information,
As well as gain access to keyboard entry on the page.
You can sanitize the inputs, but that’s not good enough.
You can’t check for everything in the world.
You really need to be encoding the stored information before you present it back to any users.
This allows things to be displayed as they were entered, but not executed by the browser.
Different languages, frameworks, libraries, etc., have their own ways of encoding information before it’s rendered by the browser. Get familiar with your library’s specific ways.
User supplied data should ALWAYS be encoded before being rendered by the browser. ALWAYS.
This goes for HTML, JS, CSS, etc.
Use a library for encoding because the chances are they’ve been vetted.
Just like we mentioned before, you still have to be diligent about using 3rd party libraries. Using a 3rd party library doesn’t mean you can wash your hands of it.
Content Security Policy (CSP) is another way to handle this. (Wikipedia)
OWASP considers XSS a type of Injection attack in 2021.
CSRF – Cross Site Request Forgery
Q: What is CSRF? A: CSRF is tricking someone into doing something they didn’t want to do, or didn’t know they were doing.
A couple of examples were given:
For example, set the img src to the logout for the site so that when someone visits the page, they’re automatically logged out.
Just imagine if the image source pointed to something a little more nefarious.
Another example is a button that tricked you into performing an action such as an account deletion on another site. Can be done using a form post and a simple button click.
How do you avoid this?
Synchronizer token:
This is a hidden field on every user submittable form on a site that has a value that’s private to the user’s session.
These tokens should be cryptographically strong random values so they can never be guessed or reverse engineered.
These tokens should never be shared with anyone else.
When the form is submitted, the token is validated against the user’s session token, and if it matches, go ahead with the action, otherwise abort.
Again, there are a number of frameworks and libraries out there that have anti-forgery built in. Check with your specific documentation.
They go on to say that anything that is not a READ operation should have CSRF tokens.
NEVER use GET requests for state changing operations!
PagerDuty had a funny mention about an administrative site that included links to delete rows from the database using GET requests. However, as the browser pre-fetched the links, it deleted the database.
OWASP dropped CSRF from the Top 10 in 2017 because the statistical data didn’t rank it highly enough to make the list.
Click-jacking
Q: What is click-jacking? A: Click-jacking is when you are fooled into clicking on something you didn’t intend to.
For example, rendering a page over the top of an iframe, and anything that was clicked on that top page (that seemed innocent) would actually make the click happen on the iframe‘d page, like clicking a Buy it Now button.
Another example is moving a window as soon as you click causing you to click on something you didn’t intend to click.
The best way to prevent click-jacking is to lock down what an iframe can load using the HTTP header X-FRAME-OPTIONS, set to either SAMEORIGIN or DENY. (developer.mozilla.org)
Account Enumeration
Q: What is account enumeration? A: Account enumeration is when an attacker attempts to extract users or information from a website.
Failed logins that take longer for one user than another may indicate that the one that took longer was a real user, maybe because it takes longer as it tries to hash the password.
Similar type of thing could happen if customers are subdomained. One subdomain shows properly and another fails. This reveals information about the customers.
These may be frustrating, as they pointed out, as you have to walk the line between user experience and security.
Just be aware of what type of data you might be exposing with these types of operations.
Regarding logins:
If the user exists or doesn’t, run the same hashing algorithm to not give away which is real or not.
If a user does a password reset, don’t give a message indicating whether the account really existed or not. Keep the flow and messaging the same.
CloudFlare let’s you deploy JAMStack websites for free using their edge network. (pages.cloudflare.com)
Amazon has their own open-source game engine, Open 3D Engine, aka O3DE. It’s the successor to Lumber Yard, a AAA-capable, cross-platform, open source, 3D engine licensed under Apache 2.0. (aws.amazon.com, o3de.org)
Let’s talk about CSS! Ever use border to try and figure out layout issues? Why not use outline instead? Thanks Andrew Diamond! (W3Schools.com)
We discussed a similar technique as a TotW for episode 81.
Have you seen those weird mobile game ads? Click this link, maybe when you’re not at work, and embrace the weird world of mobile game ads. (Reddit)
Nostalgia for the 80’s? People have uploaded some of the tapes that used to play on the loudspeakers at US department store, K-Mart (Nerdist.com)
We continue our discussion of PagerDuty’s Security Training presentation while Michael buys a vowel, Joe has some buffer, and Allen hits everything he doesn’t aim for.
Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.
Survey Says
News
Thanks for the reviews!
iTunes: aodiogo
Game Ja-Ja-Ja-Jamuary is coming up, sign up is open now! (itch.io)
Encryption
OWASP has the more generic “Cryptographic Failures” at #2, up from #3 in 2017.
PagerDuty defines encryption as encoding information in such a way that only authorized readers can access it.
Note that this is an informal definition that speaks to the most common use of the word.
Encryption is really, really difficult to get right. There are people that spend their whole lives thinking about encryption, and breaking encryption. You may think you’re a genius by coming up with a non-standard implementation, but unfortunately the attackers are really sophisticated and this strategy has shown to fail over and over.
There are different types of encryption:
Symmetric/Asymmetric – refers to whether the keys for reading and writing the encrypted data are the same.
Block Cipher – Lets you encrypt and decrypt the data in whole chunks. You need to have an entire block to encrypt or decrypt the whole block at once.
Public/Private Key – A kind of asymmetric encryption intended for situations where you want groups to be able to share one of the keys. For example, you can publish a public PGP key and then people can use that to send you a message. You keep the private key private, so you’re the only entity that can read the message.
Stream Cipher – Encode “on the fly”, think about HTTPS, great for streaming. You can start reading before you have the entire message. Great for situations where performance is important, or you might miss data.
Encryption in Transit
Also known by other names such as data in motion.
Designed to protect against entities that can snoop (or manipulate!) our communications.
You can do this with HTTPS, TLS, IPsec.
Perfect Forward Secrecy is the key to protecting past communications, by generating a new key for a single session so that compromised keys only affect the specific session they were used for.
From Wikipedia “In cryptography, forward secrecy (FS), also known as perfect forward secrecy (PFS), is a feature of specific key agreement protocols that gives assurances that session keys will not be compromised even if long-term secrets used in the session key exchange are compromised.” (Wikipedia)
Encryption at Rest
Simply means that data is encrypted where it’s stored.
An example of this is full disk encryption on laptops and desktops. The entire drive is encrypted so if someone were to steal the drive, it’d essentially be useless without the keys to decrypt the data on the drive.
For PagerDuty, and many other companies, the most important information to protect is customer data, just as important as your own passwords.
PagerDuty’s data classifications:
General data – This is anything available to the public.
Business data – Includes operating data for the business, such as payroll, employee info, etc. This type of data is expected to be encrypted in transit and at rest.
Customer data – This is data provided to the company by the customer and is expected to be encrypted in transit and at rest.
Customer data includes controls such as authentication, access control, storage, auditing, encryption, and destruction.
Business data has similar controls except without the auditing.
PagerDuty called out when using cloud systems, make sure you’re enabling the encryption on the various services, like S3, GCS, Blob storage, etc.
They mentioned it’s just a checkbox, but in reality you’re probably using scripts, templates, etc. So make sure you know the configurations to include to enable encryption.
Another interesting thing they do at PagerDuty: they get alerted when a resource is created without encryption enabled.
What about third parties you use? Should they encrypt as well? YES!!!
Perform vendor risk assessments prior to using the vendor. If they don’t pass the security assessment, use a different vendor.
Secret Management
Q. What is it? A. Protecting and auditing access to secrets.
Auditing so that you can see when someone is using your secrets that shouldn’t, as well as keep track of systems that should and are using secrets.
Hashicorp Vault has a great video to learn about the challenges of managing secrets. (YouTube)
What are secrets?
Secrets are sensitive things such as tokens, keys, passwords, user names, many others.
Secrets should NOT be stored in source control.
Although it seems to happen all the time, be it on purpose, by accident, etc.
Anyone with access to the code can now access the secrets.
PagerDuty uses Vault. Vault:
Securely stores secrets,
Provides audit access to those secrets, and
Provides mechanisms to rotate the secrets if/when necessary.
Don’t hardcode or come up with crazy ways to get secrets into your applications.
Secrets should never be shared, i.e. if two people need access to a system, they should have their own secrets to access that system.
Or maybe you have a “jump” server that has access to an external system, and users have access to the jump server.
NEVER share passwords over insecure channels. This can include channels such as:
Slack,
Email,
SMS,
But this is not an exhaustive list.
If you do accidentally post a secret in a chat or an insecure channel, you should:
Let the security team know immediately (you have a security team right?!), and
Find out how to rotate the secret and do it.
Never allow a secret to be logged!
This can be especially egregious if you’re logging customer credentials you don’t control.
Be sure you are sanitizing your log data before you log.
Hashicorp Vault is a tool for managing secrets, but did you know they have a ton of plugins? Take a look! (VaultProject.io)
Unity has tools built in for common game functionality, it’s worth taking a few minutes to google for something before you start typing. Don’t worry, there is still plenty of code to write, but these tools improve the quality and consistency of your game.
You can use animation clips to create advanced character animations, but it’s also good for simple tweens and motions that need to happen once, or in a loop. No need for “Rotator.cs” type classes that you see in a lot of Unity tutorials. (docs.unity3d.com)
NavMeshes are an efficient ways of handling pathfinding, which is an important piece of many games. You can learn the basics in just a few minutes and accomplish some amazing things. (docs.unity3d.com)