April 2024
S	M	T	W	T	F	S

	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Sun, 10 April 2022

Site Reliability Engineering - Embracing Risk

We learn how to embrace risk as we continue our learning about Site Reliability Engineering while Johnny Underwood talked too much, Joe shares a (scary) journey through his mind, and Michael, Reader of Names, ends the show on a dark note.

The full show notes for this episode are available at https://www.codingblocks.net/episode182.

Sponsors

Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.

Survey Says

Reviews

Thanks for the help Richard Hopkins and JR! Want to help out the show? Leave us a review!

News

Great notes in the Slack from @msuriar!
Sadly, O’Reilly is ending their partnership with ACM, so you’ll no longer get access to their Learning Platform if you’re a member. (news.ycombinator.com)

Chapter 3: Embracing Risk

Google aims for 100% reliability right? Wrong…
Increasing reliability is always better for the service, right? Not necessarily.
- It’s very expensive to add another 9 of reliability, and
- Can’t iterate on features as you spend more time and resources making the service more stable.
Users don’t typically notice the difference between very reliable and extremely reliable services.
The systems using these services usually aren’t 100% reliable, so the chances of noticing are very low.
SRE’s try to balance the risk of unavailability with innovation, new features, and efficient service operations by optimizing for the right balance of all.

Managing Risk

Unstable systems diminish user confidence. We want to avoid that.
Cost does not scale with improvements to reliability.
- As you improve reliability the cost can actually increase many times over.
Two dimensions of cost:
- Cost of redundancy in compute resources, and
- The opportunity cost of trading features for reliability focused time.
SREs try to balance business goals in reliability with the risk of service reliability.
- If the business goal is 99.99% reliable, then that’s exactly what the SRE will aim for, with maybe just a touch more.
  - They treat the target like a minimum and a maximum

Cover of the "Site Reliability Engineering" book from O'Reilly — The famous “SRE Book” from Google

Measuring Service Risk

Identify an objective metric for a property of the system to optimize.
- Only by doing this can you measure improvements or degradation over time.
- At Google, they focus on unplanned downtime.
Unplanned downtime is measured in relation to service availability.
- Availability = Uptime / (Uptime + Downtime).
  - A 99.99% target means a maximum of 52.56 minutes downtime in a year.
  - At Google, they don’t use uptime as the metric as their services are globally distributed and may be up in many regions while being down in another.
    - Rather, they use the successful request rate.
    - Success rate = total successful requests / total requests.
    - A 99.99% target here would mean you could have 250 failures out of 2.5M requests in a day.
NOTE: not all services are the same.
- A new user signup is likely way more important than a polling service for checking for new emails for a user.
At Google they also use this success rate for non-customer facing systems.
Google often sets quarterly availability targets and may track those targets weekly or even daily.
- Doing so allows for fixing any issues as quickly as possible.

Risk Tolerance Services

SRE’s should work directly with the business to define goals that can be engineered.
- Sometimes this can be difficult because measuring consumer services is clearly definable whereas infrastructure services may not have a direct owner.

Identifying the Risk Tolerance of Consumer Services

Often a service will have its own dedicated team and that team will best know the reliability requirements of that service.
- If there is no owning team, often times the engineers will assume the role of defining the reliability requirements.

Factors in assessing the risk tolerance of a service

What level of availability is needed?
Do different failures have different effects on the service?
Use the service cost to help identify where on the risk continuum it belongs.
What are the important metrics to track?

Target level of availability

What do the users expect?
Is the service linked directly to revenue, either for Google or for a customer?
Is it a free or paid service?
If there’s a competing service, what is their level of service?
What’s the target market? Consumers or enterprises?
- Consider Google Apps that drive businesses, externally they may have a 99.9% reliability because downtime really impacts the end businesses ability to do critical business processes. Internally they may have a higher targeted reliability to ensure the enterprises are getting the best level of customer service.
- When Google purchased YouTube, their reliability was lower because Google was more focused on introducing features for the consumer.

Types of failures

Know the shapes of errors.
- Which is worse, a constant trickle of errors throughout the day or a full site outage for a short amount of time?
- Example they provided:
  - Intermittent avatars not loading so it’d show a missing icon on a page, vs
  - Potential issue where private user information may be leaked.
  - A large trust impact is worth having a short period of full outage to fix the problem rather than have the potential of leaking sensitive information.
  - Another example they used was for ads:
    - Because most users used the ads system during working hours, they deemed it ok to have service periods (planned downtime) in off hours.

Cost

Very high on the deciding factors for how reliable to make a service.
Questions to help determine cost vs reliability:
- If we built in one more 9 of reliability, how much more revenue would it bring in?
- Does the additional revenue offset the cost of that reliability goal?

Other service metrics

Knowing which metrics are important and which ones aren’t, allow you to make better informed decisions.
Search’s primary metric was speed to results, i.e. lowest latency possible.
AdSense’s primary metric was making sure it didn’t slow down a page load it appeared on rather than the latency at which it appears.
Because of the looser goal on appearance latency, they could reduce their costs by reducing the number of regions AdSense is served by.

Identifying the Risk Tolerance of Infrastructure Services

Infrastructure services different requirements than consumer services typically because they are serving multiple clients.

Target level of availability

One approach of reliability may not be suitable for all needs.
Bigtable example:
- Real time querying for online applications means it has a high availability/reliability requirement.
- Offline analytical processing, however, has a lower availability requirement.
- Using an always highly available reliability target for both use cases would be hyper expensive due to the amount of compute that would be required.

Types of failures

Real-time querying wants request queues to almost always be empty so it can service requests ASAP.
Offline analytical processing cares more about throughput, so it never wants the queues to be empty, i.e. always be processing.
- Success and failure for both use cases are opposites in this scenario. Its the same underlying infrastructure systems serving different use-cases.

Cost

Can partition the services into different clusters based on needs.
- Low latency/high availability Bigtable cluster is a high level of service and more costly.
- Throughput cluster can be built with less redundancy and need less headroom meaning they’re constantly processing making it much more cost effective.
  - Exposing those cost savings to the end customer helps customers choose the right availability model for their real needs.
  - This is all done via delineated service levels.
  - Much of this can all be done via configurations of the various services, i.e. redundancy, amount of compute resources, etc.

… Google SRE’s unofficial motto is “Hope is not a strategy”.
Site Reliability Engineering: How Google Runs Production Systems

Motivation for Error Budgets

Tensions form between feature development teams and SRE teams.
- Software fault tolerance: How fault tolerant should the software be? How does it handle unexpected events?
- Testing: Too little and it’s a bad end-user experience, too much and you never ship.
- Push frequency: Code updates are risky. Should you reduce pushes or work on reducing the risks?
- Canary duration and size: Test deploys on a subset of a usual workload. How long do you wait on canary testing and how big do you make the canary?

*Anatomy of an Incident: Google’s Approach to Incident Management for Production Services*

Forming Your Error Budget

Both teams should define a quarterly error budget based on the service’s SLO (service level objectives).
- This determines how unreliable a service can be within a quarter.
- This removes the politics between the SREs and product development teams.
Product management sets the SLO of the required uptime for the quarter.
Actual uptime is measured by an uninvolved third party, in Google’s case, “their monitoring system”.
The difference between actual downtime and allowed downtime is the budget.
As long as there is budget remaining, new releases and pushes are allowed.

Benefits

This approach provides a good balance for both teams to succeed.
If the budget is nearly empty, the product developers will spend more time testing, hardening, or slowing release velocity.
- This sort of has the effect of having a product development team become self-policing.
What about some uncontrollable event, such as hardware failures, etc.?
- Everyone shares the same SLO objectives, so the number of releases will be reduced for the remainder of the quarter.
This also helps bring to light some of the overly aggressive reliability targets that can slow new features from being released. This may lead to renegotiating the SLO to allow for more feature releases.

Resources we Like

Links to Google’s free books on Site Reliability Engineering (sre.google)
Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
Your nines are not my nines (RachelByTheBay.com)
The Website Is Down – Sales Guy vs. Web Dude (YouTube)
- Also mentioned in episode 122.
Google services outages (Wikipedia)

Tip of the Week

Git tips!
- There are a couple convenient flags for git checkout. Next time you are switching branches, try the --track or -t flag. It makes sure that your branch has your checkout.defaultRemote upstream set (typically “origin”), making for easier pulling and pushing. (git-scm.com)
  - git checkout -b <branchname> -t
- There is a -vv flag you can pass to git branch to list all the branches you have locally, including the remote info if they are tracked so you can find any branches that don’t have the upstream set. (git-scm.com)
  - git branch -vv
- You can configure git to always set up new branches so that git pull will automatically merge from the starting point branch (assuming you are tracking an upstream branch, see previous 2 tips.) (git-scm.com)
  - git config --global branch.autoSetupMerge always
From Michael Warren on the comments from last episode, Caffeine is an updated take on the caching code founding in the Java Guava library from google (GitHub)
Great tips from @msuriar!
- Great talk from Tanya Reilly about “glue work”, some of the most important work can be hard to see and appreciate. How do we make this better? Technical leadership and glue work – Tanya Reilly | #LeadDevNewYork (YouTube)
- Google has a free book available on Incident Response! Great advice on handling and preventing incidents. Anatomy of an Incident: Google’s Approach to Incident Management for Production Services (sre.google)
Minikube!
- Minikube is a great way to run Kubernetes clusters locally. It’s cross platform and has a lot of nice features while also still being relatively simple to use and user-friendly. (minikube.sigs.k8s.io)
- Minikube has addons that you can install that add additional capabilities, like a metrics server you can use to see what resources are being used, and by what!
  - minikube addons enable metrics-server
- You can also run a “top” style command to see utilization once you have enabled the metrics. (linuxhint.com)
  - kubectl top pods
- There’s also a dashboard that’s available that you can use to deploy, troubleshoot, manage resources, and make changes. (minikube.sigs.k8s.io)
  - minikube dashboard

Direct download: coding-blocks-episode-182.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sun, 27 March 2022

Software Reliability Engineering - Hope is not a strategy

It’s finally time to learn what Site Reliability Engineering is all about, while Jer can’t speak nor type, Merkle got one (!!!), and Mr. Wunderwood is wrong.

The full show notes for this episode are available at https://www.codingblocks.net/episode181.

Survey Says

Reviews

Thanks for the review “Amazon Customer”! (You, er, we know who you are.)

Site Reliability Engineering

Site Reliability Engineering: How Google Runs Production Systems is a collections of essays, from Google’s perspective, released in 2016 … and it’s free. (sre.google)
There’s a free workbook to go along with it too. (sre.google)
But how is SRE as a career? (GlobalDots.com)
- Career Advancement Score (out of 10): 9
- Median Base Salary: $200,000
- Job Openings (YoY growth): 1,400+ (72%)
These essays are what one company did, that company being Google.
The book is told from the perspective of people within the company.

It is about scaling a business process, rather than just the machinery.
Site Reliability Engineering: How Google Runs Production Systems

Their tale should be used for emulating, not copying.
40-90% of your effort is after you have deployed a system.
The notion that once your software is “stable”, the easy part starts is just plain wrong.
Yeah, but what is a Site Reliability Engineering role?
- It’s engineers who apply the principles of computer science and engineering to the design and development of computing systems, usually large distributed ones.
- It includes writing software for those systems.
- Including building all the additional pieces those systems need, i.e. backups, load balancers, etc.
Reliability … the most fundamental feature of any product?
- Software doesn’t matter much if it can’t be used.
- Software need only to be reliable “enough”.
  - Once you’ve accomplished this, you spend time building more features or new products.
SRE’s also focus on operating services on top of the distributed computing systems. Examples include:
- Storage,
- Email, and
- Search.
Reliability is regarded as the primary focus of the SRE.
The book was largely written to help the community as a whole by exposing what Google did to solve the post deploy problems as well as to help define what they believe the role and function is for an SRE.
They also call out in the book that they hope the information in the book will work for small to large businesses. Even though they know small businesses don’t have the budget and manpower of larger businesses, the concepts here should help any software development shop.

However, we acknowledge that smaller organizations may be wondering how they can best use the experience represented here: much like security, the earlier you care about reliability, the better.
Site Reliability Engineering: How Google Runs Production Systems

It’s less costly to implement the beginnings of lightweight reliability support early in the software process rather than introduce something later that’s not present at all or has no foundation.
Who was the first SRE? Maybe Margaret Hamilton? (Wikipedia)
The SRE way:
- Thoroughness,
- Dedication,
- Belief in the value of preparation and documentation, and
- Awareness of what could go wrong, and the strong desire to prevent it.

Hope is not a strategy.
Site Reliability Engineering: How Google Runs Production Systems

Chapter 1 – Introduction

Consider the sysadmin approach to system management:
- The sysadmins run services and respond to events and updates as they happen.
- Teams typically grow as the capacity is needed.
- Usually the skills for a product developer and a sysadmin are different, therefore they end up on different teams, i.e. a development team and an operations team (i.e. the sysadmins).
- This approach is easy to implement.
Disadvantages of the sysadmin approach:
- Direct costs that are not subtle and are easy to see.
  - As the size and complexity of the services managed by the operations team grows, so does the operations team.
  - Doesn’t scale well because manual intervention with regards to change management and process updates requires more manpower.
- Indirect costs that are subtle and often more costly than the direct costs.
  - Both teams speak about things with different vocabularies (i.e. no ubiquitous language from back in the DDD days).
  - Each team has different assumptions about risk and possibilities for technical solutions.
  - Each team has different assumptions about target level of product stability.
Due to these differences, these teams usually end up in conflict.
- How quickly should software be released to production?
  - Developers want their features out as soon as possible for their customers.
  - Operations teams want to make sure the software won’t break and be a pain to manage in production.
A developer always wants their software released as fast as possible.
An ops person would want to minimize the amount of changes to ensure the system is as stable as possible.
This results in trench warfare between the two groups!
- Operations introduces launch and change gates, such as test for every problem that’s ever happened.
- Development teams introduce fewer changes and introduce more feature flags, such as sharding the features so they’re not beholden to the launch review.

What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team.
Site Reliability Engineering: How Google Runs Production Systems

Google’s Approach to this Problem?

Focus on hiring software engineers to run their products (not sysadmins).
- Create systems to accomplish the work that would have historically been done by sysadmins.
SRE can be broken down into two main categories:
- 50-60% are Google software engineers, that is people who were hired via the standard hiring procedure.
- 40-50% are candidates who were very close to the Google software engineer qualifications but didn’t quite make the original cut.
  - Additionally, they had skills that would be very valuable for SRE’s but not as common in typical software engineers, like Unix system internals and networking knowledge.
SREs believe in building software to solve complex technical problems.
- Google has tracked the progress career-wise of the two groups and have found very little difference in their performance over time.
Software engineers get bored by nature doing repetitive work and are mentally geared towards automating problems with software solutions.
SRE teams must be focused on engineering.
Traditional ops groups scale linearly by service size, hiring more people to do the same tasks over and over.
For this reason, Google puts a 50% utilization cap on SRE’s doing traditional ops work.
- This ensures the SRE team has time to automate and stabilize the software through means of automation.
- Over time, as the SRE team has automated most of the tasks, their operations workload should be reduced to minimal amounts as the software runs and heals itself.
The goal is that the other 50% of the SRE’s time is on development.
Only way to maintain those rates is to measure them.
Google has found that SRE teams are cheaper than traditional ops teams with fewer employees because they know the systems well and prevent problems.

… we want systems that are automatic, not just automated.
Site Reliability Engineering: How Google Runs Production Systems

Challenges

Hiring is hard and the SRE role competes with product teams.
Pager duty!
Requires developer skills as well as system engineering.
This is a new discipline.
Requires strong management to protect the budgets, such as stopping releases, respecting the 50% rules, etc.

One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.
Site Reliability Engineering: How Google Runs Production Systems

Tenants of SRE

Availability
Latency
Performance
Efficiency
Change Management
Monitoring
Emergency Response
Capacity Planning

Durable Focus on Engineering

In order to keep time for project work, SREs should receive a maximum of 2 events per 8-12 hour on-call shift.
This low volume allows the engineer to spend adequate time for accuracy, cleanup, and postmortem.
More than events that mean you have a problem to solve or more SREs to hire, less and you have too many SREs.
Postmortems should be written for all significant incidents, whether paged or not.
Non-paged work might be even more important since it can point to a hole in the monitoring.
Cultivate a blame-free postmortem culture.

Max Change Velocity

An error budget is an interesting way to balance innovation and reliability.
Too many problems and you need to slow down and focus more on reliability, not enough problems and you’re probably gold plating.
Ever have a manager push back on tech-debt? Maybe they aren’t aware of this balance? What can you do to quantity it?
100% uptime is generally considered to not be worth it, as gets more expensive as you get closer to the mark and your customers generally don’t have 100% uptime, so it’s wasteful.
What is the right reliability number though? That’s a business decision.
- What downtime percentage will the users allow, based on their usage of the product?
- How critical is your service? Is there a workaround?
- How well does the experience degrade?
What could a team do if there’s not anymore room in the budget?
What if there’s too much?

Monitoring

Monitoring is how to track the system’s health and availability.
- Classic approach was to have an alert get sent when some event or threshold is crossed.
  - This is flawed though because anything that requires human intervention is by it’s very definition, not automated and introduces latency.
  - Software should be interpreting and people should only be involved when the software can’t do what it needs to do.
Three types of valid monitoring:
- Alerts – a person needs to take immediate action.
- Tickets – a person needs to take action but not immediately. The event cannot automatically be handled but can wait a few days to be resolved.
- Logging – nobody needs to do anything. The logs should only be viewed if something prompts them to do so.

Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR).
Site Reliability Engineering: How Google Runs Production Systems

Emergency Response

The best metric for determining effectiveness of an emergency response is the MTTR, i.e. how quickly things got back into a healthy state.
People add latency. Even if there are more failures, a system that can avoid emergencies that require people to do something, will still have higher availability.
- Thinking through problems before they happen and creating a playbook resulted in 3x improvement in MTTR as opposed to “winging it”.
- On call SRE’s always have on-call playbooks while also doing exercises they dub the Wheel of Misfortune to prepare for on call events.

Change Management

70% of outages are due to changes in a live system.
Best practices:
- Progressive rollouts,
- Quickly and accurately detecting problems, and
- Ability to rollback safely when something goes wrong.
Removing people from the loop, the practices above help improve release velocity and safety.

Demand Forecasting and Capacity Planning

Forecasting helps you ensure service availability and keep costs in check and understood.
Be sure to account for both organic growth, i.e. normal usage, and inorganic growth, such as launches, marketing, etc.
Three mandatory steps:
- Accurate organic forecast, extending beyond the leadtime for adding capacity,
- Accurate incorporation of inorganic demand sources, and
- Regular load testing.

Provisioning

The faster provisioning is, the later you can do it.
The later you can do it, the less expensive it is.
Not all scaling is created equally. Adding a new instance may be cheap but repartitioning can be very risky and time consuming.

Efficiency and Performance

Since SRE are in charge of provisioning and usage, they are close to the costs.
It’s important to maximize resources, which fundamentally affect the success of the project.
Systems get slower as load is added, and slowness can also be viewed as a loss of capacity.
There is a balance between cost and speed. SREs are responsible for defining and maintaining SLOs.

Resources we Like

Links to Google’s free books on Site Reliability Engineering (sre.g oogle)
Why is SRE Becoming 2021’s Hottest Hire? (GlobalDots.com)
How much money do SREs make? (Gremlin.com)
Margaret Hamilton (software engineer) (Wikipedia)

Tip of the Week

Don’t reinvent the wheel, if you’re in Java. Guava is a collection of utilities that solve common problems, courtesy of Google. (GitHub)
From the mindset of RTFM: There are some interesting flags you can pass for git cherry-pick … and other tools you might use. (git-scm.com)
You can use CTRL+NUM on Windows or CMD+NUM on macOS to navigate between tabs in Chrome. (support.google.com)

Direct download: coding-blocks-episode-181.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sun, 13 March 2022

The Great Resignation

We’re living through the tail end, maybe?, of the Great Resignation, so we dig into how that might impact software engineering careers while Allen is very somber, Joe’s years are … different, and Michael pronounces each hump.

The full show notes for this episode are available at https://www.codingblocks.net/episode180.

Sponsors

Mergify – Save time by automating your pull requests and securing the code merge using a merge queue.

Survey Says

Reviews

Thanks for the review Chuck Rugged (or is it Rugged?).

What is “the great resignation”?

The Great Resignation is an ongoing economic trend where a lot of people started quitting their jobs in 2021 and peaked at 3% unemployment (up roughly 50% from the pre-COVID unemployment average).
Primarily, but not exclusively, in the US, but also trended in Europe, China, India, Australia as well.
Some interesting factors:
- High worker demand and labor shortages.
- High unemployment.
- Employees between 30 and 45 years old have had the greatest increase in resignation rates, with an average increase of more than 20% between 2020 and 2021.
- Resignation rates actually dropped for people in their 20s.
- Tech and healthcare led the trend, 4.5% for US, 3.6% for healthcare.
- Reasons cited included stagnant wages and working conditions.

Why is this a big deal?

Hiring is expensive! Think of thinks like referral fees, recruiter’s percentage, takes a while for people to become productive, onboarding, etc.
What does this mean for working conditions? More remote, better compensation, more flexibility, etc.?
Why do people change jobs?
- Promotion,
- Work life balance,
- Compensation,
- Flexibility,
- Leaving a bad environment, and/or
- Better company

What can you gain?

Salary bands, FAANG vs local vs remote vs startup
The “TC” (total compensation) Trap
- Restricted Stock Units vs Options
Top paying companies, by level (levels.fyi)
Comparing levels across orgs (levels.fyi)

About those levels

Senior engineers are senior developers who may specialize in a specific area, oversee projects, and manage junior developers.
Principal Engineer is a highly experienced engineer who oversees a variety of projects from start to finish.
Staff engineer is a senior, individual contributor role in a software engineering organization. There is no “one” kind of staff engineer and many fall into one of four archetypes: Tech Lead, Architect, Solver, and Right Hand. (staffeng.com)
Is there a hiring level cap? What does that mean?

What can you lose?

The people,
The grass isn’t always greener,
Seniority (don’t be the “At X we …” person), and/or
Comfort

Resources we Like

Great Resignation (Wikipedia)
The Great Resignation: Data and analysis show it’s not as great as screaming headlines suggest (NevadaCurrent.com)
The Great Resignation is here. What does that mean for developers? (stackoverflow.blog)
Who Is Driving the Great Resignation? (hbr.org)
What Do Software Developers Want Out of Their Next Job? (insights.dice.com)
How the Government Measures Unemployment (bls.gov)
Salary comparisons (levels.fyi)

Tip of the Week

Did you know you can expand or collapse all the files in a pull request on GitHub? Press Alt + Click on any file chevron in the pull request to collapse or expand them all! (github.blog)
Kotlin code for using Google Cloud! (cloud.google.com)
Thanks to Dave Follett for sharing How to securely erase your hard drive or SSD! (pcworld.com)
Thanks to Fuzzy Muffin for sharing Nvchad, a nice face for Neovim (Nvim) that adds some nice features, like directory access and tabs. (nvchad.github.io)
How do you merge two Git repositories? (Stack Overflow)
Use git-sizer to get various statistics about your repository. (GitHub)
How to find/identify large commits in git history? (Stack Overflow)
Then forget about BFG and filter-branch, git filter-repo is the way to remove large files from your Git repo (GitHub)
Use --shallow-exclude to exclude commits found in the supplied ref in either (or both) your git clone (git-scm.com) or git fetch operations. (git-scm.com)
Limit your git push operation “up to” a commit by using the format git push <remote name> <commit ID>:refs/heads/<branch name>. (If the <branch name> already exists on the <remote name>, you can leave off the refs/heads/ portion. (git-scm.com)

Direct download: coding-blocks-episode-180.mp3
Category:Software Development -- posted at: 8:39pm EDT

Sun, 27 February 2022

Minimum Viable Continuous Delivery

We dive into what it takes to adhere to minimum viable continuous delivery while Michael isn’t going to quit his day job, Allen catches the earworm, and Joe is experiencing full-on Stockholm syndrome.

The full show notes for this episode are available at https://www.codingblocks.net/episode179.

Sponsors

Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.

Survey Says

Sidebar

Revisiting unit testing private methods in 2022, what would you do?

Minimum Viable Continuous Integration

CD is the engineering discipline of delivering all changes in a standard way safely.
minimumcd.org

The belief is that you must at least put a certain core set of pieces in play to reap the benefits of Continuous Delivery.
The outcome that they’re looking for is the improved speed, quality, and safety of the deployment pipeline.
Minimum requirements:
- Use continuous integration, continuously integrating work into the trunk of your version control and ensuring, as much as possible, that the product is releasable.
- The application pipeline is the ONLY way to deploy to an environment.
- The pipeline decides if the work is releasable.
- The artifacts created by the pipeline meet the organization’s requirement for being deployable.
- The artifacts are considered immutable, nobody may change them after they were created by the pipeline.
- All feature work stops if the pipeline status is red.
- Must have a production like test environment.
- Must have rollback on demand capability.
- Application configuration is deployed with the artifacts.

If the pipeline says everything looks good, that should be enough – it forces the focus on what ‘releasable’ means.
Dave Farley

Continuous Integration

Use trunk based development.
Integrated daily at a minimum.
Automated testing before merging work code to the trunk.
Work is tested with other work automatically during a merge.
All feature work stops when the build is red.
New work does not break already delivered work.

Trunk Based development

What is trunk based development?
Developers collaborate on a single branch, usually named trunk, main, or something similar.
You must resist any pressure to create other long-lived development branches.
The argument is that the simplicity of this structure is more than worth anything you might gain by any other structure.
For small teams, this is easy, each committer commits straight to trunk, after a build/test gate.
For larger teams, you use short-lived feature branches that might live for a couple days max and end with a PR review and build/test gate.
What does this buy us?
- The codebase is always releasable on demand.
- Google, Facebook, authors of Continuous Delivery and The DevOps Handbook advocate for it.
But how do we …
- Big feature? Feature flag it off.
- Hot fix? Fix forward.
But …
- What if you need multiple CONSECUTIVE releases? i.e. think of the Kubernetes release cycle.
- What if you need multiple CONCURRENT releases? i.e. think of Microsoft support for multiple versions of Windows.

Resources we Like

Minimum Viable CD (MinimumCD.org)
- Beyond the Minimums, aka references (MinimumCD.org)
- Frequent Questions (MinimumCD.org)
What Is a Deployment Pipeline? (Informit)
Trunk Based Development: Introduction (TrunkBasedDevelopment.com)
Real Example of a Deployment Pipeline in the Fintech Industry (YouTube)
The Twelve-Factor App, III. Config (12Factor.net)
Kubernetes Best Practices: Blueprints for Building Successful Applications on Kubernetes (Amazon)
Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Amazon)
Comparing Git Workflows (episode 90)
Our discussions of The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Coding Blocks)

Tip of the Week

Did you know you cannot set environment variables in Java?
Terms & Conditions Apply is a game where you have to avoid giving up all your juicy data to Evil Corp by carefully avoiding accepting the terms and conditions. Good luck. Thanks Lars Onselius! (TermsAndConditions.game)
Test Containers is a Java library that gives you a way to interact with and automate containers for testing purposes. Thanks Rionmonster! (TestContainers.org)
Maybe it’s time for JSON to die? YAML is finicky, but it’s easier to read and it allows comments.
- YamlDotNet is a library that makes this easy in C#. (YamlDotNet.org)
PowerToys are a collection of utilities from Microsoft that extend Window with some really powerful and easy to use features. Thanks Scott Harden! (Microsoft)
Did you know you can now include diagrams inside your markdown files on Github? Mermaid is the name and you can create the diagrams directly in your files and keep it versioned along with your code. Thanks Murali and Scott Harden! (github.blog)

Direct download: coding-blocks-episode-179.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sun, 13 February 2022

#CBJAM 22 Recap

We have a retrospective about our recent Game Ja Ja Ja Jam, while Michael doesn’t know his A from his CNAME, Allen could be a nun, and Joe still wants to be a game developer.

The full show notes for this episode are available at https://www.codingblocks.net/episode178.

Sponsors

Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.

Survey Says

News

Thanks for the review!
- iTunes: €l£M£N$!
All CBJam Games are playable now! (itch.io)
You can see a review of the top games we’ll be talking about as well as a montage of all the games on YouTube.

Check out all of the games!

Lessons Learned

itch.io is a social network(!) where you can follow other developers, learn about other game jams, etc.
Most games are playable in a browser.
Writing copy is hard!
Playtesting – lots of games are hard!
Game Duration – aim for a really good, well paced 1-5 minutes.
- Time zones – Start, end, defaulting to a little extra
- Teams – submissions, collaborators, voting
- How much is too much?
Does theme skew voting?
What do we get from competition?
Durations
- Would you think making the event longer would get more submissions?
- How long should you get to vote?

Breakdowns

Top 5 Overall

5: Followers (itch.io)

Tomaz Saraiva (itch.io) (youtube.com)

Followers is a clever 2D platformer puzzle game where you play as 2 different characters in 2 different levels at the same time. You hit the jump button, they both jump. You move left or right, they both attempt to move left or right. The task is to safely collect all the fruit in the level but you “safely” is the keyword here because you have to be really thoughtful about your actions. The trick is utilize the various obstacles in the different levels to position your characters just right to get through. Doing this is a real mind bender, but it’s really satisfying when you figure out how to get through. The art and music fit perfectly and it’s such a cool riff on the the theme since both characters literally follow along with whatever you tell them…which usually leads to their demise!

4: Knock ‘Em (itch.io)

Eike Kriesel
Leon Arndt (itch.io)

Knock ‘Em is a 3D arcade style action game where you play as a bowling ball tasked with knocking down bowling pins…that sounds pretty normal, but once you start the game you realize that you are in for a treat. You have full reign of the bowling alley, and the characters in this game are all bowling pins. That’s right, you are a bowling ball blasting around in an alley full of bowling pins that are…bowling! You get a point for every pin and are chased by alley staff pins and police pins that will ultimately wear down your health and end the game. The action is frantic and fun and funny, with a lot of attention to details.

3: Just Down the Hall (itch.io)

Michael M. – Programming Lead, Game Design, Level Design, UI Design
Cheyenne M. – Art, Programming, Game Design, Sound Design
Alex M. – Game Testing, Special Thanks

Just Down the Hall is a spooky 2D action platformer. In the game a deadly shadow is following behind you, jumping when you jump. If you clear an obstacle the shadow will plow into it, slowing it down. If you hit an obstacle, you slow down and the shadow catches up. The game has a really cool split screen feature where you can watch the shadow following you and smacking into obstacles. This is really cool to watch and it leads to some really tense and rewarding moments as it gets closer and closer to you. The art is beautiful as well so it actually feels good when it inevitably ends.

2: Live & Evil (itch.io)

Zaksley – Developer and Game Designer (itch.io)
Drenaya – Developer and Pixel Artist (itch.io)
Thalia Music (itch.io)

Live & Evil is a 2D puzzle platformer game about a robot named “Live” and their show “Evil”…which is also the word “live” spelled backwards and the the logo reflects that in a cool way. This is important to note because in this game you swap between the two characters by hitting the Q button. Live walks on top of the platforms, Evil on the bottom. Items like collectables, switches or platforms may only be visible or usable by one of the characters so you’ll have to use both of them to win. This plays out in increasingly interesting ways as the game progresses…and again, you’ll have to see it to believe it.

1: Light of the World (itch.io)

Nelson L: Lead Programmer (itch.io)
Michael P: Lead Designer (itch.io)

Light of the World is a 2D puzzle platformer with dark atmospheric and spooky levels, you are charged with rushing between beautiful beacons of light before the your enemy can attack. It has a light but powerful narrative with a really cool bouncing shield mechanic that you can throw and retrieve to solve some light, but clever, puzzles. It’s hard to really talk about this game because every aspect of it is done so well that your jaw is on the floor the whole time. Even the YouTube video trailer for the game was expertly done and you can tell that user experience was always the top priority. This game was rated highest in all 3 categories!

Fun

a path alone (itch.io)

Harrison D (itch.io)
Kyle E

“a path alone” is a thoughtful 2D box pushing puzzle game. It’s dark, and moody, and beautiful and strange. Every level features a beautiful pixel art animal that will grant you a new ability in exchange for a favor. The advice they give you is somber and there’s something about the music and the animations and the mood that gives these interactions some emotional weight, so you are feeling something as you play this game. It’s hard to pin down exactly what that feeling is, and that’s part of the fun. The amount of polish is really evident here and you can really see how much that care and attention given to user experience pays off.

Creativity

What am I supposed to do!? (itch.io)

FireBlue87 (itch.io)
GalaxiGames (itch.io)
theoggkaze (itch.io)

What am I supposed to do is 2D drag and drop puzzle / action game in which the player is charged with helping a character escape from a terrifying and really cool looking “windy monstrosity”. You don’t directly control the character, but you do have items that you can drop into the stages to help the character out. Drop a sword and a character will slash, drop a cloud if there’s a long fall to help your character land softly. The music adds to the frantic pace and the game really steps it up in later levels where you get random items and have to quickly figure out how to make do with what you have available.

It Follows Me (itch.io)

Fussenkuh (itch.io)

It Follows Me from Fussenkuh was a really fun and unique take on the theme, where you have to thumbs up or thumbs down pairs of words based on whether the first word has the letters “me” and the second word has “it”. Get it? You up vote the pairs of words where “it” follows “me”. You only have a few seconds to make each decision so even though this is a word game, I’m probably more likely to classify it as an action game. Slapping the thumbs up and thumbs down actions are reminiscent of social media, and given the popularity of a little game called wordle, this game has an interesting contemporary tone that gives you this odd feeling that you’re playing an fun artifact from…right now.

Quirk

I became a Treasure Hunter to Pay Off My Student Debt, but now an Immortal Snail is Coming after me with a Knife (itch.io)

“I became a Treasure Hunter to Pay Off My Student Debt, but now an Immortal Snail is Coming after me with a Knife”, which also wins the award for best title. In this game you try to collect all the treasure before you are caught by a knife wielding snail. The snail is slow at first, but speeds up as you collect each treasure making for a really tense end game experience.

Ducks in Space (itch.io)

Ducks in Space is a beautiful 3D snake like game where you gather little ducklings who follow you as you swim around a cool spherical planet. If you run into your duckling tail, the game is over, but you also have to avoid some hungry herons. The game looks, sounds, and plays great and was written in native html and JavaScript, which makes everything even the game even more impressive!

Tip of the Week

We just couldn’t help ourselves and we took all of our tips from Simon Barker this week!

Tons of free icons! (flaticon.com)
More great icons for material themes! (materialdesignicons.com)
Having a tough time trying to figure out a name for your new app? Check out this site, it’ll help you find that name and tell you what platforms are available for it! (namae.dev)
Check out the rebranded/relaunched podcast “All The Code” from Simon Barker (podcasts.apple.com)
Crontab guru makes it easy to build and understand cron schedule expressions (crontab.guru)

Direct download: coding-blocks-episode-178.mp3
Category:Software Development -- posted at: 10:12pm EDT

Sun, 30 January 2022

PagerDuty's Security Training for Engineers, The Dramatic Conclusion

We wrap up our discussion of PagerDuty’s Security Training, while Joe declares this year is already a loss, Michael can’t even, and Allen says doody, err, duty.

The full show notes for this episode are available at https://www.codingblocks.net/episode177.

Sponsors

Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.

Survey Says

News

Ja Ja Ja Jamuary is complete and there are 46 new games in the world. Go play! (itch.io)

Session Management

Session management is the ability to identify a user over multiple requests.
HTTP is stateless, so there needs to be a way to maintain state.
- Cookies are commonly used to store information on the client to be sent back to the server on subsequent requests.
  - They usually contains a session token of some sort, which should be a random unique string.
  - Do NOT store sensitive information in the cookie, such as no usernames, passwords, etc.
    - Besides tampering, it can be difficult to revoke the cookies.

Session Hijacking

Session hijacking is stealing a user’s session, possibly by:
- Guessing or stealing the session identifiers, or
- Taking over cookies that weren’t properly locked down.

Session Fixation

Session fixation is when a bad actor creates a session that you will unknowingly take over, thus giving the bad actor access to the data in the user’s session.
- This used to be more of an issue when session tokens were passed around in the URL (remember CFID and CFTOKEN?!).
Always treat cookies like any other user input, don’t implicitly trust it, because it can be manipulated on the client.

How to Secure / Verify Sessions

Add extra pieces of data to the session you can verify when requests are made.
Ensure you actually created the session.
Make sure it hasn’t expired and ensure you set expirations for sessions.
- All of this just catches the easy stuff.
Session ID’s should be unique and random.
Ensure the following when sending cookies to the client:
- Secure flag is set,
- httpOnly flag is set, and
- The domain is set on the cookie so it can only be used by your application.
To avoid the session fixation we mentioned earlier, ALWAYS make sure to send a new session ID when privileges are elevated, i.e. a login.
Always keep information stored on the server side, not on the client.
Make sure you have an expiration that is set on the server side session. This should be completely independent of the cookie because the cookie values can be manipulated.
When a user logs out or the session expires, ensure you fully destroy all session information.
NEVER TRUST USER INPUT!

Permissions

Try to avoid using sudo in any shell scripts if you can.
- If you can’t avoid it, use it with care.
The the principle of least privilege, i.e. more restrictive permissions, as in, can you live with read-only perms?
Revoke permissions you don’t need.
Create separate users for separate needs.
- If you need to delete files from a storage bucket, have a service account or user set up with just that permission.
- Same for managing compute instances.
Use the least permissive approach you can as it greatly reduces risks.

Other Classic Vulnerabilities

Buffer overflow: This is when a piece of data is stored somewhere it shouldn’t be able to access.
- From Wikipedia, a buffer overflow _”is an anomaly where a program, while writing data to a buffer, overruns the buffer’s boundary and overwrites adjacent memory locations.”_
- Typically these are used to execute malicious code by putting instructions in a piece of memory that is to be executed after a previous statement completes.
- One malicious use of a buffer overflow is using a NOP sled (no-operation sled) to fill up the buffer with a lot of NOPs with your malicious code at the end of the ride.
  - Apparently you can use this method to easily get a root shell – article linked in the resources
  - Metasploit (YouTube)
Path Traversal: This is when you “break out” of the web server’s directory and are able to access, or serve up, content from elsewhere on the server
- Remember, your dependencies may also have vulnerabilities such as this. You need to run scans on your apps, code, and infrastructure.
Side Channel Attacks: This is when the attacker is using information that’s not necessarily part of a process to get information about that process. Examples include:
- Timing attack: Understanding how long certain processes take can allow you to infer information about the process. For example, multiplication takes longer than addition so you might be able to determine that there’s multiplication happening.
- Power analysis: This is when you can actually figure out what a processor is doing by analyzing the electrical power being consumed. An example of this process is called differential power analysis.
- Acoustic cryptanalysis: This is when the attacker is analyzing sounds to find out what’s going on, such as using a microphone to listen to the sounds of typing a password.
- Data remanence: This is when an attacker gets sensitive data after it was thought to have been deleted.

Resources we Like

For Engineers – PagerDuty Security Training (sudo.PagerDuty.com)
For Everyone – PagerDuty Security Training (sudo.PagerDuty.com)
Session Management Cheat Sheet (OWASP.org)
Channel-Bound Cookies (BrowserAuth.net)
Origin Cookies (tools.ietf.org)
Channel Bindings for TLS (tools.ietf.org)
Firesheep (Wikipedia)
Buffer overflow (Wikipedia)
Smashing The Stack For Fun And Profit (phrack.org)
The Visual Microphone: Passive Recovery of Sound from Video (YouTube)
NSA’s involvement in the design of the Data Encryption Standard (Wikipedia)
The Data Encryption Standard (DES) and its strength against attacks by D. Coppersmith (simson.net)
Differential cryptanalysis (Wikipedia)
Power analysis (Wikipedia)
American Cryptology during the Cold War, 1945-1989 (NSA)
DirecTV attacks hacked smart cards (theregister.co.uk)
Oh mother… | Family Feud (YouTube)
Eagle Eye (IMDb)

Tip of the Week

Did you know you can use your phone as a pro level webcam? Thanks Simon Barker! (reincubate.com)
From the tip hotline (cb.show/tips) – Mikerg sent us a great site for learning VSCode. Some are free, some require a $3 monthly subscription, but the ones Joe has done have been really good. Not just VSCode either! IntelliJ, Gmail, lots of other stuff! (keycombiner.com)
How to use Visual Studio Code as the default editor for Git MergeTool (stackoverflow.com)
Five Easy to Miss PostgreSQL Query Performance Bottlenecks (pawelurbanek.com)

Direct download: coding-blocks-episode-177.mp3
Category:Software Development -- posted at: 9:52pm EDT

Mon, 17 January 2022

PagerDuty's Security Training for Engineers, Penultimate

We’re pretty sure we’re almost done and we’re definitely all present for the recording as we continue discussing PagerDuty’s Security Training, while Allen won’t fall for it, Joe takes the show to a dark place, and Michael knows obscure, um, stuff.

The full show notes for this episode are available at https://www.codingblocks.net/episode176.

Sponsors

Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.

Survey Says

News

Thanks for the reviews!
- iTunes: YouCanSayThisNickname
Game Ja Ja Ja Jam is coming up! Just a few days away! (itch.io)

XSS – Cross Site Scripting

Q: What is XSS? A: XSS is injecting snippets of code onto webpages that will be viewed by others.
- This can allow the attacker to basically have access to everything a user does or types on a page.
Consider something like a comment on a forum, or blog that allows one to save malicious code.
- The attacker could potentially access cookies and session information,
- As well as gain access to keyboard entry on the page.
You can sanitize the inputs, but that’s not good enough.
- You can’t check for everything in the world.
You really need to be encoding the stored information before you present it back to any users.
- This allows things to be displayed as they were entered, but not executed by the browser.
- Different languages, frameworks, libraries, etc., have their own ways of encoding information before it’s rendered by the browser. Get familiar with your library’s specific ways.
User supplied data should ALWAYS be encoded before being rendered by the browser. ALWAYS.
- This goes for HTML, JS, CSS, etc.
Use a library for encoding because the chances are they’ve been vetted.
- Just like we mentioned before, you still have to be diligent about using 3rd party libraries. Using a 3rd party library doesn’t mean you can wash your hands of it.
Content Security Policy (CSP) is another way to handle this. (Wikipedia)
OWASP considers XSS a type of Injection attack in 2021.

CSRF – Cross Site Request Forgery

Q: What is CSRF? A: CSRF is tricking someone into doing something they didn’t want to do, or didn’t know they were doing.
A couple of examples were given:
- For example, set the img src to the logout for the site so that when someone visits the page, they’re automatically logged out.
  - Just imagine if the image source pointed to something a little more nefarious.
- Another example is a button that tricked you into performing an action such as an account deletion on another site. Can be done using a form post and a simple button click.
How do you avoid this?
- Synchronizer token:
  - This is a hidden field on every user submittable form on a site that has a value that’s private to the user’s session.
    - These tokens should be cryptographically strong random values so they can never be guessed or reverse engineered.
    - These tokens should never be shared with anyone else.
  - When the form is submitted, the token is validated against the user’s session token, and if it matches, go ahead with the action, otherwise abort.
- Again, there are a number of frameworks and libraries out there that have anti-forgery built in. Check with your specific documentation.
They go on to say that anything that is not a READ operation should have CSRF tokens.
NEVER use GET requests for state changing operations!
- PagerDuty had a funny mention about an administrative site that included links to delete rows from the database using GET requests. However, as the browser pre-fetched the links, it deleted the database.
OWASP dropped CSRF from the Top 10 in 2017 because the statistical data didn’t rank it highly enough to make the list.

Click-jacking

Q: What is click-jacking? A: Click-jacking is when you are fooled into clicking on something you didn’t intend to.
- For example, rendering a page over the top of an iframe, and anything that was clicked on that top page (that seemed innocent) would actually make the click happen on the iframe‘d page, like clicking a Buy it Now button.
- Another example is moving a window as soon as you click causing you to click on something you didn’t intend to click.
The best way to prevent click-jacking is to lock down what an iframe can load using the HTTP header X-FRAME-OPTIONS, set to either SAMEORIGIN or DENY. (developer.mozilla.org)

Account Enumeration

Q: What is account enumeration? A: Account enumeration is when an attacker attempts to extract users or information from a website.
- Failed logins that take longer for one user than another may indicate that the one that took longer was a real user, maybe because it takes longer as it tries to hash the password.
- Similar type of thing could happen if customers are subdomained. One subdomain shows properly and another fails. This reveals information about the customers.
These may be frustrating, as they pointed out, as you have to walk the line between user experience and security.
- Just be aware of what type of data you might be exposing with these types of operations.
Regarding logins:
- If the user exists or doesn’t, run the same hashing algorithm to not give away which is real or not.
- If a user does a password reset, don’t give a message indicating whether the account really existed or not. Keep the flow and messaging the same.

Resources we Like

For Engineers – PagerDuty Security Training (sudo.PagerDuty.com)
For Everyone – PagerDuty Security Training (sudo.PagerDuty.com)
Cross-Site Request Forgery (OWASP.org, Wikipedia)
About User Enumeration (blog.rapid7.com)

Tip of the Week

CloudFlare let’s you deploy JAMStack websites for free using their edge network. (pages.cloudflare.com)
Amazon has their own open-source game engine, Open 3D Engine, aka O3DE. It’s the successor to Lumber Yard, a AAA-capable, cross-platform, open source, 3D engine licensed under Apache 2.0. (aws.amazon.com, o3de.org)
Let’s talk about CSS! Ever use border to try and figure out layout issues? Why not use outline instead? Thanks Andrew Diamond! (W3Schools.com)
- We discussed a similar technique as a TotW for episode 81.
Have you seen those weird mobile game ads? Click this link, maybe when you’re not at work, and embrace the weird world of mobile game ads. (Reddit)
- Nostalgia for the 80’s? People have uploaded some of the tapes that used to play on the loudspeakers at US department store, K-Mart (Nerdist.com)
OWASP publishes cheat sheets for security. (cheatsheetseries.owasp.org)

Direct download: coding-blocks-episode-176.mp3
Category:Software Development -- posted at: 8:01pm EDT

Mon, 3 January 2022

PagerDuty’s Security Training for Engineers! Part Deux

We continue our discussion of PagerDuty’s Security Training presentation while Michael buys a vowel, Joe has some buffer, and Allen hits everything he doesn’t aim for.

The full show notes for this episode are available at https://www.codingblocks.net/episode175.

Sponsors

Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
Shortcut – Project management has never been easier. Check out how Shortcut is project management without all the management.

Survey Says

News

Thanks for the reviews!
- iTunes: aodiogo
Game Ja-Ja-Ja-Jamuary is coming up, sign up is open now! (itch.io)

Encryption

OWASP has the more generic “Cryptographic Failures” at #2, up from #3 in 2017.
PagerDuty defines encryption as encoding information in such a way that only authorized readers can access it.
- Note that this is an informal definition that speaks to the most common use of the word.
Encryption is really, really difficult to get right. There are people that spend their whole lives thinking about encryption, and breaking encryption. You may think you’re a genius by coming up with a non-standard implementation, but unfortunately the attackers are really sophisticated and this strategy has shown to fail over and over.
There are different types of encryption:
- Symmetric/Asymmetric – refers to whether the keys for reading and writing the encrypted data are the same.
- Block Cipher – Lets you encrypt and decrypt the data in whole chunks. You need to have an entire block to encrypt or decrypt the whole block at once.
- Public/Private Key – A kind of asymmetric encryption intended for situations where you want groups to be able to share one of the keys. For example, you can publish a public PGP key and then people can use that to send you a message. You keep the private key private, so you’re the only entity that can read the message.
- Stream Cipher – Encode “on the fly”, think about HTTPS, great for streaming. You can start reading before you have the entire message. Great for situations where performance is important, or you might miss data.

Encryption in Transit

Also known by other names such as data in motion.
Designed to protect against entities that can snoop (or manipulate!) our communications.
You can do this with HTTPS, TLS, IPsec.
Perfect Forward Secrecy is the key to protecting past communications, by generating a new key for a single session so that compromised keys only affect the specific session they were used for.
From Wikipedia “In cryptography, forward secrecy (FS), also known as perfect forward secrecy (PFS), is a feature of specific key agreement protocols that gives assurances that session keys will not be compromised even if long-term secrets used in the session key exchange are compromised.” (Wikipedia)

Encryption at Rest

Simply means that data is encrypted where it’s stored.
- An example of this is full disk encryption on laptops and desktops. The entire drive is encrypted so if someone were to steal the drive, it’d essentially be useless without the keys to decrypt the data on the drive.
For PagerDuty, and many other companies, the most important information to protect is customer data, just as important as your own passwords.
PagerDuty’s data classifications:
- General data – This is anything available to the public.
- Business data – Includes operating data for the business, such as payroll, employee info, etc. This type of data is expected to be encrypted in transit and at rest.
- Customer data – This is data provided to the company by the customer and is expected to be encrypted in transit and at rest.
  - Customer data includes controls such as authentication, access control, storage, auditing, encryption, and destruction.
  - Business data has similar controls except without the auditing.
PagerDuty called out when using cloud systems, make sure you’re enabling the encryption on the various services, like S3, GCS, Blob storage, etc.
- They mentioned it’s just a checkbox, but in reality you’re probably using scripts, templates, etc. So make sure you know the configurations to include to enable encryption.
Another interesting thing they do at PagerDuty: they get alerted when a resource is created without encryption enabled.
What about third parties you use? Should they encrypt as well? YES!!!
- Perform vendor risk assessments prior to using the vendor. If they don’t pass the security assessment, use a different vendor.

Secret Management

Q. What is it? A. Protecting and auditing access to secrets.
- Auditing so that you can see when someone is using your secrets that shouldn’t, as well as keep track of systems that should and are using secrets.
Hashicorp Vault has a great video to learn about the challenges of managing secrets. (YouTube)
What are secrets?
- Secrets are sensitive things such as tokens, keys, passwords, user names, many others.
Secrets should NOT be stored in source control.
- Although it seems to happen all the time, be it on purpose, by accident, etc.
- Anyone with access to the code can now access the secrets.
PagerDuty uses Vault. Vault:
- Securely stores secrets,
- Provides audit access to those secrets, and
- Provides mechanisms to rotate the secrets if/when necessary.
Don’t hardcode or come up with crazy ways to get secrets into your applications.
Secrets should never be shared, i.e. if two people need access to a system, they should have their own secrets to access that system.
- Or maybe you have a “jump” server that has access to an external system, and users have access to the jump server.
NEVER share passwords over insecure channels. This can include channels such as:
- Slack,
- Email,
- SMS,
- But this is not an exhaustive list.
If you do accidentally post a secret in a chat or an insecure channel, you should:
- Let the security team know immediately (you have a security team right?!), and
- Find out how to rotate the secret and do it.
Never allow a secret to be logged!
- This can be especially egregious if you’re logging customer credentials you don’t control.
- Be sure you are sanitizing your log data before you log.

Resources we Like

For Engineers – PagerDuty Security Training (sudo.PagerDuty.com)
For Everyone – PagerDuty Security Training (sudo.PagerDuty.com)
Security Now (TWiT.tv)
Have I Been Pwned (HaveIBeenPwned.com)
Forward secrecy (Wikipedia)
What is Sign in with Apple? (support.apple.com)
What is Hide My Email? (support.apple.com)
Introduction to HashiCorp Vault with Armon Dadgar (YouTube)
Encryption (NetworkSorcery.com)
OWASP Guide to Cryptography (OWASP.org)
Infrastructure Secret Management Software Overview (GitHub)

Tip of the Week

Hashicorp Vault is a tool for managing secrets, but did you know they have a ton of plugins? Take a look! (VaultProject.io)
Unity has tools built in for common game functionality, it’s worth taking a few minutes to google for something before you start typing. Don’t worry, there is still plenty of code to write, but these tools improve the quality and consistency of your game.
You can use animation clips to create advanced character animations, but it’s also good for simple tweens and motions that need to happen once, or in a loop. No need for “Rotator.cs” type classes that you see in a lot of Unity tutorials. (docs.unity3d.com)
NavMeshes are an efficient ways of handling pathfinding, which is an important piece of many games. You can learn the basics in just a few minutes and accomplish some amazing things. (docs.unity3d.com)
GoFullPage lets you take a screenshot of a whole webpage, bada bing, bada boom. (chrome.google.com, GoFullPage.com)

Direct download: coding-blocks-episode-175.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sun, 19 December 2021

PagerDuty's Security Training for Engineers

We’re taking our time as we discuss PagerDuty’s Security Training presentations and what it means to “roll the pepper” while Michael is embarrassed in front of the whole Internet, Franklin Allen Underwood is on a full name basis, and don’t talk to Joe about corn.

The full show notes for this episode are available at https://www.codingblocks.net/episode174.

Sponsors

Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.

Survey Says

News

Thanks for the reviews!
- iTunes: Goofiw, totalwhine, Kpbmx, Viv-or-vyv
Game Ja-Ja-Ja-Jamuary is coming up, sign up is open now! (itch.io)
Question about unit tests, is extra code that’s only used by unit tests acceptable?
Huge congrats to Jamie Taylor for making Microsoft MVP! Check out some of his podcasts:
- The .NET Core Podcast (DotNetCore.show)
- Tabs and Spaces Podcast (TabsAndSpaces.io)
- RJJ Software Ltd (RJJ-Software.co.uk)

Why this topic?

It’s good to learn about the common security vulnerabilities when developing software! What they are, how they are exploited, and how they are prevented.
WebGoat is a website you can run w/ known vulnerabilities. It is designed for you to poke at and find problems with to help you learn how attackers can take advantage of problems. (OWASP.org)
“But the framework takes care of that for me”
- Don’t be that person!
- Recent vulnerability with Grafana, CVE-2021-43798. (SOCPrime.com)
- The Log4j fiasco begins. (CNN)
You can’t always wait for a vulnerability patch to be released. You may need to patch one yourself.
Basically, even if you’re using a framework, it doesn’t mean you can be naïve to everything about it.
You shouldn’t use the excuse “It’s just for a hackathon” or “It’s a proof of concept.”
- This can include things like disabling firewalls, etc.
- Don’t put things on a public repo, as you might accidentally share company secrets, intellectual property, etc.
  - Open sourcing may be an option later, but it should be looked through first.
- NEVER use customer data when doing hackathons or proofs of concepts. Too many things can go wrong if it leaks out.
  - Maybe a better rule of thumb would be to never use customer data for any type of development. Instead, always use fake data.
The slides had an interesting story that was redacted: there was a software vulnerability that was discovered that existed due to a missing check-in of code, i.e. everything was functioning perfectly fine, and there was an effort already to plug a hole in the code, but it just never made it into the repo. Nearly impossible to detect by automated tools.

Vulnerability #1 – SQL Injection

OWASP has more a generic “Injection” as the #3 position, down from #1 in 2017.
An example is manipulating a query at runtime with user provided input.
- This typically implies that strings are patched into a query directly, i.e. WHERE password = '$providedPassword'.
- Can be attacked by doing something like providedPassword = ' OR 1=1 --.
- Which effectively turns into WHERE password = '' OR 1=1 --.
- This is the basis for the tale of little Bobby Tables (xkcd).
Users should NEVER be able to directly impact the runnable query.
- They can provide values, and those should be parameterized, or validated first.
The real problem is that people with SQL knowledge can string multiple lines of SQL together to manipulate the original query in some scary ways.

Blind Injection

Boolean

Boolean based attacks take time but the scripting throws errors if script results are true.
- Example they provided is “If the first database starts with an A, throw”, “If the first database starts with a B, throw”, etc.

Time Based

Uses the Boolean based attack, but puts them on a delay so they won’t be as easily detected.
So you can just regular expressions for keywords and escape quotes right?! Ummm … no!
- There’s just too many combinations of things you’d need to know as well as weird characters and tricks you couldn’t even be aware of, double or triple encoding, exceptions, etc.
- It’s surprisingly tricky. For example, how would you allow single quotes? Replace them all with \'? Unless there’s already a \ in front of it, but what if it’s \?
- You can theoretically overcome all of these problems … but … why? Why not just do it the right way?
The answer is to use prepared statements and/or parameterized queries.
- The difference between a prepared statement and what was mentioned above is the user’s input doesn’t directly modify a query, rather the input is substituted in the appropriate place.
  - Side benefit is prepared statements often execute quicker than manually constructed SQL queries.

Vulnerability #2 – Storing Passwords

OWASP has the more generic “Cryptographic Failures” at #2, up from #3 in 2017.
Never store passwords in plain text!
I’ve heard hashing is good, right?
- Kind of, until you hear that there’s this thing called rainbow tables.
  - Rainbow tables are basically dictionaries of passwords that have been hashed using various algorithms. This allows you to quickly look up a previously known password with a common hashing algorithm.
Using a salt:
- This is essentially appending a random string of data to the end of a password before hashing it.
  - This salt must NEVER be reused, and it should be changed every time a password is created or changes.
  - The sole purpose of a salt is to ensure rainbow tables will be ineffective. The salts can be stored as plain text right next to the password, they are not a secret, they just ensure the hash will be different even if the same passwords are used multiple times.
Using “a” pepper:
- They referred to it as a site-wide salt, which is pretty accurate.
- The pepper does the same thing as the salt, it’s appended to every password before hashing.
  - The biggest difference is that the pepper is not stored alongside the data, rather it’s stored in a file on a server separate from the data.
  - Essentially you’re double-salting your password before hashing.
  - Password + Salt (stored next to the password with the data) + Pepper (stored on separate server), then hash.
Pepper can make it more difficult for hackers as if they steal the database, they still don’t have the pepper.
Pepper can also make it more difficult for the owners of the system as “rolling a pepper” can be difficult, and you have to potentially keep track of all historical peppers.
Even with the salts and peppers, this still doesn’t fully solve the problem. Why?
- Can’t use a rainbow table, but … if a hacker has the salt and pepper, they can try to brute force the password hashes.
  - They can do this because depending on the hashing algorithm chosen, the hashing is just too fast: MD5, SHA-1, etc.
  - Those algorithms weren’t designed for security, they were designed for speed.
- Solution: Key-stretching
  - This is running the password through a hash algorithm a large number of times.
    - The output of the first hash will be the input for the second hash, and so on.
    - The whole point is to make it take longer to hash. If you were to hash a password 100k times, it might take a second.
      - This means for a legit user, it’s going to take a second to hash and compare a login, but for a hacker trying to crack passwords, at MOST they’ll be able to do one attempt per second.
      - Following the math here, previously with a single MD5 or similar hash, the hacker could attempt 100k password cracks per second vs one per second.
  - It’s still not perfect. Hardware is constantly getting better. So what’s a good and slow today, may not be in a year.
Adaptive Hashing:
- Same concepts as above, except you can increase the number of hashing rounds as time goes on.
- Really what you want is the cost to hack a password for a given algorithm. PagerDuty had a nice slide on this that estimated the cost of hardware to crack a password in one year.
- Good algorithms for increasing the cost to hackers are bcrypt, scrypt and PBKDF2.
  - These were designed for hashing passwords specifically.
  - Salting and key stretching are also built into the algorithms so you don’t have to go do it on your own.

Resources we Like

For Engineers – PagerDuty Security Training (sudo.PagerDuty.com)
For Everyone – PagerDuty Security Training (sudo.PagerDuty.com)
OWASP Top 10 (OWASP.org)
SQL Injection References
- MySQL SQL Injection Cheat Sheet (PenTestMonkey.net)
- SQL Injection Prevention Cheat Sheet (OWASP.org)
- Time-Based Blind SQL Injection Attacks (SQLInjection.net)
- Blind SQL Injection (OWASP.org)
- SQL and NoSQL Injection (ckarande.GitBooks.io)
Security Now episode 843, Trojan Source (TWiT.tv)
Trojan Source Bug Threatens the Security of all Code (KrebsOnSecurity)
Check if your accounts have been compromised (HaveIBeenPwned)
How big is your haystack and how well hidden is your needle? Calculator for figuring out how good your password is. (GRC.com)
bcrypt (Wikipedia)
scrypt (Wikipedia)
PBKDF2 (Wikipedia)

Tip of the Week

Did you know you can mail merge in Gmail? It works well! (developers.google.com)
Tip from Jamie Taylor: DockerSlim is a tool for slimming down your Docker images to reduce your image sizes and security foot print. You can minify it by up to 30x. Free and open-source. (GitHub)
Game Jam is coming up, checking out the free assets provided by Unity in the asset store. The quality is incredible and inspiring and the items range from art work to controllers (think FPS, 3P) to full “microgames” that you can take and build with till your heart’s content. Most are free and the one’s that aren’t are cheap and interesting. (assetstore.unity.com)
while True: learn() is a puzzle video game that can help teach you machine learning techniques. Thanks to Alex from GamingFyx for sharing this!
- while True: learn() (SteamPowered.com)
- Gaming Fyx (fyx.space)
Now that Zsh is the default shell in macOS, it’s time to get comfy and set up tab completion (ScriptingOSX.com)
GiTerm is a command line tool for visualizing Git information. (GitHub)

Direct download: coding-blocks-episode-174.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sun, 5 December 2021

What is a Game Engine?

With Game Ja-Ja-Ja-Jamuary coming up, we discuss what makes a game engine, while Michael’s impersonation is spot-on, Allen may really just be Michael, and Joe already has the title of his next podcast show at the ready.

The full show notes for this episode are available at https://www.codingblocks.net/episode173.

Sponsors

Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.

Survey Says

Game Jam ’22 is coming up in Ja-Ja-Ja-Jamuary

News

Thanks for the reviews!
- Podchaser: Jamie Introcaso
Game Ja-Ja-Ja-Jamuary is coming up, sign up is open now! (itch.io)

What is a Game Engine?

What’s a…
- Library,
- Framework,
- Toolkit,
- … Engine?
Want to see terrible explanations of a thing? Google “framework vs engine”.
Other types of engines: storage engine, rendering engine, for example.

Q: Why do people use game engines? Well, they reduce costs, complexities, and time-to-market. Consistency!
Q: Why do so many AAA games create their own custom engines?

Common Features of Game Engines

2D/3D rendering engine
- Basic shapes (planes, spheres, lines),
- Particles, Shaders,
- Masking/Culling,
- Progressive enhancement (either by distance or by some other means)
Physics engine
- Collision detection,
- Mass,
- Gravity,
- Torque,
- Force,
- Friction,
- Springiness,
- Fluid Dynamics,
- Wind
Sound
- Multiple sounds at once, looping, spatial settings, etc.
Scripting
AI
Networking
- Ever thought about how this works? Peer to peer, dedicated servers?
Streaming
- Streaming assets, as in, the player hasn’t installed your game.
Scene Management
Cinematics
UI
Often engines also include development tools to making working with these various systems easier … like an IDE.

Some Really Cool Things About Unity

Asset Store and Package Management,
ProBuilder (Unity),
Terrain,
Animation Manager,
Ad Systems and Analytics,
Target multiple platforms: Xbox, Windows, Linux, Android, MacOS, iOS, PSX, Switch, etc.

About the Industry

How big is the industry?
- $150B in 2019, estimated $250B for 2025 (TechJury.net)
- How does it compare to other industries?
  - Movies are $41B,
  - Books are $25B,
  - Netflix is $7B … that’s about half of Nintendo,
  - HBO is $2B
How many companies and employees?
- 2,457 companies and 220k jobs … in 2015! (Quora)
What’s the breakdown on sales?
- Mobile $90B, PC $35B, console $49B (NewZoo.com)
How many games released in a year?
- About 10k a year, on Steam (NewZoo.com)
How long does it take? 1 – 10 years?
The 10 Best Games Made By Just One Person (TheGamer.com)

Commentary on Popular Game Engines

Unity

Publish for 20+ platforms
50% of games are made with Unity (GameDeveloper.com)
List of Unity games (Wikpedia)
Pricing range: Free to $2,400. You can use the free plan if revenue or funding is less than $100k!
Program in C#
Great learning resources (learn.unity.com)

Unreal

Many AAA games built with Unreal. Basically think of the top 10 biggest, most beautiful, AAA games; those are probably all Unreal or custom (RAGE, Frostbyte, Last of Us)
Pricing: from free to “call for pricing”, 5% royalty after $1mm
List of Unreal Engine games (Wikipedia)
Originally came out of the Unreal series of games, and a new one is coming out soon! (Epic Games)
Program in C++

Godot

Open Source
Growing in popularity
You can program in a variety of languages, officially C/C++ and GDScript but there are other bindings (Wikipedia)

Custom Game Engines:

GameMaker
RPG Maker
Specialized: Frostbyte, Cryo, etc.
Korge
libGDX

Final Question

Game Jam sign-up is live … what are you thinking for technology and mechanics?

Allen: VR / Escape Room
Michael: Something web based
Joe: Going 3D, wanting to focus on level design and physics this time

Resources We Like

Splitgate, where Halo meets Portal (Splitgate.com)
Mythic Quest (Apple)
What’s wrong with TikTok? (Washington Post)
Blood, Sweat, and Pixels by Jason Schreier (Amazon)

Tip of the Week

ProBuilder is a free tool available in Unity that is great for making polygons and great for mocking out levels or building ramps. The coolest part is the way it works, giving you a bunch of tools that you do things like create vertices, edges, surfaces, extrude, intrude, mirror, etc. You have to add it via the package manager but it’s worth it for simple games and prototypes. (Unity)
Great blog on processing billions of events in real time at Twitter, thanks Mikerg! (blog.twitter.com)
forEachIndexed is a nice Kotlin method for iterating through items in a collection, with an index for positional based computations (ozenero.com)
How can you log out of Netflix on Samsung Smart TVs? Ever heard of the Konami code? Press Up Up Down Down Left Right Left Right Up Up Up Up (help.netflix.com)

Direct download: coding-blocks-episode-173.mp3
Category:Software Development -- posted at: 9:32pm EDT

Sponsors

Survey Says

How do we feel about DevOps?

Reviews

News

Chapter 3: Embracing Risk

Managing Risk

Measuring Service Risk

Risk Tolerance Services

Identifying the Risk Tolerance of Consumer Services

Factors in assessing the risk tolerance of a service

Target level of availability

Types of failures

Cost

Other service metrics

Identifying the Risk Tolerance of Infrastructure Services

Target level of availability

Types of failures

Cost

Motivation for Error Budgets

Forming Your Error Budget

Benefits

Resources we Like

Tip of the Week

Survey Says

So, DevOps is a culture, but SRE is a job title?

Reviews

Site Reliability Engineering

Chapter 1 – Introduction

Google’s Approach to this Problem?

Challenges

Tenants of SRE

Durable Focus on Engineering

Max Change Velocity

Monitoring

Emergency Response

Change Management

Demand Forecasting and Capacity Planning

Provisioning

Efficiency and Performance

Resources we Like

Tip of the Week

Sponsors

Survey Says

What's most important to you when you're looking for another job?

Reviews

What is “the great resignation”?

Why is this a big deal?

What can you gain?

About those levels

What can you lose?

Resources we Like

Tip of the Week

Sponsors

Survey Says

How mature is your CI/CD pipeline?

Sidebar

Minimum Viable Continuous Integration

Continuous Integration

Trunk Based development

Resources we Like

Tip of the Week

Sponsors

Survey Says

What percentage of time does your team devote to technical debt per release cycle?

News

Lessons Learned

Breakdowns

Top 5 Overall

Fun

Creativity

Quirk

Tip of the Week

Sponsors

Survey Says

How awesome was Game Ja-Ja-Ja-Jamuary?!

News

Session Management

Session Hijacking

Session Fixation