Coding Blocks

December 2019
S	M	T	W	T	F	S

1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Sun, 22 December 2019

Designing Data-Intensive Applications – Maintainability

We dig into what it takes to make a maintainable application as we continue to learn from Designing Data-Intensive Applications, as Allen is a big fan of baby Yoda, Michael’s index isn’t corrupt, and Joe has some latency issues.

In case you’re reading this via your podcast player, this episode’s full show notes can be found at https://www.codingblocks.net/episode122 where you can join in on the conversation.

Survey Says

News

Thank you for taking a moment out of your day to leave us a review:
- iTunes: Kodestar, CrouchingProbeHiddenCannon
- Stitcher: programticalpragrammerer, Patricio Page, TheOtherOtherJZ, Luke Garrigan
Be careful about sharing/using code to/from Stack Overflow! You need to be aware of the licensing and what it might mean for your application.
- What is the license for the content I post? (Stack Overflow)
- Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) (Creative Commons)
- Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) (tldrlegal.com)
- We Still Don’t Understand Open Source Licensing (episode 5)
Get your tickets now for NDC { London } for your chance to kick Allen in the shins where he will be giving his talk Big Data Analytics in Near-Real-Time with Apache Kafka Streams. (ndc-london.com)

Maintainability

A majority of cost in software is maintaining it, not creating it in the first place.
Coincidentally most people dislike working on these legacy systems possibly due to one or more of the following:
- Bad code,
- Outdated platforms, and/or
- Made to do things the system wasn’t designed to do.
We SHOULD build applications to minimize the pain during the maintenance phase which involves using the following design principals:
- Operability – make it easy to keep the system running smoothly.
- Simplicity – make it easy for new developers to pick up and understand what was created, remove complexity from the system.
- Evolvability – make it easy for developers to extend, modify, and enhance the system.

Operability

Good operations can overcome bad software, but great software cannot overcome bad operations.
Operations teams are vital for making software run properly.
Responsibilities include:
- Monitoring and restoring service if the system goes into a bad state.
- Tracking down the problems.
- Keeping the system updated and patched.
- Keeping track of how systems impact each other.
- Anticipating and planning for future problems, such as scale and/or capacity.
- Establish good practices for deployments, configuration management, etc.
- Doing complicated maintenance tasks, such as migrating from one platform to another.
- Maintaining security.
- Making processes and operations predictable for stability.
- Keeping knowledge of systems in the business even as people come and go. No tribal knowledge.
The whole point of this is to make mundane tasks easy allowing the operations teams to focus on higher value activities.
This is where data systems come in.
- Get visibility into running systems, i.e. monitoring.
- Supporting automation and integration.
- Avoiding dependencies on specific machines.
- Documenting easy to understand operational models, for example, if you do this action, then this will happen.
- Giving good default behavior with the ability to override settings.
- Self-healing when possible but also manually controllable.
- Predictable behavior.

Simplicity

As projects grow, they tend to get much more complicated over time.
- This slows down development as it takes longer to make changes.
- If you’re not careful, it can become a big ball of mud.
Indicators of complexity:
- Explosion of state space,
- Tight coupling,
- Spaghetti of dependencies,
- Inconsistent coding standards such as naming and terminology,
- Hacking in performance improvements,
- Code to handle one off edge cases sprinkled throughout.
Greater risk of introducing bugs.
Reducing complexity improves maintainability of software.
- For this reason alone, we should strive to make our systems simpler to understand.
Reducing complexity DOES NOT mean removing or reducing functionality.

Accidental complexity – Complexity that is accidental if it is not inherent in the problem that the software solves, but arises only from the implementation.
Ben Moseley and Peter Marks, Out of the Tar Pit

How to remove accidental complexity?

Abstraction
- Allows you to hide implementation details behind a facade.
- Also allows you to reduce duplication as abstractions can allow for reuse among many implementations.
Examples of good abstractions:
- Programming languages abstract system level architectures such as CPU, RAM registers, etc.
- SQL is an abstraction over complex memory and data structures.
Finding good abstractions is VERY HARD.
- This has become apparent with distributed systems as these abstractions are still being explored.

Evolvability

It’s VERY likely that your system will need to undergo changes as time moves on.
Agile methods allow you to adapt to change. Some methods include:
- Test Driven Development
- Refactoring
How easy you can modify a data system is closely linked to how simple the system is.
Rather than calling it agility, the book refers to it as evolvability.

Resources We Like

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
Grokking the System Design Interview (Educative.io)
All permutations of English letters with a handy search feature (libraryofbabel.info)
What is the license for the content I post? (Stack Overflow)
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) (Creative Commons)
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) (tldrlegal.com)
We Still Don’t Understand Open Source Licensing (episode 5)
Streaming process output to a browser, with SignalR and C# (Coding Blocks, dev.to)
Googlewhack – A Google query using two words, without quotes, that returns a single search result. (Wikipedia)
The Website is Down – Sales Guy vs. Web Dude (YouTube – Explicit)
Software Design Anti-patterns (episode 65)
Clean Architecture – How to Quantify Component Coupling (episode 72)
Static Analysis w/ NDepends – How good is your code? (episode 15)
Static Analysis of Open Source .NET Projects (Coding Blocks)
Does Facebook still use PHP? (Quora)

Tip of the Week

Need to move data from here to there? Check out Fluentd. (fluentd.org)
Use NUnit’s Assert.Multiple to assert that multiple conditions are all met. (GitHub)
Turns out C#’s where generic type constraints are even cooler than you might have realized. (docs.microsoft.com)
Best Practices for Response Times and Latency (GitHub)

Direct download: coding-blocks-episode-122.mp3
Category:Software Development -- posted at: 8:33pm EDT

Sun, 8 December 2019

Designing Data-Intensive Applications – Scalability

We continue to study the teachings of Designing Data-Intensive Applications, while Michael’s favorite book series might be the Twilight series, Joe blames his squeak toy chewing habit on his dogs, and Allen might be a Belieber.

Head over to https://www.codingblocks.net/episode121 to find this episode’s full show notes and join in the conversation, in case you’re reading these show notes via your podcast player.

Survey Says

News

So many new reviews to be thankful for! Thanks for making it a point to share your review with us:
- iTunes: bobby_richard, SeanNeedsNewGlasses, Teshiwyn, vasul007, The Jin John
- Stitcher: HotReloadJalapeño, Leonieke, Anonymous, Juke0815
Joe was a guest on The Waffling Taylors. Check out the entire three episode series: Squidge The Ring, Jay vs Doors, and Exploding Horses.
Be aware of “free” VPNs.
- This iOS Security App Shares User Data With China: 8 Million Americans Impacted (Forbes)
- Facebook shuts down Onavo VPN app following privacy scandal (Engadget)
Allen will be speaking at NDC { London } where he will be giving his talk Big Data Analytics in Near-Real-Time with Apache Kafka Streams. Be sure to stop by for your chance to kick him in the shins! (ndc-london.com)

Scalability

Increased load is a common reason for degradation of reliability.
Scalability is the term used to describe a system’s ability to cope with increased load.
It can be tempting to say X scales, or doesn’t, but referring to a system as “scalable” really means how good your options are.
Couple questions to ask of your system:
- If the system grows in a particular way (more data, more users, more usage), what are our options for coping?
- How easy is it to add computing resources?

Describing Load

Need to figure out how to describe load before before we can truly answer any questions.
“Load Parameters” are the metrics that make sense to measure for your system.
Load Parameters are a measure of stress, not of performance.
Different parameters may matter more for your system.
Examples for particular uses may include:
- Web Server: requests per second
- Database: read/write ratio
- Application: # of simultaneous active users
- Cache: hit/miss ratio
This may not be a simple number either. Sometimes you may care more about the average number, sometimes only the extremes.

Describing Performance

Two ways to look at describing performance:
- How does performance change when you increase a load parameter, without changing resources?
- How much do you need to increase resources to maintain performance when increasing a load parameter?
Performance numbers measure how well your system is responding.
Examples can include:
- Throughput (records processed per second)
- Response time

Latency vs Response Time

Latency is how long a request is waiting to be handled, awaiting service.
Response time is the total time it takes for the client to see the response, including any latency.

What do you mean by “numbers”?

Performance numbers are generally a set of numbers: i.e. minimum, maximum, average, median, percentile.
Sometimes the outliers are really important.
- For example, a game may have an average of 59 FPS but that number might drop to 10 FPS when garbage collection is running.
- Average may not be your best measure if you want to see the typical response time.
For this reason it’s better to use percentiles.
- Consider the median (sort all the response times and the one in the middle is the median). Median is known as the 50th percentile or P50.
  - That means that half of your response times will be under the 50% mark and half will be over.
  - To find the outliers, you can look at the 95th, 99th and 99.9th percentiles, aka P95, P99, P999.
    - If you were to look at the P95 mark and the response time is 2s, then that means that 95% of all requests come back in under 2 seconds and 5% of all requests come back in over 2s.
      - The example provided is that Amazon describes response times in the P999 even though it only affects 1 in 1000 users. The reason? Because the slowest response times would be for the customers who’ve made the most purchases. In other words, the most valued customers!
Increased response times have been measured by many large companies in regards to completions of orders, leaving sites, etc.
Ultimately, trying to optimize for P999 is very expensive and may not make the investment worth it.
- Not only is it expensive but it’s also difficult because you may be fighting environmental things outside of your control.
Percentiles are often used in:
- SLO’s – Service Level Objectives
- SLA’s – Service Level Agreements
- Both are contracts that draw out the expected performance and availability of a service.
Queuing delays are a big part of response times in the higher percentiles.
- Servers can only process a finite amount of things in parallel and the rest of the requests are queued.
- A relatively small number of requests could be responsible for slowing many things down.
  - Known as head-of-the-line blocking.
    - For this reason – it’s important to make sure you’re measuring client side response times to make sure you’re getting the full picture.
    - In load testing, the client application needs to make sure it’s issuing new requests even if it’s waiting for older ones. This will more realistically mimic the real world.
For applications that make multiple service calls to complete a screen or page, slower response times become very critical because typically the user experience is not good even if you’re just waiting for one out of 20 requests. The slowest offender is typically what determines the user experience.
- Compounding slow requests is known as tail latency amplification.
Monitoring response times can be very helpful, but also dangerous.
- If you’re trying to calculate the real response time averages against an entire set of data every minute, the calculations can be very expensive.
  - There are approximation algorithms that are much more efficient, such as forward decay, t-digest, or HdrHistogram.
  - Averaging percentiles is meaningless. Rather you need to add the histograms.

Coping with Load

How do we retain good performance when our load increases?

An application that was designed for 1,000 concurrent users will likely not handle 10,000 concurrent users very well.
- It is often necessary to rethink your architecture every time load is increased significantly.
The options, simplified:
- Scaling up – adding more hardware resources to a single machine to handle additional load.
- Scaling out – adding more machines to handle additional load.
One is not necessarily better than the other.
- Scaling up can be much simpler and easier to maintain, but there is a limit to the power available on a single machine as well as the cost ramifications of creating an uber-powerful single machine.
- Scaling out can be much cheaper in hardware costs but cost more in developer time and maintenance to make sure everything is running as expected
- Rather than picking one over the other, consider when each makes the most sense.
Elasticity is when a system can dynamically resize itself based on load, i.e. adding more machines as necessary, etc.
- Some of this happens automatically based on some load detecting criteria.
- Some of this happens manually by a person.
  - Manually may be simpler and can protect against unexpected massive scaling that may hit the wallet hard.
For a long time, RDBMS’s typically ran on a single machine. Even with a failover, the load wasn’t typically distributed.
- As distributed systems are becoming more common and the abstractions are being better built, things like this are changing, and likely to change even more.

“[T]here is no such thing as a generic, one-size-fits-all scalable architecture …”
Martin Kleppmann

Problems could be reads, writes, volume of data, complexity, etc.
- A system that handles 100,000 requests per second at 1 KB in size is very different from a system that handles 3 requests per minute with each file size being 2 GB. Same throughput, but much different requirements.
Designing a scalable system based off of bad assumptions can be both wasted time and even worse counterproductive.
In early stage applications, it’s often much more important to be able to iterate quickly on the application than to design for an unknown future load.

Resources We Like

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
Grokking the System Design Interview (Educative.io)
HdrHistogram: A High Dynamic Range Histogram (hdrhistogram.org)
Performance (Stack Exchange)
AWS Snowmobile (aws.amazon.com)
Monitoring Containerized Application Health with Docker (Pluralsight)

Tip of the Week

Get back into slack: Instead of typing in your slack room name, just click the “Find your workspace” link and enter your email – it’ll send you all of the workspaces linked to that email.
Forget everything Michael said about CodeDOM in episode 112. Just use Roslyn. (GitHub)
The holidays are the perfect time for Advent of Code. (adventofcode.com)

Direct download: coding-blocks-episode-121.mp3
Category:Software Development -- posted at: 10:58pm EDT

Sponsors

Survey Says

Which sci-fi series is best?

News

Maintainability

Operability

Simplicity

How to remove accidental complexity?

Evolvability

Resources We Like

Tip of the Week

Sponsors

Survey Says

With the new year coming, what kind of learning resolution do you plan on setting? I plan to learn ...

News

Scalability

Describing Load

Describing Performance

Latency vs Response Time

What do you mean by “numbers”?

Coping with Load

Resources We Like

Tip of the Week