Coding Blocks

It’s time to take a break, stretch our legs, grab a drink, and maybe even join in some interesting conversations around the water cooler as Michael goes off script, Joe is very confused, and Allen insists that we stay on script.

The full show notes for this episode are available at https://www.codingblocks.net/episode163. Stop by, check it out, and join the conversation.

Sponsors

  • Educative.io – Learn in-demand tech skills with hands-on courses using live developer environments. Visit educative.io/codingblocks to get an additional 10% off an Educative Unlimited annual subscription.

Survey Says

Which desktop OS do you prefer?

Take the survey at: https://www.codingblocks.net/episode163

News

  • We really appreciate the latest reviews, so thank you!
    • iTunes: EveryNickIsTaken2858, Memnoch97
  • Allen finished his latest ergonomic keyboard review: Moonlander Ergonomic Keyboard Long Term Review (YouTube)
  • Sadly, the .http files tip from episode 161 for JetBrains IDEs is only application for JetBrains’ Ultimate version.

Meantime, at the watercooler….

GitHub Copilot (GitHub)

  • In short, it’s a VS Code Extension that leverages the OpenAI Codex, a product that translates natural language to code, in order to … wait for it … write code. It’s currently in limited preview.

What’s the value?

  • Is the code correct? Github says ~40-50% in some large scale test cases
  • It works best with small, documented functions
  • Does having the code written for you steer you towards solutions?
  • Could this encourage similar bugs/security holes across multiple languages by people importing the same code?
  • Is this any different from developers using the same common solutions from StackOverflow?
  • Could it become a crutch for new developers?
  • Better for certain kinds of code? (Boiler plate, common accessors, date math)
    • Boiler Plate (like angular / controller vars)
    • Common APIs (Twitter, Goodreads)
    • Common Algorithms, Design Patterns
    • Less Familiar Languages
  • But is it useful? We’ll see!

Is this the future?

  • We see more low, no, and now co-code solutions all the time, is this where things are going?
  • This probably won’t be “it”, but maybe we will see things like this more commonly – in any case it’s different, why not give it a shot?

Is it Ethical?

  • The “AI” or whatever has been trained on “billions of lines” of open-source code…but not strictly permissive licenses. This means a dev using this tool runs the risk of accidently including proprietary code
    • Quake Engine Source Code Example (GPLv2) (Twitter)
  • From an article in VentureBeat:
    • 54 million public software repositories hosted on GitHub as of May 2020 (for Python) 179GB of unique Python files under 1MB in size. Some basic limitations on line and file length, sanitization: The final training dataset totaled 159GB.
    • There is problem with bias, especially in more niche categories
  • Is it ethical to use somebody else’s data to train an AI without their permission?
  • Can it get you sued?
  • Would your thoughts change if the data is public? License restricted?
  • Would your thoughts change if the product/model were open-sourced?

Abstractions… how far is too far?

  • Services should communicate with datastores and services via APIs that hide the details, these provide for a nice indirection that allows for easier maintenance in the future
  • Do you abstract at the service level or the feature level?
  • Are ORMs a foregone conclusion?
  • What about services that have a unique communication pattern, or assist with cross cutting concerns for things like microservices (We are looking at you hear Kafka!)

The 10 Best Practices for Remote Software Engineering

  • From article: The 10 Best Practices for Remote Software Engineering (ACM)
    • Work on Things You Care About
    • Define Goals for Yourself
    • Define Productivity for Yourself
    • Establish Routine and Environment
    • Take Responsibility for Your Work
    • Take Responsibility for Human Connection
    • Practice Empathetic Review
    • Have Self-Compassion
    • Learn to Say Yes, No, and Not Anymore
    • Choose Correct Communication Channels

Terminal Tricks (CodeMag.com)

Some of Michael’s (Linux/macOS) favorites from the article:

  • Abbreviate your directories with tab completion when changing directories, such as cd /v/l/a, and assuming that that abbreviated path can uniquely identify something like, /var/logs/apache, tab completion will take care of the rest.
  • Use nl to get a numbered list of some previous command’s output, such as ls -l | nl.
    • ERRATUM: During the episode, Michael mentioned that the output would first list the total lines, but that just happened to be due to output from ll and was unrelated to the output from nl.
  • On macOS, you can use the powermetrics command to gain access to all sorts of metrics related to the internals of your computer, such as the temperature at various sensors.
  • Use !! to repeat the last command. This can be especially helpful when you want to do something like prepend/append the previous command, such as sudo !!.
    • ERRATUM: Wow, Michael really got this one wrong during the episode. It doesn’t repeat the “last sudo command” nor does it leave the command in edit mode. Listen to Allen’s description. /8)
  • Awesome keyboard shortcuts:
    • CTRL+A takes you to the start of the line and CTRL+E takes you to the end.
    • No need to type clear any longer as CTRL+L will clear your screen.
    • CTRL+U deletes the content to the left of the cursor and CTRL+K deletes the content to the right of the cursor.
    • Made a mistake in while typing your command? Use CTRL+SHIFT+- to undo what you last typed.
  • Using the history command, you can see your previous commands and even limit it with a negative number, such as history -5 to see only the last five commands.

Tip of the Week

  • Partial Diff is a VS Code extension that makes it easy to compare text. You can right click to compare files or even blocks of text in the same file, as well as in different files. (Visual Studio Marketplace)
  • StackBlitz is an online development environment for full stack applications. (StackBlitz.com)
  • Microcks, an open source Kubernetes native tool for API mocking and testing. (Microcks.io)
  • Bridging the HTTP protocol to Apache Kafka (Strimzi.io)
  • Difference Between grep, sed, and awk (Baeldung.com)
  • As an alternative to the ruler hack mentioned in episode 161, there are several compact, travel ready laptop stands. (Amazon)
Direct download: coding-blocks-episode-163.mp3
Category:Software Development -- posted at: 10:16pm EDT

We wrap up our replication discussion of Designing Data-Intensive Applications, this time discussing leaderless replication strategies and issues, while Allen missed his calling, Joe doesn’t read the gray boxes, and Michael lives in a future where we use apps.

If you’re reading this via your podcast player, you can find this episode’s full show notes at https://www.codingblocks.net/episode162. As Joe would say, check it out and join in on the conversation.

Sponsors

  • Educative.io – Learn in-demand tech skills with hands-on courses using live developer environments. Visit educative.io/codingblocks to get an additional 10% off an Educative Unlimited annual subscription.

Survey Says

Do you have TikTok installed?

Take the survey at: https://www.codingblocks.net/episode162.

News

  • Thank you for the latest review!
    • iTunes: tuns3r

Designing Data Intensive Applications

Check out the book!

Single Leader to Multi-Leader to Leaderless

  • When you have leaders and followers, the leader is responsible for making sure the followers get operations in the correct order
  • Dynamo brought the trend to the modern era (all are Dynamo inspired) but also…
    • Riak
    • Cassandra
    • Voldemort
  • We talked about NoSQL Databases before:
  • What exactly is NewSQL? https://en.wikipedia.org/wiki/NewSQL
  • What if we just let every replica take writes? Couple ways to do this…
    • You can write to several replicas
    • You can use a coordinator node to pass on the writes
  • But how do you keep these operations in order? You don’t!
    • Thought exercise, how can you make sure operation order not matter?
    • Couple ideas: No partial updates, increments, version numbers

Multiple Writes, Multiple Reads

  • What do you do if your client (or coordinator) try to write to multiple nodes…and some are down?
  • Well, it’s an implementation detail, you can choose to enforce a “quorom”. Some number of nodes have to acknowledge the write.
    • This ratio can be configurable, making it so some % is required for a write to be accepted
    • What about nodes that are out of date?
    • The trick to mitigating stale data…the replicas keep a version number, and you only use the latest data – potentially by querying multiple nodes at the same time for the requested data
    • We’ve talked about logical clocks before, it’s a way of tracking time via observed changes…like the total number of changes to a collection/table…no timezone or nanosecond differences

How do you keep data in sync?

  • About those unavailable nodes…2 ways to fix them up
    • Read Repair: When the client realizes it got stale data from one of the replicas, it can send the updated data (with the version number) back to that replica. Pretty cool! – works well for data that is read frequently
    • Anti-Entropy: The nodes can also do similar background tasks, querying other replicas to see which are out of data – ordering not guaranteed!
    • Voldemort: ONLY uses read repair – this could lead to loss of data if multiple replicas went down and the “new” data was never read from after being written

Quorums for reading and writing

  • Quick Reminder: We are still talking about 100% of the data on each replica
  • 3 major numbers at play:
    • Number of nodes
    • Number of confirmed writes
    • Number of reads required
  • If you want to be safe, the nodes you write to and the ones you write too should include some overlap
  • A common way to ensure that, keep the number of writes + the number of reads should be greater than the number of nodes
  • Example: You have 10 nodes – if you use 5 for writing and 5 for reading…you may not have an overlap resulting in potentially stale data!
  • Common approach – taken number of nodes (odd number) + 1, then divide that number by 2 and that’s the number of reader and writers you should have
    • 9 Nodes – 5 writes and 5 reads – ensures non-stale data
    • When using this approach, you can handle Nodes / 2 (rounded down) number of failed nodes
  • How would you tweak the numbers for a write heavy workload?
  • Typically, you write and read to ALL replicas, but you only need a successful response from these numbers
  • What if you have a LOT of nodes?!?
  • Note: there’s still room for problems here – author explicitly lists 5 types of edge cases, and one category of miscellaneous timing edge cases. All variations of readers and writers getting out of sync or things happen at the same timing
  • If you really want to be safe, you need consensus (r = w = n) or transactions (that’s a whole other chapter)
  • Note that if the number of required readers or writers doesn’t return an OK, then an error is returned from the operation
  • Also worth considering is you don’t have to have overlap – having readers + writers < nodes means you could have stale data, but at possibly lower latencies and lower probabilities of error responses

Monitoring staleness

  • Single/Multi Leader lag is generally easy to monitor – you just query the leader and the replicas to see which operation they are on
  • Leaderless databases don’t have guaranteed ordering so you can’t do it this way
  • If the system only uses read repair (where the data is fixed up by clients only as it is read) then you can have data that is ancient
  • It’s hard to give a good algorithm description here because so much relies on the implementation details

And when things don’t work?

  • Multi-writes and multi-reads are great when a small % of nodes or down, or slow
  • What if that % is higher?
    • Return an error when we can’t get quorum?
    • Accept writes and catch the unavailable nodes back up later?
  • If you choose to continue operating, we call it “sloppy quorum” – when you allow reads or writes from replicas that aren’t the “home” nodes – the likened it to you got locked out of your house and you ask your neighbor if you can stay at their place for the night
  • This increases (write) availability, at the cost of consistency
  • Technically it’s not a quorum at all, but it’s the best we can do in that situation if you really care about availability – the data is stored somewhere just not where it’d normally be stored

Detecting Concurrent Writes

  • What do you get when you write the same key at the same time with different values?
  • Remember, we’re talking about logical clocks here so imagine that 2 clients both write version #17 to two different nodes
  • This may sound unlikely, but when you realize we’re talking logical clocks, and systems that can operate at reduced capacity…it happens
  • What can we do about it?
    • Last write wins: But which one is considered last? Remember, how we catch up? (Readers fix or leaders communicate) …either way, the data will eventually become consistent but we can’t say which one will win…just that one will eventually take over
      • Note: We can take something else into account here, like clock time…but no perfect answer
      • LWW is good when your data is immutable, like logs – Cassandra recommends using a UUID as a key for each write operation
    • Happens-Before Relationship – (Riak has CfRDT that bundle a version vector to help with this)

This “happens-before” relationship and concurrency

  • How do we know whether the operations are concurrent or not?
    Basically if neither operation knows about the other, then they are concurrent…
  • Three possible states if you have writes A and B
    • A happened before B
    • B happened before A
    • A and B happened concurrently
  • When there is a happens before, then you take the later value
  • When they are concurrent, then you have to figure out how to resolve the conflicts
    • Merging concurrently written values
      • Last write wins?
      • Union the data?
      • No good answer

Version vectors

  • The collection of version numbers from all replicas is called a version vector
  • Riak uses dotted version vectors – the version vectors are sent back to the clients when values are read, and need to be sent back to the db when the value is written back
    • Doing this allows the db to understand if the write was an overwrite or concurrent
    • This also allows applications to merge siblings by reading from one replica and write to another without losing data if the siblings are merged correctly

Resources We Like

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
  • Past episode discussions on Designing Data-Intensive Applications (Coding Blocks)
  • Designing Data-Intensive Applications – Data Models: Relational vs Document (episode 123)
  • NewSQL (Wikipedia)
  • Do not allow Jeff Bezos to return to Earth (Change.org)
  • Man Invests $20 in Obscure Cryptocurrency, Becomes Trillionaire Overnight, at Least Temporarily (Newsweek)
  • Quantifying Eventual Consistency with PBS (Bailis.org)
  • Riak Distributed Data Types (Riak.com)

Tip of the Week

  • A GitHub repo for a list of “falsehoods”: common things that people believe but aren’t true, but targeted at the kinds of assumptions that programmers might make when they are working on domains they are less familiar with. (GitHub)
  • The Linux at command lets you easily schedule commands to run in the future. It’s really user friendly so you can be lazy with how you specify the command, for example echo "command_to_be_run" | at 09:00 or at 09:00 -f /path/to/some/executable (linuxize.com)
  • You can try Kotlin online at play.kotlinlang.org, it’s an online editor with links to lots of examples. (play.kotlinlan.org)
  • The Docker COPY cmd will need to be run if there are changes to files that are being copied. You can use a .dockerignore to skip files that you don’t care about to trim down on unnecessary work and build times. (doc.docker.com).
Direct download: coding-blocks-episode-162.mp3
Category:Software Development -- posted at: 8:01pm EDT

1