Coding Blocks

We continue our discussion of Designing Data-Intensive Applications, this time focusing on multi-leader replication, while Joe is seriously tired, and Allen is on to Michael’s shenanigans.

For anyone reading this via their podcast player, this episode’s show notes can be at, where you can join the conversation.


  • – Learn in-demand tech skills with hands-on courses using live developer environments. Visit to get an additional 10% off an Educative Unlimited annual subscription.

Survey Says

How do you put on your shoes?


  • Thank you very much for the new reviews:
    • iTunes: GubleReid, tbednarick, JJHinAsia, katie_crossing
    • Audible: Anonymous User, Anonymous User … hmm

When One Leader Just Won’t Do

DesigningData-Intensive Applications Talking about Multi-Leader Replication

Replication Recap and Latency

  • When you’re talking about single or multi-leader replication, remember all writes go through leaders
  • If your application is read heavy, then you can add followers to increase your scalability
  • That doesn’t work well with sync writes..the more followers, the higher the latency
    • The more nodes the more likely there will be a problem with one or more
    • The upside is that your data is consistent
  • The problem is if you allow async writes, then your data can be stale. Potentially very stale (it does dial up the availability and perhaps performance)
  • You have to design your app knowing that followers will eventually catch up – “eventual consistency
    • “Eventual” is purposely vague – could be a few seconds, could be an hour. There is no guarantee.
  • Some common use cases make this particularly bad, like a user updating some information…they often expect to see that change afterwards
  • There are a couple techniques that can help with this problem

Techniques for mitigation replication lag

  • Read You Writes Consistency refers to an attempt to read significant data from leader or in sync replicas by the user that submitted the data
  • In general this ensures that the user who wrote the data will get the same data back – other users may get stale version of the data
  • But how can you do that?
    • Read important data from a leader if a change has been made OR if the data is known to only be changeable by that particular user (user profile)
    • Read from a leader/In Sync Replica for some period of time after a change
    • Client can keep a timestamp of it’s most recent write, then only allow reads from a replica that has that timestamp (logical clocks keep problems with clock synchronization at bay here)
  • But…what if the user is using multiple devices?
    • Centralize MetaData (1 leader to read from for everything)
    • You make sure to route all devices for a user the same way
      • Monotonic Reads: a guarantee of sorts that ensures you won’t see data moving backwards in time. One way to do this – keep a timestamp of the most recent read data, discard any reads older than that…you may get errors, but you won’t see data older than you’ve already seen.
      • Another possibility – ensure that the reads are always coming from the same replica
    • Consistent Prefix Reads: Think about causal data…an order is placed, and then the order is shipped…but what if we had writes going to more than one spot and you query the order is shipped..but nothing was placed? (We didn’t have this problem with a Single Replica)
      • We’ll talk more about this problem in a future episode, but the short answer is to make sure that causal data gets sent to the same “partition”

Replication isn’t as easy as it sounds, is it?

Multi-Leader Rep…lication

Single leader replication had some problems. There was a single point of failure for writes, and it could take time to figure out the new leader. Should the old leader come back then…we have a problem. Multi-Leader replication…

  • Allows more than one node to receive writes
  • Most things behave just like single-leader replication
  • Each leader acts as followers to other leaders

When to use Multi-Leader Replication

  • Many database systems that support single-leader replication can be taken a step further to make them mulit-leader. Usually. you don’t want to have multiple leaders within the same datacenter because the complexity outweighs the benefits.
  • When you have multiple leaders you would typically have a leader in each datacenter
  • An interesting approach is for each datacenter to have a leader and followers…similar to the single leader. However, each leader would be followers to the other datacenter leaders
    • Sort of a chained single-leader replication setup

Comparing Single-Leader vs Multi-Leader Replication

Performance – because writes can occur in each datacenter without having to go through a single datacenter, latency can be greatly reduced in multi-leader

  • The synchronization of that data across datacenters can happen asynchronously making the system feel faster overall
  • Fault tolerance – in single-leader, everything is on pause while a new leader is elected
  • In multi-leader, the other datacenters can continue taking writes and will catch back up when a new leader is selected in the datacenter where the failure occurred
    Network problems
  • Usually a multi-leader replication is more capable of handling network issues as there are multiple data centers handling the writes – therefore a major issue in one datacenter doesn’t cause everything to take a dive

So it’s clear right? Multi-leader all the things? Hint: No!

Problems with Multi-Leader Replication

  • Changes to the same data concurrently in multiple datacenters has to be resolved – conflict resolution – to be discussed later
  • External tools for popular databases:
  • Additional problems – multi-leader is typically bolted on after the fact
  • Auto-incrementing keys, triggers, constraints can all be problematic
    • Those reasons alone are reasons why it’s usually recommended to avoid multi-leader replication

Clients with offline operation

  • Multi-leader makes sense when there are applications that need to continue to work even when they’re not connected to the network
    • Calendars were an example given – you can make changes locally and when your app is online again it syncs back up with the remote databases
    • Each application’s local database acts as a leader
    • CouchDB was designed to handle this type of setup

Collaborative editing

Google Docs, Etherpad, Changes are saved to the “local” version that’s open per user, then changes are synced to a central server and pushed out to other users of the document

Conflict resolution

  • One of the problems with multi-leader writes is there will come times when there will be conflicting writes when two leaders write to the same column in a row with different values
  • How do you solve this?
    • If you can automate, you should because you don’t want to be putting this together by hand
    • Make one leader more important than the others
    • Make certain writes always go through the same data centers
  • It’s not easy – Amazon was brought up as having problems with this as well

Multi-Leader Replication Toplogogies

  • A replication topology describes how replicas communicate
  • Two leaders is easy
  • Some popular topologies:
    • Ring: Each leader reads from “right”, writes to the “left”
    • All to All: Very Chatty, especially as you add more and more nodes
    • Star: 1 special leader that all other leaders read from
  • Depending on the topology, a write may need to pass through several nodes before it reaches all replicas
  • How do you prevent infinite loops? Tagging is a popular strategy
  • If you have a star or circular topology, then a single node failure can break the flow
  • All to all is safest, but some networks are faster than others that can cause problems with “overrun” – a dependent change can get recorded before the previous
  • You can mitigate this by keeping “version vectors”, kind of logical clock you can use to keep from getting too far ahead

Resources We Like

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
  • Past episode discussions on Designing Data-Intensive Applications (Coding Blocks)
  • Amazon Yesterday Shipping (YouTube)
  • Uber engineering blog (

Tip of the Week

  • .http files are a convenient way of running web requests. The magic is in the IDE support. IntelliJ has it built in and VSCode has an extension. (IntelliJ ProductsVSCode Extension)
  • iTerm2 is a macOS Terminal Replacement that adds some really nice features. Some of our Outlaw’s favorite short-cuts: (iTerm2Features and Screenshots)
    • CMD+D to create a new panel (split vertically)
    • CMD+SHIFT+D to create a new panel (split horizontally)
    • CMD+Option+arrow keys to navigate between panes
    • CMD+Number to navigate between tabs
  • Ruler Hack – An architect scale ruler is a great way to prevent heat build up on your laptop by giving the hottest parts of the laptop some air to breathe. (Amazon)
  • Fizz Buzz Enterprise Edition is a funny, and sadly reminiscent, way of doing FizzBuzz that incorporates all the buzzwords and most abused design patterns that you see in enterprise Code. (GitHub)
  • From our friend Jamie Taylor (of DotNet Core Podcast, Tabs ‘n Spaces, and Waffling Taylors), mkcert is a “zero-config” way to easily generate self-signed certificates that your computer will trust. Great for dev! (GitHub)
Direct download: coding-blocks-episode-161.mp3
Category:Software Development -- posted at: 8:55pm EDT

We dive back into Designing Data-Intensive Applications to learn more about replication while Michael thinks cluster is a three syllable word, Allen doesn’t understand how we roll, and Joe isn’t even paying attention.

For those that like to read these show notes via their podcast player, we like to include a handy link to get to the full version of these notes so that you can participate in the conversation at


  • Datadog –  Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
  • Linode – Sign up for $100 in free credit and simplify your infrastructure with Linode’s Linux virtual machines.
  • – Learn in-demand tech skills with hands-on courses using live developer environments. Visit to get an additional 10% off an Educative Unlimited annual subscription.

Survey Says

How important is it to learn advanced programming techniques?


  • Thank you to everyone that left us a new review:
    • Audible: Ashfisch, Anonymous User (aka András)

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair Douglas Adams

Douglas Adams
Book: Designing Data-Intensive Applications In this episode, we are discussing Data Replication, from chapter 5 of “Designing Data-Intensive Applications”.

Replication in Distributed Systems

  • When we talk about replication, we are talking about keeping copies of the same data on multiple machines connected by a network
  • For this episode, we’re talking about data small enough that it can fit on a single machine
  • Why would you want to replicate data?
    • Keeping data close to where it’s used
    • Increase availability
    • Increase throughput by allowing more access to the data
  • Data that doesn’t change is easy, you just copy it
  • 3 popular algorithms
    • Single Leader
    • Multi-Leader
    • Leaderless
  • Well established (1970’s!) algorithms for dealing with syncing data, but a lot data applications haven’t needed replication so the practical applications are still evolving
    • Cluster group of computers that make up our data system
    • Node each computer in the cluster (whether it has data or not)
    • Replica each node that has a copy of the database
  • Every write to the database needs to be copied to every replica
  • The most common approach is “leader based replication”, two of the algorithms we mentioned apply
  • One of the nodes is designated as the “leader”, all writes must go to the leader
  • The leader writes the data locally, then sends to data to it’s followers via a “replication log” or “change stream”
  • The followers tail this log and apply the changes in the same order as the leader
  • Reads can be made from any of the replicas
  • This is a common feature of many databases, Postgres, Mongo, it’s common for queues and some file systems as well

Synchronous vs Asynchronous Writes

  • How does a distributed system determine that a write is complete?
  • The system could hang on till all replicas are updated, favoring consistency…this is slow, potentially a big problem if one of the replicas is unavailable
  • The system could confirm receipt to the writer immediately, trusting that replicas will eventually keep up… this favors availability, but your chances for incorrectness increase
  • You could do a hybrid, wait for x replicas to confirm and call it a quorum
  • All of this is related to the CAP theorem…you get at most two: Consistency, Availability and Partition Tolerance
  • The book mentions “chain replication” and other variants, but those are still rare

Steps for Adding New Followers

  1. Take a consistent snapshot of the leader at some point in time (most db can do this without any sort of lock)
  2. Copy the snapshot to the new follower
  3. The follower connects to the leader and requests all changes since the back-up
  4. When the follower is fully caught up, the process is complete

Handling Outages

  • Nodes can go down at any given time
  • What happens if a non-leader goes down?
    • What does your db care about? (Available or Consistency)
    • Often Configurable
  • When the replica becomes available again, it can use the same “catch-up” mechanism we described before when we add a new follower
  • What happens if you lose the leader?
    • Failover: One of the replicas needs to be promoted, clients need to reconfigure for this new leader
  • Failover can be manual or automatic

Rough Steps for Failover

  1. Determining that the leader has failed (trickier than it sounds! how can a replica know if the leader is down, or if it’s a network partition?)
  2. Choosing a new leader (election algorithms determine the best candidate, which is tricky with multiple nodes, separate systems like Apache Zookeeper)
  3. Reconfigure: clients need to be updated (you’ll sometimes see things like “bootstrap” services or zookeeper that are responsible for pointing to the “real” leader…think about what this means for client libraries…fire and forget? try/catch?

Failover is Hard!

  • How long do you wait to declare a leader dead?
  • What if the leader comes back? What if it still thinks it’s leader? Has data the others didn’t know about? Discard those writes?
  • Split brain – two replicas think they are leaders…imagine this with auto-incrementing keys… Which one do you shut down? What if both shut down?
  • There are solutions to these problems…but they are complex and are a large source of problems
  • Node failures, unreliable networks, tradeoffs around consistency, durability, availability, latency are fundamental problems with distributed systems

Implementation of Replication Logs

  • 3 main strategies for replication, all based around followers replaying the same changes

Statement-Based Replication

  • Leader logs every Insert, Update, Delete command, and followers execute them
  • Problems
    • Statements like NOW() or RAND() can be different
    • Auto-increments, triggers depend on existing things happen in the exact order..but db are multi-threaded, what about multi-step transactions?
    • What about LSM databases that do things with delete/compaction phases?
  • You can work around these, but it’s messy – this approach is no longer popular
  • Example, MySQL used to do it

Write Ahead Log Shipping

  • LSM and B-Tree databases keep an append only WAL containing all writes
  • Similar to statement-based, but more low level…contains details on which bytes change to which disk blocks
  • Tightly coupled to the storage engine, this can mean upgrades require downtime
  • Examples: Postgres, Oracle

Row Based Log Replication

  • Decouples replication from the storage engine
  • Similar to WAL, but a litle higher level – updates contain what changed, deletes similar to a “tombstone”
  • Also known as Change Data Capture
  • Often seen as an optional configuration (Sql Server, for example)
  • Examples: (New MySQL/binlog)

Trigger-Based Replication

  • Application based replication, for example an app can ask for a backup on demand
  • Doesn’t keep replicas in sync, but can be useful

Resources We Like

Tip of the Week

  • A collection of CSS generators for grid, gradients, shadows, color palettes etc. from Smashing Magazine.
  • Learn This One Weird ? Trick To Debug CSS (
  • Use tree to see a visualization of a directory structure from the command line. Install it in Ubuntu via apt install tree. (
  • Initialize a variable in Kotlin with a try-catch expression, like val myvar: String = try { ... } catch { ... }. (Stack Overflow)
  • Manage secrets and protect sensitive data (and more with Hashicorp Vault. (Hashicorp)
Direct download: coding-blocks-episode-160.mp3
Category:Software Development -- posted at: 8:01pm EDT