Coding Blocks

Since we can’t leave the house, we discuss what it takes to effectively work remote while Allen’s frail body requires an ergonomic keyboard, Joe finally takes a passionate stance, and Michael tells them why they’re wrong.

Reading these show notes via your podcast player? You can find this episode’s full show notes at https://www.codingblocks.net/episode129 and be a part of the conversation.

Sponsors

Survey Says

What's your preferred method to increase your productivity?

Take the survey at: https://www.codingblocks.net/episode129.

News

  • Thank you, krauseling, for the latest iTunes review.
  • TechSmith is offering Snagit and Video Review for free through June 2020. (TechSmith)

How to WFH

The Essentials

  • First and foremost, get a quality internet connection.
    • For video calls, favor lower latency over higher bandwidth.
  • Turn your camera on.
  • Use a comfortable headset with a good microphone
    • Wired headphones are definitely the way to go. Better audio quality, fewer problems, and no battery life issues to worry about.
  • Mute unless talking.
  • Not all video sharing is equal. Know which to use when screen sharing.
    • Screen sharing in Zoom is much better than in Hangouts. The text on the screen is crisp and readable and the screen sharing session is responsive.
  • Communicate when you will be away during normal hours.
  • Make sure your IM application status and/or availability is accurate.

Sound Good

Price Description  
$30 Sony MDRXB50AP Extra Bass Earbuds Headset with mic (Amazon)
$149 SteelSeries Arctis 7 Gaming Headphones (Amazon)
$18 Apple EarPods with 3.5mm Headphone Plug (Amazon)
Avoid these headphones
  Description Price
Sennheiser SC 130 USB Single Sided Headset (Amazon) NA
Sennheiser SC 160 USB Double-Sided Headset (Amazon) $65

Look Good, too

Price Description  
NA Logitech c930e WebCam (Amazon)

Digging Deeper

  • Don’t be afraid to spend time on calls just chatting about non work related stuff.
    • Working from home means there’s little opportunity to connect personally and that is sorely needed when working from home. Taking time to chat will help to keep the team connected.
    • Keep it light and have fun!
  • Be available and over communicate.
    • During business hours make sure you’re available. That doesn’t mean you need to be in front of your computer constantly, but it does mean to make sure you can be reached via phone, email, or chat and can participate when needed.
    • Working from home also means it is super important to communicate status and make sure people feel like progress is being made.
    • Also, if you need to be offline for any reason, send up a flare, don’t just disappear.
    • Make sure your chat application status is really your status. People will rely on you showing “Active” meaning that you are available. Don’t game your status. Take a break if you need to but if you aren’t available, don’t show available. Also, if you don’t show “Active” many will assume that you aren’t available or online.
    • We’ve also found that sometimes it is good to show “offline” or “unavailable” to give us a chance to get into a flow and get things done, so don’t be afraid to do that. Having this be a “known agreement” will signal to others that they may just want to send you an e-mail or schedule a conference later.
    • If something is urgent in email, make sure to send the subject with a prefix of “URGENT:”
      • But beware the an “urgent” email doesn’t mean you’ll get an instant reply. If you need an answer right now, consider a phone call.
      • An “urgent” email should be treated as “as soon as you read this”, knowing that it might not be read for a while.
    • Make sure your calendar is up to date. If you are busy or out of the office (OOO) then make sure you schedule that in your calendar so that people will know when they can meet with you.
    • Along with the above, when scheduling meetings, check the availability of your attendees.
  • Be flexible.
    • This goes with things mentioned above. As a manager especially, you need to be flexible and recognize that working from home sometimes means people need to be away for periods of time for personal reasons. Don’t sweat that unless these people aren’t delivering per the next point.
    • Favor shorter milestones or deliverables and an iterative approach.
      • This helps keep people focused and results oriented. Science projects are easy to squash if you define short milestones that provide quick wins on the way to a longer term goal.
        • We use the term “fail fast” a lot where we break projects into smaller bits and try to attack what’s scariest first in an effort to “fail fast” and change course.
      • We use JIRA and work in 2 week sprints.
        • Define work in small enough increments. If something exceeds two weeks, it means it needs to be reviewed and refined into smaller work streams. Spend the time to think through it.
    • Require estimates on work items to help keep thing on track.
  • Allow and encourage people to work in groups or teams if appropriate, for things like:
    • Brainstorming sessions.
    • Mini-scrums that are feature or project based.
    • Pair programming. Use of the proper video application for screen sharing is important here.
  • Conference etiquette:
    • Mute. If you’re not talking, mute.
      • Lots of participants? Mute.
      • Smaller/Team meeting? Up to you. But probably best to mute.
    • Use a microphone and verify people hear you okay. Don’t forgo a real headset or microphone and instead try to use your internal laptop microphone and speakers. You will either be super loud with background noise, for example people just hear you typing the whole time or hear your fan running, or people won’t hear you at all.
    • When you start presenting, it is a good practice to ask “can you see my screen?”
    • Give others opportunities to talk and if someone hasn’t said anything, mention it and ask for their feedback especially if you think their opinion is important on the subject at hand.
  • Use a tool to help you focus.
    • It is easy to get distracted by any number of things.
    • A technique that works well for some is the Pomodoro Technique. There’s also nifty applications and timers that you can use to reinforce it.
  • Music may not be the answer.
    • For some people just putting on noise-cancelling headphones helps with external noise (kids, TV, etc.)
  • Choose the right desktop sharing tool when needed.
    • We’ve found that Hangouts is a great tool to meet quickly and while it does provide for screen sharing, the video quality isn’t great. It does not allow people who are viewing your screen to zoom in and if you have a very high resolution monitor, people may find it hard to read/see it.
    • While Webex is a little more challenging to use, it does provide the ability for others to zoom in when you share, and the shared screens are more clear than Hangouts.
    • Additionally, Webex allows you to view all participants in one gallery view, thus reinforcing team cohesion.
    • That said though, we’ve found Zoom to be far superior to it’s competitors.
  • Develop a routine.
    • Get up and start working at roughly the same time if you can.
    • Shower and dress as if you’re going out for errands at least.
    • If possible, have a dedicated workspace.
    • Most importantly, make sure you stop work at some point and just be home. If at all possible, coupled with the dedicated workspace tip, if you can have a physical barrier, such as a door, use it, i.e close the door and “be home” and not “at work”.
    • It’s hard not to overeat at first, but try to avoid the pantry that is probably really close to your workspace.
    • Try to get out of the house for exercise or errands in the middle of day to break things up.
    • Working from home is much more sedentary than working in an office. Make it a point to get up from your desk and walk around, check the mail, do whatever you can to stretch your legs.

Resources We Like

Tip of the Week

Direct download: coding-blocks-episode-129.mp3
Category:Software Development -- posted at: 9:58pm EDT

It’s time to learn about SSTables and LSM-Trees as Joe feels pretty zacked, Michael clarifies what he was looking forward to, and Allen has opinions about Dr Who.

These show notes can be found at https://www.codingblocks.net/episode128 where you be a part of the conversation, in case you’re reading this via your podcast player.

Sponsors

Survey Says

Do you leave your laptop plugged in the majority of the time?

Take the survey at: https://www.codingblocks.net/episode128.

News

  • Thank you for all of the great reviews:
    • iTunes: devextremis, CaffinatedGamer, Matt Hussey, index out of range
    • Stitcher: Marcos Sagrado, MoarLiekCodingRokzAmirite, Asparges69
  • Sadly, due to COVID-19 (aka Coronavirus), the 15th Annual Orlando Code Camp & Tech Conference has been cancelled. We’ll keep you informed of your next opportunity to kick us in the shins. (orlandocodecamp.com)
    • During this unprecedented time, TechSmith is offering Snagit and Video Review for free through June 2020. (TechSmith)

SSTables and LSM-Trees

SSTables

  • SSTable is short for “Sorted String Table”.
  • SSTable requires that the writes be sorted by key.
    • This means we cannot append the new key/value pairs to the segment immediately because we need to make sure the data is sorted by key first.

What are the benefits of the SSTable over the hash indexed log segments?

  • Merging the segments is much faster, and simpler. It’s basically a mergesort against the segment files being merged. Look at the first key in each file, and take the lowest key (according to the sort order), add it to the new segment file … rinse-n-repeat.
    • When the same key shows in multiple segment files, keep the newer segment’s key/value pair, sticking with the notion that the last written key/value for any given key is the most up to date value.
  • To find keys, you no longer need to keep the entire hash of indexes in memory. Instead, you can use a sparse index where you store a key in memory for every few kilobytes from a segment file
    • This saves on memory.
    • This also allows for quick scans as well.
      • For example, when you search for a key, Michael and the key isn’t in the index, you can find two keys in the sparse index that Michael falls between, such as Micah and Mick, then start at the Micah offset and scan that portion of the segment until you find the Michael key.
  • Another improvement for speeding up read scans is to write chunks of data to disk in compressed blocks. Then, the keys in the sparse index point to the beginning of that compressed block.

So how do you write this to disk in the proper order?

  • If you just write them to disk as you get them, they’ll be out of order in an append only manner because you’re likely going to receive them out of order.
  • One method is to actually write them to disk in a sorted structure. B-Tree is one option. However, maintaining a sorted structure in memory is actually easier than trying to maintain it on disk though, due to well known tree data structures like red-black trees and AVL trees.
    • The keys are sorted as they’re inserted due to the way nodes are shuffled during inserts.
    • This allows you to write the data to memory in any order and retrieve it sorted.
  • When data arrives, write it to the memory balanced tree data structure, such as a red-black tree. This is also referred to as a memtable.
  • Once you’ve reached a predefined size threshold, you dump the data from memory to disk in a new SSTable file.
  • While the new segment is being written to disk, any incoming key/value pairs get written to a new memtable.
  • When serving up read requests, you search in your memtable first, then back to the most recent segment, and so on moving backwards until you find the key you’re looking for.
  • Occasionally run a merge on the segments to get rid of overwritten or deleted items.

Downside of this method?

  • If the database crashes for some reason, the data in the memtable is lost.
  • To avoid this, you can use an append-only, unsorted log for each new record that comes in. If the database crashes, that log file can be used to recreate the memtable.

LSM-Trees

This implementation is the ground work for:

  • LevelDB (GitHub) and RocksDB (GitHub),
  • Databases intended to be embedded in other applications,
    • RocksDB is embedded in Kafka Streams and is used for GlobalKTables.
  • Similar storage engines are used by Cassandra and HBase.
    • Both took some design queues from Google’s BigTable whitepaper, which introduced the terms SSTable and memtable.

All of this was initially described under the name Log-Structured Merge TreeLSM-Tree.

  • Storage engines that are based on the notion of storing compacted and sorted files are often called LSM storage engines.
    • Lucene, the indexing engine used in Solr and ElasticSearch, uses a very similar process.

Optimizing

  • One of the problems with the LSM-Tree model is that searching for keys that don’t exist can be expensive.
    • Must search the memtable first, then latest segment, then the next oldest segment, etc., all the way back through all the segments.
    • One solution for this particular problem is a Bloom filter.
      • A Bloom filter is a data structure used for approximating what is in a set of data. It can tell you if the key does not exist, saving a lot of I/O looking for the key.
  • There are competing strategies for determining when and how to perform the merge and compaction operations. The most common approaches include:
    • Leveled compaction – Key ranges are split into smaller SSTables and old data is moved to different “levels” allowing the compacting process to use less disk and done incrementally. This is the strategy used by LevelDB and RocksDB.
    • Size-tiered compaction – Smaller and newer SSTables are merged into larger and older SSTables. This is the strategy used by HBase.

Resources We Like

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
  • Red-black trees in 5 minutes – Insertions (examples) (YouTube)
  • Data Structures – (some) Trees (episode 97)
  • B-Tree Visualization (USFCA)
  • Red Black Tree vs AVL Tree (GeeksforGeeks)
  • How to: Use Bloom Filters in Redis (YouTube)
  • A Busy Developer’s Guide to Database Storage Engines – The Basics (yugabyteDB)

Tip of the Week

  • Save time typing paths by drag-n-dropping a folder from Finder/File Explorer to your command shell. Works on Windows and macOS in Command Prompt, Powershell, Cmder, and Terminal.
  • Popular and seminal white papers curated by Papers We Love (GitHub)
    • See if there is an upcoming PWL meetup in your area (paperswelove.org)
    • And there’s a corresponding Papers We Love Conference (pwlconf.org)
  • Every find yourself in the situation where you’re asked to pivot from your current work to another task that would require you to stash your current changes and change branches? Maybe you do that. Or maybe you clone the repo into another path and work from there? But there’s a pro-tip way. Instead, you can use git worktree to work with your repo in another path without needing to re-clone the repo.
    • For example, git worktree add -b myhotfix /temp master copies the files from master to /temp and creates a new branch named myhotfix.
  • Get your Silicon Valley fix with Mythic Quest. (Apple)
  • Level up your programming skills with exercises and mentors with Exercism. (exercism.io)
    • Exercism has been worth mentioning a few times:
      • Algorithms, Puzzles, and the Technical Interview (episode 26)
      • Deliberate Practice for Programmers (episode 78)
  • Use elasticdump’s import and export tools for Elasticsearch. (GitHub)
  • Use docker run --network="NETWORK-NAME-HERE" to connect a container to an existing Docker network. (docs.docker.com)
Direct download: coding-blocks-episode-128.mp3
Category:Software Development -- posted at: 8:01pm EDT

In this episode, Allen is back, Joe knows his maff, and Michael brings the jokes, all that and more as we discuss the internals of how databases store and retrieve the data we save as we continue our deep dive into Designing Data-Intensive Applications.

If you’re reading these show notes via your podcast player, did you know that you can find them at https://www.codingblocks.net/episode127? Well you do now! Check it out and join in the conversation.

Sponsors

  • Datadog.com/codingblocks – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
  • Educative.io – Level up your coding skills, quickly and efficiently. Visit educative.io/codingblocks to get 10% off any course or annual subscription.
  • Clubhouse – The fast and enjoyable project management platform that breaks down silos and brings teams together to ship value, not features. Sign up to get two additional free months of Clubhouse on any paid plan by visiting clubhouse.io/codingblocks.

Survey Says

Which fast food restaurant makes the better fries?

Take the survey at: https://www.codingblocks.net/episode127.

News

  • We thank all of the awesome people that left us reviews:
    • iTunes: TheLunceforce, BrianMorrisonMe, Collectorofmuchstuff, Momentum Mori, brianbrifri, Isyldar, James Speaker
    • Stitcher: adigolee
  • Come see Allen, Joe, and Michael in person at the 15th Annual Orlando Code Camp & Tech Conference, March 28th. Sign up for your chance to kick them all in the shins and grab some swag. (orlandocodecamp.com)

Database Storage and Retrieval

  • A database is a collection of data.
  • A database management system includes the database, APIs for managing the data and access to it.

RDBMS Storage Data Structures

  • Generally speaking, data is written to a log in an append only fashion, which is very efficient.
    • Log: an append-only sequence of records; this doesn’t have to be human readable.
  • These write operations are typically pretty fast because writing to the end of a file is generally a very fast operation.
  • Reading for a key from a file is much more expensive though as the entire file has to be scanned for instances of the key.
  • To solve this problem, there are indexes.
    • Generally speaking, an index is just different ways to store another structure derived from the primary set of data.
    • Having indices incurs additional overhead on writes. You’re no longer just writing to the primary data file, but you’re also keeping the indices up to date at the same time.
      • This is a trade-off you incur in databases: indexes speed up reads but slow down writes.

Hash Indexes

  • One possible solution is to keep every key’s offset (which points to the location of the value of the key) in memory.
    • This is what is done for Bitcask, the default storage engine for Riak.
    • The system must have enough RAM for the index though.
  • In the example given, all the keys stay in memory, but the file is still always appended to, meaning that the key’s offset is likely to change frequently, but it’s still very efficient as you’re only ever storing a pointer to the location of the value.
  • If you’re always writing to a file, aren’t you going to run out of disk space?
    • File segmenting / compaction solves this.
      • Duplicate keys in a given file are compacted to store just the last value written for the key, and those values are written to a new file.
        • This typically happens on a background thread.
      • Once the new segment file has been created, after merging in changes from the previous file, then it becomes the new “live” log file.
      • This means while the background thread is running to create the new segment, the locations for keys are being read from the old segment files in the meantime so that processes aren’t blocked.
      • After the new segment file creation is completed, the old segment files can be deleted.
        • This is how Kafka topic retention policies work, and what happens when you run “force merge” on an Elasticsearch index (same goes for similar systems).
  • Some key factors in making this work well:
    • File format
      • CSV is not a great format for logs. Typically you want to use a binary format that encodes the length of the string in bytes with the actual string appended afterwards.
    • Deleting records requires some special attention
      • You have to add a tombstone record to the file. During the merge process, the key and values will be deleted.
    • Crash recovery
      • If things go south on the server, recovering might take some time if there are large segments or key/value pairs.
      • Bitcask makes this faster by snapshotting the in-memory hashes on occasion so that starting back up can be faster.
    • Incomplete record writes
      • Bitcask files include checksums so any corruption in the logs can be ignored.
    • Concurrency control
      • It’s common for there to only be one writer thread, but multiple reader threads, since written data is immutable.

Why not update the file, instead of only appending to it?

  • Appending and merging are sequential operations, which are particularly efficient on HDD and somewhat on SSD.
  • Concurrency and crash recovery are much simpler.
  • Merging old segments is a convenient and unintrusive way to avoid fragmentation.

Downsides to Hash Indexes

  • The hash table must fit in memory or else you have to spill over to disk which is inefficient for hash table.
  • Range queries are not efficient, you have to lookup each key.

Resources We Like

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
  • Grokking the System Design Interview (Educative.io)

Tip of the Week

  • Add authentication to your applications with minimum fuss using KeyCloak. (keycloak.org)
  • Master any major city’s transit system like a boss with CityMapper. (citymapper.com)
  • Spin up a new VM with a single command using Multipass. (GitHub)
  • Random User Generator – like Lorem Ipsum, but for people. (randomuser.me)
  • The perfect gifts for that nerd in your life. (remembertheapi.com)
  • Use CTRL+SHIFT+O in Chrome’s Sources tab to navigate to your JavaScript function by name.
  • tabs AND spaces – A new podcast that talks the topics that developers care about. (tabsandspaces.io)
Direct download: coding-blocks-episode-127.mp3
Category:Software Development -- posted at: 8:01pm EDT

1