Coding Blocks

November 2022
S	M	T	W	T	F	S

		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Sun, 20 November 2022

Technical Challenges of Scale at Twitter

We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them both.

The full show notes for this episode are available at https://www.codingblocks.net/episode198.

News

Want to help us out? Leave us a review!
The 2023 Game Ja-Ja-Ja Jam is coming up!

Twitter has a Data Problem

Moving an Exabyte of Data

In 2019, over 100 million people per day would visit Twitter.
Every tweet and user action creates an event that is used by machine learning and employees for analytics.
Their goal was to democratize data analysis within Twitter to allow people with various skillsets to analyze and/or visualize the data.
At the time, various technologies were used for data analysis:
- Scalding which required programmer knowledge, and
- Presto and Vertica which had performance issues at scale.
Another problem was having data spread across multiple systems without a simple way to access it.

Moving pieces to Google Cloud Platform

The Google Cloud big data tools at play:
- BigQuery, a cost-effective, serverless, multicloud enterprise data warehouse to power your data-driven innovation.
- DataStudio, unifying data in one place with ability to explore, visualize and tell stories with the data.

History of Data Warehousing at Twitter

2011 – Data analysis was done with Vertica and Hadoop and data was ingested using Pig for MapReduce.
2012 – Replaced Pig with Scalding using Scala APIs that were geared towards creating complex pipelines that were easy to test. However, it was difficult for people with SQL skills to pick up.
2016 – Started using Presto to access Hadoop data using SQL and also used Spark for ad hoc data science and machine learning.
2018 …
- Scalding for production pipelines,
- Scalding and Spark for ad hoc data science and machine learning,
- Vertica and Presto for ad hoc, interactive SQL analysis,
- Druid for interactive, exploratory access to time-series metrics, and
- Tableau, Zeppelin, and Pivot for data visualization.
So why the change? To simplify analytical tools for Twitter employees.

BigQuery for Everyone

Challenges:
- Needed to develop an infrastructure to reliably ingest large amounts of data,
- Support company-wide data management,
- Implement access controls,
- Ensure customer privacy, and
- Build systems for:
  - Resource allocation,
  - Monitoring, and
  - Charge-back.
In 2018, they rolled out an alpha release.
- The most frequently used tables were offered with personal data removed.
  - Over 250 users, from engineering, finance, and marketing used the alpha.
  - Sometime around June of 2019, they had a month where 8,000 queries were run that processed over 100 petabytes of data, not including scheduled reports.
  - The alpha turned out to be a large success so they moved forward with more using BigQuery.
They have a nice diagram that’s an overview of what their processes looked like at this time, where they essentially pushed data into GCS from on-premise Hadoop data clusters, and then used Airflow to move that into BigQuery, from which Data Studio pulled its data.

Ease of Use

BigQuery was easy to use because it didn’t require the installation of special tools and instead was easy to navigate via a web UI.
- Users did need to become familiar with some GCP and BigQuery concepts such as projects, datasets, and tables.
- They developed educational material for users which helped get people up and running with BigQuery and Data Studio.
In regards to loading data, they looked at various pieces …
- Cloud Composer (managed Airflow) couldn’t be used due to Domain Restricted Sharing (data governance).
- Google Data Transfer Service was not flexible enough for data pipelines with dependencies.
- They ended up using Apache Airflow as they could customize it to their needs.
  - For data transformation, once data was in BigQuery, they created scheduled jobs to do simple SQL transforms.
  - For complex transformations, they planned to use Airflow or Cloud Composer with Cloud Dataflow.

Performance

BigQuery is not for low-latency, high-throughput queries, or for low-latency, time-series analytics.
- It is for SQL queries that process large amounts of data.
Their requirements for their BigQuery usage was to return results within a minute.
- To achieve these requirements, they allowed their internal customers to reserve minimum slots for their queries, where a slot is a unit of computational capacity to execute a query.
The engineering team had to analyze 800+ queries, each processing around 1TB of data, to figure out how to allocate the proper slots for production and other environments.

Data Governance

Twitter focused on discoverability, access control, security, and privacy.
For data discovery and management, they extended their DAL to work with both their on-premise and GCP data, providing a single API to query all sets of data.
In regards to controlling access to the data, they took advantage of two GCP features:
- Domain restricted sharing, meaning only users inside Twitter could access the data, and
- VPC service controls to prevent data exfiltration as well as only allow access from known IP ranges.

Authentication, Authorization, and Auditing

For authentication, they used GCP user accounts for ad hoc queries and service accounts for production queries.
For authorization, each dataset had an owner service account and a reader group.
For auditing, they exported BigQuery stackdriver logs with detailed execution information to BigQuery datasets for analysis.

Ensuring Proper Handling of Private Data

They required registering all BigQuery datasets,
Annotate private data,
Use proper retention, and
Scrub and remove data that was deleted by users.

Privacy Categories for Datasets

Highly sensitive datasets are available on an as-needed basis with least privilege.
- These have individual reader groups that are actively monitored.
Medium sensitivity datasets are anonymized data sets with no PII (Personally identifiable information) and provide a good balance between privacy and utility, such as, how many users used a particular feature without knowing who the users were.
Low sensitivity datasets are datasets where all user level information is removed.
Public datasets are available to everyone within Twitter.
Scheduled tasks were used to register datasets with the DAL, as well as a number of additional things.

Cost

Roughly the same for querying Presto vs BigQuery.
There are additional costs associated with storing data in GCS and BigQuery.
Utilized flat-rate pricing so they didn’t have to figure out fluctuating costs of running ad hoc queries.
In some situations where querying 10’s of petabytes, it was more cost-effective to utilize Presto querying data in GCS storage.

Could you build Twitter in a weekend?

Resources

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Characters Sets (No Excuses!) (JoelOnSoftware.com)
Scaling data access by moving an exabyte of data to Google Cloud (blog.twitter.com)
Democratizing data analysis with Google BigQuery (blog.twitter.com)
Google BigQuery, Cloud data warehouse to power your data-driven innovation (cloud.google.com)
Google Data Studio, Your data is beautiful. Use it. (datastudio.withgoogle.com)
Looker Studio, formerly Data Studio (datastudio.google.com)
Stack Overflow’s engineering blog (Stack Overflow)
Apache Airflow (Wikipedia)
Stack Exchange performance (Stack Exchange)
Elon Musk and Twitter employees engage in war of words (NewsBytesApp.com)

Tip of the Week

VS Code has a plugin for Kubernetes and it’s actually very nice! Particularly when you “attach” to the container. It installs a couple bits on the container, and you can treat it like a local computer. The only thing to watch for … it’s very easy to set your local context! (marketplace.visualstudio.com)
kafkactl is a great command line tool for managing Apache Kafka and has a consistent API that is intuitive to use. (deviceinsight.github.io)
Cruise Control is a tool for Apache Kafka that helps balance resource utilization, detect and alert on problems, and administrate. (GitHub)
iTerm2 is a terminal emulator for macOS that does amazing things. Why aren’t you already using it? (iterm2.com)
- Previously mentioned in epsiode 147 and episode 161.
Message compression in Kafka will help you save a lot of space and network bandwidth, and the compression is per message so it’s easy to enable in existing systems! (cwiki.apache.org)

Direct download: coding-blocks-episode-198.mp3
Category:Software Development -- posted at: 8:01pm EDT

Sun, 6 November 2022

The 2022 Shopping Spree

It’s that time of year where we’ve got money burning a hole in our pockets. That’s right, it’s time for the annual shopping spree. Meanwhile, Fiona Allen is being gross, Joe throws shade at Burger King, and Michael has a new character encoding method.

The full show notes for this episode are available at https://www.codingblocks.net/episode197.

News

Thank you to everyone that left a review!
- Anonymous User, rd, Ian Matchett, Glen Jakobsen
- Want to help out the show? Leave us a review!
Almost time to start talking about … Game JA JA JA JAMUARY!
What’s your perspective on strong, static, weak or dynamic typing and how is it shaped by your experiences?
How do you move into DevOps or SRE roles if you have developer experience?
- Check out Google’s career pages! (sre.google)
- We did an episode, or eight, on The DevOps Handbook once upon a time. (The DevOps Handbook episodes)

Allen’s List

Price	Description
Nerdy Stuff
$459.00	Kinesis Advantage360 – Bluetooth Version (Amazon)
$99.99	Logitech Ergonomic MX Vertical Wireless Mouse (Amazon)
Healthy Stuff
$109.97	Bodylastics Warrior Resistance Band Set (Amazon)
$19.99	Resistance Band Rack Storage / Hanger (Amazon)
Entertainment Stuff
$99.00	Wiim Mini Streamer (Amazon)
$49.99	Roku Streaming Stick 4k (Amazon)
$549.00	PS VR2 (PlayStation)
Audio Stuff
$169.00	Audio Technica M50x (Amazon)
$58.95	Honorable mention: AKG Pro Audio K240 Studio Headphones (Amazon)
$21.99	Honorable mention: Brainwavz Round Memory Foam Earpads (Amazon)
$56.95	AIYIMA DAC-A2 (Amazon)
Woodworking stuff
$349.00	20-Volt Maximum Lithium-Ion Cordless Combo Kit (4-Tool) with 4 Ah Battery, 2 Ah Battery, Charger and Bag (Amazon)
$44.00	Kreg KMA2685 Rip-Cut Circular Saw Guide (Amazon)

Joe’s List

Well, you know Joe has to be a little different so the format’s a bit different here! What if there was a way to spend money that could actually make you happy? Check out this article: Yes, you can buy happiness … if you spend it to save time (CNBC).

Ideas for ways to spend $2k to save you time

A good mattress will improve your sleep, and therefore your amount of quality time in a day! ($1k),
Cleaning Service ($100 – $300 per month),
Massage ($50 per month),
Car Wash Subscription ($20 per month),
Grocery Delivery Service (Shipt is $10 a month + up charges on items),
Hire landscapers ($100 per month),
Get a virtual assistant ($10 to $20 an hour),
Use a delivery services like DoorDash or Postmates, or
Get your meals mailed to you (Blue Apron, Factor ~$7 to $10 per meal per person).

Remember, it’s not just about the time you save, it’s also about increasing the quality and value of the time you’re already saving!

What to do with that time and energy?

You could …

Create a Business,
Create a hobby website or portfolio,
LeetCode, or
Game Ja Ja Ja Jamuary!

Or you could …

Hang out with friends or family,
Go to the gym,
Learn an instrument, or
Meditate.

Trust the process, knowing that whatever time you do put into tech will be more fruitful!

Michael’s List

	Description	Price
Workstations
	Honorable mention: Zero Gravity Workstations (ErgoQuest.com)	$$$$$.$$
Serious Stuff
	Google Nest Wifi Pro (Amazon) Connection Failed During Setup (Reddit)	$399.99
	Honorable mention: ASUS ZenWiFi ET8 (Amazon)	$480.00
	Apple AirPods Pro (2nd Generation) (Amazon)	$239.00
	Lifelong Office Chair Wheels (Black) (Amazon)	$36.95
	Alex Tech Wire Loom Tubing Cable Sleeve (Amazon)	$12.99
	OXO Good Grips Sweep & Swipe Laptop Cleaner (Amazon)	$11.95
Fun Stuff
	DJI OM 5 Smartphone Gimbal Stabilizer (Amazon)	$129.00
	Ember Temperature Control Travel Mug 2, 12 oz, Black (Amazon)	$191.95
	Gillette Heated Razor for Men (Amazon)	$99.99
	MScreen Standard Widescreen (Indiegogo)	$149.00
	Transformers Optimus Prime Auto-Converting Robot by Robosen (Elite Edition) (Amazon)	$699.00
	LuckyBot Food 3D Printer Extruder (Amazon)	$169.00
	Stealth Abs + Plank Core Trainer (Amazon)	$149.00

Tip of the Week

How do you fix a typo on your phone? Try pressing and then sliding your thumb on the space bar!
It’s a nifty trick to keep you in the flow. And it works on both Android and iOS.
Heading off to holiday? Here’s an addendum to episode 191‘s Tip of the Week … Don’t forget your calendar!
- On iOS, go to Settings -> Mail -> Accounts -> Select your work account -> Turn off the Mail and Calendar sliders.
Also, in Slack, you can pause notifications for an extended period and if you do, it’ll automatically change your status to Vacationing .
Did you know that Docker only has an image cache locally, there isn’t a local registry installed? This matters if you go to use something like microk8s instead of minikube! (microk8s.io)
What if you want to see what process has a file locked?
- In Windows, Ronald Sahagun let us know you can use File Locksmith in PowerToys from Microsoft. (learn.microsoft.com)
- In Linux based systems, Dave Follett points out you can just cat the process ID file in your /proc directory: cat /proc/<processId> to see what’s locked. LS Locks makes it easy too, just run the command and grep for your file. (Stack Exchange)

Direct download: coding-blocks-episode-197.mp3
Category:Software Development -- posted at: 10:01pm EDT

News

Twitter has a Data Problem

Moving an Exabyte of Data

Moving pieces to Google Cloud Platform

History of Data Warehousing at Twitter

BigQuery for Everyone

Ease of Use

Performance

Data Governance

Authentication, Authorization, and Auditing

Ensuring Proper Handling of Private Data

Privacy Categories for Datasets

Cost

Resources

Tip of the Week

Sponsors

News

Allen’s List

Joe’s List

Ideas for ways to spend $2k to save you time

What to do with that time and energy?

Michael’s List

Tip of the Week