Sun, 20 November 2022
We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them both.
The full show notes for this episode are available at https://www.codingblocks.net/episode198.
- Want to help us out? Leave us a review!
- The 2023 Game Ja-Ja-Ja Jam is coming up!
Twitter has a Data Problem
Moving an Exabyte of Data
- In 2019, over 100 million people per day would visit Twitter.
- Every tweet and user action creates an event that is used by machine learning and employees for analytics.
- Their goal was to democratize data analysis within Twitter to allow people with various skillsets to analyze and/or visualize the data.
- At the time, various technologies were used for data analysis:
- Scalding which required programmer knowledge, and
- Presto and Vertica which had performance issues at scale.
- Another problem was having data spread across multiple systems without a simple way to access it.
Moving pieces to Google Cloud Platform
- The Google Cloud big data tools at play:
- BigQuery, a cost-effective, serverless, multicloud enterprise data warehouse to power your data-driven innovation.
- DataStudio, unifying data in one place with ability to explore, visualize and tell stories with the data.
History of Data Warehousing at Twitter
- 2011 – Data analysis was done with Vertica and Hadoop and data was ingested using Pig for MapReduce.
- 2012 – Replaced Pig with Scalding using Scala APIs that were geared towards creating complex pipelines that were easy to test. However, it was difficult for people with SQL skills to pick up.
- 2016 – Started using Presto to access Hadoop data using SQL and also used Spark for ad hoc data science and machine learning.
- 2018 …
- Scalding for production pipelines,
- Scalding and Spark for ad hoc data science and machine learning,
- Vertica and Presto for ad hoc, interactive SQL analysis,
- Druid for interactive, exploratory access to time-series metrics, and
- Tableau, Zeppelin, and Pivot for data visualization.
- So why the change? To simplify analytical tools for Twitter employees.
BigQuery for Everyone
- Needed to develop an infrastructure to reliably ingest large amounts of data,
- Support company-wide data management,
- Implement access controls,
- Ensure customer privacy, and
- Build systems for:
- Resource allocation,
- Monitoring, and
- In 2018, they rolled out an alpha release.
- The most frequently used tables were offered with personal data removed.
- Over 250 users, from engineering, finance, and marketing used the alpha.
- Sometime around June of 2019, they had a month where 8,000 queries were run that processed over 100 petabytes of data, not including scheduled reports.
- The alpha turned out to be a large success so they moved forward with more using BigQuery.
- They have a nice diagram that’s an overview of what their processes looked like at this time, where they essentially pushed data into GCS from on-premise Hadoop data clusters, and then used Airflow to move that into BigQuery, from which Data Studio pulled its data.
Ease of Use
- BigQuery was easy to use because it didn’t require the installation of special tools and instead was easy to navigate via a web UI.
- Users did need to become familiar with some GCP and BigQuery concepts such as projects, datasets, and tables.
- They developed educational material for users which helped get people up and running with BigQuery and Data Studio.
- In regards to loading data, they looked at various pieces …
- Cloud Composer (managed Airflow) couldn’t be used due to Domain Restricted Sharing (data governance).
- Google Data Transfer Service was not flexible enough for data pipelines with dependencies.
- They ended up using Apache Airflow as they could customize it to their needs.
- For data transformation, once data was in BigQuery, they created scheduled jobs to do simple SQL transforms.
- For complex transformations, they planned to use Airflow or Cloud Composer with Cloud Dataflow.
- BigQuery is not for low-latency, high-throughput queries, or for low-latency, time-series analytics.
- It is for SQL queries that process large amounts of data.
- Their requirements for their BigQuery usage was to return results within a minute.
- To achieve these requirements, they allowed their internal customers to reserve minimum slots for their queries, where a slot is a unit of computational capacity to execute a query.
- The engineering team had to analyze 800+ queries, each processing around 1TB of data, to figure out how to allocate the proper slots for production and other environments.
- Twitter focused on discoverability, access control, security, and privacy.
- For data discovery and management, they extended their DAL to work with both their on-premise and GCP data, providing a single API to query all sets of data.
- In regards to controlling access to the data, they took advantage of two GCP features:
- Domain restricted sharing, meaning only users inside Twitter could access the data, and
- VPC service controls to prevent data exfiltration as well as only allow access from known IP ranges.
Authentication, Authorization, and Auditing
- For authentication, they used GCP user accounts for ad hoc queries and service accounts for production queries.
- For authorization, each dataset had an owner service account and a reader group.
- For auditing, they exported BigQuery stackdriver logs with detailed execution information to BigQuery datasets for analysis.
Ensuring Proper Handling of Private Data
- They required registering all BigQuery datasets,
- Annotate private data,
- Use proper retention, and
- Scrub and remove data that was deleted by users.
Privacy Categories for Datasets
- Highly sensitive datasets are available on an as-needed basis with least privilege.
- These have individual reader groups that are actively monitored.
- Medium sensitivity datasets are anonymized data sets with no PII (Personally identifiable information) and provide a good balance between privacy and utility, such as, how many users used a particular feature without knowing who the users were.
- Low sensitivity datasets are datasets where all user level information is removed.
- Public datasets are available to everyone within Twitter.
- Scheduled tasks were used to register datasets with the DAL, as well as a number of additional things.
- Roughly the same for querying Presto vs BigQuery.
- There are additional costs associated with storing data in GCS and BigQuery.
- Utilized flat-rate pricing so they didn’t have to figure out fluctuating costs of running ad hoc queries.
- In some situations where querying 10’s of petabytes, it was more cost-effective to utilize Presto querying data in GCS storage.
Could you build Twitter in a weekend?
Tip of the Week
- VS Code has a plugin for Kubernetes and it’s actually very nice! Particularly when you “attach” to the container. It installs a couple bits on the container, and you can treat it like a local computer. The only thing to watch for … it’s very easy to set your local context! (marketplace.visualstudio.com)
kafkactl is a great command line tool for managing Apache Kafka and has a consistent API that is intuitive to use. (deviceinsight.github.io)
- Cruise Control is a tool for Apache Kafka that helps balance resource utilization, detect and alert on problems, and administrate. (GitHub)
- iTerm2 is a terminal emulator for macOS that does amazing things. Why aren’t you already using it? (iterm2.com)
- Message compression in Kafka will help you save a lot of space and network bandwidth, and the compression is per message so it’s easy to enable in existing systems! (cwiki.apache.org)
Direct download: coding-blocks-episode-198.mp3
-- posted at: 8:01pm EDT
Sun, 6 November 2022
It’s that time of year where we’ve got money burning a hole in our pockets. That’s right, it’s time for the annual shopping spree. Meanwhile, Fiona Allen is being gross, Joe throws shade at Burger King, and Michael has a new character encoding method.
The full show notes for this episode are available at https://www.codingblocks.net/episode197.
- Retool – Stop wrestling with UI libraries, hacking together data sources, and figuring out access controls, and instead start shipping apps that move your business forward.
- Thank you to everyone that left a review!
- Anonymous User, rd, Ian Matchett, Glen Jakobsen
- Want to help out the show? Leave us a review!
- Almost time to start talking about … Game JA JA JA JAMUARY!
- What’s your perspective on strong, static, weak or dynamic typing and how is it shaped by your experiences?
- How do you move into DevOps or SRE roles if you have developer experience?
Well, you know Joe has to be a little different so the format’s a bit different here! What if there was a way to spend money that could actually make you happy? Check out this article: Yes, you can buy happiness … if you spend it to save time (CNBC).
Ideas for ways to spend $2k to save you time
- A good mattress will improve your sleep, and therefore your amount of quality time in a day! ($1k),
- Cleaning Service ($100 – $300 per month),
- Massage ($50 per month),
- Car Wash Subscription ($20 per month),
- Grocery Delivery Service (Shipt is $10 a month + up charges on items),
- Hire landscapers ($100 per month),
- Get a virtual assistant ($10 to $20 an hour),
- Use a delivery services like DoorDash or Postmates, or
- Get your meals mailed to you (Blue Apron, Factor ~$7 to $10 per meal per person).
Remember, it’s not just about the time you save, it’s also about increasing the quality and value of the time you’re already saving!
What to do with that time and energy?
You could …
- Create a Business,
- Create a hobby website or portfolio,
- LeetCode, or
- Game Ja Ja Ja Jamuary!
Or you could …
- Hang out with friends or family,
- Go to the gym,
- Learn an instrument, or
Trust the process, knowing that whatever time you do put into tech will be more fruitful!
Tip of the Week
- How do you fix a typo on your phone? Try pressing and then sliding your thumb on the space bar!
It’s a nifty trick to keep you in the flow. And it works on both Android and iOS.
- Heading off to holiday? Here’s an addendum to episode 191‘s Tip of the Week … Don’t forget your calendar!
- On iOS, go to Settings -> Mail -> Accounts -> Select your work account -> Turn off the Mail and Calendar sliders.
- Also, in Slack, you can pause notifications for an extended period and if you do, it’ll automatically change your status to Vacationing .
- Did you know that Docker only has an image cache locally, there isn’t a local registry installed? This matters if you go to use something like microk8s instead of minikube! (microk8s.io)
- What if you want to see what process has a file locked?
- In Windows, Ronald Sahagun let us know you can use File Locksmith in PowerToys from Microsoft. (learn.microsoft.com)
- In Linux based systems, Dave Follett points out you can just cat the process ID file in your /proc directory:
cat /proc/<processId> to see what’s locked. LS Locks makes it easy too, just run the command and grep for your file. (Stack Exchange)
Direct download: coding-blocks-episode-197.mp3
-- posted at: 10:01pm EDT