Doing Data Science at Twitter


On June 17, 2015, I celebrated my two year #Twitterversary @Twitter. Looking back, the Data Science (short for DS) landscape at Twitter has shifted quite a bit:

Machine Learning has played an increasingly prominent role across many core Twitter products that were previously not ML driven (e.g. “While you are away”)

Tool wise, we’ve moved away from Pig and all new data pipelines are now written in Scalding, a Scala DSL built on top of cascading that makes it easy to specify Hadoop MapReduce jobs

Organizationally, we switched to an embedded model where DS are now working closer than ever with the product/engineering teams

And these are only a handful of changes among many others! On a personal note, I’ve recently branched out from Growth to PIE (Product, Instrumentation, and Experimentation) to work on the statistical methodologies of our home grown A/B Testing platform.

Being at Twitter is truly exciting, because it allows me to observe and learn, first hand, how a major technology company leverages data and DS to create competitive edges.

Meanwhile, demands and desires to do data science continued to skyrocket.

“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it” — Dan Ariely

There are many, and I mean many, discussions around how to become a data scientist. While these discussions are extremely informative (I am one of the beneficiaries), they tend to over-emphasize on techniques, tools, and skill-sets. In my opinion, it is equally important for aspiring Data Scientists to know what it is really like to work as a DS in practice.

As a result, as I hit my two year mark at Twitter, I want to use this reflection as an opportunity to share my personal experience, in the hope that others in the field would do the same!

Type A Data Scientist v.s. Type B Data Scientist

Before Twitter, I got the impression that all DS need to be unicorns — from Math/Stat, CS/ML/Algorithms, to data viz. In addition to technical skills, writing and communication skills are crucial. Furthermore, being able to prioritize, lead, and manage projects are paramount for execution. Oh yeah, you should also evangelize a data driven culture. Good luck!

A few months in into my job, I learned that while unicorns do exist, for the majority of us who are still trying to get there, it is unrealistic/infeasible to do all these things at once. That said, almost everything data related is tied to the term DS, and it was a bit daunting to find my place as a newbie.

Overtime, I realized that there is a overly simplified but sufficiently accurate dichotomy of the different types of Data Scientists. I wasn’t able to articulate this well until I came across a Quora answer from Michael Hochster, who elegantly summarized this point. In his words:

Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.

Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).

I wish I had known this earlier. In fact, as an aspiring DS, it is very useful to keep this distinction in mind as you make career decisions and choices.

Personally, my background is in Math, Operations Research, and Statistics. I identified myself mainly as a Type A Data Scientist, but I also really enjoy Type B projects that involved more engineering!

DS at early stage start-ups, growing start-ups, and those who achieved scale

One of the most common decisions to make while looking for tech jobs is the decision between joining a large v.s. small company. While there are a lot of good general discussions on this topic, there isn’t much information specifically for DS — namely, how the role of DS would change depending on the stage and size the company.

Companies at different stages produce data in different velocity, variety, and volume (the infamous 3Vs). A start-up trying to find its product market fit probably don’t need Hadoop because there isn’t much data. A growing start-up will be more data intensive but might do just fine using PostgreSQL or Vertica. But a company like Twitter cannot efficiently process all its data without using Hadoop and the Map-Reduce framework.

One important lesson I learned at Twitter is that a Data Scientist’s capability to extract value from data is largely coupled with the maturity of the data platform of its company. Understand what kind of DS work you want to get involved, and do your research to evaluate if the company’s infrastructure can support your goal is not only smart, but paramount to ensure the right mutual fit.

At early stage start-ups: the primary analytic focus is to implement logging, to build ETL processes, to model data and design schemas so data can be tracked and stored. The goal here is focused on building the analytics foundation rather than analysis itself

At mid-stage growing start-ups: Since the company is growing, the data is probably growing too. The data platform needs to adapt, but with the foundation laid out already, there will be a natural shift to insight generation. Unless the company leverages Data Science for its strategic differentiation to start with, many analytics work are around defining KPI, attributing growth, and finding the next opportunities to grow

Companies who achieved scale: When the company scales up, data also scales up. It needs to leverage data to create or maintain competitive edge. e.g. Search results need to be better, recommendations need to be more relevant, logistics or operations need to be more efficient — this is the time where specialist like ML engineers, Optimization experts, Experimentation designers can play a huge role in stepping up the game.

By the time I joined Twitter, it already has a very mature data platform and stable infrastructure in place. The warehouse is clean and reliable, and ETL processes are processing hundreds of Map-Reduce jobs easily on a daily basis. Most importantly, we have talented DS working on data platform, product insights, Growth, experimentations, and Search/Relevance, along with way other focus areas.

My Journey

I was the first dedicated Data Scientist on Growth, and the reality is, it took us a good few months before Product, Engineering, and DS converged on how DS can play a critical role in the process. Based on my experience working closely with the product team, I categorize my responsibilities into four general areas:

Product Insights

Data Pipeline

Experimentation (A/B Testing)


Let me describe my experience and learning in each of these topics.


NLP News