Data Science is the Old New Thing

In the last six months the term “Big Data” has reached the mainstream media. Companies are furiously hiring brilliant data scientists to make sense of all the data they have. Most revolutionary: I no longer need to sheepishly disclose that my college major was “Mathematical and Computational Science.”

Is this a step forward? Absolutely.

Does it mean that those same companies are doing all they can to use data in intelligent ways? In most cases, no.

The de facto current “data scientist” model is the third of four steps to data utopia. Here are the steps:

1) Null Set. You don’t collect a lot of data. You don’t know what they don’t know, and you operate solely based on intuition. No serious Internet companies are here, but some small startups are, as are most small offline businesses.

2) Collect Only. You collect data, but you don’t really know how to use them. Maybe you have non-technical marketers telling engineers to run specific queries. This rarely works, because most (not all) marketers don’t have an analytical understanding of the technology or the product, and the engineers just do what they’re told. This is still pretty common.

3) Data Economics. You collect data, but have a line in the sand between builders of products and those who use the data. Your product and engineering teams spec and build features; the Analytics/Data Science team takes the data available to them and uses them to figure out cool stuff.

4) Data Hacking. Your systems (technical and human) are structured so that data geekery is integrated with product and engineering. Your team asks not (only) what they can do with the data they have, but how they can get the data they want. This requires really smart data geeks who can understand how to design a product and can write code.

Let’s say Twitter decides they want to figure out how much coffee each of their users drinks.

If they were at Null Set, they would be completely lost.

If they were at Write Only, they might look at the frequency with which users mention “coffee”, “Starbucks”, “latte”, and maybe “caffeine” or “cafe”. Those with two or more mentions would be pegged as heavy coffee drinkers. And their accuracy would be abominable.

If they were at Data Economics, they would have a couple of smart data scientists, and they’d probably manually classify the people they knew as heavy/light/non-coffee drinkers. They’d use those manually classified accounts, plus keyword data and connection data, to build a statistical model that estimates the coffee consumption for any given user. They’d do okay — they’d uncover keywords that are more predictive than the marketers’ list from Write Only, and find one or two surprising correlations — but they still probably wouldn’t be able to build a great model because the data Twitter has are not very predictive of coffee drinking.

Were they at Data Hacking, they’d do everything in Data Economics. But then they’d also build out a dozen or so features to yield future insights. Maybe they’d suggest following Starbucks or Peet’s and use acceptance as a proxy for liking coffee. Maybe they’d create a “tweet about my coffee” iPhone/Android app. Maybe they’d ask a small percentage of people who’d just tweeted from a coffee shop what they ordered. Maybe they’d ask users who just tweeted about something coffee-related if they wanted a coupon for a free Starbucks coffee; presumably those who accept are much more likely coffee drinkers than those who aren’t.

I offer no guarantees of the effectiveness of any of those ideas. The point is that that they’re interactive, and they facilitate iteration: hack, analyze, hack, analyze. The hackers are the analyzers, and vice versa.

Want to measure reputation or influence? Algorithms on top of the existing data from Facebook/LinkedIn/Twitter aren’t enough — you need better data. Cubeduel — which allows users to express a preference between two people they’ve worked with — is an interesting product because it’s closer to “real” reputation measurement. But on the reputation front, the three big social networks haven’t built a lot of clever measurement features themselves.

And that’s not unique to them: few if any Internet companies are in the Data Hacking phase today. The good ones are doing Data Economics: they have great data scientists (or search relevance teams, etc.), but with scientists who are siloed. Here are the data available to you, they say, go figure something out.

That’s limiting, because there’s an entire class of problems that need Data Hacking brilliance to be solved. Andrew Chen’s post about Growth Hacker being the new VP Marketing is an example of companies moving to step 4 to improve their distribution: growth hackers embody the “data + dev + product” approach and do better on distribution than anyone else. In the next few years, we’ll see visionary data hackers tackle more of our deepest, most challenging problems in interactive, iterative ways.

Data Scale – why big data trumps small data

As I walk into a coffee shop, the guy behind the counter sees that I’m in a hurry and that I’m by myself. He’s seen me a few times before, and knows that I don’t usually order sweet snacks. He knows I tip reasonably well. He’s likely to treat me a certain way: efficiently, without small talk, and not trying to sell me a muffin.

In the “real world”, his behavior — and my user experience — is largely the result of subconscious change (in Daniel Kahneman’s terrific book this is called System 1). Online, personalization and improvement of my experience usually comes from lots of data. Offline, the cashier’s “data set” is the personal experiences he’s had. Online, it’s the same thing for the site I’m visiting. The big difference is that most working humans are between 15 and 75 — a difference of 5x. Online, Facebook has nearly a billion users, and my blog has… less than one fifth that number. Online differences are orders of magnitude larger.

That advantage compounds over time, as companies with many millions of users attain data scale. Data scale is the millions of pieces of information that allow a company to improve the user experience in ways that competitors with fewer users cannot. I saw this firsthand at PayPal, LinkedIn, and Circle of Moms: all three companies were able to provide features and additional value to new and returning users because of what we’d learned from millions of others.

Network Effects and Big Data

Network effects are well-known and understood in the consumer Internet world. As Facebook grows more popular, more of your friends are on the site with you, and it becomes more and more useful (or at least, entertaining) for you. And that size distinguishes it from an upstart: why sign up for a new site with just three of your friends when you can be on Facebook with almost everyone? Network effects have clear and well-defined values for both websites and users.

By contrast, consider the opening sentence for the big data Wikipedia entry:

In information technology, big data consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools

The entry depicts a purely technical set of requirements, with no bearing on the product or user. But lots of data is more than just awkwardness and data management tools. Companies with data scale can create a set of features and processes — prediction, testing, understanding, and segmentation — that aren’t possible to those with small user bases. Collectively, they allow a company to block access to fraudsters, tailor products to users, and understand them in deep ways. Data scale improves both the user experience and the bottom line.

The 4 Advantages of Data Scale

In my twelve years working with consumer Internet data, I’ve seen four things that companies with data scale can do much better than smaller competitors:

  1. Predict

    Fraud nearly destroyed PayPal’s business in its early years. Fortunately, we figured out how to accurately detect it, and wound up reducing fraud rates by 80-90%. After predicting which transactions were most risky, we’d block and reverse the bad ones — helping PayPal move from bleeding money in 2000 to profitability+IPO in 2002.

    Data scale was necessary for that detection. More transactions — and more fraudulent transactions — give smart scientists the data they need to discover complex but statistically valid predictors of fraud. Start with a set of only 10,000 transactions and 100 fraudulent transactions, and you can put together a few simple rules to find fraud. But with millions of transactions and tens of thousands of fraudulent transactions, our fraud analytics team could find subtler patterns and detect fraud more accurately. A mini-PayPal might have the world’s smartest predictive modelers, but without a large data set, there’s only so much they could do.

    Incidentally, this was a major reason PayPal needed to raise lots of capital. Losing a lot of money to fraud was a necessary byproduct in gathering the data needed to understand the problem and build good predictive models. A “lean startup” approach makes sense in some cases, but wouldn’t have cut it for PayPal.

    User perspective: if a company can figure out that you’re very likely a good, non-fraudulent customer, they can provide you with services they’d never want to offer their riskier users. That figuring out process is much more accurate when they have data scale.

  2. Understand

    Most websites start off with less structured data — their databases contain lots of text. Free form fields are easier for developers to code, and (early on) often make it easier for users to enter information. But unstructured data quickly get messy, and without scale, they don’t allow for easy inferences. But more unstructured data — along with a clever data scientist or two — can be a ticket to intelligently structure and build the corresponding features and insights. A few examples:

    • Until 2006, LinkedIn had no structure around company names. Users could type anything they wanted into a company field, and we had no way of automatically detecting that HP, H-P, Hewlett Packard, and HP, Inc. were all the same company. By parsing the data and matching it against other sources like email domain and address books, we were able to detect that those four names were synonyms. Without manual intervention, those processes are only possible with data scale: one person at “HP, Inc.” with an email address could be random, but when 97 out of 100 users have that property it’s a safe bet that it’s not a random fluke.

      Having an accurate list of companies allows LinkedIn to better guess who you know, it facilitates good company pages, and by using autocomplete, it improves data quality going forward.

    • Before we built Circle of Moms, we built a Facebook app called Circle of Friends. Less than a year in, with millions of users but weakening growth and minimal revenue, we started to search for ways we might shift our business. We found that moms were creating “circles of moms” and using them more than anyone else was using their circles.

      Data scale enabled us to find that trend, and understand what was going on. And that wound up being the insight that ultimately pushed us toward being a successful company.

    User perspective: younger, smaller companies don’t really know what users want, and thus have to keep their product open-ended. When you use the product of a company with lots of data, they’ve learned what people actually want to do, and you get a cleaner, more structured experience.

  3. Test

    Circle of Moms was fanatical about a/b testing on day one (LinkedIn was not — much to my chagrin — but I digress). But in order to decide between a and b, you need meaningful differences in outcomes. If there’s a large difference (say, 40-50%), then 100 outcomes (signups, clicks, whatever the company is optimizing for) in each group is often sufficient for establishing statistical significance. If the difference is 10% or less, you’ll need on the order of 1000+ outcomes.

    Let’s take a graphical look. Below are overall simulated clickthrough rates (CTRs) for different-sized user bases.

    clickthrough by population size

    [Technical details: I ran a simulation where a company a/b tests a variety of emails or subject lines. Each subject has clickthrough rate between 0.75% and 3%, randomly selected with a uniform distribution. All tests are pairwise, so A is tested against B, the winner is tested against C, that winner against D, etc. Tests are resolved at p=.99. A few aspects of this are unrealistic -- uniformly distributed CTRs, non-improving (on average) subjects, only tests with two participants, the same rules for resolving tests for big user bases and small, etc. -- but it's close enough for these purposes.]

    With a small user base, CTR will be mediocre: about 1.9%, and slow to improve. As a user base gets bigger and bigger, a higher and higher percentage of users wind up receiving a very good, well-tested subject line: big companies see a CTR very close to 3%. The largest improvement comes between 100,000 users and 1,000,000 users — in this case, that represents data scale. Most of our successful emails at Circle of Moms would go to a few hundred thousand or a few million people; we were right on the edge of having data scale. If we’d had fewer users, a high percentage of our users would have been “guinea pigs”. With millions of registered moms, we had (roughly) the same number of guinea pigs, but many more users for whom we could use our guinea pig learnings and send the very best content.

    Note the magnitudes of these differences. With data scale, testing can mean a bump of 50% or more; the testing bump is much less for a small operation. For a product close to being viral, an additional 10% — a “small data” bump — might be huge and a/b testing worthwhile. For a product with millions of users, a 50% jump is large almost regardless of application. On the other hand, for a small product where 10-20% doesn’t represent the difference between success and failure, time is best spent somewhere else. In other words, a/b testing is something every company with a large user base should do; for smaller companies the value varies.

    User perspective: if you’re part of a large group, you likely get better content because of feedback from those before you. If you’re part of a small group, you are more likely to be giving feedback rather than profiting from the feedback of those before you.

  4. Segment

    At Circle of Moms, segmenting was essentially a mix of predicting and testing. After we tested emails and subject lines with small batches of users, we’d create predictive models to figure out which future users would be likely to click on them.

    This meant we could figure out the odds that someone would click on each of twenty possible emails we might send. And we’d send the very best one for her.

    For Circle of Moms, predictive models were relatively simple and didn’t need as many users/observations as PayPal’s did. But because we were testing twenty different emails at a time and didn’t want to test everything on everyone, scale still mattered. 50,000 people is usually enough to create a model; multiply that by 20 and you have a million. That calls for a million people just in the training set (i.e., the guinea pigs). If you only have 1.5 million users, the benefit of this type of segmentation will be small — 2/3 of users will have received a “random” email to gather data for the models. At 5 million, a company is at data scale, and the vast majority of its users (80%) will get a personalized email.

    I got started on a segmentation-type problem at LinkedIn — matching people to jobs. Job matching means segmenting people into thousands of buckets (each job is a bucket), rather than only 20. Back in 2006, the quality and quantity of LinkedIn’s data made the job very difficult: 5 million users and only a few thousand past job listings was not enough data to do matching well. Today, with 20-30 times as many users, 7 years of job listings, and some scientists who are likely much better than I, LinkedIn does much better finding jobs for people than I did in 2006. That’s data scale (plus a talent upgrade) at work.

    Automated segmentation is harder to simulate and precisely quantify than testing. But the overall picture is clear: it’s useless at small scale, but usually far more valuable than testing at data scale.

    User perspective: when a company can figure out what you like, they can provide you with content uniquely suited to your needs and interests. The more data they have — both on you and on others — the better they can perform this service.

Things I don’t know I don’t know?

I’m both a data guy and an early stage startup guy, and that generally constrains the problems I see. I left PayPal a little while after the company was acquired by Ebay; I left LinkedIn when it was an 80-person company that I found too slow-moving; I left Circle of Moms after Sugar’s acquisition. That means I’ve never worked on a product with over ten million users. No doubt I’m missing out on some of the advantages that truly massive companies have. Others have more firsthand knowledge on the topic of really big data scale — those of you in that category, ping me about your favorite post and I’ll add a link.

WTF is the “Numerate Choir”?

When I showed this site to my wife, she mostly liked what she saw. She’s kind of a skeptic, so that was a nice surprise.

But she did ask what the heck a nume-er-RAYTE choir is, and wondered why on earth I would call my blog that.

First off, let’s start with pronunciation. It’s NUME-er-it. Numerate is (in this context) an adjective, not a verb, and it doesn’t sound like enumerate.

Numerate means (roughly) mathematically literate — the opposite of innumerate. A numerate person is, among other things, able to tell the difference between fallacies and statistically sound conclusions. The world has too many innumerate people.

“Preaching to the choir” is telling something to people who likely already believe it. My usage of the word choir comes from this phrase.

Hence, the Numerate Choir: people who are using data, analytics, and science to build businesses and improve the world.