Data Science is the Old New Thing

In the last six months the term “Big Data” has reached the mainstream media. Companies are furiously hiring brilliant data scientists to make sense of all the data they have. Most revolutionary: I no longer need to sheepishly disclose that my college major was “Mathematical and Computational Science.”

Is this a step forward? Absolutely.

Does it mean that those same companies are doing all they can to use data in intelligent ways? In most cases, no.

The de facto current “data scientist” model is the third of four steps to data utopia. Here are the steps:

1) Null Set. You don’t collect a lot of data. You don’t know what they don’t know, and you operate solely based on intuition. No serious Internet companies are here, but some small startups are, as are most small offline businesses.

2) Collect Only. You collect data, but you don’t really know how to use them. Maybe you have non-technical marketers telling engineers to run specific queries. This rarely works, because most (not all) marketers don’t have an analytical understanding of the technology or the product, and the engineers just do what they’re told. This is still pretty common.

3) Data Economics. You collect data, but have a line in the sand between builders of products and those who use the data. Your product and engineering teams spec and build features; the Analytics/Data Science team takes the data available to them and uses them to figure out cool stuff.

4) Data Hacking. Your systems (technical and human) are structured so that data geekery is integrated with product and engineering. Your team asks not (only) what they can do with the data they have, but how they can get the data they want. This requires really smart data geeks who can understand how to design a product and can write code.

Let’s say Twitter decides they want to figure out how much coffee each of their users drinks.

If they were at Null Set, they would be completely lost.

If they were at Write Only, they might look at the frequency with which users mention “coffee”, “Starbucks”, “latte”, and maybe “caffeine” or “cafe”. Those with two or more mentions would be pegged as heavy coffee drinkers. And their accuracy would be abominable.

If they were at Data Economics, they would have a couple of smart data scientists, and they’d probably manually classify the people they knew as heavy/light/non-coffee drinkers. They’d use those manually classified accounts, plus keyword data and connection data, to build a statistical model that estimates the coffee consumption for any given user. They’d do okay — they’d uncover keywords that are more predictive than the marketers’ list from Write Only, and find one or two surprising correlations — but they still probably wouldn’t be able to build a great model because the data Twitter has are not very predictive of coffee drinking.

Were they at Data Hacking, they’d do everything in Data Economics. But then they’d also build out a dozen or so features to yield future insights. Maybe they’d suggest following Starbucks or Peet’s and use acceptance as a proxy for liking coffee. Maybe they’d create a “tweet about my coffee” iPhone/Android app. Maybe they’d ask a small percentage of people who’d just tweeted from a coffee shop what they ordered. Maybe they’d ask users who just tweeted about something coffee-related if they wanted a coupon for a free Starbucks coffee; presumably those who accept are much more likely coffee drinkers than those who aren’t.

I offer no guarantees of the effectiveness of any of those ideas. The point is that that they’re interactive, and they facilitate iteration: hack, analyze, hack, analyze. The hackers are the analyzers, and vice versa.

Want to measure reputation or influence? Algorithms on top of the existing data from Facebook/LinkedIn/Twitter aren’t enough — you need better data. Cubeduel — which allows users to express a preference between two people they’ve worked with — is an interesting product because it’s closer to “real” reputation measurement. But on the reputation front, the three big social networks haven’t built a lot of clever measurement features themselves.

And that’s not unique to them: few if any Internet companies are in the Data Hacking phase today. The good ones are doing Data Economics: they have great data scientists (or search relevance teams, etc.), but with scientists who are siloed. Here are the data available to you, they say, go figure something out.

That’s limiting, because there’s an entire class of problems that need Data Hacking brilliance to be solved. Andrew Chen’s post about Growth Hacker being the new VP Marketing is an example of companies moving to step 4 to improve their distribution: growth hackers embody the “data + dev + product” approach and do better on distribution than anyone else. In the next few years, we’ll see visionary data hackers tackle more of our deepest, most challenging problems in interactive, iterative ways.

Mike Greenfield founded Circle of Moms and Team Rankings, led LinkedIn's analytics team from 2004-2007, and built much of PayPal's early fraud detection technology. Ping him at [first_name] at mikegreenfield.com.