I spent the first four years of my career building predictive models. For PayPal (my employer), knowing the probability that a user or transaction would turn out fraudulent was incredibly valuable. 60% probability this transaction is fraudulent? Automatically block the transaction. 2% probability this large merchant will “take the money and run”? Get an agent on his account, look at his transaction patterns, talk to him on the phone, keep a close eye on him. 0.01% chance he’ll wind up bad? Move on to the next person: this guy isn’t interesting!
So we spent a lot of time building and optimizing statistical models that would estimate risk. Unlike the maligned risk models from the financial world, our models were completely empirical and built from huge data sets — thus likely to be stable and accurate (barring major product changes).
But because of the size of PayPal’s data set, the process of building a model was fairly cumbersome. Say we were looking at $500+ transactions for fraud by the merchant. We’d go back and gather almost all previous transactions in that range. For each payment, we’d gather a huge number of other features, as of the time the transaction happened: how much money, what day of the week, what time of day, how many payments had the seller accepted in the 24 hours prior, what was the largest payment the buyer had ever sent, what was the historical fraud rate on the IP address the seller was using… plus 1000 other things. We’d use those 1000 features to try and predict whether or not the transaction would wind up being marked as merchant fraud.
Armed with lots of data, we’d build a predictive model that could ultimately score any new transaction by looking at those features. Then we’d see how the candidate model performed against an independent test set. We’d repeat a few times to try and build a better model, then deploy the best model onto production. All in all, it would take a fraud scientist at least a few weeks to build out a new model.
Once deployed, that model would score every transaction (or user), providing an extremely valuable way to highlight the riskiest people and transactions in the system. In many cases, we’d identify (and ultimately stop) half of the fraud in the system while blocking or examining well under 1% of the transactions.
This was a key asset for PayPal: because of our data and our models, we could predict better than anyone else.
I applied this same approach to predicting the outcome of sporting events for Team Rankings. For Team Rankings, the features are things like offensive rebound percentage, power rating in away games, and distance traveled for this game. And we try to predict a score (Stanford expected to win by 4.7 points), a winner (Stanford has a 72% chance to win), or the team that will cover the point spread (Stanford has a 54% chance to cover). We’ve done pretty well over the years, correctly predicting close to 55% of games against the spread (50% is average; 52.5% or higher means you’re actually profitable).
So when I first joined LinkedIn to work on analytics, I found I had this awesome hammer (building predictive models) and went looking for nails. I was a little disappointed to find that there weren’t any obvious ones. Predict how likely someone was to become a premium subscriber? Kind of interesting, but unclear what you actually do with it. Estimate the likelihood someone changes jobs in the next year? A little more interesting — likely changers can be highlighted for recruiters and get more “new job”-focused content — but only really valuable if the differences between high likelihood and low likelihood are large.
For a couple of years, Circle of Moms was much the same story: no obvious nails to slam with the predictive modeling hammer. But then our editorial team launched The RoundUp and was writing a number of articles each week, in large part to allow us to cut the Facebook cord. What if we could estimate the likelihood that a mom would click on an article emailed to her, before actually sending the article? Then we could send out the article she’d be most likely to read.
This time around, I structured things a little differently. Not wanting to spend weeks and weeks re-assembling “at the time of transaction” data as we had at PayPal, we computed feature values (e.g., how old is this mom’s youngest kid, what percentage of food-related emails has she historically clicked on) as of the time we sent the email, and immediately stored those values in the database.
Rather than build one statistical model to predict clickthrough rate for all user-email combinations, we built statistical models specific to each article. If we’d built just one predictive model, we would have used characteristics of the article as model features, and could have started scoring people immediately upon starting to send the article. However, our article-specific understanding of who would click on what would have been limited: no doubt the top predictor of clickthrough rate would have been overall email clickthrough rate.
By building an article-specific model, the unique predictors for this article — e.g., having a child between 12 and 24 months for that “how to start potty training” story — would feature more prominently in the model. The downside? We’d need to create lots and lots of statistical models, and we couldn’t use our models until we gathered enough data to build them out.
To handle building that number of models, we had to automate the process of model creation. Fortunately (okay, intentionally), we were already saving everything we’d use to build a statistical model in a single table we could dump out with one simple query. Building the technology out took a month or two, but once we did it, everything was automated: no data scientists needed. Every week, we’d build ten or twenty models for the new articles we were testing with a small batch of users. Then the following week, we’d generate scores (predicted clickthrough odds) for each article for each user, then send each user the best scoring article.
It turned out that all of this worked well. Our models were pretty good at finding targeted content for our six million moms, and once we tweaked the model building technology, everything scaled pretty nicely.
Automation is incredibly important. It democratizes the process of building and using statistical models, so that a small startup (with lots of data) can build pretty good statistical models without a team of statisticians. These automated statistical models will almost inevitably perform more poorly than their human-built counterparts, but they’re close enough to be competitive.
Kaggle allows companies like PayPal, Netflix, and Allstate to upload data sets; data experts then compete to build the best predictive model to win a cash prize. Somewhat counter-intuitively, this has a similar impact to automated statistical modeling: it commoditizes the process of coming up with predictive algorithms. A decade ago the algorithms were key. What we’ll likely see going forward is that data matter a lot, algorithms are close to a commodity, and clever applications of those algorithms grow in importance.
As it becomes easier to build predictive models — and most companies have a really long way to go to get there — we’ll see more and more applications when the technologies and processes to build models get simpler. Here are some examples.
1) Finding the odds a user will like each of 10 or 20 choices. This is what we did at Circle of Moms, and is a great way to find the best content for an email, daily deal, or small (mobile) screen. It has huge potential for many content and purchase-driven sites, and is a key component of Retention Science, in which I’m an investor.
2) The last mile of a complex recommendation system. Amazon could likely get significant improvements by building predictive models for whether someone will buy (click on?) each of their top 10,000 items (i.e., 10,000 separate models). Then, when I go to the home page and they use other logic to find the top 50 for me, they apply predictive modeling to order those top 50 (scoring the top 10,000 would likely be too intensive).
3) Advanced user segmentation. As I mentioned before, it’s not obvious what LinkedIn gains from knowing my likelihood of becoming a subscriber. But if they can break that out in a few separate ways — odds of me being a premium job seeker, odds of me buying a recruiting package, odds of me being a non-subscriber who nonetheless invites a lot of friends — they can upsell the functionality that I’m most likely to use. If they want to get really clever, they can A/B test different upsells and build statistical models to predict someone’s response rate to each one.
4) Likelihood of being interested in a specific add-on. If Hertz determined that there’s a 0.5% chance I’d actually buy the additional insurance option or the prepaid gas, they likely wouldn’t bother trying to sell it to me. On the other hand, if it was 50%, they would (and should). If they avoided pitching me on something I’ll never buy, they’d go through the line more quickly and piss me off less; meanwhile they could still use upsells to drive profit by pitching customers who are more likely to buy.
5) Predicting someone’s movement from where they are right now. Given who and what they are, and that they’re walking or driving in a certain direction, what is the likelihood that they enter this store or that restaurant?
To be clear, automated predictive modeling is not a cure-all. It’s too expensive (in terms of processing power) to apply to every pair of items (user and book, for example). It doesn’t add much if a user has strongly indicated intent (this friend is in my address book, or I’ve searched for this exact flight) — in that case nothing fancy is required. But for many other areas, expect to see predictive modeling become automated and democratized, as data become more important and algorithms less so.