Why “A/B Testing vs. Holistic UX Design” Is a False Dichotomy

UX designers are skeptical of the use of A/B testing. Is their skepticism warranted?

I’m about as hardcore a data geek as they come, and my answer is: Yes. But…

The “but” is absolutely key: A/B testing is almost always conducted in an overly simplistic way that focuses solely on hill climbing and not on understanding. A/B testing need not only be about growth hacking and optimizing clickthrough rates: done right, A/B testing contributes to deep knowledge about how people respond to different interfaces.

Most companies use A/B testing to try to optimize one result. If their goal is to sell widgets, they test having a blue button versus having a red button for a few days. If after that period, more people who have seen the red button buy the widget, they resolve the test and show the red button to everyone.

This is simple, and it’s effective if you’re trying to optimize exactly one metric. It’s not effective for comparing a new home page design with an old one, unless there’s only one thing you care about. Most companies care about more than one metric — they want to sell stuff, they want people to come back to their site regularly, and they want a strong community of engaged users on their site — so the red button/blue button “which sells more?” approach doesn’t apply.

But swearing off A/B testing because of the way it’s currently done would be like swearing off smartphones and tablets because you didn’t like the Apple Newton. A/B testing methodologies are currently quite primitive and need to get a lot better.

So how do you get to the next step? You evaluate two (or more) designs holistically. 95% of the time, a home page is designed to do more than optimize a single stat. The home page should facilitate browsing, product understanding, usage. Those are measurable things, but tougher to measure than simply “clicks” or “purchases”. How many pages are people in each group viewing; how frequently are they coming back; what kinds of activity are they doing; how many of them are closing their accounts?

This sort of testing is hard to do — and done by very few Internet companies today — for a few reasons:

1) It’s technically challenging. At Circle of Moms, we built a system to track usage on both sides of a test, so that one could go to a test page and compare the two sides along dozens of stats. It required us to associate every event with every active A/B test the user had ever been part of. We also had to come up with clever ways to filter out outliers, like people who have viewed 5000 pages in a month. Setting this up is not impossible, but takes a fair amount of work to scale and maintain. As far as I’m aware, none of the mainstream web analytics companies are well set up to allow this kind of testing.

2) If you’re looking at many different stats when comparing two sides of a test, it’s tough to know what matters and what doesn’t. Consumers of test results — product managers, developers, designers, execs — need some intuition for what’s important, and not feel overwhelmed with excessive data. Having only one metric they look at makes things a lot simpler, but it doesn’t reflect what the business actually needs.

3) Politically and emotionally, it’s tough to throw away features that a team has spent a long time on. If you switch over to a new home page without properly testing it, everyone can celebrate the hard work that went into it, and not worry about how it actually performs.

There’s a substantial cost to this lack of testing. I’ve seen a number of companies with data scale spend weeks or months building out a new version of something (home page, signup flow), not A/B test it, then indirectly discover that the new version performs worse than the old one. This leads to two terrible options: roll back to an old version of the product — a technical and logistical nightmare and a morale hit — or keep a version of the product that is significantly less effective.

The alternative is to use A/B testing as part of a holistic design approach. Have some outlandish idea for what the home page should look like? Sketch it out, get in-person feedback, refine. Build out a basic version, maybe get more in-person feedback, refine. A/B test it with a small percentage of users, see how usage looks relative to the baseline. If the results look terrible, move on to the next idea. If they look promising but have some issues (maybe usage is much better, but engagement is down), refine and A/B test the refined version. Keep repeating this process until you get to a design that is clearly more effective than the baseline.

Everyone knows the famous quote:

There are three kinds of lies: lies, damned lies, and statistics

It’s easy to feel “lied to” — misled would be a better term — after being fed inappropriate statistics. A/B testing small changes, with the goal of optimizing one variable, often falls into this category: it leads to short-term, non-holistic product development processes.

The solution to being misled isn’t to stop using statistics, it’s to make sure the right ones are being used. Nor should one throw out the A/B testing framework baby with the poor methodology bathwater. Rather, companies need to measure the right things and ask the right questions that allow them to deeply understand the effects of their changes. Optimize sometimes; understand always. The end result of this approach is a holistic A/B testing mentality that everyone should be able to embrace.

The Forgotten Art of Estimation

My third grade class spent a lot of time studying “estimation”. Exercises were things like guessing how many jelly beans were in a jar, or how long it would take to fill up a gallon of water from a slow drip.

At the time, it seemed like a silly thing for eight and nine year olds to spend time on. Who needs to know how many jelly beans are in a jar, unless they literally want to become a bean counter?

Of course, I had no inkling that the Internet would emerge, and would be a dominant force in the world, and I’d spend most of my waking hours building products for it (give me a break, I was 9). So I didn’t know that estimation would be an essential skill — arguably the essential skill — for anyone deciding how and what to build online twenty-odd years later.

Why does estimation even matter? Well, if you can accurately estimate the effect of building Y or changing X, you wind up spending your time building things that your customers will use and that will help your business. It’s an especially valuable tool when you’re building something well-defined on top of your existing product. Twitter adding their own check-in feature? Should be pretty easy to make a good estimate of the effects. Google pushing out their glasses product? Much harder to estimate.

Many very smart people in Silicon Valley are surprisingly poor at estimation. LinkedIn is a rare visionary company, but in 2005 and 2006 we wasted a lot of time on features that had a very small probability of moving the needle for the business.

Let’s set the stage for those feature decisions. It’s 2006. LinkedIn has around 50 employees, and around 5 million registered users. We’re doing okay: our year-old job posting and premium subscription products are generating real revenues, and we’re getting close to profitability. But our new user growth is flat, most of our users have zero or one connection, and morale is only fair: half our engineering team left last summer for the hottest tech company around — Google.

The board and/or exec team decided to focus on LinkedIn’s usage more than on revenue or growth. That’s a debatable decision, but not necessarily a bad one; for this exercise, let’s just work under the assumption that building products to increase usage among existing users is the right thing for the team to do.

To estimate increases in user engagement, it’s valuable to understand and look at the individual levers which can lead to increases. For a social network like LinkedIn circa 2006, there were three major ways people would return to visit the site:

1) They’d directly type in the URL, or click on a bookmark in their browser.
2) They’d get pinged (emailed) after a friend or acquaintance performed an action: invite them to be a connection, respond to their comment, etc. They’d click through on that email.
3) They get pinged (emailed) independent of actions by other people, e.g., a marketing message or feature update. They’d click through on that email.

This isn’t exhaustive or current for LinkedIn or other sites: Google search is no doubt a huge driver of re-engagement for sites like Yelp; Like/Share/Tweet widgets across the web encourage lots of usage of Facebook/LinkedIn/Twitter; category 2 is much more built out than it was in 2006 with mobile notifications, Facebook and Twitter integration and more; mobile apps create huge amounts of category 1 usage today but didn’t in 2006.

Option #3 (non-social emails) wasn’t really on the table for LinkedIn in 2006: we’d had a couple of successful pure marketing emails, but no template for replicating that success (it likely would have been possible — sections 3 and 4 of my post on the value of large data sets explain how we did this at Circle of Moms — but would have been more of a pivoter move than a visionary one).

That left us two ways to increase engagement with a new feature. The feature could be so memorable that it would get someone to come back more frequently on their own, or it could generate more/more relevant pings by their friends, inviting them to connect or telling them about activity.

So now we have a baseline for estimation: anything we’d build to increase LinkedIn’s engagement would have to do one of those things.

Unfortunately, we built two major features in 2006 — Services and Answers — and neither one ever really had a chance to succeed at increasing engagement.

Services was a way for service providers — accountants, lawyers, auditors, and many more — to list their offerings. Providers would structure their LinkedIn profiles more carefully, so that they’d fit into a directory. Other users could then go and find, for instance, Sarbanes Oxley experts in the Bay Area.

Services, our team hoped, would provide one clear use case for the (many) people who asked “what do I use LinkedIn for?”: you use LinkedIn to find an accountant. So, that begs the question… how often do people need to find accountants? Considering that most Americans do their own taxes, and those who don’t probably keep their accountant for an average of at least a few years, it’s unlikely that over 20% of American households look for an accountant in a given year. So that means LinkedIn might have an average user look for an accountant once every five years.

Add in lawyers, Sarbanes Oxley experts, branding consultants and dozens of other specialist service providers; you get more frequency — but not much more. Best case, people find themselves searching for those sorts of specialists every few months. So even if LinkedIn got all of those searches — and I’ve heard about a site called Google where that search thing is popular — we would have brought our users back to the site maybe once every two months. Since there was nothing inherently social about the services product, friend pings wasn’t a viable means of increasing usage. In other words, we weren’t likely to get the deep engagement we were looking for.

The other product we built was Answers. Yahoo had recently built what was seen by Silicon Valley as a very successful product — Yahoo Answers had much better usage than Google Answers or any of its competitors — and LinkedIn saw opportunity in building out a professional version. Answers had a vision somewhat akin to Quora, but without functionality to follow (non-connection) users or follow topics.

Let’s again go back to the key engagement questions: would Answers provide a reason for people come back to LinkedIn regularly of their own accord, and/or because they were pinged by friends?

The vast majority of Internet users don’t contribute deep content, whether questions or otherwise. We didn’t have any reason to think that questions would be special; again a best case scenario would be that users might ask a question every few months.

That means two things. One, because most of our users had fewer than ten connections, it would be very rare that a typical one would get pinged saying “your friend has asked a question”. Second, if and when that user logged in, there would be the same issue: not many questions by friends.

In summary, we weren’t creating much directly relevant content and weren’t providing people with good reasons to come back to the site on their own. Not a winner.

[Aside: it's even debatable as to whether Quora's solved these problems, in spite of their better model (followed topics and followed users) that better facilitates discovery. I love the Quora product, and they have a tremendous amount of interesting content, but I'd guess that most of their users aren't regularly getting pinged about relevant stuff, nor are they logging in randomly to see what's going on.]

LinkedIn’s estimation process failed because the collective logic within the company was as follows:

1) It’s not clear to people how they use LinkedIn. How do we give them more obvious use cases?
2) Once we give them those use cases, they’ll use the product.

Unfortunately, the question “how and how many of those people will use the product for each use case?” didn’t really get asked, and we built a bunch of stuff that wound up getting scrapped. Good estimation generally happens when you ask good but simple questions, and follow them up with other good and relevant questions. What percentage of people will do X? Once they do X, what percentage will then do Y?

Of course, I didn’t manage to convince my peers that other features (most notably, those that increased the number of connections a typical user had) were more important than Answers or Services. I’m glad to report that LinkedIn before and after made many solid product decisions, and has of course attained great success in the process. Still, we had two major instances where the estimation process left something to be desired.

At Circle of Moms, we were okay but not great at estimation. That’s largely on my shoulders — it’s something that management needs to push people on by asking them to estimate the effects of product features and then justifying those assumptions. The process doesn’t need to be anything formal; the back of the envelope is literally fine. I’ve heard that Zynga is phenomenally accurate at guessing the effects of different features — in part because they’ve built so many games — and that serves as an enormous advantage. It’s also something that “business” types — most notably finance types — do well to size markets. In my next startup, I’ll make a conscious effort to bring the skill of estimation to my product team.

Data Science is the Old New Thing

In the last six months the term “Big Data” has reached the mainstream media. Companies are furiously hiring brilliant data scientists to make sense of all the data they have. Most revolutionary: I no longer need to sheepishly disclose that my college major was “Mathematical and Computational Science.”

Is this a step forward? Absolutely.

Does it mean that those same companies are doing all they can to use data in intelligent ways? In most cases, no.

The de facto current “data scientist” model is the third of four steps to data utopia. Here are the steps:

1) Null Set. You don’t collect a lot of data. You don’t know what they don’t know, and you operate solely based on intuition. No serious Internet companies are here, but some small startups are, as are most small offline businesses.

2) Collect Only. You collect data, but you don’t really know how to use them. Maybe you have non-technical marketers telling engineers to run specific queries. This rarely works, because most (not all) marketers don’t have an analytical understanding of the technology or the product, and the engineers just do what they’re told. This is still pretty common.

3) Data Economics. You collect data, but have a line in the sand between builders of products and those who use the data. Your product and engineering teams spec and build features; the Analytics/Data Science team takes the data available to them and uses them to figure out cool stuff.

4) Data Hacking. Your systems (technical and human) are structured so that data geekery is integrated with product and engineering. Your team asks not (only) what they can do with the data they have, but how they can get the data they want. This requires really smart data geeks who can understand how to design a product and can write code.

Let’s say Twitter decides they want to figure out how much coffee each of their users drinks.

If they were at Null Set, they would be completely lost.

If they were at Write Only, they might look at the frequency with which users mention “coffee”, “Starbucks”, “latte”, and maybe “caffeine” or “cafe”. Those with two or more mentions would be pegged as heavy coffee drinkers. And their accuracy would be abominable.

If they were at Data Economics, they would have a couple of smart data scientists, and they’d probably manually classify the people they knew as heavy/light/non-coffee drinkers. They’d use those manually classified accounts, plus keyword data and connection data, to build a statistical model that estimates the coffee consumption for any given user. They’d do okay — they’d uncover keywords that are more predictive than the marketers’ list from Write Only, and find one or two surprising correlations — but they still probably wouldn’t be able to build a great model because the data Twitter has are not very predictive of coffee drinking.

Were they at Data Hacking, they’d do everything in Data Economics. But then they’d also build out a dozen or so features to yield future insights. Maybe they’d suggest following Starbucks or Peet’s and use acceptance as a proxy for liking coffee. Maybe they’d create a “tweet about my coffee” iPhone/Android app. Maybe they’d ask a small percentage of people who’d just tweeted from a coffee shop what they ordered. Maybe they’d ask users who just tweeted about something coffee-related if they wanted a coupon for a free Starbucks coffee; presumably those who accept are much more likely coffee drinkers than those who aren’t.

I offer no guarantees of the effectiveness of any of those ideas. The point is that that they’re interactive, and they facilitate iteration: hack, analyze, hack, analyze. The hackers are the analyzers, and vice versa.

Want to measure reputation or influence? Algorithms on top of the existing data from Facebook/LinkedIn/Twitter aren’t enough — you need better data. Cubeduel — which allows users to express a preference between two people they’ve worked with — is an interesting product because it’s closer to “real” reputation measurement. But on the reputation front, the three big social networks haven’t built a lot of clever measurement features themselves.

And that’s not unique to them: few if any Internet companies are in the Data Hacking phase today. The good ones are doing Data Economics: they have great data scientists (or search relevance teams, etc.), but with scientists who are siloed. Here are the data available to you, they say, go figure something out.

That’s limiting, because there’s an entire class of problems that need Data Hacking brilliance to be solved. Andrew Chen’s post about Growth Hacker being the new VP Marketing is an example of companies moving to step 4 to improve their distribution: growth hackers embody the “data + dev + product” approach and do better on distribution than anyone else. In the next few years, we’ll see visionary data hackers tackle more of our deepest, most challenging problems in interactive, iterative ways.