UX designers are skeptical of the use of A/B testing. Is their skepticism warranted?
I’m about as hardcore a data geek as they come, and my answer is: Yes. But…
The “but” is absolutely key: A/B testing is almost always conducted in an overly simplistic way that focuses solely on hill climbing and not on understanding. A/B testing need not only be about growth hacking and optimizing clickthrough rates: done right, A/B testing contributes to deep knowledge about how people respond to different interfaces.
Most companies use A/B testing to try to optimize one result. If their goal is to sell widgets, they test having a blue button versus having a red button for a few days. If after that period, more people who have seen the red button buy the widget, they resolve the test and show the red button to everyone.
This is simple, and it’s effective if you’re trying to optimize exactly one metric. It’s not effective for comparing a new home page design with an old one, unless there’s only one thing you care about. Most companies care about more than one metric — they want to sell stuff, they want people to come back to their site regularly, and they want a strong community of engaged users on their site — so the red button/blue button “which sells more?” approach doesn’t apply.
But swearing off A/B testing because of the way it’s currently done would be like swearing off smartphones and tablets because you didn’t like the Apple Newton. A/B testing methodologies are currently quite primitive and need to get a lot better.
So how do you get to the next step? You evaluate two (or more) designs holistically. 95% of the time, a home page is designed to do more than optimize a single stat. The home page should facilitate browsing, product understanding, usage. Those are measurable things, but tougher to measure than simply “clicks” or “purchases”. How many pages are people in each group viewing; how frequently are they coming back; what kinds of activity are they doing; how many of them are closing their accounts?
This sort of testing is hard to do — and done by very few Internet companies today — for a few reasons:
1) It’s technically challenging. At Circle of Moms, we built a system to track usage on both sides of a test, so that one could go to a test page and compare the two sides along dozens of stats. It required us to associate every event with every active A/B test the user had ever been part of. We also had to come up with clever ways to filter out outliers, like people who have viewed 5000 pages in a month. Setting this up is not impossible, but takes a fair amount of work to scale and maintain. As far as I’m aware, none of the mainstream web analytics companies are well set up to allow this kind of testing.
2) If you’re looking at many different stats when comparing two sides of a test, it’s tough to know what matters and what doesn’t. Consumers of test results — product managers, developers, designers, execs — need some intuition for what’s important, and not feel overwhelmed with excessive data. Having only one metric they look at makes things a lot simpler, but it doesn’t reflect what the business actually needs.
3) Politically and emotionally, it’s tough to throw away features that a team has spent a long time on. If you switch over to a new home page without properly testing it, everyone can celebrate the hard work that went into it, and not worry about how it actually performs.
There’s a substantial cost to this lack of testing. I’ve seen a number of companies with data scale spend weeks or months building out a new version of something (home page, signup flow), not A/B test it, then indirectly discover that the new version performs worse than the old one. This leads to two terrible options: roll back to an old version of the product — a technical and logistical nightmare and a morale hit — or keep a version of the product that is significantly less effective.
The alternative is to use A/B testing as part of a holistic design approach. Have some outlandish idea for what the home page should look like? Sketch it out, get in-person feedback, refine. Build out a basic version, maybe get more in-person feedback, refine. A/B test it with a small percentage of users, see how usage looks relative to the baseline. If the results look terrible, move on to the next idea. If they look promising but have some issues (maybe usage is much better, but engagement is down), refine and A/B test the refined version. Keep repeating this process until you get to a design that is clearly more effective than the baseline.
Everyone knows the famous quote:
There are three kinds of lies: lies, damned lies, and statistics
It’s easy to feel “lied to” — misled would be a better term — after being fed inappropriate statistics. A/B testing small changes, with the goal of optimizing one variable, often falls into this category: it leads to short-term, non-holistic product development processes.
The solution to being misled isn’t to stop using statistics, it’s to make sure the right ones are being used. Nor should one throw out the A/B testing framework baby with the poor methodology bathwater. Rather, companies need to measure the right things and ask the right questions that allow them to deeply understand the effects of their changes. Optimize sometimes; understand always. The end result of this approach is a holistic A/B testing mentality that everyone should be able to embrace.