A Primer on A/B Testing (Yummy Candy!)

I think I know how it feels to be a nagging dentist.

I spend lots of time helping startup founders figure out how to increase the number of people using their product. Sometimes, founders think that because a few silly folks labelled me with the (soon-to-be-cliched) title of growth hacker, I am “magical” like an Apple product. With one quick suggestion from me, they can get to a million users!

Unfortunately, it doesn’t usually work that way. Instead, I tell them, they need to (among other things) rigorously A/B test a dozen interface changes on their three or four most important pages. And then I get that “what-do-you-mean-I-need-to-floss-every-single-night?” kind of look.

I’ve run hundreds of A/B tests over the years, and in the process I’ve learned a lot about what messages people respond to. After seeing the results of those tests, I present a shocking hypothesis: “you should try this yummy candy!” will be more effective than “you really need to start flossing every day.”

So… I need to tell you why A/B testing is like yummy candy. Fortunately, I can make that argument without being misleading: running A/B tests can be really fun and addictive (like Skittles!). You’ve probably experienced an eager expectation that something new would immediately improve your world in a significant way. Maybe as part of a website — a new, beautiful signup flow will mean a super engaged user base — or, in your personal life: a new hairstyle will encourage people to respond to you in a better way.

A/B testing can provide that spark of hope on a very frequent basis: at Circle of Moms we’d have dozens of tests running at any given time, each serving as a quantitatively sound way to understand our usage and improve our product. Pushing out new tests multiple times a week, getting rapid feedback on each, is like regularly handing out chocolate to your team. Each test is a yummy morsel of hope: it has the potential to bring users in, excite and engage existing users, and make money. Frequent testing is like frequent chocolate consumption. Yum!

Frequent chocolate consumption has risks, and so does frequent A/B testing. With A/B testing, it’s important to be holistic and patient about collecting data. But a product development strategy involving A/B testing is generally both more fun and more effective than the alternative “change and pray” approach.

Now that we’ve established that A/B testing is fun, we get to the real questions. Why does it actually matter to your business? What should you be testing? When does it make sense to do? (brief answer: not always) And how, technically, should you do it?

Let’s tackle each of those.

WHY

The reason to A/B test is simple: because newer doesn’t always mean better, and everyone I’ve met is mediocre at predicting how effective a new experience will be. There’s often an implicit assumption that ______ in my product isn’t very good, and by spending time on it, we can only make better. In extreme cases — the current version is a 404 page not found error — that’s very likely to be true. But in more common cases — the signup flow is a little bit ugly and awkward — product changes don’t always mean progress.

We saw this time and again at Circle of Moms. We had a new homepage that looked cleaner and more usable… and users who saw it stopped contributing to conversations. We had a signup flow that seemed much simpler and more professional… but fewer people got through it and those who got through it didn’t invite their friends to join our site. Surely asking people to share their answers on Facebook would be good, right? Turns out no: very few moms actually shared their activity, while many others were scared off by the thought of us making content too public (this only applied for some content types).

Okay, you say, that’s all fair and well, but how about just making a change and seeing how it affects overall metrics for the product? There is a case where this is a good approach, and I’ll walk through it in the “When” section. But most of the time, it’s the wrong way to go.

To work, serial “testing” requires three things: the rest of the world staying steady, large changes, and a close eye on metrics. Let’s say you’re looking at how a new homepage design affects activity, and all of a sudden your sending email IP is blacklisted by Yahoo. Your numbers will almost certainly go down, regardless of the effectiveness of your new homepage. New signup flow, and all of a sudden you get a surge of search traffic that broadens your audience but decreases the quality? Same type of issue. Major site downtime or technical issues can have the same impact.

If you have a huge increase or decrease, and you know that the outside world is more or less the same over the test period, and you measure different cohorts properly, and of course you only measure one things at a time… serial testing can work. If you really think those can happen consistently, you’re a lot more optimistic than I am.

WHAT

There are two reasons to A/B test something:

1) You have a product enhancement that might improve your metrics at a level material to your business, and want to try it.

2) You have a radically revamped piece of your product, and want to verify that it’s at least as effective as the current version.

Generally, #1 is about iteration and optimization, while #2 is about design and vision. The thought processes for the two are very different.

Optimization is only useful on products close enough to “good” to be optimized. Overused but apropos cliche: A/B testing something that’s badly broken is akin to rearranging the deck chairs on the Titanic. Here are a couple of cases where you may or may not want to use optimization:

  • Viral signup flows. If your current signup flow features 1000 signups inviting 3000 people, 900 of whom register for your product, you’re very close to being viral (K=0.9). A/B testing would be a good use of time. If your current flow features 1000 signups inviting 600 people, 80 of whom join (K=0.08), then you aren’t in the ballpark: optimizing button text is likely a waste of time. Go bigger.
  • Email content. Subject lines and link text can have a huge impact on email clickthrough rates. One typical example: an email with the subject “5 Embarrassing Kid Moments” gets 2.5 times as many clicks as one with the subject “The Craziest Thing My Child Has Done.” But again, being close to “good” is key: if that 2.5x is the difference between 50 clicks a week and 125 clicks a week, does it matter? If it doesn’t matter (and good estimation is key), no point spending time A/B testing it.
  • Purchase funnel. Much of Team Rankings‘ revenue comes from subscriptions, and some purchase funnels can be much more effective than others. Last year, for instance, our March Madness product, BracketBrains, generated 30% more sales when we prompted people to “Get 2011 Picks” than to “Get BracketBrains”. Same caveats apply here, though: test if and only if the differences are likely to matter for your business.

The thought process for a new design is very different. At Circle of Moms, I’d explain to my team that we tested a new home page not because we wanted it to improve our metrics but because we didn’t want to tank them. I wrote an entire post on why A/B testing vs. holistic design is a false dichotomy; the TL;DR is “test entire designs, see how each does on a variety of metrics, then make an informed judgment call on how to go forward.”

WHEN

If you thought flossing was exciting, wait until he starts talking about statistical significance!

The most important type of significance in assessing when to A/B test isn’t statistical significance, it’s business significance. Can a new version of this page make a real difference to our business? Size of user base, development team, and revenues inform what’s useful for different companies. Facebook can move the needle with hundreds of 0.1% improvements. A small startup with no revenue and 100 users doesn’t care about 0.1%, nor will they be able to detect it. In a small product with no usage, serial testing is fine: there’s no chance that the business will be built around the existing product, so rapid change is more important than scientific understanding.

Statistical significance is the second most important variable in assessing when to A/B test. Making a decision between two options without enough data can undermine the entire point of A/B testing. Doing statistical significance properly can be difficult, but the 80-20 solution is pretty simple. Just use an online split test calculator, using estimates of statistics to see whether you are likely to attain statistical significance for a single output variable.

A big caveat: a little common sense regarding statistical significance can go a long way. If you’ve been running a test for a while and don’t have a clear winner but have some other ideas that might move the needle a lot more, you might be well-served by resolving the test now and trying something new. You’re running a company, not trying to publish in an academic journal.

At Team Rankings, we regularly do this with BracketBrains. Most of our sales happen over a four day period, so we have a limited window in which to test things. The “cost” of resolving slightly sub-optimally — say choosing the 5% option rather than the 5.1% option — is likely to be lower than the opportunity cost of not running an additional test. And since resolving a test when 99% of sales have occurred does us no good, we’re more aggressive than traditional statistical tests would dictate. If, on the other hand, you’re early in a product’s lifetime, conservative decision-making might be more appropriate.

HOW

There’s one way to A/B test properly: build your own system. Lots of people probably don’t want to hear that, but products like Optimizely are too simplistic and optimization-focused to be broadly useful. Outsourcing your A/B testing is like outsourcing your relationship with your users: you need to understand how people are using your product, and the A/B testing services currently available don’t cut it.

I wish I could recommend an open source A/B testing framework to avoid re-inventing the wheel; ping me if you know of a good one or are creating one (if so, I’d be happy to help). A/Bingo is the closest.

The good news is that it’s pretty simple to get your own very basic A/B testing system up and running, and it’s easy to build up functionality over time. Here’s the bare minimum:

Data structure:
AB_TESTS (id, name, time_created)
AB_TEST_OPTIONS (id, ab_test_id, weight, name)
USER_AB_TEST_OPTIONS (id, user_id/visitor_id, ab_test_option_id, time_created)

Code

/ CONFIG FILE //
ab_tests = array(
	“home_design”=>array(
		array(name=>“2_columns”, “weight”=> 1),
 		array(name=>”1_column”, “weight”=>9)
	);
); 

// VIEW FILE //
if (user->has_ab(“home_design”, “2_columns”) {
  // show new 2-column layout
}
if (user->has_ab(“home_design”, “1_column”) {
  // show old 1-column layout
}

// USER OR VISITOR OBJECT //
function has_ab() {
  // check if this test exists
  // if not, create it in the DB
  // (one row in AB_TESTS, multiple rows in AB_TEST_OPTIONS)

  // check if this user/visitor already has an ab test option selected for this test
  // if not, select a random number
  // use the weighting to decide which version he should get
  // record it in the USER_AB_TEST_OPTIONS table
}

Reporting (SQL)
I’m assuming you have a USER_ACTIVITY table that records different types of activity, with user, time, and activity type. A table like that makes A/B test reporting a whole lot easier.

select
  AB_TEST_OPTION_ID,
  ACTIVITY_ID,
  count(distinct USER_ID) USERS_DOING_ACTIVITY,
  count(1) TOTAL_ACTIVITIES
from USER_ACTIVITY a, USER_AB_TEST_OPTIONS b
where a.USER_ID=b.USER_ID
  and AB_TEST_OPTION_ID in (…)
  and ACTIVITY_ID in (…)
  and a.time_created>b.time_created
GROUP BY AB_TEST_OPTION_ID, ACTIVITY_ID;

Scaling
If your site is or becomes massive, scaling the framework will entail some additional work. USER_AB_TEST_OPTIONS may require a large number of writes, and that little query joining USER_ACTIVITY and USER_AB_TEST_OPTIONS might take a while. Writing to the table in batch, using a separate tracking database, and/or using non-SQL options may all help to scale everything.

At Circle of Moms, we built out a system to automatically report lots of stats for every test. This was awesome, but it takes some work to scale, and I would never recommend it as a first step.

So…
As I said, A/B testing is like candy: fun and sometimes addictive. Done correctly, it can be part of the best form of mature and thoughtful product development. It builds a culture of testing and measuring. It lets you understand what works and what doesn’t. It forces you to get smarter about what actually moves metrics. Most important, it fosters an environment where data trumps opinions… anyone want to volunteer to try to take that to D.C.?

Mike Greenfield founded Circle of Moms and Team Rankings, led LinkedIn's analytics team from 2004-2007, and built much of PayPal's early fraud detection technology. Ping him at [first_name] at mikegreenfield.com.