AB Testing 101

What I wish I knew about AB testing when I started my career

Published in

Jonathan’s Musings

17 min readAug 25, 2023

AB testing launches your product and company to the next level

I started my career as a software engineer at Applied Predictive Technologies (APT), which sold multi-million dollar contracts for sophisticated AB testing software to Fortune 500 clients and was acquired by Mastercard in 2015 for $600 million. So I’ve been involved in AB testing since the beginning of my career.

A few years later I was VP of Engineering at Storyblocks where I helped build our online platform for running AB tests to scale our revenue from $10m to $30m+. Next at Foundry.ai as VP of Engineering I helped build a multi-armed bandit for media sites. Then a few years later I helped stand up AB testing as an architect at ID.me, a company valued at over $1.5B with $100m+ in ARR.

So I should have known a lot about AB testing, right? Wrong! I was surprised just how little I knew after I joined the AB testing platform company Eppo. Below is a synthesis of what I wish I’d known about AB testing when I started my career.

What is AB Testing?

First things first, let’s define what AB testing is (also known as split testing). To quote Harvard Business Review, “A/B testing is a way to compare two versions of something to figure out which performs better.”

Here’s a visual representation of an AB test of a web page.

Source: https://vwo.com/blog/ab-testing-examples/

The fake “YourDelivery” company is testing to figure out which variant is going to lead to more food delivery orders. Whichever variant wins will be rolled out to the entire population of users.

For the rest of the article, I’ll be assuming we’re working on AB testing a web or mobile product.

Ok, so let’s define what goes into running an AB test:

Feature flagging and randomization: how you determine who sees what variant (e.g., which experience or features)
Metrics: what we measure to determine which variant wins and that we don’t inadvertently break anything in the process
Statistics: the fancy math way of determining whether or not a metric for a given variant is better than another variant
Drawing conclusions: how we make decisions from the metrics and statistics

Let’s walk through each of these systematically.

Feature flagging and randomization

Feature flagging is used to enable or disable a feature for a given user. And we can think of each AB test as a flag that determines which variant a user sees. That’s where randomization comes in.

Randomization is about “rolling the dice” and figuring out which variant the user sees. It sounds simple but it’s actually complicated to do well. Let’s start with the “naive” version of randomization to illustrate the complexity.

Naive version of randomization

A user hits the home page and we have a test we’re running on the main copy, with variants control and test. The backend has to determine which one to render to the user. Here’s how the randomization works for a user visiting the home page:

Lookup a user’s variant from the database. If it exists, use the recorded variant. If it doesn’t exist, then…
Roll the dice. ExecuteMath.random() (or whatever randomization function exists for your language)
Assign the variant. Compare the number returned from Math.random() . If it’s < 0.5, then assign the control variant. Otherwise assign the test variant.
Save the variant to the database (so it can be looked up in step 1 the next time the user visits the home page)
Use the variant to determine what copy to render on the home page
Render the home page

Simple enough. This is actually what we implemented at Storyblocks back in 2014 when we first started AB testing. It works but it has some noticeable downsides:

Hard dependency on reading from and writing to a transactional database. If you have a test running on your home page, there’s potentially a lot of write volume going to your testing table. And if you have two, three, four or more tests running on that page, then multiply that load. The app is also trying to read from this table with every page load so there’s a lot of read/write contention. We crashed our app a few times at Storyblocks due to database locks related to this high write volume, particularly when a new test is added to a high traffic page.
Primarily works with traditional architecture where the server renders an HTML page. It doesn’t really work for single page or mobile apps very well because every test you’re running would need a blocking network call before it could render its content appropriately.
Race condition. If a user opens your home page multiple times at the same time (say from a Chrome restore), then it’s indeterminant which variant the user is going to see and have recorded. This is because the lookups from step 1 may all return nothing, so each page load rolls the dice and assigns a different variant. It’s random which one actually wins the race to be saved to the database and the user is potentially served different experiences.

Improved randomization via hashing

So how do we do randomization better? The simple answer: hashing. Instead of simply rolling the dice using Math.random() , we hash the combination of the experiment identifier and the user identifier using something like MD5, which effectively creates a consistent random number for the experiment/user combination. We then take the first few bytes and modulo by a relatively large number (say 10,000). Then divide your variants across these 10,000 “shards” to determine which variant to serve. (if you’re interested in actually seeing some code for this, you can check out Eppo’s SDK for it here). Here’s what that looks like in a diagram with 10 shards.

After you’ve computed the variant, you log the result, but instead of writing to a transactional database, which is blocking, you write the result to a data firehose (such as AWS Kinesis) in a non-blocking way. Eventually the data makes its way into a table in your data lake/warehouse for analysis (often called the “assignments” table).

Ok, so why do I need a feature flagging tool? Can’t I just implement this hashing logic myself? Yes, you could (and we did at Storyblocks back in the day) but there are some downsides

There’s still a dependency on a database/service to store the test configuration. The naive implementation requires fetching the data each time you need to randomize a user into a variant. This is challenging for SPAs and mobile apps.
There’s no good way to opt users in to a specific variant. This is often necessary for testing or rollout purposes (e.g., someone should be able to test each variant). The reason for this is that with only hashing, the variant is fully determined by the experiment/user combination.

The answer to randomization: feature flagging

So what do we do? Feature flagging! I won’t go into it in full detail here, but feature flagging solves these issues for us, by combining the best of both worlds: the ability to opt specific groups of users into a test and the ability to randomize everyone else. There’s a great Eppo blogpost that describes what goes into building a global feature flagging service if you want to learn more.

A feature flagging tool determines which variant a user sees

Metrics

Metrics are probably the easiest part of AB testing to understand. Each business or product typically comes with its own set of metrics that define user engagement, financial performance and anything else you can measure that will help drive business strategy and decisions. For Storyblocks, a stock media site, that was 30-day revenue for a new signup (financial), downloads (engagement), search speed (performance), net promoter score (customer satisfaction) and many more.

The naive approach here is simply to join your assignments table to other tables in your database to compute metric values for each of the users in your experiment. Here are some illustrative queries:

SELECT a.user_id, a.variant, SUM(p.revenue) AS revenue
FROM assignments a
JOIN purchases p
 ON a.user_id = p.user_id
WHERE a.experiment_id = 'some-experiment'
 AND p.purchased_at >= a.assigned_at

SELECT a.user_id, a.variant, COUNT(*) AS num_page_views
FROM assignments a
JOIN page_views p
 ON a.user_id = p.user_id
WHERE a.experiment_id = 'some-experiment'
 AND p.viewed_at >= a.assigned_at

-- etc.

This becomes cumbersome for a few reasons:

As your user base grows, the ad hoc joins to your assignments table become repetitive and expensive
As the number of metrics grows, the SQL to compute them becomes hard to manage (just imagine having 50 or even 1000 of these)
If the underlying data varies with time, it becomes hard to reproduce results

So to scale your AB testing, you need a system with the following:

The ability to compute the assignments for a particular experiment once per computation cycle
A repository to manage the SQL defining your metrics
An immutable event or “fact” layer to define your underlying primitives to compute your metrics

Let me explain the event/fact layer in more detail. A critical aspect to making metrics easily reproducible and measurable is to base them on events or “facts” that occur in the product or business. These should be immutable and have a timestamp associated with them. At Storyblocks those facts included subscription payments, downloads, page views, searches and the like. The metric for 30-day revenue for a new signup is simply an operation (sum) on top of a fact (subscription payments). Number of searches is simply a count of the number of search events. And so on. A company like Eppo makes these facts and other definitions a core part of your AB testing infrastructure and also provides the capabilities for computing assignments once and building out a fact/metric repository.

An important aspect of configuring an experiment is defining primary and guardrail metrics. The primary metric for an experiment is the metric most closely associated with what you’re trying to test. So for the homepage refresh of YourDelivery where you’re testing blue vs red background colors, your primary metric is probably revenue. Guardrail metrics are things that you typically aren’t trying to change but you’re going to measure them to make sure you don’t negatively impact user experience. Stuff like time on site, page views, etc.

Statistics

Ok, statistics. This is the hardest part for someone new to AB testing to understand. You’ve probably heard that we want a p-value to be less than 0.05 for a given metric difference to be statistically significant but you might not know much else. So I’m going to start with the naive approach that you can find in a statistics 101 textbook. Then I’ll show what’s wrong with the naive approach. Finally, I’ll explain the approach you should be taking. There will also be a bonus section at the end.

The naive approach: the Student t-test

Let’s assume we’re running the home page test for YourDelivery shown above, with two variants control (blue) and test (red) with an even 50/50 split between them. Let’s also assume we’re only looking at one metric, revenue. Every user that visits the home page will be assigned to one of the variants and then we can compute the revenue metric for each user. How do we determine if there’s a statistically significant difference between test and control? The naive approach is simply to use a Student t-test to check if there’s a statistical difference. You compute the mean and standard deviation for test and control, plug them into the t-statistic formula, compare that value to a critical value you look up, and voila, you know if your metric, in this case revenue, is statistically different between the groups.

Let’s dive into the details. The formula for the classic t-statistic is as follows:

t-statistic

Variable definitions in the formula are as follows:

To look up the critical value for a given significance level (typically 5%), you need to know the degrees of freedom. However for large sample sizes that we typically have when we’re AB testing, the t-distribution converges to the normal distribution so we can just use that to look up the critical value. The parameters for that normal distribution under the null hypothesis (i.e. there is no difference between the groups) are:

At Storyblocks this is the approach we used. Since we wanted to track how the test was performing over time, we would plot the lift and p-value over time and use that for making decisions.

Example plot of lift and p-value over time

What’s wrong with the naive approach

The naive approach seems sound, right? After all, it’s following textbook statistics. However there are a few major downsides:

The “Peeking Problem”. The classic t-test only guarantees its statistical significance if you look at the results once (e.g., a fixed sample size). More details below.
P-values are notoriously prone to mis-interpretation, have arbitrary thresholds (e.g., 5%), and do not indicate effect size.
Absolute vs relative differences. The classic t-test looks at absolute differences instead of relative differences.

The Peeking Problem

Using the naive t-test approach, we thought we were getting a 5% significance level. However, the classic t-test only provides the advertised statistical significance guarantees if you look at the results once (in other words, you pre-determine a fixed sample size). Evan Miller writes a great blog post about this problem that I highly recommend reading to understand more. Below is a table from Evan’s blog post illustrating how bad the peeking problem is.

Illustration of how peeking impacts significance

So if you’re running a test for 2+ weeks and checking results daily, then to get a true 5% significance, you need to raise your significance to be ≤ 1%. That’s a pretty big change and represents at least a full standard deviation of difference from the naive approach.

The approach you should take

Ok, now that we know some pitfalls of the naive approach, let’s outline key aspects of the way we should approach the statistics for our AB testing (I’ll include more info about each below the list in separate sections).

Relative lifts. Look at relative lifts instead of absolute differences.
Sequential confidence intervals. These confidence intervals give you significance guarantees that hold across all of time, so you can peek at results as much as you want, and they’re easier to interpret than p-values
Controlled-Experiment Using Pre-Experiment Data (CUPED). We can actually use sophisticated methods that leverage pre-experiment data to reduce our variance, thus shrinking our confidence intervals and speeding up tests.

(1) Relative lifts

The rationale behind relative lifts is straight forward: we typically care about relative changes instead absolute changes and they’re easier to discuss. It’s easier to understand a “5% increase in revenue” compared to a “$5 increase in revenue per user”.

How does the math change for relative lifts? I’m going to quote from Eppo’s documentation on the subject. First, let’s define relative lift:

From the central limit theorem, we know that the treatment and control means are normally distributed for large sample sizes. This allows us to model the relative lift as a normal distribution with the following parameters:

Ok, that’s somewhat complicated. But it’s necessary to compute the sequential confidence intervals.

(2) Sequential confidence intervals

First, let’s start with the confidence interval using a visual representation from an Eppo experiment dashboard:

So you can see that the “point estimate” is a 5.9% lift, with a confidence interval of ~2.5% on either side representing where the true relative lift should be 95% (one minus the typical significance of 5%) of the time. These are much easier for non-statisticians to interpret than p-values — the visuals really help illustrate the data and statistics together.

So what are sequential confidence intervals? Simply put, they’re confidence intervals that hold to a certain confidence level over all of time. They solve the “peeking problem” so you can look at your results as often as you want knowing that your significance level holds. The math here is super tricky so I’ll simply refer you to Eppo’s documentation on the subject if you’re interested in learning more.

(3) Controlled-Experiment Using Pre-Experiment Data (CUPED)

Sequential confidence intervals are wider than their fixed sample counterparts, so it’s harder for metrics to reach statistical significance when using sequential confidence intervals. Enter “Controlled-Experiment Using Pre-Experiment Data” (commonly called CUPED), a method for reducing variance by using pre-experiment data. In short, we can leverage what we know about user behavior before an experiment to help predict the relative lift more accurately. Visually, it looks something like the following:

Source: https://docs.geteppo.com/statistics/cuped/

The math is complicated so I won’t bore you with the details. Just know that powerful AB testing platforms like Eppo provide CUPED implementations out of the box.

Bonus material — simplifying computation

While I didn’t fully write out the math for sequential confidence intervals, know that we need to compute the number of users, the mean, and the standard deviation of each group, treatment and control, and we can plug those in to the various formulas.

First, the means are relatively simple to compute:

The standard deviation is slightly harder to compute but is defined as follows:

As you can see, we must first compute the mean and then go back and compute the standard deviation. That’s computationally expensive because it requires two passes. But there’s a reformulation we can employ to do the computation in one pass. Let me derive it for you:

Ok, that looks pretty complicated. The original formula seems simpler. However you’ll notice we can compute these sums in one pass. In SQL it’s something like:

SELECT count(*) as n
  , sum(revenue) as revenue
  , sum(revenue * revenue) as revenue_2
FROM user_metric_dataframe

So that’s great from a computation stand point.

Bonus material — Bayesian statistics

Perhaps you’ve heard of Bayes’ theorem before but you’ve likely not heard of Bayesian statistics. I certainly never heard about it until I arrived at Eppo. I won’t go into the details but will try to provide a brief overview.

Let’s start with Bayes’ theorem:

In Bayesian statistics, you have a belief about your population and then the observed data. Let’s simplify this to “belief” and “data” and write Bayes’ theorem slightly differently.

So basically you use the likelihood to update your prior, giving you the posterior probability (ignoring for a moment the normalization factor, which is generally hard to compute).

Why is this methodology potentially preferred if you have a small sample size? Because you can set your prior to be something that’s relatively informed and get tighter confidence intervals than you would with classical frequentist statistics. Referring back to the original example, you could say that you expect the relative difference between test (red) and control (blue) is normally distributed with standard deviation of 5% (or something like that, it’s a bit up to you to set your priors).

I totally understand that’s hard to follow if you have no knowledge of Bayesian statistics. If you want to learn more, I recommend picking up a copy of the book Bayesian Statistics the Fun Way. You could also read through the sections of Eppo’s documentation on Bayesian analysis and confidence intervals.

Drawing conclusions

Drawing conclusions is the art of AB testing. Sometimes the decision is easy: diagnostics are all green, your experiment metrics moved in the expected direction and there were no negative impacts on guardrail metrics. However, studies show that only around 1/3 of experiments produce positive results. A lot of experiments might look similar to the report card below for a “New User Onboarding” test:

New user onboarding experiment report card

The primary metric “Total Upgrades to Paid Plan” is up ~6% while there are some negative impacts such as “Site creations” being down ~10%. So what do you do? Ultimately, there’s no right answer. It’s up to you and your team to make the tough calls in situations like this.

In addition to experiment report cards, it’s important to look at experiment diagnostics to make sure the underlying data is in good shape. A very common problem with AB testing is what’s called “sample ratio mismatch” or SRM, which is just a fancy way of saying that the number of users in test and control don’t match what’s expected. For instance you might be running a 50/50 test but your data is showing 55/45. Here’s what an SRM looks like in Eppo:

Example traffic chart for an SRM detected by Eppo

There’s also a variety of other ways your data could be off: one or more of your metrics may not have data; there could be an SRM for a particular dimension value; there may not be any assignments at all; there might be an imbalance of pre-experiment data across variants; and more.

Tools like Eppo help make your life easier by providing you easy-to-understand dashboards that are refreshed nightly. So you can grab your cup of coffee, open up your experiment dashboard, check on your experiments, and potentially start making decisions (or at least monitoring to make sure you haven’t broken something).

Tools and platforms

While you might have initially thought that building an AB testing platform is relatively straight forward, I hope I illustrated that doing it well is extremely challenging. Everything from building a feature flagging tool, to constructing a metrics repository, to getting the stats right, to actually computing the results on a nightly basis, there’s a lot that goes into a robust platform. Thankfully, you don’t need to build one from scratch. There are a variety of tools and platforms that help make AB testing easier. Below is a list of some relevant ones:

Analyzing each of these platforms is beyond the scope of this article. Given all the requirements for an AB testing platform outlined above, however, I can confidently say that Eppo (even though I may be slightly biased because I work there) is the best all-in-one platform for companies that have their data centralized in a modern data warehouse (Snowflake, BigQuery, Redshift, or Databricks) and are looking to run product tests on web or mobile, including tests of ML/AI systems. Eppo provides a robust, global feature flagging system, a fact/metric repository management layer, advanced statistics, nightly computations of experiment results, detailed experiment diagnostics, and a user-friendly opinionated UX that is easy to use even for non-statisticians. If you’re just looking to run simple marketing copy tests, then a tool like Optimizely is probably better for you though it’ll be pretty expensive.