We often hear people ask "how do I decide which option in an A/B test is the best?" The answer provided by a traditional approach to A/B testing goes like this:
- Set each variation to be displayed to a fixed proportion of your users, let’s say 50% each
- Run the test until you observe one variation to win with some level of confidence
- Disable the loser and set the winner to 100%
The truth is, while this approach is very straightforward and easy to understand – there is a more sophisticated way to run an experiment that will allow you to capture more conversions, arrive at faster results, and reduce the time you spend managing experiments and interpreting results.
Auto-Optimization is our way of A/B testing so that you can automatically show what’s working better more often.
True vs. Observed Performance
In any A/B test, each variation being tested has a true conversion rate that is equal to a theoretical probability that a user converts, given that they are a user. This is distinct from each variation’s observed conversion rate that is equal to the actual number of conversions divided by the actual number of test users.
Many A/B testers make the false assumption that the observed conversion rate is the same as the true conversion rate. In fact, these two metrics measure two distinct things that have important implications for your A/B tests. Most A/B testing tools account for this difference by showing you confidence intervals for each variation. What they don’t show you is each variations’ full probability distribution:
Examples of what some probability distributions look like
The greater level of statistical power you include in an experiment, the denser the probability distribution becomes and the closer the observed conversion rate will be to the true conversion rate. Take three variations (A, B and C) where we’ve observed the following data:
- Variation A: 5 conversions, 25 users
- Variation B: 50 conversions, 250 users
- Variation C: 500 conversions, 2,500 users
Even though all three variations have an observed conversion rate of 20%, we can say with greater confidence that the observed conversion rate of Variation C is closer to the true conversion rate. We can use a probability distribution to visualize and quantify this difference between A, B and C:
They’re all converting at 20%, but different levels of statistical power lead to more or less dense probability distributions for where the true conversion rate lies.
With each additional data point, the probability distribution of a variation’s performance is altered.
Example of a changing probability distribution over time.
Auto-Optimization uses this dynamic information to algorithmically determine how often competing variations are shown. At a high level, this means that we will show better performing variations more often and worse performing variations less often - allowing you to save your business time and money.
The advantages of this automated approach over traditional A/B testing to you are:
- Reduce the resources you’ll need as a business to actively manage experiments, since they are being optimized automatically
- Spend test resources (i.e.: time and users) towards establishing confidence in the performance of variations that matter the most: Winners
- Realize a higher average conversion rate during the course of the experiment itself by sending a greater proportion of users to better performing variations
If this approach is so great, why isn't all A/B testing done this way? Reliable algorithmic designs can be computationally intensive, and historically have been too costly to use. But new methods in Bayesian computation and parallel processing have made it easier and more affordable than ever to reap the benefits of high-frequency computing for the purpose of making better mobile apps.
How it Works
When using Auto-Optimization, all variations will initialize to be shown in equal proportions. So if you have a total of five variations, the probability that any given variation will be shown to a user when you launch the experiment is one-fifth, or 20%.
As soon as we start to collect data for each variation, we’ll also observe differences in their performances and project the probability distributions that each variation may have of producing a true performance at any given value. Eventually, these distributions ‘separate out’ and we can programmatically exploit those differences.
As our Auto-Optimization algorithm learns, it incorporates knowledge into test variations’ probability distributions.
New users included in experiments are then shown a variation selected using a randomized probability matching heuristic. The best performing arm is the most likely to be picked because it’s probability distribution is more heavily weighted towards a higher value. However, there is always a chance of picking the other arms when their probability distributions overlap with the best performer, allowing the algorithm to keep learning about the true conversion rate of all variations.
Usually the best-performing variation is picked because it has a probability distribution that is more heavily weighted towards a higher value, and a randomly selected value from it’s probability distribution is greater than that of other variations.
Sometimes, another variation is picked because a randomly selected value from its probability distribution is greater than that of the best-performing variation. This is important because it allows the algorithm to explore where the true conversion rate of all variations lies more precisely, and make better decisions in the future.
As the algorithm continues to learn, the variations’ distributions ‘separate out’ and the best-performer will be picked more frequently.
Auto-Optimization vs. Traditional A/B
In this way, Auto-Optimization prunes away the weaker variations and drives new users to the better ones - so that you can achieve statistical significance faster and maintain higher average performance while the experiment itself is still running. In contrast, traditional A/B testing uses a fixed exploration rate, so it will consistently test suboptimal arms throughout the life of the experiment.
Traditional A/B tests do not change their exploration rate over time since the splits are fixed at the beginning of the experiment. The downside of Traditional A/B is that a fixed percentage of users have to be given suboptimal experience. An Epsilon-Greedy algorithm also has a fixed exploration rate over time; they dynamically choose the best arm more often, but will continue to explore through the length of the experiment. Auto-Optimization eliminates exploring sub-optimal variations until only the best variation(s) remain.
Traditional A/B tests also have a fixed rate of exploitation, and traffic is split evenly through the length of the experiment. Epsilon-Greedy Algorithms may initially choose the best arm more often due to the fixed exploitation rate, but will continue to explore even after learning which arm is best. Auto-Optimization explores the best options and will continue to increase the rate at which the best arm is selected over time.
Auto-Optimization uses two metrics to determine when an experiment has been completed and a winner should be rolled out to the entire user base or segment: Chance to Beat Baseline* and Potential Value Remaining**.
An experiment is completed when:
- Any given variation has a 95% Chance to Beat Baseline; and
- There is a 95% chance that the Potential Value Remaining in the experiment is less than 1% of the best performing variation’s conversion rate.
In this way, we can declare with confidence that both the winning variation is better and continuing to run the experiment would not result in any significant further improvements in performance.
*You can read up on how Chance to Beat Baseline is calculated here.
**Potential Value Remaining is calculated using a Monte Carlo simulation based on randomized probability sampling to find the additional improvement in performance that could be realized from continuing to run the experiment.