Choosing the Right Statistical Method for AB Testing: A Comprehensive Guide for Non-Technical Users

Written by Julie Trenque

Updated on 09/11/2024

3 min

Intermediate

Was this content useful?

Introduction

As AB testing practitioners, we benefit from the extensive study of controlled experiments by renowned statisticians over the years. This provides us with numerous effective methods for decision-making.

Making decisions with a controlled level of risk is the heart of AB testing but we can sometimes be overwhelmed by names such as “Frequentist” or ”Bayesian”. We hope this article will help you to choose the right tool to use for the job. At Kameleoon, we want you to Experiment your way. This means that we want to make sure our customers are comfortable with the tools they use and understand the risk they take and the reward they get when they implement a variation. This is why we support a wide range of statistical computations and empower our customers to make the best out of each experiment. Whether it is classical fixed-sample frequentist test, sequential test or bayesian statistics, but also multiple testing correction and CUPED. We also have developed an opportunity detection algorithm to make sure you don’t miss a chance to improve the experience of your users.

Main available paths

When interpreting your AB test results today, the pivotal decision lies in choosing between frequentist, Bayesian, and sequential approaches. By frequentist approach, we mean in this context the fixed sample size method as opposed to the sequential one. Each approach carries its own set of advantages and drawbacks. These methods employ distinct methodologies. We believe that you should opt for the one aligning most closely with your operational preferences and the experimentation culture you aim to foster within your company.

Fixed Sample Frequentist

The Fixed Sample Frequentist method forces a quite rigid framework where you want to make sure you state your hypothesis clearly before launching the test. Estimate the length of the experiment based on the minimal effect size you want to detect and stick to it. If you manage to stick to all these restrictions, without stopping early because you saw an unexpected drop in one metric or extending the duration of your experiment because you have nearly reached your  desired reliability level, then the results are great as you get a very good power and control the risk as you wish it. You have to be well organized to get the most out of the fixed sample frequentist method but the results are worth it.

Pros

  • Maximal statistical power.
  • Most spread method when it comes to randomized controlled trials.

Cons

  • Very rigid.

Target

Mature experimentation teams who understand the risk they take when they don’t follow the method by the book. If you know what you are doing this method offers some liberties but if your focus is only on testing ideas you shall stick to the guidelines.

Bayesian

The Bayesian method on the other hand gives more freedom because it hides some complexity behind the prior. This method will be perfect if you have an extended amount of data which you can leverage to build informed prior for your experiments. Or if you are convinced that the world is bayesian… The Bayesian framework gives you access to probability distributions that you can use to extract a wide range of information, one very useful usually being the probability for your variation to improve the conversion rate of a given objective compared to the original.

Pros

  • Allows you to leverage your knowledge.
  • Gives you access to many interesting quantities based on the estimated distribution.

Cons

  • Can lead in the wrong direction if our prior is not built correctly.

Target

Experimentation teams which are used to work with Bayesian techniques, which like the flexibility and the power to be able to leverage data to set a prior. This technique might be addressed to slightly advanced users as it can be useful to understand what is a probability distribution to get the most out of it.

Sequential

Finally the sequential testing framework came in as a solution to the peeking problem. Experimenters often run into a conundrum where you see the experiment results going sideways but must uphold the desired sample size which was set prior to launching the test. Sequential tests shine when the classical fixed sample size tests are failing. As you can imagine it does come at a cost which is the statistical power. If you have a very flexible way of experimenting where you want to move fast at the cost of a broader estimation of the effect size this can be the solution made for you.

Pros

  • Very flexible, no need to estimate sample size
  • Get valid confidence interval at any point during the life of the experiment

Cons

  • Lower statistical power.

Target

Fast moving teams which feel pressured by the rigidity of the fixed sample framework and are ready to sacrifice some statistical power to get more flexibility.

Important: We want to note that in a real world scenario you will want to make the best use of sequential and couple it with fixed sample methodology. This means setting sequential alerts to be notified if the sequential reliability reaches a given threshold or also validating any early peeking with its sequential counterpart. Incorporating this into your fixed sample methodology will unlock some new scenarios and allow you to increase your velocity of experimentation.

And other tools…

We also wanted to mention two of the many other tools available for us CRO practitioners. They are often overlooked but we believe they should be used wisely.

Multiple testing correction

It is good advice when running an experiment to include more than one change which results in more than one variation. This will increase the global velocity of your experimentation program. But when doing so, make sure you use some correction to not unwillingly increase your risk of getting a false positive result.

CUPED

CUPED starts to be well known by now and is a very reliable way to reduce the sample size needed to detect a given lift. If you want a refresher on CUPED mechanism be sure to read our dedicated article. We wanted to remind here the use cases where it makes more sense to apply the technique. We have found that CUPED will have the most impact in the following use cases and circumstances:

  • Your experiment includes returning visitors: We can improve our predictions when the experiment is exposed to returning visitors because we’ll have data on them.
  • You’ve already run many experiments in Kameleoon: The more you use Kameleoon, the more experiment data you’ll have that our algorithm can use to improve the effectiveness of CUPED
  • There is a correlation between the goal conversions before the start of the experiment and during the live experiment: Make sure you’ve been collecting data for the main goal of your current experiment (for which you want to enable CUPED) prior to launching the experiment. The more correlation we have for the goal conversions of the previous and current experiments, the better we’ll be able to predict the real impact of the current test.

  • In this article :