Sample size calculator

You want to hit statistical reliability - fast.
Figure out how big your sample size needs to be with our ab test calculator.
No Mathematics PhD required.

When running A/B testing to improve your conversion rate, it is highly recommended to calculate a sample size before testing and measure your confidence interval.

This advice comes from old-fashioned industries (agriculture, pharmaceutical…) where it’s important to know your confidence level because it will define the experiment costs that we are looking to keep as low as possible.

This is what sample size calculators are used for. You are asked for the current success rate (conversion rate) and the size of the minimum effect to be measured. The result of the calculation is the population size needed to conclude from such an experiment.

This transposes badly in the “digital area” for three main reasons:

  1. Measuring conversions costs nothing (unlike in the industry).
  2. The number of visitors is a part of the problem (not the answer).
  3. The effect of variation is difficult to predict (in practice this is precisely the question you are asking yourself!).

This makes it very difficult to use sample size calculators. So, our data scientists @AB Tasty have developed a Minimum Detectable Effect calculator (MDE).

Just enter the number of visitors you have on your site and the conversion rate of the page you want to test!

Minimum Detectable Effect calculator

[?] %
Minimal Detectable Effect

Calculate the minimum sample size as well as the ideal duration of your A/B tests based on your audience, conversions and other factors like the Minimum Detectable Effect.

How many users do you need?

[?] %
[?] %
[?] %
[?] %
Required number of tested visitors per variation

How long should your test run?

Our A/B test calculator also gives you an idea of the duration of your A/B test. For this test duration calculator to work, please fill in the information above, as well as your average daily traffic on the tested page and your number of variations – including the control version. Read more about confidence interval and methods to interpret test results.

Required duration in days

The goal is to provide a simple way to calculate the needed population size required for a test to be statistically significant (e.g. the needed amount of visitors you need to assess that a lift/loss of x% can be trusted with a 95% confidence level).

The null hypothesis is the convention in “frequentist” statistical tests, stating that there is no difference between variations (thus, the naming “null”).

When a test’s result is negative, it means that there is indeed a difference: we are negating the null hypothesis. On the contrary, when the test’s result is positive, it means that there isn’t any difference between variations.

This is linked to the concept of p-value.

The p-­value is the probability of the result of an A/B test considering the null hypothesis.

In short, if the p-value is low (smaller than 0.05), the null hypothesis is unlikely to be true, hence that there is a difference between variations.

On the contrary, if the p-value is high (greater than 0.05), then the null hypothesis is likely to be true, meaning that there is probably no real difference between the variations. At the very least, you cannot conclude at this point and need more data to further the analysis.

This p-value only informs about the existence of a difference, it doesn’t give any information about its size or whether A > B or B > A.

Notation: since the p-value formulation is a bit confusing, it is often translated into a “confidence level” using percentage: (1 -­ p-value)*100.

Reaching statistical significance means that the confidence level is equal or greater than a given threshold. Theory dictates that this threshold is fixed once, before the start of the experiment.

For the confidence level, a conventional threshold for its statistical significance is 95% (corresponding to a p-value of 0.05), but it is only a convention.

This threshold should be set with the distinctive characteristics of each business in mind, as it is directly linked to the risk deemed reasonable for the experiment.

Also remember that a 95% statistical significance means that, statistically, 1 in every 20 results will be wrong, without any possibilities to detect it.

The algorithm is currently based on an extrapolation of the z-statistic formula, usually used for the normal distribution. AB Tasty also offers Bayesian A/B Testing.

Statistical power is the ability for a test to detect an effect, if the effect actually exists. i.e.: detecting a difference between variations if a real difference exists.

When doing prediction there are two types of errors. For A/B tests, a type I error, also called “false positive”, is declaring a bad variation as the winner, while a type II error is missing a winning variation.

The distinction is not just a theoretical: type I and type II errors often don’t implicate the same cost! It is then desirable to handle them differently.

Also named one- and two-tailed tests, the difference lies in the scope of their result:

  • One-sided tests will only give one information on whether A = B or not. If A != B, it could be that A > B or A < B.
  • Two-sided tests will give one more information: if A != B, is A > B or A < B?

This is really important for A/B testing as the direction of a difference, if any, is generally unknown before an experiment starts.

Two-sided tests are safer to use and this is what we use at AB Tasty.