Author: Adam G. Dobrakowski
Redaction: Zuzanna Kwiatkowska
If you work in machine learning, you’ve definitely heard about A/B testing. It is the best and most reliable experiment which you can conduct to confirm the performance and quality of your model.
In this post, I’m going to introduce you to the topic of A/B testing and what steps they involve. It is the first out of 3 post in this series:
- A/B Testing in Machine Learning. Part 2: Most common problems
- A/B Testing in Machine Learning. Part 3: 4 most common mistakes
What are A/B tests?
Originally, A/B tests were introduced as a way to test changes of websites or applications. Their most common use case is to increase conversion rate by observing which version of a given website performs better and then using it as a main version of the website. However, they can also be used to test other aspects of the system, such as usability, readability or how memorable it is.
In A/B testing, we show 2 versions of, let’s say, a product to 2 separate groups of users, one version to one group. We then calculate performance metrics in each group and analyse if the difference in metrics between those groups was random or a result of differences between versions.
In the machine learning world, A/B testing can have a broader definition – it can be a final confirmation that our model works better in the real world than a different model or than a human. We can test 2 completely different models or two versions of the same model.
We can use A/B testing to verify the majority of machine learning models, all from various industries, for example recommendation systems or predictive models, both in online and real world.
What is important, in A/B testing we validate both of our groups at the same time. This protects us and our measurements from any factors that can occur regardless of our model, for example naturally increased or decreased sales volume in some periods of the year.
Steps in A/B testing
A/B tests consist of the following steps:
- We deploy our model to production.
- We split traffic of our users to 2 streams.
- We decide on the time and assumptions of our test.
- We gather the data.
- We analyse the data and conclude on the performance of our model.
The following diagram shows the procedure for websites, but the overall idea stays the same:
Points 1 and 5 are fairly obvious, so let’s focus on 2-4.
A. Users’ split
As I already mentioned, we need to split our user base to 2 separate streams that are as similar to each other as possible. This step also helps us reduce any influence of external factors and allows a better, more fair comparison.
When we analyse a website which is visited by a large number of users, we can simply show a randomly chosen version of our website to a given visitor. However, this is only easy in theory. One problem is that we must ensure that we always show the same version of the website to this specific user, even if he visits it again after a couple of days. This can be solved for example by using cookies.
Another challenge is when we want to split between larger elements, such as marketing campaigns or versions of the e-store. Imagine that your ML model is tasked to optimise the marketing campaign of a given product by showcasing the ads on chosen websites. If we want our comparison to be fair, we would need to create exactly the same campaign, but managed by a human or some other version of the model. Unfortunately, those campaigns would never be identical, because if we use an ad place for one of the campaigns, we simply cannot use it for another at the same time.
The problem is even bigger if we want to optimise those campaigns by using tools like Facebook Ads, which has its own optimization algorithm. It can influence our own model and make the comparison impossible.
But don’t worry! Those are the problems that I’m going to cover in detail in Part 2 of this article.
B. Test assumptions and statistical significance
Statistical significance tells us a probability that the difference between 2 groups is not random. It’s important when using tools like A/B testing, because the differences between groups may be completely random and drive us to wrong conclusions.
Size of the group should be dependent on the assumed confidence level and length of our experiment. Moreover, A/B testing can be quite expensive, so it’s good to estimate sample size and experiment length by doing simulations on historical data or using various statistical methods.
How to do that? We firstly assume the KPI of our model, for example we want to increase conversion rate by 5%. Next, we perform N simulations, assuming different sample sizes and experiment lengths. In each simulation, we compare 2 models, one with historical conversion rate, and one with increased conversion rate. Then, we observe the historical differences in those simulations. For some of them, the difference between models may be 4% or 10%, and for some the second algorithm can be worse. This is why we need to choose the assumptions where the difference was significant in, for example, 95% of simulations
The same parameters can be calculated without simulations if we know that model performance is coherent with some probability distribution. In such a case, we can simply test the null hypothesis stating that the second algorithm never reaches better performance than the first one of 5% and above. The sample size that allows us to reject the null hypothesis with some assumed confidence level is the parameter we’re looking for.
C. Gathering the data
During the experiment, we must collect all the data about metrics and performance that will allow us to compare the models and their quality. While doing so, you should create a database (logs) of events that were triggered on production with the information about which model triggered it.
If you can’t use a more complex system for splitting the users, you can simply generate 2 random groups by using a deterministic function f(user_id) = user_id mod 2. This way, you can avoid the necessity of storing any information about the user to ensure the test is conducted correctly.
In this article, we talked about A/B testing. Is this something new for you? Or do you have prior experience in creating and conducting such tests?
Share your thoughts on our LinkedIn and remember – this is only the first part of our series of articles regarding A/B testing. Stay tuned for more!