Author: Adam G. Dobrakowski
Redaction: Zuzanna Kwiatkowska
This article is the second part of the series regarding A/B testing. You can access other articles with the following links:
- A/B Testing in Machine Learning. Part 1: How to prepare the A/B tests?
- A/B Testing in Machine Learning. Part 3: 4 most common mistakes
In this article, I will show you the most common problems I encountered when performing A/B tests in real-life scenarios. I will also tell you how to deal with those challenges!
How to construct two identical test groups?
As I already mentioned in the previous article, our machine learning model is often the component of a bigger system (for example an online shop). Because of that, it might not be feasible to divide our users into two identical groups.
What can we do about that?
Create groups with the same average age
Let’s assume that your model will optimise the performance of 20 online stores. Each of them may have a different size, different products and different profits. When you randomly divide your traffic into two groups, you may find that one group performs much better than the other (even without the model). In this case, using the model won’t change much in terms of comparison.
The solution may be the “smart” selection of groups so that the total profit in both groups is the same over the last few months. Then we can expect that this profit will also be similar in the next period. Later, when we use the model for the group of interest, we will be able to compare the results with the other group.
Dynamic group change
If enabling and disabling the model for a given store does not generate a high cost, we can think of a solution where we regularly rotate the content of the group which uses the model (for example every day). If the experiment lasts long enough, each store will be in both groups for a similar amount of time. Additionally, frequent changes and randomness will offset the effect of external trends in both groups.
Comparing percentage changes
Let’s assume that you can’t get two similar groups of stores anyway (for example, because the algorithm can only be implemented in two stores that are completely different). Then you implement the algorithm in one of them and see how the store’s performance has improved compared to the same period in the past.
If, for example, they increased by 11%, you should verify if the same change occurred in the second store during the same period. If the second store’s results increased by only 5%, then (taking care of statistical significance) you can count that your algorithm gave a value of +6 p.p.
Note that if you were to implement the model in both stores right away and see an average increase of +10%, you would not know if this is due to the model’s performance or other factors. That’s why A/B testing is so important.
If the problem you are working on is very urgent and you observe large gains from the model on historical data, you or decision makers may be very tempted to implement the model on the whole set at once, e.g. so that all patients can benefit from the good results of the model.
However, you must be aware (and be able to communicate it to others) that this gain is only apparent because the results obtained on a historical data set do not necessarily have to be reflected in real operations. Therefore, at least a minimum part of the set (e.g. 10%) should be excluded from the model to observe it and be able to compare it with the model results.
Interactions between group A and group B
It may turn out that even if you divide the space well into two groups (A, where you run the model, and B – where you do not), the results obtained in group A will affect group B. How is this possible?
A good example is marketing campaigns where you use intermediary portals with their own optimisation algorithms. It may turn out that the broker’s algorithm will learn that it should give the ML-optimised campaign better advertising places because it brings more profits. Group B will then become “injured” and will perform worse than it would have achieved if group A had not been improved. This may not be a problem for you, but you must be careful not to draw incorrect conclusions about your model.
Another example is when an algorithm supports human action. For example, we have a group of sellers A who use the recommendations of the ML model and a group of sellers B who operate without the model. In such a setup, you need to give some recommendations to both groups so that sellers don’t know what group they belong to (if they knew their group, psychological factors could alter the results; those additional recommendations are often random or come from a simpler model). Despite this, it may turn out that sellers from both groups will somehow pass information to each other, and group B will be able to benefit from the recommendations of group A.
How to deal with these problems? I don’t think there is a universal solution. First of all, you need to be aware of such potential problems. For example, you can start a minimal version of the A/B test to observe whether these phenomena occur or not.
No statistical significance
Imagine that you tested 10 stores in group A (with a model) and 10 stores in group B. After a month of testing, it turned out that stores A achieved an average improvement of 9% and stores B an average improvement of 5%. Let’s assume that in the previous months, the results of both stores were sufficient. The assumed KPI was to improve the store’s operation by 5%.
Would you consider the A/B tests successful in this situation? You seem to have achieved a 4 p.p. improvement. However, every test is subject to uncertainty. In this situation, it is worth considering continuing the test or increasing its scope.
I have presented you with some problems that I have encountered in my work. And what is your experience with this topic? Have you encountered any other difficulties? I would love to hear your opinion on LinkedIn!