Author: Adam G. Dobrakowski
Redaction: Zuzanna Kwiatkowska
In Machine Learning, benchmark is a type of model used to compare performance of other models. There are different types of benchmarks. Sometimes, it is a so-called state-of-the-art model, i.e. the best one on a given dataset for a given problem. The goal of benchmarking is then to see if we can create a better model and beat published results.
However, in this article, I want to talk about a simple benchmark, which you create in the very beginning of your project. Its goal is to track our progress and see how we compare to our past selves.
Properties of benchmark
In my opinion, there are several properties that a useful benchmark must have:
- It must be easy to run it.
- It must have simple structure/architecture.
- It must be interpretable (i.e. we must know why it gives particular results).
- If it requires training, it must be relatively fast.
- It must be well suited to solve our problem and give relatively good performance.
Creating such benchmarks is easier than it seems. Let’s dive into a couple of examples!
Creating naive models with sklearn
Python’s scikit-learn is an extremely useful library if you work in Machine Learning. Not only it consists of multiple algorithms for both classification and regressions, but also implemented metrics, tools for data preprocessing or Pipeline module for eg. stacking operations together.
However, not everyone knows that it also provides 2 simple classes for benchmarking: Dummy Regressor and Dummy Classifier.
DummyRegressor is a model which returns a fixed value for a given regression problem. We can use an average, median or quantile calculated based on a training dataset or some arbitrarily selected constant.
Similarly, DummyClassifier gives a naive prediction, but for example using the most frequent class from the training dataset. Other options are random class based on class size, random class drawn from uniform distribution on classes or a fixed one, chosen by a developer.
Despite their simplicity, both models can be extremely useful to know where we stand with our problem.
A good rule of thumb when creating your benchmark for supervised learning problems on tabular data is using regression (either linear for continuous target or logistic for classification).
Many times you can observe that such a simple statistical method can achieve relatively good performance. Even more, it is sometimes hard to beat by more complex models like neural networks or XGBoost (tip: if you encounter a situation like this, check out Data Centric Approach).
What to remember? If you want to interpret regression coefficients as feature importance coefficients, you need to normalise your data first!
Using intuitive heuristics
If you create your models to be used by humans, often our goal is to automate or improve the task performed by them regularly. Examples?
- Assessing client’s credit rating.
- Predicting optimal dosage of a drug taken by a patient.
- Recommending treatment based on symptoms of disease.
- Forecasting the number of sales for a given product.
- Recommending product to buy.
- Predicting trains’ delay.
In all of those situations, we have an expert who, based on their knowledge and experience, makes a decision. Those decisions can be supported or replaced by machine learning algorithms.
Building a simple heuristic requires understanding how the decision is currently made by an expert. Most of the time, the process is too complex, but then we can at least understand what aspects are important for an expert. In my experience, we have a dozen or so of the most important ones.
Product recommendation is a good example. When a seller offers additional products to the client, they often propose something that has the best number of sales in the shop or something that he already sold to clients similar to the current one.
Understanding the human decision process can make it much easier for us when creating benchmarks, because we can simply translate those rules and processes to code. Even though it seems simple, it can be a powerful model in terms of performance.
Using subset of features
If you have to merge data from multiple databases when creating your model, you can start with using a single table and some small subset of features. Remember that, when deployed, our model needs to collect features in real time, so all merges can drastically increase inference time. Choosing only a small subset of features can also allow the model to be much smaller, faster and easier to interpret.
Benchmarking for NLP problems
Methods which I showed so far can be easily used in most standard ML problems. In this section, let’s cover a specific case of natural language processing.
Obviously, if your NLP task is in fact simple regression or classification, feel free to use previous methods. However, we can have different problems in NLP or more complex classification/regression, such as:
- Sentiment analysis.
- Text summaries.
- Classification of document type.
Each of those problems may require completely different, complex heuristic. But let me show you the example of some benchmark models for each of those problems:
- We can create a model that finds some keywords. For example, if we build a simple database of positive and negative adjectives, we can find them in our text and, based on the ones that occur, estimate if the opinion is positive or negative.
- Here we can use the first few phrases of input text.
- A good benchmark for those types of problems is using word frequency methods like TF-IDF. It’s particularly easy to use in Sklearn.
But why do I even need a benchmark?
When reading previous sections, you may have had some concerns regarding benchmarking. Why waste time and not use the most advanced models straightaway?
In my experience, this is not a good approach.
First of all, when we approach a Machine Learning project and set some goal or KPI, we must know if this goal is even achievable. Without building a benchmark, it’s hard to estimate how much time we need to finish the project.
Secondly, in many real life scenarios, achieving the best possible performance is not really the project’s goal. Maybe it sounds bizarre to you, but when deploying the model performance is not the only factor that we use to assess model’s quality. There are many more, for example training time, inference time, time of feature extraction or interpretability. From a business perspective, it’s not important how we achieve a business goal (e.g. with what model), but IF we achieve it.
Another reason is that benchmarks can show us what we can achieve with simplicity. I saw multiple projects in which someone started with neural networks and ended up surprised that linear regression works exactly the same, being multiple times smaller and faster. You will be considered more expert if you always check it yourself.
Last but not least, benchmarks are a good way to test deployment environments and pipelines. You can verify if the system works correctly end-to-end without serving most advanced results.
In this article, I showed you my approach to building benchmarks. What is your experience? Do you use something else? Or maybe you disagree with my approach? I would love to hear your opinion on LinkedIn!