When obtaining information from our clients, we often receive access to data consisting only of positive events, e.g. a list of items purchased by each user or clicked ads.
Many machine learning models need not only positive but also negative events to be able to correctly estimate the probability of a positive event. These could be items not bought by a user during his visit in the store (despite having a chance to buy them) or ads that the user saw but did not click on. In some projects, there are so many negative events that processing all of them is too time-consuming. In such situations, we use negative event sampling, i.e. selecting a random subset of all potentially available negative events.
In this strategy of building a training set, you have to watch out for several traps:
- It is important avoid selecting a negative event with an identical positive event
- You have to draw from the full set of available negative events, but avoid, for example, contradictory data to be added to the training set, e.g. the purchase of a product that is unavailable on a given day or a purchase from a brick-and-mortar store that was closed that day.
- When distinguishing good product recommendations from average product recommendations, you should include good and average recommendations in the training set in the randomly selected events, not good and bad ones. We used this strategy on the occasion of the Recsys 2016 competition https://arxiv.org/pdf/1612.00959.pdf .
If model predictions are used as accurate probability estimates, for example, to calculate expected revenue from an ad impression, the model predictions need to be recalibrated. We do this exactly like the Facebook team in section 6.3 of the publication.