When obtaining information from our clients, we often receive access to data consisting only of positive events, e.g., a list of items purchased by each user or clicked ads.
Many machinelearning models need not only positive but also negative events to be able to estimate the probability of a positive event correctly. These could be items not bought by a user during his visit to the store (despite having a chance to buy them) or ads that the user saw but did not click on. In some projects, there are so many negative events that processing all of them is too time-consuming. In such situations, we use negative event sampling, i.e., selecting a random subset of all potentially available negative events.
In this strategy of building a training set, you have to watch out for several traps:
• It is essential to avoid selecting a negative event with an identical positive event.
• You have to draw from the complete set of available negative events but avoid, for example, contradictory data to be added to the training set, e.g., the purchase of a product that is unavailable on a given day or a purchase from a brick-and-mortar store that was closed that day.
• When distinguishing good product recommendations from average product recommendations, you should include good and average recommendations in the training set in the randomly selected events, not good and bad ones. We used this strategy on the occasion of the Recsys 2016 competition: https://lnkd.in/dgUb-FzC
If model predictions are used as accurate probability estimates, for example, to calculate expected revenue from an ad impression, the model predictions need to be recalibrated.