Calibration

  • Calibration measures how similar a set of recommended items is to a particular user’s interest profile.
  • Calibration refers to the process of adjusting the predicted scores or probabilities generated by the model to align with the actual user preferences or behaviors. It aims to ensure that the model’s predictions are well-calibrated and accurately reflect the likelihood of a user’s interest or preference for a particular item.
  • Calibration is important in recommender systems because it helps to improve the accuracy and reliability of the recommendations. When the model’s predictions are well-calibrated, users can have better trust in the recommendations and the system can provide more accurate estimates of user preferences.
  • There are several techniques that can be used for calibration in recommender systems, including: 1. Platt Scaling: Platt scaling is a popular calibration technique that involves fitting a logistic regression model to the predicted scores and the true labels. The logistic regression model learns a transformation function to map the predicted scores to calibrated probabilities. 2. Isotonic Regression: Isotonic regression is another calibration method that models the relationship between the predicted scores and the observed outcomes using a non-decreasing function. It fits a piecewise constant, monotonic function to the predicted scores to obtain calibrated probabilities. 3. Bayesian Calibration: Bayesian calibration approaches incorporate prior knowledge about the distribution of user preferences to adjust the predicted scores. Bayesian methods can provide more flexible and adaptive calibration by taking into account the uncertainty in the model predictions.
  • The choice of calibration technique depends on the specific requirements of the recommender system and the available data. It is common to evaluate the calibration performance using calibration plots, reliability diagrams, or other metrics that measure the alignment between predicted probabilities and observed outcomes.

  • Overall, calibration plays a crucial role in ensuring that recommender systems provide accurate and trustworthy recommendations by aligning the predicted scores or probabilities with the true user preferences.

  • Calibration in recommender systems typically happens at the ranking stage, specifically before the final presentation of recommendations to the user. It is a step that occurs after candidate generation and retrieval.

  • The calibration process involves adjusting the raw scores or predictions generated by the model to improve their accuracy and alignment with user preferences. This adjustment is based on historical data and user feedback. The goal is to calibrate the scores in such a way that they accurately reflect the relative preferences of the user for different items.

  • Once the calibration is performed, the calibrated scores are used in the ranking stage to determine the order in which the recommended items are presented to the user. This ranking is typically based on a combination of factors, including the calibrated scores, user preferences, item relevance, diversity, and potentially other business-specific considerations.

  • Existing methods for calibrating models: Let’s quickly go through several existing methods that are commonly used:
  • Platt scaling
  • Platt scaling was first introduced for Support Vector Machines (SVM) but also applies to other classification models. It takes two parameters α, β and uses the original output of the model as a feature. That is,
  • where zᵢ is the original output of the model and σ stands for the sigmoid function. The two parameters can be fit on the dataset. [7]
  • Platt scaling is effective for SVMs and boosted trees. It’s less effective on models that are already well-calibrated based on probabilistic modeling such as logistic regression and DL classification models with the sigmoid function as the last layer.
  • Platt scaling with features
  • A common practice is to extend the Platt scaling method by adding one-hot encoding for additional categorical features such as browse, country, day of the week, etc., inside the sigmoid function. In this case, this method is essentially a logistic regression model on top of the existing model. It can be shown that the predictions for subgroups specified by the categorical features will be calibrated upon the convergence of the corresponding model parameters.
  • This logistic-regression-like method will be referred to as “Platt scaling with features” in the sections below.

Post by Wenzhe:

  • Probability calibration for recommender systems is actually an important topic. When implementing a ranking model, our choice of sampling strategy as well as loss function doesn’t necessarily produce calibrated probabilities or even probabilities at all.
  • We might balance the dataset by downsampling negatives, use ranking loss or perform hard negative mining. All of which will distort the output of the model. However, down-stream tasks including rank score, ads market place etc often assumes calibrated probabilities.
  • There are two common methods for probability calibration: Platt Scaling and Isotonic Regression. (https://lnkd.in/dJ469b-z). Platt Scaling is most effective when the distortion in the predicted probabilities is sigmoid-shaped already.
  • This is quite a strong assumption for most cases. Isotonic Regression on the other hand, can correct any monotonic, nonlinear distortion. The cost of isotonic regression is two fold compared to platt scaling. First, it is easier to overfit, although it is not a problem for recsys systems since we have hundreds of millions if not billions of samples.
  • More importantly, it is not easy to do real-time / continuous update because it is non parametric. However, real-time update is a key aspect for most modern recsys systems which means calibration need to be updated in real time as well. What’s your favourite method for calibrating recsys models?

Weights and Biases

  • link of post for content below
  • Calibration ensures the range of predictions lie on the normalized interval. It’s not necessary for any rank-based recommendations. It’s valuable when the output probabilities are useful as confidence estimates in addition to the ranking.
  • When building classifiers, whether binary or multiclass, the learner generates a probability for each class to indicate the likelihood of an input belonging to that class. In cases where the main objective is to accurately predict the correct class most of the time, such as minimizing cross-entropy, the relative magnitudes of the probabilities per class are the most important factor. In other words, the class with the highest probability is selected as the predicted class, and the exact values of the probabilities are not of great significance.
  • However, there are situations where you may require more information from your model than just the class recommendation. In certain applications, it is important for the probability estimates to reflect the actual confidence or certainty of the model in its predictions. For instance, if you are interested in risk estimates resulting from misclassifications, you would like the probabilities to be reliable indicators of the model’s confidence in its accuracy for each specific label.
  • A calibrated model is one that establishes a direct relationship between the predicted probabilities and the actual likelihood of the model’s accuracy for a given label. In other words, a calibrated model’s probability estimates can be interpreted as accurate confidence estimates. If a calibrated model predicts a probability of 0.8 for a certain class, it means that, on average, it is correct about 80% of the time when assigning that class to similar inputs.
  • Having a calibrated model can be advantageous in applications where accurate confidence estimates are crucial. It allows for a better understanding of the model’s reliability and aids in decision-making processes that involve risk assessment or uncertainty management.
  • To summarize, while the relative magnitudes of probabilities are typically sufficient for selecting the correct class in most cases, there are scenarios where calibrated models are desired to ensure that probability estimates accurately reflect the model’s confidence or likelihood of accuracy in assigning a specific label.
  • In the context of Sebastian Raschka’s tweet and subsequent discussions, the focus was on the effects of calibration on the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) metric. It was pointed out that calibration adjustments do not actually impact the AUC because the AUC is invariant to the magnitudes of probabilities and only depends on the ordering of the predictions. Therefore, calibrating the probabilities does not change the overall performance of the model as measured by the ROC-AUC.
  • Some participants in the discussion mentioned that the observed performance increase was likely due to the introduction of cross-validation (CV) during model training rather than the specific calibration techniques. In this context, the calibration models were essentially acting like bagging, which is an ensemble technique that combines multiple models to improve performance.
  • Additionally, it was suggested that for those interested in risk-adjusted models that incorporate confidence estimates, the Kolmogorov-Smirnov Statistic could be a more suitable metric than ROC-AUC. The Kolmogorov-Smirnov Statistic measures the maximum difference between cumulative distribution functions and can provide insights into the distribution of predicted probabilities, which is relevant when assessing risk.
  • Moving on to the application of classifier calibration in recommender systems, Wenzhe Shi’s post highlighted the importance of confidence estimates in models used for ad recommendations and similar applications. While relative affinity scores may be sufficient for selecting the top-k recommendations, confidence estimates become crucial in such scenarios. However, the post also highlighted some challenges with applying isotonic regression, a calibration technique discussed earlier, in the context of continuous learning in modern recommender systems. Since these systems often involve streaming events and require real-time updates, a meta-learner that can adapt and update in real-time is necessary. Commenters suggested alternative approaches such as conditional calibration with respect to a covariate or frequent retraining of the calibration model to address these challenges and ensure accurate confidence estimates in dynamic recommender systems.

How it works



User Interactions / Ratings
       |
       v
  [User Interactions / Ratings]

Collaborative Filtering for Embedding
       |
       v
   [User Interactions / Ratings]
       |
       v
   User Embeddings
       |
       v
  [User Embeddings, Item Embeddings]

ANN for Efficient Retrieval
       |
       v
   [User Embeddings, Item Embeddings]
       |
       v
   Candidate Item Retrieval
       |
       v
  [Candidate Item List]

Multiple Models for First Ranking
       |
       v
  [Candidate Item List]
       |
       v
  Multiple Ranked Lists

MOO Optimization
       |
       v
  [Multiple Ranked Lists]
       |
       v
Optimized Ranked Lists

Calibration
       |
       v
[Optimized Ranked Lists]
       |
       v
Calibrated Ranked Lists

Diversity Re-ranking
       |
       v
[Calibrated Ranked Lists]
       |
       v
  Final Recommendation List
  • To calibrate multiple models of experts ranking, such as P(click) and P(like), you can follow these general steps: 1. Data Preparation: Collect a labeled dataset consisting of examples where you have the ground truth labels (e.g., actual clicks and likes) for each model’s ranking predictions. Make sure you have a sufficient number of examples for reliable calibration.

     2. **Evaluation Metric**: Choose an appropriate evaluation metric to assess the calibration of your models. Common metrics include reliability diagrams, expected calibration error (ECE), or Brier score. These metrics measure the agreement between predicted probabilities and observed frequencies.
    
     3. **Binning**: Divide the predicted probabilities (P(click) and P(like)) into bins or groups. For example, you can create ten bins for probabilities ranging from 0.0 to 1.0, each with a width of 0.1.
    
     4. **Compute Average Probability**: Calculate the average predicted probability for each bin by averaging the predicted probabilities of examples falling within that bin.
    
     5. **Compute Observed Frequency**: Determine the observed frequency of the positive events (e.g., actual clicks and likes) within each bin. This can be done by counting the number of positive events in the labeled dataset that fall into each bin.
    
     6. **Plot Calibration Curve**: Plot a calibration curve by using the average predicted probabilities (from step 4) on the x-axis and the observed frequencies (from step 5) on the y-axis. The calibration curve should ideally be close to the diagonal, indicating good calibration.
    
     7. **Assess Calibration Metrics**: Calculate the chosen calibration metrics, such as reliability diagrams, ECE, or Brier score, using the predicted probabilities and observed frequencies obtained in steps 4 and 5. These metrics will provide a quantitative measure of the calibration quality.
    
     8. **Iterative Calibration**: If the models are not well-calibrated, you can use techniques like Platt scaling or isotonic regression to adjust the predicted probabilities and improve calibration. These methods aim to align the predicted probabilities with the observed frequencies.
    
     9. **Evaluate Calibration**: After applying calibration adjustments, repeat steps 4 to 7 to evaluate the calibrated models. This process can be iterative until the models exhibit satisfactory calibration.
    
  • Remember that calibration is essential to ensure that the predicted probabilities align well with the actual probabilities or frequencies of events. It helps to make reliable decisions based on the ranking predictions provided by the models.