Vinija's Notes • Concepts • Metrics

Overview
Accuracy
Precision
Recall
F1 Score
Area Under Curve - Receiver Operating Characteristics (AUC-ROC):
BLEU Score:
Word Error Rate (WER):
Perplexity:
Mean Squared Error (MSE):
Mean Absolute Error (MAE):

Overview

It’s important to understand the distinction between metrics and loss functions in Machine and Deep Learning.
Loss functions are used to optimize a model during training. They are functions defined on a data point, prediction, and label, and they measure the “distance” between the prediction and the label. The goal of a machine learning model is to minimize this difference, or “loss.” For instance, popular loss functions include Mean Squared Error for regression tasks, and Cross-Entropy Loss for classification tasks.
On the other hand, evaluation metrics are used to measure the performance of a model on the validation or test set. These metrics help to quantify the performance of a model in terms that make sense to us and align with our goals. For instance, accuracy is a commonly used evaluation metric that tells us the proportion of correct predictions made by our model.
The key difference lies in their usage. Loss functions are used to train the model and are differentiable so that the gradients can guide the optimization process. Evaluation metrics, however, are used to understand the model’s performance in real-world, human-understandable terms.
Often, there’s a disconnection between the two. The optimization objective (loss function) of a model may not always align with what we ultimately care about (evaluation metric). For instance, you might train a binary classification model with Cross-Entropy Loss, but evaluate it based on its AUC-ROC score, which considers the rank ordering of the predicted probabilities.
If that does happen, there are a few strategies to help resolve this disconnect:
- Surrogate Loss Functions: Use a differentiable surrogate loss function which is a good proxy for the non-differentiable evaluation metric. For example, instead of optimizing for accuracy directly, we might optimize the cross-entropy loss as a proxy, which works well in many cases.
- Direct Optimization: Some approaches aim to directly optimize the evaluation metric, even if it’s non-differentiable. One way to do this is by using reinforcement learning techniques, where the non-differentiable metric is used as a reward signal. Another approach is to use the method of sub-gradients for non-differentiable functions.
- Early Stopping: Monitor the evaluation metric on a validation set during training, and stop training when the evaluation metric stops improving, even if the loss function could still be decreased on the training set.
- Post-processing: Train with a standard differentiable loss function, but then adjust the decision threshold or otherwise post-process the predictions to optimize the evaluation metric.
- Custom Loss Function: In some cases, it might be possible to design a new differentiable loss function which more closely aligns with the evaluation metric of interest.
Now that we have secured the foundation of our understanding, lets continue on here to look at a few common metrics.

Accuracy

This is one of the most common evaluation metrics in ML, and it is often used in binary and multiclass classification problems. It calculates the proportion of correct predictions out of the total number of predictions.

\[Accuracy = (True Positives + True Negatives) / (Total Number of Predictions)\]

Precision

Precision is often used in ML and NLP for tasks like information retrieval, where it’s crucial to retrieve as many relevant documents as possible. Precision measures the proportion of relevant instances among the retrieved instances.

\[Precision = True Positives / (True Positives + False Positives)\]

Recall

Also known as sensitivity or true positive rate, recall measures the proportion of actual positives that were identified correctly.

\[Recall = True Positives / (True Positives + False Negatives)\]

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a balance between both. It is particularly useful in cases where either precision or recall may individually present a biased view of the model’s performance, such as in imbalanced datasets.

\[F1 Score = 2 * ((Precision * Recall) / (Precision + Recall))\]

Area Under Curve - Receiver Operating Characteristics (AUC-ROC):

This metric is used for binary classification problems. It measures the entire two-dimensional area underneath the entire ROC curve (plotting true positive rate vs false positive rate at various threshold settings).

BLEU Score:

BLEU (Bilingual Evaluation Understudy) is a score for comparing a candidate translation of text to one or more reference translations. It’s commonly used in machine translation tasks.

Word Error Rate (WER):

WER is commonly used in speech recognition and measures the difference between sequences of words. It is a type of edit distance that counts substitution, deletion, and insertion operations needed to change one sequence into another.

Perplexity:

Perplexity is a measurement of how well a probability model predicts a sample. In NLP, it’s used to evaluate language models. A lower perplexity indicates better performance.

Mean Squared Error (MSE):

Commonly used in regression tasks, MSE measures the average of the squares of the errors - the difference between the estimated values and what is estimated.

\[MSE = 1/n * Σ(actual value - predicted value)^2\]

Mean Absolute Error (MAE):

This is another common regression metric, which takes the absolute difference between the target and predicted values.

\[MAE = 1/n * Σ|actual value - predicted value|\]