NLP • Metrics
Evaluating Natural Language Processing (NLP) Models: An Indepth Look at BLEU, ROUGE, and F1 Score
 Natural Language Processing has revolutionized the way we interact with machines and interpret textual data. Models in NLP are designed to understand, interpret, and generate humanlike text, making them integral in fields like text analysis, translation, summarization, and information extraction.
 Evaluating the performance of these models is crucial for their development and improvement. In this regard, several metrics have been proposed, such as BLEU, ROUGE, and the F1 score, each of which serves different purposes and has its own strengths and weaknesses.
 This article will explore three widely used evaluation metrics in NLP: BLEU, ROUGE, and F1 score.
BLEU (Bilingual Evaluation Understudy)
 BLEU, an acronym for Bilingual Evaluation Understudy, is predominantly used in machine translation. It quantifies the quality of the machinegenerated text by comparing it with a set of reference translations. The crux of the BLEU score calculation is the precision of ngrams (continuous sequence of n items in text) in the machinetranslated text. However, to prevent the overestimation of precision due to shorter sentences, BLEU includes a brevity penalty factor. Despite its widespread use, it’s important to note that BLEU mainly focuses on precision, and lacks a recall component.
 Mathematically, precision for unigram (single word) is calculated as follows:
 Precision = (Number of correct words in machine translation) / (Total words in machine translation)
 BLEU extends this idea to consider precision of ngrams (continuous sequence of n items from a given sample of text). However, BLEU uses a modified precision calculation to avoid the problem of artificially inflated precision scores.

The equation of BLEU score for ngrams is:
 \[BLEU = BP * exp ( Sum (from i=1 to n) w_i * log (p_i) )\]
where,
BP
is the brevity penalty (to penalize short sentences)w_i
are the weights for each gram (usually, we give equal weight)p_i
is the precision for each igram
ROUGE (RecallOriented Understudy for Gisting Evaluation)
 ROUGE, which stands for RecallOriented Understudy for Gisting Evaluation, is used primarily for evaluating automatic summarization and, sometimes, machine translation. The key feature of ROUGE is its focus on recall, measuring how many of the reference ngrams are found in the systemgenerated summary. This makes it especially useful for tasks where coverage of key points is important. Among its variants, ROUGEN computes the overlap of ngrams, ROUGEL uses the longest common subsequence to account for sentencelevel structure similarity, and ROUGES includes skipbigram statistics.

ROUGEN specifically refers to the overlap of Ngrams between the system and reference summaries.
 \[ROUGEN = (Number of Ngrams in both system and reference summary) / (Total number of Ngrams in reference summary)\]
 ROUGEL considers sentence level structure similarity naturally and identifies longest cooccurring insequence ngrams automatically.
 ROUGES includes skipbigram plus unigrambased cooccurrence statistics. Skipbigram is any pair of words in their sentence order.
F1 Score
 The F1 score, commonly used in many machine learning classification problems, is also applicable to various NLP tasks like Named Entity Recognition, POStagging, etc. The F1 score is the harmonic mean of precision and recall, and thus balances the two and prevents extreme cases where one is favored over the other. It ranges from 0 to 1, where 1 signifies perfect precision and recall.
 Precision is the number of true positive results divided by the number of all positive results, including those not identified correctly. Recall, on the other hand, is the number of true positive results divided by the number of all samples that should have been identified as positive.
 The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
 The mathematical representation is as follows:
 \[F1 Score = 2 * (Precision * Recall) / (Precision + Recall)\]
 In NLP, F1 score is often used in tasks like Named Entity Recognition, POStagging, etc.