• The Transformer architecture heralded a new era in natural language processing (NLP) with its unique attention mechanism. Not only did this model significantly accelerate training times, but its design also made it uniquely amenable to parallelization, unlocking unprecedented efficiency.


  • The images below, source and source respectively, display the transformer architecture.

  • Emerging in 2017, the Transformer architecture carved a new niche within the deep learning community. Despite the reigning popularity of recurrent neural networks (RNNs) for sequential data tasks like translation and summarization, Transformers introduced an alternative: processing sequences without adhering to their order. By liberating itself from the linear processing constraint of RNNs, Transformers opened the door to substantial parallelization and swift training periods.
  • This groundbreaking model was first showcased in “Attention is All You Need” by Vaswani et al., and its foundations have been leveraged in successive models like BERT, GPT-2, and T5.
  • In the field of deep learning, particularly in sequence-to-sequence models used for tasks such as translation, text summarization, and speech recognition, the concepts of “encoder” and “decoder” play pivotal roles.


  • Within the realm of deep learning, especially in sequence-to-sequence paradigms for tasks like translation or summarization, encoders serve as the backbone.
  • Composition: At its core, the Transformer’s encoder processes sequences in their entirety. This encoder comprises a series of N=6 identical layers, each containing two principal sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.
  • Self-Attention Mechanism: Here, each word in the sequence is scrutinized in relation to all others, resulting in scores indicating the importance or “attention” each word should receive. This design choice ensures that significant words receive due emphasis during processing.
  • Purpose: The encoder’s role transcends mere data input. In sequence-to-sequence applications, the encoder comprehensively processes input data (like an English sentence) and distills this information into a condensed vector representation. In translation tasks, this representation of an English sentence serves as a foundation for the decoder to produce a French equivalent.


  • Contrasting the encoder, which processes input, the decoder’s mandate is output generation.
  • Composition: Mirroring the encoder’s design, the Transformer’s decoder processes entire sequences, relying on N=6 identical layers. However, it introduces a third sub-layer that facilitates multi-head attention over the encoder’s output.
  • Masked Self-Attention: The decoder’s self-attention mechanism integrates a masking feature. By only attending to preceding positions in the output sequence, it ensures predictions at position i are solely based on outputs from positions before i.
  • Role: The decoder doesn’t generate outputs en masse but produces them sequentially. In translation scenarios, the decoder interprets the encoder’s context vector to iteratively construct the translated sentence.

Transformer: Encoder vs. Decoder

  • A Transformer model typically consists of an encoder stack and a decoder stack. In the original paper, the encoder processes the input sequence (e.g., a sentence in a translation task), and the decoder generates the output sequence (e.g., the translated sentence). However, variations like BERT use only the encoder stack, while GPT models are built on the decoder stack.

1. Encoder

a. Input Embedding:

  • Each word/token in the input sequence is first converted into a dense vector using token embeddings.
  • Positional Encoding: Since Transformer doesn’t have a built-in notion of sequence order, a positional encoding is added to each embedding to provide positional information.

b. Self Attention Mechanism:

  • Queries (Q), Keys (K), and Values (V) are derived from the input embeddings.
  • Attention scores are computed by taking the dot product of Q and K, followed by scaling and applying a softmax.
  • The scores determine how much focus each word in the sequence should have on every other word.
  • The attention outputs are computed by multiplying the attention scores with V.

c. Feed-Forward Neural Network:

  • Each attention output is passed through a feed-forward neural network (identical for each position).
  • This is followed by layer normalization and a residual connection.

d. Stacking:

  • Multiple such encoder layers (usually 6 or more in models like BERT and the original Transformer) are stacked to form the encoder part.

2. Decoder

a. Input Embedding:

  • Just like the encoder, the decoder also starts with embedding the input tokens (which are the output tokens so far) and adding positional encodings.

b. Self Attention Mechanism:

  • The mechanism is similar to the encoder’s self attention, but with one major difference: to maintain auto-regressive property, positions in the decoder can only attend to earlier positions in the output sequence.
  • This is achieved using a masked version of the attention to prevent future tokens from being used.

c. Encoder-Decoder Attention Layer:

  • After the masked self-attention layer, the decoder has another attention mechanism.
  • This attention layer helps the decoder focus on relevant parts of the input sentence, similar to how attention works in seq2seq models with LSTMs.
  • Here, the Q comes from the decoder’s previous layer, and K and V come from the encoder’s output.

d. Feed-Forward Neural Network:

  • Just like in the encoder, the output of the encoder-decoder attention goes through a feed-forward network.

e. Stacking:

  • Again, multiple decoder layers (usually the same number as encoder layers) are stacked to form the complete decoder.

3. Key Differences

  • Encoder-Decoder Attention: This mechanism is exclusive to the decoder, enabling it to refer back to the input sequence, which is particularly useful for sequence-to-sequence tasks like machine translation.

  • Masked Attention: Only present in the decoder to preserve the autoregressive property, ensuring that the prediction for a particular word doesn’t depend on future words.
  • Encoder-Decoder Attention: This feature, unique to the decoder, facilitates its referencing of the input sequence, crucial for tasks like machine translation.
  • Masked Attention: The decoder exclusively incorporates this to ensure predictions for a specific word are independent of future words.
  • Usage Variations:
    • BERT: Prefers the encoder stack, pre-training on tasks predicting sequence masks to achieve context-rich understanding.
    • GPT: Relies solely on the decoder stack, adopting an auto-regressive training paradigm predicting subsequent sequence words.

      4. Variations & Standalone Uses

  • BERT: Uses only the encoder stack. It’s pre-trained on tasks that predict masked-out words in a sequence, making it powerful for tasks requiring understanding of context.

  • GPT: Uses only the decoder stack. It’s trained auto-regressively where the model predicts the next word in a sequence.

  • The Transformer architecture, with its encoder and decoder components, has revolutionized how we handle sequence data. The self-attention mechanism within these components allows models to consider different words in a sequence with varying degrees of attention, providing a rich, context-aware representation. Understanding the intricacies of the encoder and decoder sheds light on the versatility and capability of Transformer-based models in diverse NLP tasks.

The Attention Mechanism

  • One of the key innovations in the Transformer model is the self-attention mechanism. In self-attention, the attention scores are computed as a function of the input sequence itself. In other words, the attention scores are based on the other words in the sequence, allowing the model to focus on words that are relevant to the current word being processed.
  • The Transformer uses a variant of self-attention called “scaled dot-product attention.” In this variant, the attention score between two words is calculated as the dot product of their embeddings, scaled by the square root of the embedding dimension, and then a softmax operation is applied to ensure the attention scores are probabilities that sum to one.
  • The attention mechanism has revolutionized the field of natural language processing and is a pivotal part of many state-of-the-art models, including the Transformer, BERT, and GPT. It was designed to resolve a specific issue in sequence-to-sequence models, namely the difficulty in handling long sequences.

The Problem with Long Sequences

  • Prior to the development of attention mechanisms, sequence-to-sequence models, typically built using Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), or Gated Recurrent Units (GRU), faced a challenge when processing long sequences. In these models, the encoder processed the input sequence to produce a single fixed-length context vector, which was then used by the decoder to generate an output sequence. However, compressing the entire input sequence into a single vector led to loss of information, particularly in the case of long sequences, resulting in reduced model performance.

The Attention Mechanism: A Solution

  • The attention mechanism was introduced as a solution to this problem. Instead of encoding the entire input sequence into a single context vector, the attention mechanism allows the decoder to focus on different parts of the input sequence at each step of the output generation, thereby maintaining the context and addressing the information loss.

How Attention Works

  • Suppose we have an input sequence (e.g., a sentence in English) and we want to generate an output sequence (e.g., the sentence translated into French). Here’s how the attention mechanism would work:
  1. Encoding with Attention: When the encoder processes the input sequence, it produces a set of vectors, each representing a word or a part of the input sequence. These are known as annotation vectors.
  2. Calculating Attention Weights: For each step of the output sequence, the attention mechanism calculates the similarity between the decoder’s current state and each of the encoder’s annotation vectors. These similarity scores, once normalized, become the attention weights. The normalization ensures that the weights sum up to one, and is typically done using the softmax function.
  3. Context Vector Calculation: The attention mechanism then computes a weighted sum of the annotation vectors, with the weights given by the attention weights. This weighted sum, called the context vector, captures the parts of the input sequence that are relevant to the current output step.
  4. Decoding with Attention: The decoder uses the context vector along with its current state to generate the next word in the output sequence.
  • A quick note on attention during training vs inference: during training, you can process all tokens in parallel since the entire sequence is available. This additionally means that the attention computation can be done in batches of multiple tokens or even the entire sequence at once.
    • However, during inference for text generation, operations and attention computation must be done sequentially because the next token depends on the previous ones. Thus, you can calculate the attention scores incrementally and cache previous results for future tokens.

Benefits of Attention

  • The attention mechanism offers several advantages. First, by allowing the model to focus on different parts of the input for each output step, it significantly reduces the loss of information. Second, it allows for better handling of long sequences, leading to improved model performance on tasks like machine translation and text summarization. Finally, the attention weights provide interpretability, giving insights into which parts of the input sequence the model considers important when generating each part of the output sequence.
  • In summary, the attention mechanism has become an integral part of modern sequence-to-sequence models, contributing to significant advances in the field of natural language processing.

Deeper look at Self-Attention Mechanism

  • The Transformer architecture’s pioneering feature is its self-attention mechanism. Distinct from traditional attention mechanisms, self-attention determines attention scores directly from the input sequence. This implies that for every element in the sequence, the mechanism computes its relevance in relation to every other element, facilitating focus on pertinent information during processing.

  • The image below shows the scaled dot-product and multi-head attention from the ‘Attention is All You Need’ paper.

Multi-Head Attention

  • The Transformer architecture uses Q (queries), K (keys), and V (values) matrices for its attention mechanism.
  • As Eugene Yan mentioned in his blog, you can think of these matrices as an analogy of you going to a library:
  • Keys:
    • *Library Analogy:**
      • In a library, the titles on the spines of books (keys) indicate the content within. These titles help you assess how relevant each book is to your specific question.
    • Attention Mechanism:
      • In neural networks, the keys represent the words in the input sentence. Each word has its key vector (e.g., k1 for the first word, k2 for the second). These key vectors help the model understand how each word relates to the word currently being focused on.
  • Query:
    • Library Analogy:
      • Your specific question in the library serves as the query. You use this query to evaluate and compare the relevance of the book titles (keys) to find the information you need.
    • Attention Mechanism:
      • The query in the attention mechanism refers to the word currently under focus. In an encoder, the query vector points to the current input word, such as q1 for the first word in a sentence. The model uses this query vector to assess the relevance of each key.
  • Value
    • Library Analogy:
      • Once you identify relevant books based on their titles (keys), you extract the information (value) from these books to answer your query.
    • Attention Mechanism:
      • Each word in the input sentence is also represented by a value vector containing that word’s information. The attention scores, derived from the relevance assessment between the query and keys and normalized to sum to 1 through a softmax function, are used to weigh these value vectors. This results in each focal word being represented by a weighted combination of all words in the sentence, with more relevant words having a higher weight in the final representation.
      • For any given sequence of \(n\) tokens, the model first extracts a text embeddings matrix \(X\) of size \((n, d)\). This matrix is then enhanced with Positional Sinusoidal Embedding.
      • The aim of the Multi-Head Attention layer is to recompute the token embeddings, emphasizing both the relative importance of tokens and their positions.
      • The embedding matrix \(X\) is processed in parallel across \(h\) attention heads. For each head, \(X\) is linearly projected to obtain \(Q\), \(K\), and \(V\) matrices. These projections lead to matrices of sizes \((n, k)\), \((n, k)\), and \((n, v)\) respectively.
  • Why do we need multi-headed attention?
    • Multi-headed attention in Transformer models is a crucial component that enhances the model’s ability to focus on different positions and capture various aspects of the input sequence. Here’s why multi-headed attention is needed:
      1. Capturing Different Contextual Relationships:
      • Each ‘head’ in a multi-headed attention mechanism can focus on different parts of the input sequence. This allows the model to capture a variety of relationships, such as different syntactic and semantic aspects of a sentence.
        1. Improving Representation Power:
      • With multiple attention heads, the model can attend to information from different representation subspaces at different positions. This leads to a richer and more nuanced understanding of the input.
        1. Parallel Processing:
      • Multi-headed attention enables parallel processing of multiple attention layers. This not only makes the model more efficient but also allows it to integrate information from different perspectives simultaneously.
        1. Robustness to Input Variations:
      • The ability to focus on different parts of the input makes the model more robust to variations and noise in the data, as it does not overly rely on a single attention pattern.
        1. Flexibility in Learning:
      • Multi-headed attention provides the flexibility to learn to focus on different types of information that might be relevant for different tasks, making the Transformer model versatile for a wide range of applications.
  • Multi-headed attention enhances the Transformer’s ability to understand complex input sequences by allowing it to simultaneously consider information from different perspectives and representation subspaces, leading to more effective and contextually rich models.

Scaled Dot-Product Attention:

  • Within the umbrella of self-attention, the Transformer adopts the “scaled dot-product attention.”
  • To compute the attention score between two elements, their embeddings undergo a dot product operation. This resultant score is then scaled by the square root of the dimensionality of the embeddings. To derive attention weights, a softmax function is applied to these scores, ensuring they sum to one and possess a probabilistic interpretation.
  • It is to be noted that this focuses on the operations within a single attention head.
  • \(Q\), \(K\), and \(V\) are derived from linear projections of \(X\).
  • Attention scores are computed by taking the dot product of \(Q\) and \(K^T\), resulting in a matrix of size \((n, n)\).
  • This score matrix represents the relative importance between tokens. After masking (used in decoding) and scaling, it’s passed through a softmax function.
  • The attention scores are then used to weight the “values” \(V\), providing a contextually rich embedding of the input.

The Impact on Natural Language Processing:

  • This mechanism has profoundly influenced the domain of natural language processing. Architectures like Transformer, BERT, and GPT owe their exceptional performance to it. The attention mechanism was conceptualized to address sequence-to-sequence models’ challenge in managing extensive sequences.

Challenges with Extended Sequences:

  • Historically, sequence-to-sequence models predominantly utilized architectures like RNNs, LSTMs, or GRUs. The encoder in these models processed the input sequence, resulting in a singular, fixed-length context vector. This vector subsequently directed the decoder in producing the output sequence. The inherent limitation was the compression of potentially vast information into a singular context vector, inevitably leading to data attrition, especially with lengthy sequences. This truncation culminated in diminished model efficacy.

Addressing the Challenge with Attention:

  • Attention mechanisms were introduced to ameliorate this deficiency. Contrary to the traditional method of compressing the entire input into a singular vector, attention allows the decoder to reference diverse segments of the input sequence while generating each output component. This method preserves contextuality and mitigates information degradation.

The Functional Workflow of Attention:

  • Consider a scenario where an English sentence (input sequence) is being translated into French (output sequence). The attention mechanism operates as follows:
  1. Annotation Vector Creation: The encoder processes the input sequence, yielding a set of vectors. Each of these vectors—termed “annotation vectors”—represents distinct words or segments of the input sequence.

  2. Determination of Attention Weights: During each phase of the output generation, attention computes the similarity (often via dot-product operations) between the current state of the decoder and all annotation vectors from the encoder. Post normalization (typically using the softmax function), these similarity measures manifest as attention weights.

  3. Context Vector Derivation: Using the attention weights, a weighted sum of all annotation vectors is computed. This resultant vector, termed the “context vector,” encapsulates pertinent segments of the input sequence for the current output generation phase.

  4. Attention-Aided Decoding: The decoder amalgamates information from the context vector and its current state to generate subsequent elements in the output sequence.

The Merits of Attention:

Implementing the attention mechanism offers multiple benefits:

  • It drastically mitigates information loss by allowing the model to reference various input segments for every output element.
  • The mechanism can adeptly manage extended sequences, thus enhancing model performance for tasks like translation or summarization.
  • Attention weights offer an interpretability layer, elucidating the input segments deemed crucial by the model during specific output generation phases.

  • To encapsulate, the self-attention mechanism has become an indispensable component in contemporary sequence-to-sequence models, catalyzing considerable advancements in natural language processing.

Positional Encoding

  • Since the Transformer does not process the words in the sequence in order, it does not have any inherent sense of the positional relationships between words. To address this, the Transformer includes positional encodings in its input representations. These encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the embeddings, so the two can be summed.
  • The Transformer uses sine and cosine functions of different frequencies as its positional encodings, allowing it to extrapolate to sequence lengths beyond those seen in the training data.
  • The Transformer model has been a pivotal development in NLP, leading to significant improvements in machine translation, text summarization, and other language-related tasks. Its key innovations, including self-attention and positional encoding, have addressed many of the limitations of previous sequence processing models. Subsequent work, including models such as BERT and GPT-3, have built upon the Transformer’s foundations, further advancing the state-of-the-art in NLP.

Layer Normalization: A Technique to Accelerate Model Training

  • The forward pass can exhibit inconsistencies and sometimes convey limited information.
  • The Scaled Dot Product attention mechanism introduces modifications that facilitate the training of Transformers.
  • As the dimensionality \(d\) increases, the dot products between vectors tend to grow disproportionately large.
  • This can lead to significant inputs for the softmax function, which in turn results in vanishingly small gradients.

Limitations of Transformers

  • Inherent Quadratic Computation in Self-Attention:The computational complexity is \(O(T^2d)\), where \(T\) denotes sequence length and \(d\) represents dimensionality.
  • “The quadratic time and space complexity of attention layer computations w.r.t. the number of input tokens n.” source
  • ” When the embedding size d > n, the 2nd problem is the quadratic time complexity of linear layers w.r.t. embedding size d.” source
  • “3rd problem is Positional Sinusoidal Embedding used in the original architecture.” source
    • In other words:
      1. Quadratic time and space complexity of the attention layer computation w.r.t. input tokens:
        • Transformers use what’s known as self-attention, where each token in a sequence attends to all other tokens (including itself). If you have a sequence of \(n\) tokens, you’ll essentially have to compute attention scores for each pair of tokens, resulting in \(n^2\) (quadratic) computations.
        • Similarly, for storing these attention scores, you’d need space that scales with \(n^2\), leading to a quadratic space complexity.
        • This becomes problematic for very long sequences as both the computation time and memory usage grow quickly, limiting the practical use of standard transformers for lengthy inputs.
      2. Quadratic time complexity of linear layers w.r.t. embedding size \(d\):
        • In transformers, after calculating the attention scores, the result is passed through linear layers, which have weights that scale with the dimension of the embeddings. If your token is represented by an embedding of size \(d\), and if \(d\) is greater than \(n\) (the number of tokens), then the computation associated with these linear layers can also be demanding.
        • The complexity arises because for each token, you’re doing operations in a \(d\)-dimensional space. For densely connected layers, if \(d\) grows, the number of parameters and hence computations grows quadratically.
      3. Positional Sinusoidal Embedding:
        • Transformers, in their original design, do not inherently understand the order of tokens (i.e., they don’t recognize sequences). To address this, positional information is added to the token embeddings.
        • The original Transformer model (by Vaswani et al.) proposed using sinusoidal functions to generate these positional embeddings. This method allows models to theoretically handle sequences of any length (since sinusoids are periodic and continuous), but it might not be the most efficient or effective way to capture positional information, especially for very long sequences or specialized tasks. Hence, it’s often considered a limitation or area of improvement, leading to newer positional encoding methods like Rotary Positional Embeddings (RoPE).
  • In comparison, RNNs have a linear growth in computational complexity.

Pretraining Strategies for Transformers

  • Transformers have achieved state-of-the-art performance on a multitude of NLP tasks, in part due to the effective use of pretraining strategies. Here is a deeper dive into these strategies and their significance:
  1. Motivation for Pretraining:
    • Most NLP tasks have limited labeled data available, which makes training deep models like Transformers from scratch challenging. Pretraining on large unlabeled datasets allows models to learn the semantics and syntax of the language, providing a strong initialization for fine-tuning on specific tasks.
  2. Pretraining Tasks:
    • Language Modeling (Masked LM): In this task, some words in a sentence are masked out, and the model aims to predict them based on the context. This encourages the model to learn a comprehensive understanding of language structure and semantics.
    • Next Sentence Prediction: Alongside masked language modeling, some versions of the Transformer, like BERT, predict whether two sentences are consecutive in a text. This task helps the model understand sentence relationships.
  3. Vocabulary Strategies:
    • Byte-Pair Encoding (BPE): This algorithm repeatedly merges the most frequent pair of bytes in the training data until a desired vocabulary size is reached. The result is a flexible vocabulary that can represent common words, morphemes, and even single characters.
    • UNK Token: For words or sub-words not in the learned vocabulary, a special “UNK” (for “unknown”) token is used.
  4. Context Sensitivity:
    • Unlike models like Word2Vec, which assign a single embedding to each word irrespective of its context, Transformers generate context-sensitive embeddings. This means the same word can have different embeddings based on its surrounding words, capturing nuances and polysemy effectively.
  5. Procedure:
    • Phase 1: Pretraining:
      • Transformers are trained on extensive text corpora like Wikipedia, BooksCorpus, or web texts. They aim to predict masked out words or other objectives as described above.
      • Once trained, the weights act as a representation of the language, capturing general features, patterns, and structures.
    • Phase 2: Fine-tuning:
      • The pretrained Transformer is then fine-tuned on a specific task (e.g., text classification, named entity recognition). During this phase, the model adapts its generalized knowledge to the peculiarities and specific requirements of the target task.
  6. Benefits:
    • Robust Initialization: Starting with weights that already understand language makes convergence faster and often results in better performance.
    • Data Efficiency: By leveraging knowledge from large unlabeled datasets, Transformers can achieve competitive results on tasks even with limited labeled data.
    • Generalization: The broad knowledge acquired during pretraining aids in generalizing better to various NLP tasks.
  • In conclusion, pretraining has become a cornerstone in the success of Transformer-based models in NLP. The ability to leverage vast amounts of unlabeled data to benefit tasks with limited labeled examples is a paradigm shift in how models are trained and deployed.

Pretrained Models: A Refined Explanation

Decoders as Language Models:

  • Decoders, traditionally utilized in sequence-to-sequence architectures, serve as language models in the context of pretraining.
  • Generative Nature: Decoders are adept at generating sequences. However, their limitation is that they cannot condition on subsequent words in a sequence.
  • Pretraining Perspective: While decoders are pretrained as language models, it is crucial to recognize that they are trained to predict the probability of a word given its preceding words, denoted as $$p(w w_1)$$. Nevertheless, when leveraging these pretrained decoders in downstream applications, this specific modeling condition can be abstracted.
  • Learning and Updates: During the training process, errors are comprehensively back-propagated throughout the entire network, ensuring a holistic update of all parameters.

GPT (Generative Pretrained Transformer):

  • GPT marked a significant advancement in the realm of pretrained decoders.
  • Structural Overview: GPT, as a model, boasts a Transformer decoder architecture with 12 layers. Each layer consists of 768-dimensional hidden states and feed-forward networks with 3072-dimensional hidden layers.
  • Token Representation: GPT employs the Byte Pair Encoding (BPE) methodology for tokenization. Specifically, this approach involved 40,000 merge operations.
  • Applications: One notable application of GPT, beyond language modeling, is Natural Language Inference (NLI). In this task, the model classifies pairs of sentences into categories: entailment, contradiction, or neutral.

Teacher Forcing

  • Decoders are autoregressive models in nature, which means they generate one token at a time based on previously generated tokens. In standard training, they use their own predictions as input for the next step.
  • However, with teacher forcing, during training, the decoder is fed with the correct or ground truth tokens from the target sequence at each step. This accelerates learning and improves stability, but it may not fully prepare the model for real-world inference, where it must rely on its own predictions. Techniques like scheduled sampling can mitigate this exposure bias.
  • Let’s delve a bit deeper into this concept:
    • Teacher Forcing, as we mentioned, says that during training, the model is fed with the ground truth (true) target sequence at each time step as input, rather than the model’s own predictions. This helps the model learn faster and more accurately during training because it has access to the correct information at each step. However, when the model is deployed for inference (generating sequences), it typically does not have access to ground truth information and must rely on its own predictions, which can be less accurate, leading to a problem known as exposure bias or the “train-test discrepancy.”
    • Scheduled sampling is a technique whose primary goal is to address the discrepancy between the training and inference phases that arises due to teacher forcing.
    • Scheduled sampling is introduced to bridge the train-test discrepancy by gradually transitioning from teacher forcing to using the model’s own predictions during training. Here’s how it works:

      1. Teacher Forcing Phase:
        • In the early stages of training, scheduled sampling follows a schedule where teacher forcing is dominant. This means that the model is mostly exposed to the ground truth target sequence during training.
        • At each time step, the model has a high probability of receiving the true target as input, which encourages it to learn from the correct data.
      2. Transition Phase:
        • As training progresses, scheduled sampling gradually reduces the probability of using the true target as input and increases the probability of using the model’s own predictions.
        • This transition phase helps the model get accustomed to generating its own sequences and reduces its dependence on the ground truth data.
      3. Inference Phase:
        • During inference (when the model generates sequences without access to the ground truth), scheduled sampling is typically turned off. The model relies entirely on its own predictions to generate sequences.

Encoders in NLP:

Encoders are specialized in capturing bidirectional context from input data.

  • Bidirectional Context: Encoders have the capacity to perceive and leverage context from both previous and subsequent words in a sequence.

  • Structural Components: Essential elements of encoders include multi-headed self-attention mechanisms, feed-forward neural networks, and layer normalization.

  • Lack of Locality Bias: A notable feature of encoders, especially those based on transformers, is that they do not exhibit a locality bias. This means that context from distant words in a sequence is treated with equal importance as context from nearby words, providing an advantage over models like LSTM that can have challenges with long-term dependencies.

BERT Explored:

Decoding BERT: An In-depth Analysis

  • BERT (Bidirectional Encoder Representations from Transformers) stands as a testament to the power of deep learning in NLP. Crafted by Google, BERT harnesses the potential of transformer-based architectures.

  • Structural Dynamics: BERT comes in two primary flavors: BERT-Base, with twelve encoder layers, and BERT-Large, boasting twenty-four encoder layers. Complementing these layers, BERT includes substantial feed-forward networks with hidden units of dimensions 768 and 1024 for BERT-Base and BERT-Large, respectively. The models also feature 12 and 16 attention heads for the two versions, surpassing configurations in standard transformer models.

Unmasking BERT’s Language Modeling

  • BERT’s language modeling strategy is unique. Unlike traditional models which predict the next word in a sequence, BERT employs a “masked” language model approach. Words in the input sequence are randomly replaced with a [MASK] token. The model then tries to predict these masked words, leveraging context from both before and after the mask. This bidirectional context utilization is central to BERT’s proficiency in understanding language semantics.

BERT: Fine-tuning and Pre-training

  • In the operational hierarchy of BERT, the models are pre-trained once and fine-tuned multiple times, making the model highly adaptable. Fine-tuning BERT models is often manageable on a single GPU, making it an accessible and efficient process.

BERT’s Contribution to Reading Comprehension

  • BERT has made significant strides in advancing reading comprehension. By masking 15% of the input, BERT creates a diverse learning environment for the model, improving its fine-tuning potential. To further enhance this, BERT sometimes randomly substitutes a word with another and instructs the model to predict the correct word for that position.

BERT’s Edge over GPT and ELMO

  • Traditional models such as GPT and ELMO use unidirectional context, either to the left or right, or a concatenation of both. This model works for predicting the next word but lacks bidirectionality, meaning words can’t see themselves in a context that incorporates both previous and subsequent words. BERT, on the other hand, uses bidirectional context, giving it an advantage in understanding the mutual relation of words in a sentence.

BERT’s Training Mechanism

  • BERT’s architecture operates on a sequence of 512 words dimensionally. During the training phase, BERT masks out 15% of the words and instructs the model to predict these words. By using the NSP feature, BERT can also understand the relationship between two sentences. This particular pre-training scheme allows BERT to outshine other models on key NLP tasks such as QA and Natural Language Inference (NLI).

BERT’s Fine-tuning Procedure

  • The fine-tuning process in BERT differs based on the task. For sentiment analysis, for instance, a sentence is encoded with BERT and the parameters are finalized in the output matrix. This results in a marginal increase in new parameters in contrast to the existing ones.

The Future of BERT

  • BERT is a deep bidirectional transformer encoder pre-trained on a large volume of text. With BERT-Base comprising of 12 layers and 110M parameters, and BERT-Large encompassing 24 layers with 330M parameters, the models offer unmatched potential. By designating a question as segment A and a passage as segment B, BERT predicts two endpoints in segment B, signifying the answer. This intricate architecture provides BERT with a competitive edge in the realm of NLP.

  • Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. It has proven to be a groundbreaking model in the NLP landscape, outperforming the traditional models in a wide variety of tasks.
  • BERT is designed to pretrain deep bidirectional representations from the unlabelled text by jointly conditioning on both left and right context in all layers. As a result, BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
  • The architecture of BERT is based on a transformer model architecture, originally introduced in “Attention is All You Need” by Vaswani et al. BERT leverages the Transformer’s encoder part. While the transformer model has an encoder-decoder structure, BERT only uses the encoder mechanism.
  • BERT uses the transformer’s self-attention mechanism to take into account the context of words. Unlike previous models, which were unidirectional, BERT is bidirectional. This means it can understand the context of a word based on all of its surroundings (left and right of the word).

Pre-training and Fine-tuning

  • The innovation in BERT is the bidirectional training, which is a masked language model. This concept is a game-changer in NLP tasks. During pre-training, the model learns to predict randomly masked words in a sentence by considering the context words on both sides.
  • In the fine-tuning stage, BERT can be tailored to specific tasks (e.g., question answering, sentiment analysis) with a minimal amount of task-specific parameters. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labelled data from the downstream tasks.

Applications of BERT

  • BERT has set new standards for NLP tasks. Its applications include but are not limited to:
  1. Question answering systems: BERT can be used to build a system that can answer questions about a given piece of text.
  2. Language understanding: BERT’s bidirectional approach provides a deeper understanding of the context of words in a sentence, helping to understand the nuances of a language.
  3. Text classification: BERT can classify texts into specific categories, enabling it to perform tasks like sentiment analysis.
  4. Named entity recognition: BERT can identify and categorize named entities in a text into predefined categories like persons, organizations, locations, etc.
  5. Improving search engine results: BERT’s understanding of query context can help provide more relevant search results.
  • BERT is a powerful model that has drastically changed the landscape of NLP. It provides a more contextual representation of words in a text, enabling systems to understand natural language better than ever before. The impact of BERT is evident, as it has been adopted by Google for improving search results. The future holds more potential for transformer-based models like BERT, as more advances are made in the field of NLP.

Flavors of BERT

  • BERT, or Bidirectional Encoder Representations from Transformers, is a significant development in Natural Language Processing (NLP) that was introduced by Google in 2018. BERT is revolutionary because it provides a pre-trained language model that can be fine-tuned for a wide variety of NLP tasks with minimal additional task-specific parameters. Since the introduction of BERT, several “flavors” or variations of the original model have been developed, each with unique characteristics. Here are some of the most notable variants:
  1. RoBERTa (Robustly optimized BERT approach): Developed by Facebook’s AI team, RoBERTa is a reimplementation of BERT with certain training modifications. The team removed the next-sentence pretraining objective and trained with larger mini-batches and learning rates.

  2. DistilBERT: Developed by the Hugging Face team, DistilBERT is a smaller, faster, cheaper, and lighter version of BERT. DistilBERT has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances.

  3. ALBERT (A Lite BERT): Another variant developed by Google Research, ALBERT reduces the parameters of BERT by sharing parameters across the layers. It also introduces a sentence-order prediction task to improve on BERT’s next-sentence prediction task.

  4. BERTweet: This variant is a BERT model that is specifically trained for English Tweets. This enables it to understand the specific language, emojis, and hashtags commonly used on Twitter.

  5. SpanBERT: SpanBERT improves BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the model to predict the entire span, rather than just masked tokens.

  6. BioBERT: BioBERT is a version of BERT pre-trained on large-scale biomedical corpora. It significantly outperforms the original BERT in various biomedical NLP tasks and is specially designed for tasks such as disease prediction and drug discovery.

  7. SciBERT: Similar to BioBERT, SciBERT is pre-trained on a large corpus of scientific text, allowing it to perform better on scientific NLP tasks.

  8. CamemBERT: This is a BERT model trained on French text. It showcases how the BERT architecture can be applied to other languages apart from English.

  9. ClinicalBERT: Like BioBERT and SciBERT, ClinicalBERT is a domain-specific BERT model that is pre-trained on clinical notes, making it better suited for NLP tasks in the healthcare domain.

    • Each of these flavors of BERT have been designed with specific modifications or are pre-trained on specific corpora to enhance performance on various tasks or domains. As the field of NLP continues to evolve, it’s likely that we will see even more variations of BERT in the future.


  • The Generative Pretrained Transformer (GPT) is an autoregressive language model that uses deep learning techniques to produce human-like text. It is the foundation of models like GPT-2 and GPT-3, developed by OpenAI.
  • GPT models are a significant milestone in Natural Language Processing (NLP) and have gained popularity due to their efficiency in a wide array of language processing tasks. The underlying concept of GPT is that it uses transformer-based neural networks and unsupervised learning to generate highly contextualized outputs.

Architecture of GPT

  • GPT is based on the transformer model, specifically the transformer’s decoder part. A Transformer model has an encoder-decoder structure, but GPT only uses the decoder mechanism without the need for an encoder. It uses the transformer’s self-attention mechanism for connecting the entire sequence of inputs.
  • The GPT model follows an autoregressive approach where it predicts the next word in a sentence based on the previous words. The model learns to predict the probability of a word given the previous words used in the text. Once trained, it can generate text by sampling words from these probability distributions.
  • In contrast to traditional transformers, GPT doesn’t use bidirectional context; instead, it uses left-to-right or unidirectional context where it only pays attention to the preceding words.
  • GPT follows a two-step process: Pre-training and Fine-tuning.
  • During pre-training, the model is trained on a large corpus of text from the internet, learning to predict the next word in a sentence. At this stage, the model learns a wide array of language patterns.
  • In the fine-tuning phase, the base model is further trained on a specific task with labeled data. For example, if the task is about sentiment analysis, the model is fine-tuned using labeled datasets indicating whether a given sentence expresses a positive or negative sentiment.

Applications of GPT

  • GPT models can generate coherent, diverse, and contextually rich sentences, making them versatile in many NLP tasks. These tasks include but are not limited to:
  1. Text generation: GPT can generate human-like text which can be used in chatbots, story writing, or as an AI writing assistant.
  2. Translation: Given a text in one language, GPT can be used to translate it into another language.
  3. Question answering: GPT can understand a piece of text and answer questions related to it.
  4. Sentiment analysis: By fine-tuning GPT on specific datasets, it can classify the sentiment of a piece of text.

Future of GPT

  • GPT models, especially the newer versions like GPT-3, provide a highly effective, scalable solution for many NLP tasks. They are not without their shortcomings, like generating text that can be nonsensical or sensitive to slight changes in input phrasing, but the concept has already had a significant impact on the field.
  • GPT and its subsequent iterations have shown us a glimpse of how far transformer-based models can go in understanding and generating human-like text. They represent a step forward in the journey toward more sophisticated language-based AI applications. Future versions, like GPT-4 and beyond, promise even more advancements and capabilities.

Flavors of GPT

  • Similar to BERT, GPT (Generative Pretraining Transformer) has several iterations or “flavors”, which are significant developments in the field of Natural Language Processing (NLP). These variants have been introduced by OpenAI, with each version featuring improvements and modifications over the previous ones. Here are the main flavors of GPT:
  1. GPT-1: The first GPT model was introduced by OpenAI in June 2018. It was a transformer-based language model, trained using unsupervised learning and then fine-tuned for specific tasks. This model marked a shift from traditional methods, showing that pretraining language models could improve performance on a variety of NLP tasks.

  2. GPT-2: Released in 2019, GPT-2 was an extension of GPT-1 but featured a larger model size. With 1.5 billion parameters, GPT-2 showcased the “scalability” hypothesis — the idea that larger models, given sufficient data and compute, can perform better. OpenAI initially chose not to release the full model due to concerns about potential misuse, demonstrating the significant capabilities of this variant.

  3. GPT-3: The most recent variant, GPT-3, was introduced in 2020 and drastically scaled up the model size to 175 billion parameters. The model’s capabilities were shown to improve substantially with the increase in size, and it demonstrated strong performance even without task-specific fine-tuning. GPT-3 is capable of producing human-like text and has been applied in various domains, from drafting emails and writing code to creating written content.

  4. GPT-J: It’s not an official GPT variant released by OpenAI but a model developed by EleutherAI. GPT-J has 6 billion parameters and it’s the largest transformer model trained openly. It has a similar architecture to GPT-3 and its performance is somewhere between GPT-2 and GPT-3.

  5. GPT-Neo: Again, this is not an official GPT variant released by OpenAI but a project by EleutherAI. The goal of GPT-Neo was to provide the open-source community with large-scale transformer models, as an alternative to the proprietary models like GPT-3. The released GPT-Neo models have 1.3 billion and 2.7 billion parameters.

  • Each of these flavors of GPT mark significant advancements in NLP, demonstrating the potential of transformer-based architectures and the benefits of scaling model size. However, as these models become more capable, they also raise important questions about ethical considerations and potential misuse, which continue to be key considerations for the field.

T5 Language Model: A Detailed Perspective

  • Transfer learning has reshaped the manner in which NLP models are trained and fine-tuned.

  • Basic Paradigm:

    • Unsupervised Pretraining: With a wealth of unlabeled text data at our disposal, models are first pretrained using unsupervised objectives. One common approach involves masking certain words in a sequence and prompting the model to predict them.
    • Fine-tuning: Once the pretraining phase has reached a certain point, the model is then fine-tuned on specific tasks, such as sentiment analysis.
    • Benefits: This approach has proven to be highly efficient, leading to superior performance compared to training models from scratch. The year 2018 witnessed a surge in research papers advocating for the transfer learning paradigm, elevating the excitement in the field.

T5: Text-to-Text Transfer Transformer

  • T5, or the Text-to-Text Transfer Transformer, emerged as a significant milestone in the world of NLP.

  • Primary Functionality: One application of T5 is to assess the similarity between two sentences. It outputs a floating-point number to represent this similarity, like a score of 3.8.

  • Innovative Approach: T5 introduced the idea of converting regression problems into classification tasks for more structured outcomes.

  • Architectural Roots: The Transformer architecture, which lies at the heart of T5, was originally proposed as an encoder-decoder model.

  • Pretraining Objective: T5 is pretrained using a BERT-base sized encoder-decoder model. It utilizes a denoising objective on a masked dataset and is specifically trained on the C4 dataset. This pretrained model is then fine-tuned on various tasks such as the General Language Understanding Evaluation (GLUE) for natural language understanding and the Stanford Question Answering Dataset (SQuAD).

Deep Dive: Encoder vs. Decoder vs. Encoder-Decoder

The Transformer architecture can be divided into encoder-only, decoder-only, and encoder-decoder configurations. Each serves a unique purpose, but they also come with challenges.

  • Encoder-Decoder Dilemma: Traditional encoder-decoder frameworks sometimes grapple with encoding irrelevant information. This problem exacerbates when handling lengthy or information-dense inputs where selective encoding becomes challenging.

  • Real-world Application: Take text summarization as an instance. It’s framed as a sequence-to-sequence task where the input is an extensive text, and the output is a concise version. Expecting a fixed-sized vector to encapsulate all the nuances of potentially long text can be unrealistic, leading to potential information loss.

  • Attention to the Rescue: The attention mechanism was born out of the necessity to allow decoders to refer back to the input sequence. It provides a “context” vector, computed based on input hidden states, to the decoder during its token generation process. This mechanism ensures that the decoder doesn’t just rely on the final hidden state of the encoder but can also focus on specific parts of the input text that are relevant to the current decoding step.

  • In essence, T5 and similar models leverage the power of transfer learning and Transformer architectures, coupled with attention mechanisms, to push the boundaries of what’s possible in NLP tasks.

What are some drawbacks of the Transformer?

  • The runtime of Transformer architecture is quadratic in the length of the input sequence, which means it can be slow when processing long documents or taking characters as inputs. In other words, computing all pairs of interactions during self-attention means our computation grows quadratically with the sequence length, i.e., \(O(T^2 d)\), where \(T\) is the sequence length, and \(d\) is the dimensionality. Note that for recurrent models, it only grew linearly!
    • Say, \(d = 1000\). So, for a single (shortish) sentence, \(T \leq 30 \Rightarrow T^{2} \leq 900 \Rightarrow T^2 d \approx 900K\). Note that in practice, we set a bound such as \(T=512\). Imagine working on long documents with \(T \geq 10,000\)!?
  • Wouldn’t it be nice for Transformers if we didn’t have to compute pair-wise interactions between each word pair in the sentence? Recent studies such as:
  • Compared to CNNs, the data appetite of transformers is obscenely high. CNNs are still sample efficient, which makes them great candidates for low-resource tasks. This is especially true for image/video generation tasks where an exceptionally large amount of data is needed, even for CNN architectures (and thus implies that Transformer architectures would have a ridiculously high data requirement). For example, the recent CLIP architecture by Radford et al. was trained with CNN-based ResNets as vision backbones (and not a ViT-like transformer architecture). While transformers do offer accuracy bumps once their data requirement is satisfied, CNNs offer a way to deliver decent accuracy performance in tasks where the amount of data available is not exceptionally high. Both architectures thus have their usecases.
  • The runtime of the Transformer architecture is quadratic in the length of the input sequence. Computing attention over all word-pairs requires the number of edges in the graph to scale quadratically with the number of nodes, i.e., in an \(n\) word sentence, a Transformer would be doing computations over \(n^{2}\) pairs of words. This implies a large parameter count (implying high memory footprint) and thereby high computational complexity. More in the section on What Would We Like to Fix about the Transformer?
  • High compute requirements has a negative impact on power and battery life requirements, especially for portable device targets.
  • Overall, a transformer requires higher computational power, more data, power/battery life, and memory footprint, for it to offer better performance (in terms of say, accuracy) compared to its conventional competitors.

Sentiment Analysis and Encoders:

  • Objective: The primary aim of sentiment analysis is to determine the sentiment or emotion expressed in a piece of text, categorizing it as positive, negative, neutral, or sometimes more fine-grained classifications.

  • Why Encoders:

    1. Fixed-length Representation: Sentiment analysis requires the model to understand and summarize the entire content of a text. Encoders convert variable-length input sequences into a fixed-length context vector, capturing the essence of the content. This holistic understanding is crucial for determining sentiment.
  1. No Sequence Generation Needed: Unlike tasks that require a new sequence as output (e.g., translation or summarization), sentiment analysis outputs a class label. The encoder’s role is to analyze and extract features from the input text, which can then be classified. The generation capabilities of decoders are unnecessary.

  2. End-to-end Classification: Once the encoder processes the input text, the resulting context vector can be directly fed into a classifier (like a softmax layer) to determine the sentiment.

Generative AI and Decoders:

  • Objective: Generative AI tasks involve producing new content or sequences, whether that’s text, images, music, or other data.

  • Why Decoders:

    1. Sequence Generation: By design, decoders are built to produce sequences. In generative AI tasks, where the aim is to create novel sequences, the decoding process becomes paramount.
  1. Conditioned Generation: Generative tasks can often be conditioned on some input (e.g., generating a continuation of a given text). Decoders can take a context vector (potentially from an encoder) and produce sequences that align with that context.

  2. Flexibility in Length: Decoders can generate sequences of variable lengths, which is essential in generative tasks where the desired output’s length may not be predefined.

  3. Sampling and Exploration: Decoders can leverage mechanisms like temperature sampling, top-k sampling, etc., to introduce randomness and creativity in the generated content, which is crucial for generative AI to produce diverse outputs.

In summary, the inherent design and capabilities of encoders align with the objectives of sentiment analysis, where understanding and summarizing content is the primary goal. In contrast, the sequence generation capabilities of decoders make them well-suited for generative AI tasks.

Running on-device

  • Running deep learning models on-device is an increasingly popular approach due to its benefits such as reduced latency and improved privacy. However, these benefits come with constraints on model size and complexity due to the limited computational and memory resources of devices. This has led to the development of smaller, more efficient versions of transformer-based models such as BERT and GPT that are suitable for on-device deployment. Here are some examples:
  1. DistilBERT: DistilBERT is a distilled version of the BERT model, developed by Hugging Face. It has 40% less parameters than bert-base-uncased and runs 60% faster, while retaining over 95% of BERT’s performance as measured on the GLUE language understanding benchmark. DistilBERT is designed to be smaller, faster, and more lightweight than its parent model BERT, making it perfect for on-device applications.

  2. ALBERT: ALBERT, short for “A Lite BERT,” is another variant of BERT developed by Google Research. It has two major optimizations to lower memory consumption and increase the training speed: factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization disentangles the size of the hidden layers from the size of vocabulary embedding, which significantly reduces the memory requirement for large vocabulary tasks. Cross-layer parameter sharing prevents the increase of parameters with the depth of the network, leading to a dramatic decrease in parameters. These characteristics make ALBERT more suitable for on-device applications.

  3. DistilGPT: Following the success of DistilBERT, Hugging Face also released distilled versions of GPT called DistilGPT. These are smaller, faster versions of the GPT model that maintain much of the parent model’s performance while being suitable for on-device deployment.

  4. MiniLM: MiniLM is a distilled variant of the BERT model developed by Microsoft Research. It uses a two-step distillation process: a self-distillation step where the original BERT model is distilled to a smaller BERT model with the same architecture, and then a multi-task knowledge distillation step where the small BERT model is further distilled using a multi-task learning objective. MiniLM is designed to have comparable performance to BERT on a range of NLP tasks while being significantly smaller and faster.

  • These transformer models are designed to maintain much of the performance of the original, full-size models while having significantly lower computational and memory requirements, making them suitable for on-device deployment. However, it should be noted that there will still be a trade-off between model size/performance and the computational/memory constraints of the device.

Why was dropout used in the original transformer if there was enough data that overfitting was not a problem

  • Dropout was used in the original Transformer model not primarily to address overfitting due to lack of data but as a regularization technique to improve generalization and model robustness. While the Transformer model had a substantial amount of data for training, dropout was included as part of the design to help prevent the model from overfitting the training data and to make it more robust in handling various inputs and tasks.
  • Here are some reasons why dropout was used in the original Transformer:

    1. Model Generalization: Even with large datasets, deep neural networks like Transformers can still overfit to the training data, especially when they have a large number of parameters. Dropout helps in preventing overfitting by randomly deactivating a portion of neurons during each training step, making the model less reliant on specific neurons and features.

    2. Model Robustness: A robust model is one that can handle different types of inputs, noisy data, perturbations, and adversarial examples while maintaining its performance and reliability. Dropout can increase the model’s robustness by forcing it to learn more redundant representations. This means that even if some neurons are dropped out during training, the model can still make accurate predictions because it has learned to rely on multiple pathways through the network.

    3. Model Efficiency: Dropout can be seen as a form of model ensemble during training. It trains multiple subnetworks with shared parameters but different dropout masks. During inference, you can effectively average the predictions from these subnetworks, which often leads to better generalization and more accurate results.

    4. Regularization: Dropout acts as a form of regularization, which can help the model avoid memorizing specific examples in the training data and instead focus on learning more abstract and generalizable patterns.

How to choose the output token from the decoder

  • In the context of language models and sequence generation tasks, the softmax function at the end of a decoder plays a crucial role. Let’s break down its purpose and how it relates to different sampling strategies like argmax, top-k sampling, etc.

Softmax Function in Decoder

  1. Purpose: The softmax function is applied to the logits (the raw output of the last layer of the neural network before the softmax layer) to convert them into probabilities. Each logit corresponds to a possible next token (like a word or character), and the softmax function ensures that the output is a valid probability distribution, with all values between 0 and 1 and summing up to 1.

  2. Mechanism: The softmax function exponentiates its inputs and then normalizes them, making sure that larger input values (higher logits) correspond to larger probabilities. This is crucial for understanding which tokens are more likely to be the correct next token in the sequence.

Sampling Strategies

  • Once we have a probability distribution for the next token, different strategies can be used to select the actual token:
  1. Argmax: This is the simplest approach. You just pick the token with the highest probability. This method is deterministic – given the same context, it will always produce the same output. However, it can lead to repetitive and predictable text generation.

  2. Top-k Sampling: This method involves narrowing down the choices to the top ‘k’ tokens according to their probabilities and then sampling from this subset. This introduces randomness and helps generate more diverse and interesting outputs. The value of ‘k’ controls the diversity: a smaller ‘k’ leads to less randomness, while a larger ‘k’ increases diversity.

  3. Other Methods: There are other sampling methods like top-p (or nucleus) sampling, where tokens are selected based on a cumulative probability threshold, and temperature-based sampling, where the ‘temperature’ parameter modifies the probability distribution to control randomness.

Selection Process

  • Deterministic vs. Stochastic: Argmax is deterministic (always picks the highest probability token), while top-k and other methods are stochastic, introducing randomness into the selection process.
  • Uniformity: In top-k sampling, once the top ‘k’ tokens are selected, the choice among them can be uniform (each of the top ‘k’ tokens has an equal chance of being picked) or proportional to their calculated probabilities.
  • Bias: The method used can introduce a bias in the type of text generated. Argmax tends to be safe and less creative, while stochastic methods can generate more novel and varied text but with a higher chance of producing irrelevant or nonsensical output.

Transformer overall architecture

  • In the following section, we will go over the overall architecture as presented below from the original paper by Vaswani et al.

Inputs to the Encoder

  1. Input Embedding: Tokens from the input sequence are converted into high-dimensional vectors using an embedding layer.

  2. Positional Encoding: To the input embeddings, positional encodings are added. These encodings provide information about the order or position of the tokens within the sequence. This is necessary because the Transformer does not process the tokens sequentially and therefore does not inherently know the order of the tokens. The positional encoding in the Transformer model provides information about the relative or absolute position of the tokens in the sequence.


  1. Multi-Head Self-Attention: This mechanism allows the encoder to consider other words in the input sequence when encoding a specific word. The multi-head aspect means that this process is done in parallel multiple times with different weight matrices, allowing the model to focus on different parts of the sentence simultaneously.

  2. Add & Norm: After the attention scores are used to weight the input embeddings, the result is added to the original input embeddings (residual connection) and then normalized (layer normalization). This helps with training deep networks by preventing the vanishing gradient problem. Training Stability: By normalizing the inputs to have zero mean and unit variance, we ensure that the scale of the inputs doesn’t drastically affect the learning. This is particularly important in deep networks where the scale can compound across layers.
    • Faster Convergence: Layer normalization can help the model to train faster by smoothing the optimization landscape. Independence from Batch Size: Unlike batch normalization, layer normalization does not depend on the batch size, which is beneficial since Transformers often deal with variable sequence lengths
  3. Position-wise Feed-Forward Networks: Each position’s output from the self-attention layer is fed into a feed-forward neural network. This network is applied to each position separately and identically. This means that the same feed-forward network is used for each position, but it does not share parameters across different positions.

  4. Add & Norm: Again, after the feed-forward network, another residual connection and layer normalization are applied.
  • The above steps (3 to 6) are repeated \(N\) times, which means there are \(N\) identical layers stacked on top of each other in the encoder.


  1. Output Embedding: Output embedding here refers to the generated output from the decoder as its an autoregressive model. It starts with a token to begin with.

  2. Positional Encoding: Just like in the encoder, positional encodings are added to the output embeddings to provide positional information.

  3. Masked Multi-Head Attention: In the decoder, the self-attention mechanism is masked to prevent each position from attending to subsequent positions. This ensures that predictions for a position can only depend on known outputs at positions before it.

  4. Add & Norm: Similar to the encoder, the result of the masked multi-head attention is combined with a residual connection and layer normalization.

  5. Multi-Head Cross Attention: This step is where the decoder starts to consider the encoder output. It takes the encoder output (specifically K and V from the encoders attention) and the output from its previous masked multi-head attention layer to perform the multi-head attention.

  6. Add & Norm: The result is then combined with a residual connection and normalized.

  7. Feed-Forward: The output of the multi-head attention is then passed through a position-wise feed-forward network.

  8. Add & Norm: Followed by another residual connection and layer normalization.

These steps (9 to 14) are also repeated \(N\) times, with \(N\) layers in the decoder.

  • In the decoder, the output embedding that is “shifted right” refers to a technique used during training known as “teacher forcing.” Here’s what it entails:

    • Teacher Forcing: During training, the decoder is given the correct output (e.g., the translation of a sentence) as input but shifted one position to the right. This means that the token at each position in the input is the token that should have been predicted at the previous step.
    • Why Shift Right: This shift ensures that the prediction for a particular position (say position (i)) is only dependent on the known outputs at positions less than (i). Essentially, it prevents the model from “cheating” by seeing the correct output for position (i) when predicting position (i).
    • Training vs. Inference: While during training the model uses this shifted output, during inference (when the model is generating text), the model does not have access to the ground truth and must instead generate the sequence one token at a time based on its own predictions.


  1. Linear: The output of the decoder is passed through a final linear layer.

  2. Softmax: A softmax layer is applied to the linear layer’s output to produce a probability distribution over the possible output tokens.

  • This sequence of operations defines the flow from input tokens to output probabilities, encoding the input information and decoding it into an output sequence with the help of learned weights and the self-attention mechanism. Each block within the encoder and decoder layers works to refine the representation and relationship between tokens to produce accurate and contextually relevant outputs.


  • The choice of sampling method (argmax, top-k, etc.) depends on the specific requirements of the task. For tasks requiring high creativity and variability (like story generation), stochastic methods are preferred. For tasks demanding high precision and less variability (like translation), deterministic methods like argmax might be more suitable. The softmax function is fundamental in all these cases, as it provides the necessary probability distribution from which these decisions are made.

Use Cases

  • Let’s look at a few encoder only, decoder only, and encoder-decoder use cases to see how they would work on a specific task. This will give more insights into how they work internally.
  • Transformers are a type of sequence transduction model. Sequence transduction refers to any model that transforms a sequence from one domain into a sequence in another domain. In the context of Transformers, this typically means translating a sequence of input tokens (such as words in a sentence) into a sequence of output tokens.

Encoder-Only Model - BERT

  • The primary objective of an encoder-only model like BERT is to create contextually rich embeddings. It is trained using the following loss functions: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
  • To utilize an encoder-only model like BERT for sentiment classification, follow these steps:
    • Input Processing: BERT processes all input tokens simultaneously. Each token in the input text is converted into an embedding vector.
    • Positional Encoding: Positional encodings are added to these embeddings to preserve the order of the tokens.
    • Multi-Headed Attention: The model then applies multi-headed self-attention to these embeddings, allowing each token to interact with all others and gather contextual information.
    • Residual Connections and Layer Normalization: After attention, residual connections (or skip connections) are used to add the input of each layer to its output. This is followed by layer normalization to stabilize the activations within the model.
    • Feed-Forward Networks: Each attention layer is followed by feed-forward neural networks for additional processing.
    • Additional Add and Norm: Another round of residual connections and layer normalization follows the feed-forward networks.
    • MLM and NSP Tasks: During pre-training, BERT uses MLM and NSP tasks. These tasks are crucial for learning context-rich embeddings but are not directly used in the sentiment classification task.
    • Output Linear Layer: For sentiment classification, a linear layer is added on top of BERT’s output (usually the output corresponding to the [CLS] token). This layer maps the complex embeddings to a simpler space suitable for classification.
    • Softmax Layer: Finally, a softmax layer is applied to the output of the linear layer. This transforms the linear layer’s outputs into probabilities, enabling multi-class sentiment classification.

In this workflow, BERT’s powerful self-attention mechanism and deep architecture enable it to understand complex language nuances, making it highly effective for sentiment analysis and other NLP tasks.

Encoder During Training

  1. Data Preparation:
    • BERT is pre-trained on a large corpus of unlabeled text data.
    • The data is processed to create instances for two primary tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
  2. Masked Language Modeling (MLM):
    • In MLM, a certain percentage of the input tokens are randomly masked. The goal for BERT is to predict these masked tokens based on their context.
    • This task trains BERT to understand the context and relationships between words.
  3. Next Sentence Prediction (NSP):
    • In NSP, the model is given pairs of sentences and must predict if the second sentence is a logical follow-up of the first.
    • This task helps BERT learn relationships between sentences.
  4. Bidirectional Contextual Training:
    • Unlike traditional left-to-right language models, BERT is trained to understand the context from both sides (left and right) of each token in the input.
  5. Loss Calculation and Optimization:
    • During training, separate loss functions are calculated for MLM and NSP tasks. The model’s parameters are updated to minimize these losses.
    • Backpropagation and optimization algorithms (like Adam) are used to adjust the weights of the network.

Encoder During Inference

  1. Fine-Tuning for Specific Tasks:
    • BERT is often fine-tuned for specific downstream tasks such as sentiment analysis, question answering, or named entity recognition.
    • During fine-tuning, the pre-trained BERT model is trained further on a smaller, task-specific dataset with labeled data.
  2. Input Processing:
    • For inference tasks, inputs are tokenized, and positional encodings are added. The model then processes the input to generate embeddings.
  3. Contextualized Embeddings:
    • BERT generates contextualized embeddings for each token in the input. These embeddings encapsulate the learned bidirectional context.
  4. Task-Specific Outputs:
    • Depending on the task, the output layer(s) added during fine-tuning are used to generate the final task-specific predictions.
    • For example, in a classification task, the output corresponding to the [CLS] token is used to determine the class.
  5. No Backpropagation:
    • During inference, there is no backpropagation or adjustment of weights. The model uses its learned parameters to generate predictions based on the input.
  • In summary, BERT’s training involves pre-training on large corpora with MLM and NSP to learn bidirectional context and relationships, followed by fine-tuning for specific tasks with task-relevant data. Inference with BERT involves processing input through the fine-tuned model to leverage its learned representations for generating accurate predictions for the given task.

Decoder-Only Model - GPT

  • GPT is an autoregressive, generative model well-suited for tasks like text generation. An example of this model in action is ChatGPT.
  • For a text summarization use-case, the process is as follows:
    • Input Processing: The text to be summarized is first pre-processed and tokenized. Each token is then converted into an embedding vector.
    • Positional Encoding: Positional encodings are added to these embeddings to incorporate information about the order of the tokens in the sequence.
    • Masked Multi-Headed Self-Attention: GPT employs masked self-attention, where each token can only attend to preceding tokens (to maintain the autoregressive property). This is crucial for generating coherent and contextually relevant text.
    • Add and Norm: After attention, each layer applies residual connections, adding the input of the layer to its output. This is followed by layer normalization to ensure stable activations within the model.
    • Absence of Cross-Attention: In a decoder-only model like GPT, cross-attention (common in encoder-decoder models) is not present. GPT relies solely on self-attention for generating text.
    • Teacher Forcing during Training: During the training phase, teacher forcing is used. This involves providing the correct next token as input to the model regardless of the model’s previous output, which accelerates learning.
    • Output Linear Layer: A linear layer is used to map the complex, high-dimensional representations to a simpler space suitable for token prediction.
    • Softmax Layer: The softmax function is applied to the output of the linear layer to obtain a probability distribution over possible next tokens.
    • Autoregressive Generation: The model generates one token at a time. After each token is generated, it is fed back into the model as input for generating the subsequent token.
  • In this workflow, GPT’s capacity for masked self-attention and autoregressive generation makes it highly effective for tasks like summarization, where the goal is to produce coherent and concise versions of the input text. The model’s design allows it to generate text token-by-token, building up a summary based on the context it has seen so far.

Decoder During Training

  1. Data Preparation:
    • The training data is typically a large corpus of text.
    • Each text instance is tokenized into a sequence of tokens.
  2. Teacher Forcing:
    • The model is trained using a technique called “teacher forcing.” Here, the model is provided with the correct next token as the target during training, regardless of its own previous predictions.
    • This method accelerates the learning process as the model always learns from the correct sequence.
  3. Masked Self-Attention:
    • The model uses masked self-attention, meaning each token can only attend to previous tokens in the sequence. This ensures that the prediction for each token only depends on the preceding tokens.
  4. Sequential Processing:
    • The model processes each token in the sequence, predicting the next token based on the context provided by the previous tokens.
  5. Loss Calculation:
    • The loss is calculated between the predicted token and the actual next token in the sequence.
    • Commonly, Cross-Entropy Loss is used for this purpose.
  6. Backpropagation and Optimization:
    • Based on the loss, gradients are calculated and backpropagated through the network.
    • An optimizer, like Adam, updates the model’s weights to minimize the loss.

Decoder During Inference

  • At inference, the model will only be provided with a start token and must predict the next token based on it.
    1. Starting the Sequence:
    • Inference begins with a start token or a prompt provided to the model.
  1. Autoregressive Generation:
    • The model generates one token at a time in an autoregressive manner.
    • Each generated token is added to the sequence and used as part of the input for generating the next token.
  2. No Teacher Forcing:
    • Unlike training, there is no teacher forcing during inference. The model relies solely on its own predictions to generate the next token.
  3. Temperature and Top-K Sampling:
    • Techniques like temperature scaling or top-k sampling are often used to control the randomness and diversity of the generated text.
  4. End of Sequence:
    • The generation process continues until an end-of-sequence token is generated or a maximum length is reached.
  5. No Backpropagation:
    • Since the model is not being trained, there is no loss calculation or backpropagation. The focus is on generating coherent and contextually appropriate text.
  • In Summary
    • Training Phase: The full ground truth is fed in, but masked attention ensures predictions are made based only on previous tokens. This allows for parallel processing while maintaining the sequential nature of the task.
    • Inference Phase: The model generates the sequence one token at a time, using its own previous outputs as part of the input for each subsequent prediction. This is a truly sequential process, reflecting how the model will be used in practice.

Training Phase

  • Encoder:
    • The source sentence is fed into the encoder.
    • Multi-headed self-attention is applied, allowing each token to interact with all other tokens in the source sentence.
    • An Add & Norm step follows, which includes residual connections and layer normalization.
    • The data then passes through a feedforward neural network.
    • Another Add & Norm step is applied after the feedforward network.
  • Decoder:
    • Training begins with a <Start> token and the target sentence.
    • Masked multi-headed self-attention is applied. In this step, the model can only attend to earlier positions in the target sentence to preserve the autoregressive property.
    • This is followed by an Add & Norm step.
    • Cross-attention is then performed where the model uses the output of the self-attention as queries (Q) and the encoder’s output as keys (K) and values (V). This allows the decoder to focus on relevant parts of the source sentence.
    • Another Add & Norm step follows.
    • The data then passes through a feedforward network.
    • Teacher forcing is used: the correct next token of the target sentence is always provided as input during training, which helps in recalculating the loss and adjusting the model.
    • The objective is to generate the text in the target language.

Inference Phase

  • Encoder:
    • Similar to training, the source sentence is input to the encoder.
    • The encoder processes this sentence through multi-headed self-attention, and feedforward layers, with Add & Norm steps, producing a set of encoded vectors.
  • Decoder:
    • Inference begins with a <Start> token.
    • The decoder generates the translation one token at a time.
    • For each token:
      • Multi-headed self-attention (not masked) is applied to only the tokens generated thus far.
      • Cross-attention is performed where the decoder uses its output as queries to attend to the encoder’s output (keys and values). This helps the decoder focus on relevant parts of the source sentence for each token it generates.
      • The decoder then predicts the next token based on this context.
      • The newly predicted token is added to the sequence and used for generating subsequent tokens.
    • This process continues until an end-of-sequence token is produced or a maximum length is reached.


If you found our work useful, please cite it as:

  title   = {Transformers},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{https://aman.ai}}