Background: Pre-Training

  • One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training).
  • The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

Enter BERT

  • In 2018, Google open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. As the name suggests, it generates representations using an encoder from Vaswani et al.’s Transformer architecture. However, there are notable differences between BERT and the original Transformer, especially in how they train those models.
  • With BERT, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models.
  • In the paper, Devlin et al. (2018) demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).

Contextualized Embeddings

  • ELMo came up with the concept of contextualized embeddings by grouping together the hidden states of the LSTM-based model (and the initial non-contextualized embedding) in a certain way (concatenation followed by weighted summation).

BERT: An Overview

  • At the input, BERT (and many other transformer models) consume 512 tokens max —- truncating anything beyond this length. Since it can generate an output per input token, it can output 512 tokens.
  • BERT actually uses WordPieces as tokens rather than the input words – so some words are broken down into smaller chunks.
  • BERT is trained using two objectives: (i) Masked Language Modeling (MLM), and (ii) Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

  • While the OpenAI transformer (which was a decoder) gave us a fine-tunable (through prompting) pre-trained model based on the Transformer, something went missing in this transition from LSTMs (ELMo) to Transformers (OpenAI Transformer). ELMo’s language model was bi-directional, but the OpenAI transformer is a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?
    • However, here’s the issue with bidirectional conditioning when pre-training a language model. The community usually trains a language model by training it on a related task which helps develop a contextual understanding of words in a model. More often than not, such tasks involve predicting the next word or words in close vicinity of each other. Such training methods can’t be extended and used for bidirectional models because it would allow each word to indirectly “see itself” — when you would approach the same sentence again but from opposite direction, you kind of already know what to expect. A case of data leakage. In other words, bidrectional conditioning would allow each word to indirectly see itself in a multi-layered context. The training objective of Masked Language Modelling, which seeks to predict the masked tokens, solves this problem.
    • While the masked language modelling objective allows us to obtain a bidirectional pre-trained model, note that a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, BERT does not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the \(i^{th}\) token is chosen, BERT replaces the \(i^{th}\) token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time, and (3) the unchanged \(i^{th}\) token 10% of the time.

Next Sentence Prediction (NSP)

  • To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (\(A\) and \(B\)), is \(B\) likely to be the sentence that follows \(A\), or not?
  • More on the two pre-training objectives in the section on Masked Language Model (MLM) and Next Sentence Prediction (NSP).

BERT Flavors

  • BERT comes in two flavors:
    • BERT Base: 12 layers (transformer blocks), 12 attention heads, 768 hidden size (i.e., the size of \(q\), \(k\) and \(v\) vectors), and 110 million parameters.
    • BERT Large: 24 layers (transformer blocks), 16 attention heads, 1024 hidden size (i.e., the size of \(q\), \(k\) and \(v\) vectors) and 340 million parameters.
  • By default BERT (which typically refers to BERT-base), word embeddings have 768 dimensions.

Sentence Embeddings with BERT

  • To calculate sentence embeddings using BERT, there are multiple strategies, but a simple approach is to average the second to last hidden layer of each token producing a single 768 length vector. You can also do a weighted sum of the vectors of words in the sentence.

BERT’s Encoder Architecture vs. Other Decoder Architectures

  • BERT is based on the Transformer encoder. Unlike BERT, decoder models (GPT, TransformerXL, XLNet, etc.) are auto-regressive in nature. As an encoder-based architecture, BERT traded-off auto-regression and gained the ability to incorporate context on both sides of a word and thereby offer better results.
  • Note that XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.
  • More on this in the article on Encoding vs. Decoder Models.

What makes BERT different?

  • BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit.
  • However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).
  • Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary.
  • For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.
  • A visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below. BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:

Why Unsupervised Pre-Training?

  • Vaswani et al. employed supervised learning to train the original Transformer models for language translation tasks, which requires pairs of source and target language sentences. For example, a German-to-English translation model needs a training dataset with many German sentences and corresponding English translations. Collecting such text data may involve much work, but we require them to ensure machine translation quality. There is not much else we can do about it, or can we?
  • We actually can use unsupervised learning to tap into many unlabelled corpora. However, before discussing unsupervised learning, let’s look at another problem with supervised representation learning. The original Transformer architecture has an encoder for a source language and a decoder for a target language. The encoder learns task-specific representations, which are helpful for the decoder to perform translation, i.e., from German sentences to English. It sounds reasonable that the model learns representations helpful for the ultimate objective. But there is a catch.
  • If we wanted the model to perform other tasks like question answering and language inference, we would need to modify its architecture and re-train it from scratch. It is time-consuming, especially with a large corpus.
  • Would human brains learn different representations for each specific task? It does not seem so. When kids learn a language, they do not aim for a single task in mind. They would somehow understand the use of words in many situations and acquire how to adjust and apply them for multiple activities.
  • To summarize what we have discussed so far, the question is whether we can train a model with many unlabeled texts to generate representations and adjust the model for different tasks without from-scratch training.
  • The answer is a resounding yes, and that’s exactly what Devlin et al. did with BERT. They pre-trained a model with unsupervised learning to obtain non-task-specific representations helpful for various language model tasks. Then, they added one additional output layer to fine-tune the pre-trained model for each task, achieving state-of-the-art results on eleven natural language processing tasks like GLUE, MultiNLI, and SQuAD v1.1 and v2.0 (question answering).
  • So, the first step in the BERT framework is to pre-train a model on a large amount of unlabeled data, giving many contexts for the model to learn representations in unsupervised training. The resulting pre-trained BERT model is a non-task-specific feature extractor that we can fine-tune quickly to a specific objective.
  • The next question is how they pre-trained BERT using text datasets without labeling.

The Strength of Bidirectionality

  • If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
  • To solve this problem, BERT uses the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:

  • While this idea has been around for a very long time, BERT was the first to adopt it to pre-train a deep neural network.
  • BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: given two sentences \(A\) and \(B\), is \(B\) the actual next sentence that comes after \(A\) in the corpus, or just a random sentence? For example:

  • Thus, BERT has been trained on two main tasks:
    • Masked Language Model (MLM)
    • Next Sentence Prediction (NSP)

Masked Language Model (MLM)

  • Language models (LM) estimate the probability of the next word following a sequence of words in a sentence. One application of LM is to generate texts given a prompt by predicting the most probable word sequence. It is a left-to-right language model.
  • Note: “left-to-right” may imply languages such as English and German. However, other languages use “right-to-left” or “top-to-bottom” sequences. Therefore, “left-to-right” means the natural flow of language sequences (forward), not aiming for specific languages. Also, “right-to-left” means the reverse order of language sequences (backward).
  • Unlike left-to-right language models, the masked language models (MLM) estimate the probability of masked words. We randomly hide (mask) some tokens from the input and ask the model to predict the original tokens in the masked positions. It lets us use unlabeled text data since we hide tokens that become labels (as such, we may call it “self-supervised” rather than “unsupervised,” but I’ll keep the same terminology as the paper for consistency’s sake).
  • The model predicts each hidden token solely based on its context, where the self-attention mechanism from the Transformer architecture comes into play. The context of a hidden token originates from both directions since the self-attention mechanism considers all tokens, not just the ones preceding the hidden token. Devlin et al. call such representation bi-directional, comparing with uni-directional representation by left-to-right language models.
  • Note: IMHO, the term “bi-directional” is a bit misnomer because the self-attention mechanism is not directional at all. However, we should treat the term as the antithesis of “uni-directional”.
  • In the paper, they have a footnote (4) that says:

We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only unidirectional version is referred to as a “Transformer decoder” since it can be used for text generation.

  • So, Devlin et al. trained an encoder (including the self-attention layers) to generate bi-directional representations, which can be richer than uni-directional representations from left-to-right language models for some tasks. Also, it is better than simply concatenating independently trained left-to-right and right-to-left language model representations because bi-directional representations simultaneously incorporate contexts from all tokens.
  • However, many tasks involve understanding the relationship between two sentences, such as Question Answering (QA) and Natural Language Inference (NLI). Language modeling (LM or MLM) does not capture this information.
  • We need another unsupervised representation learning task for multi-sentence relationships.

Next Sentence Prediction (NSP)

  • A next sentence prediction is a task to predict a binary value (i.e., Yes/No, True/False) to learn the relationship between two sentences. For example, there are two sentences \(A\) and \(B\), and the model predicts if \(B\) is the actual next sentence that follows \(A\). They randomly used the true or false next sentence for \(B\). It is easy to generate such a dataset from any monolingual corpus. Hence, it is unsupervised learning.
  • But how can we pre-train a model for both MLM and NSP tasks?
  • To understand how a model can accommodate two pre-training objectives, let’s look at how they tokenize inputs.
  • They used WordPiece for tokenization, which has a vocabulary of 30,000 tokens, based on the most frequent sub-words (combinations of characters and symbols). They also used the following special tokens:
    • [CLS] — stands for classification and is the first token of every sequence. The final hidden state is the aggregate sequence representation used for classification tasks.
    • [SEP] — a separator token for separating sentences, questions and related passages.
    • [MASK] — a token to hide some of the input tokens
  • For the MLM task, they randomly choose a token to replace with the [MASK] token. They calculate cross-entropy loss between the model’s prediction and the masked token to train the model.
  • For the NSP task, given sentences \(A\) and \(B\), the model learns to predict if \(B\) follows \(A\). They separated sentences \(A\) and \(B\) using [SEP] token and used the final hidden state in place of [CLS] for binary classification. The output of the [CLS] token tells us how likely the current sentence follows the prior sentence. You can think about the output of [CLS] as a probability. The motivation is that the [CLS] embedding should contain a “summary” of both sentences to be able to decide if they follow each other or not. Note that they can’t take any other word from the input sequence, because the output of the word is it’s representation. So they add a tag that has no other purpose than being a sentence-level representation for classification. Their final model achieved 97%-98% accuracy on this task.
  • Now you may ask the question – instead of using [CLS]’s output, can we just output a number as probability? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both the MLM and NSP tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) helps BERT to learn both tasks at the same time and boost its performance.
  • When it comes to classification task (e.g. sentiment classification), the output of [CLS] can be helpful because it contains BERT’s understanding at the sentence-level.
  • Note: there is one more detail when separating two sentences. They added a learned “segment” embedding to every token, indicating whether it belongs to sentence \(A\) or \(B\). It’s similar to positional encoding, but it is for sentence level. They call it segmentation embeddings. Figure 2 of the paper shows the various embeddings corresponding to the input.

  • So, Devlin et al. pre-trained BERT using the two unsupervised tasks and empirically showed that pre-trained bi-directional representations could help execute various language tasks involving single text or text pairs.
  • The final step is to conduct supervised fine-tuning to perform specific tasks.

Supervised Fine-Tuning

  • Fine-tuning adjusts all pre-trained model parameters for a specific task, which is a lot faster than from-scratch training. Furthermore, it is more flexible than feature-based training that fixes pre-trained parameters. As a result, we can quickly train a model for each specific task without heavily engineering a task-specific architecture.
  • The pre-trained BERT model can generate representations for single text or text pairs, thanks to the special tokens and the two unsupervised language modeling pre-training tasks. As such, we can plug task-specific inputs and outputs into BERT for each downstream task.
  • For classification tasks, we feed the final [CLS] representation to an output layer. For multi-sentence tasks, the encoder can process a concatenated text pair (using [SEP]) into bi-directional cross attention between two sentences. For example, we can use it for question-passage pair in a question-answering task.
  • By now, it should be clear why and how they repurposed the Transformer architecture, especially the self-attention mechanism through unsupervised pre-training objectives and downstream task-specific fine-tuning.

Training with Cloud TPUs

  • Everything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs. Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques.
  • The Transformer model architecture, developed by researchers at Google in 2017, gave BERT the foundation to make it successful. The Transformer is implemented in Google’s open source release, as well as the tensor2tensor library.

Results with BERT

  • To evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture.
  • On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:

  • BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:

  • Below are the GLUE test results from table 1 of the paper. They reported results on the two model sizes:
    • The base BERT uses 110M parameters in total:
      • 12 encoder blocks
      • 768-dimensional embedding vectors
      • 12 attention heads
    • The large BERT uses 340M parameters in total:
      • 24 encoder blocks
      • 1024-dimensional embedding vectors
      • 16 attention heads

Making BERT Work for You

  • The models that Google has released can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that Google has released so far are English-only, but they are working on releasing models which have been pre-trained on a variety of languages in the near future.
  • The open source TensorFlow implementation and pointers to pre-trained BERT models can be found here. Alternatively, you can get started using BERT through Colab with the notebook “BERT FineTuning with Cloud TPUs”.

A Visual Notebook to Using BERT for the First Time

  • This Jupyter notebook by Jay Alammar offers a great intro to using a pre-trained BERT model to carry out sentiment classification using the Stanford Sentiment Treebank (SST2) dataset.

Fascinating input by Sebastian Raschka

  • Source for image and excerpt below
  • The below summary is based on Cramming: Training a Language Model on a Single GPU in One Day by Jonas Geiping and Tom Goldstein.
  • “It was a very interesting read with lots of practical insights!
  • Let’s start with the most interesting takeaway: Sure, smaller models have higher throughput, but smaller models learn less efficiently. Consequently, larger models don’t take longer to train!
  • In this paper, the researchers trained a masked language model / encoder-style LLM (here: BERT) for 24h on 1 GPU — for comparison, the original 2018 BERT paper trained it on 16 TPUs for 4 days. (Caveat: This doesn’t factor in time and resources for hyperparameter tuning.)
  • And why do you need to pretrain LLMs, anyway, given that pretrained models are available off the shelf? Consider research/study purposes, or you may want to adapt them to new languages or domains (e.g., think of protein or DNA sequences).
  • The quite impressive outcome of this project was that they could train BERT with a 78.6 average performance (compared to 80.9) — the larger, the better.
  • What were some of the performance squeezing tricks?
  • they used automated operator fusion & 32/16-bit mixed precision training
  • they disabled biases in QKV attention matrices and fully-connected layers
  • They decreased the input length from 512 -> 128 tokens.
  • There were no benefits in replacing the original multi-head self-attention mechanism with FLASH attention or Fourier attention.
  • There was no advantage gained from changing GELU activations to something else.
  • Keeping the original 12 attention heads is important to maintain finetuning performance.

  • From a predictive performance standpoint:

  • A triangular one-cycle learning rate schedules work best.
  • Dropout was not needed during pretraining due to the dataset and 1-epoch training schedule.
  • Increasing the vocabulary size past 32k does not improve GLUE performance (but MNLI improves).” Source

Other models in the family

  • Big Bird • Reference:27[17] • Family: BERT • Pretraining Architecture: Encoder • Pretraining Task: MLM • Extension: Big Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequences • Application:Particularly well suited for longer sequences, not only in text but also e.g. in genomics • Date (of first known publication): 07/2020 • Num. Params:Depends on the overall architecture • Corpus:Books, CC-News, Stories and Wikipedia) • Lab:Google

Further Reading



If you found our work useful, please cite it as:

  title   = {BERT},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{}}