Models • Bidirectional Encoder Representations from Transformers (BERT)
- Pre-Training in NLP
- BERT: Revolutionizing NLP Pre-Training
- What makes BERT different?
- Why Unsupervised Pre-Training?
- The Strength of Bidirectionality
- Masked Language Model (MLM) Overview
- Next Sentence Prediction (NSP) Simplified
- Supervised Fine-Tuning
- Training with Cloud TPUs
- Results with BERT
- Making BERT Work for You
- Flavors of BERT
- Further Reading
- References
- Citation
Pre-Training in NLP
- A major hurdle in natural language processing (NLP) is limited training data. Task-specific NLP datasets often have only thousands to hundreds of thousands of examples, while deep learning NLP models excel with millions or billions of examples. To bridge this gap, researchers use vast amounts of unannotated web text for pre-training general-purpose language models.
- These pre-trained models, when fine-tuned, significantly enhance performance in small-data tasks like question answering and sentiment analysis, surpassing models trained from scratch.
BERT: Revolutionizing NLP Pre-Training
- In 2018, Google unveiled BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking NLP pre-training technique, leveraging the encoder from Vaswani et al.’s Transformer architecture. Distinct from the original Transformer, BERT introduces unique training methods. It allows for the training of state-of-the-art question answering systems and various other models swiftly—approximately 30 minutes on a single Cloud TPU or a few hours on a GPU. Google’s open-source release includes TensorFlow-based source code and several pre-trained models.
- BERT marked a significant advancement by achieving top results in 11 NLP tasks, as shown in the paper by Devlin et al. (2018). This included impressive performance on the competitive Stanford Question Answering Dataset (SQuAD v1.1).
- Preceding BERT, ELMo introduced the concept of contextualized embeddings, which integrated the hidden states of an LSTM-based model with initial non-contextualized embeddings through concatenation and weighted summation.
Technical Aspects of BERT
- BERT’s input mechanism is designed to handle a maximum of 512 tokens, truncating longer inputs. It generates outputs corresponding to each input token, resulting in up to 512 output tokens. Unlike traditional models, BERT employs WordPieces for tokenization, breaking down words into smaller segments. Its training is underpinned by two primary objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), enabling it to capture a deep understanding of language context and structure.
- In the development of BERT and similar models, two key pre-training tasks stand out: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). These tasks are crucial for enhancing the model’s understanding of language context and structure.
- MLM tackles a fundamental challenge in language processing: understanding words in the context of their surrounding text. Unlike traditional models that predict the next word in a sequence, MLM randomly masks some words in a sentence and tasks the model with predicting them. This approach allows the model to learn bidirectional context, meaning it considers both the words that come before and after the masked word. To avoid the model merely memorizing specific word patterns, BERT randomly alters the masked words during training: sometimes it replaces them with a
[MASK]
token, other times with a random word, and occasionally leaves them unchanged. This variability ensures that the model genuinely understands context rather than relying on specific tokens. - NSP is designed to give the model a deeper understanding of how sentences relate to each other. In this task, the model is given two sentences and must predict whether the second sentence logically follows the first. This ability is crucial for tasks that require a deeper comprehension of text structure, such as question answering and paragraph completion. By training the model on NSP, it learns to recognize not just individual words or phrases, but also the broader narrative and argumentative structures in a text.
- Together, MLM and NSP provide a comprehensive framework for training language models like BERT, enabling them to grasp the nuances of language and text structure in a way that was not possible with earlier models.
Sentence Embeddings with BERT
- To calculate sentence embeddings using BERT, there are multiple strategies, but a simple approach is to average the second to last hidden layer of each token producing a single 768 length vector. You can also do a weighted sum of the vectors of words in the sentence.
BERT’s Encoder Architecture vs. Other Decoder Architectures
- BERT is based on the Transformer encoder. Unlike BERT, decoder models (GPT, TransformerXL, XLNet, etc.) are auto-regressive in nature. As an encoder-based architecture, BERT traded-off auto-regression and gained the ability to incorporate context on both sides of a word and thereby offer better results.
- Note that XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.
- More on this in the article on Encoding vs. Decoder Models.
What makes BERT different?
- BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit.
- However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).
- Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary.
- For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.
- A visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below. BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:
Why Unsupervised Pre-Training?
- Vaswani et al. employed supervised learning to train the original Transformer models for language translation tasks, which requires pairs of source and target language sentences. For example, a German-to-English translation model needs a training dataset with many German sentences and corresponding English translations. Collecting such text data may involve much work, but we require them to ensure machine translation quality. There is not much else we can do about it, or can we?
- We actually can use unsupervised learning to tap into many unlabelled corpora. However, before discussing unsupervised learning, let’s look at another problem with supervised representation learning. The original Transformer architecture has an encoder for a source language and a decoder for a target language. The encoder learns task-specific representations, which are helpful for the decoder to perform translation, i.e., from German sentences to English. It sounds reasonable that the model learns representations helpful for the ultimate objective. But there is a catch.
- If we wanted the model to perform other tasks like question answering and language inference, we would need to modify its architecture and re-train it from scratch. It is time-consuming, especially with a large corpus.
- Would human brains learn different representations for each specific task? It does not seem so. When kids learn a language, they do not aim for a single task in mind. They would somehow understand the use of words in many situations and acquire how to adjust and apply them for multiple activities.
- To summarize what we have discussed so far, the question is whether we can train a model with many unlabeled texts to generate representations and adjust the model for different tasks without from-scratch training.
- The answer is a resounding yes, and that’s exactly what Devlin et al. did with BERT. They pre-trained a model with unsupervised learning to obtain non-task-specific representations helpful for various language model tasks. Then, they added one additional output layer to fine-tune the pre-trained model for each task, achieving state-of-the-art results on eleven natural language processing tasks like GLUE, MultiNLI, and SQuAD v1.1 and v2.0 (question answering).
- So, the first step in the BERT framework is to pre-train a model on a large amount of unlabeled data, giving many contexts for the model to learn representations in unsupervised training. The resulting pre-trained BERT model is a non-task-specific feature extractor that we can fine-tune quickly to a specific objective.
- The next question is how they pre-trained BERT using text datasets without labeling.
The Strength of Bidirectionality
- If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
- To solve this problem, BERT uses the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
- While this idea has been around for a very long time, BERT was the first to adopt it to pre-train a deep neural network.
- BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: given two sentences \(A\) and \(B\), is \(B\) the actual next sentence that comes after \(A\) in the corpus, or just a random sentence? For example:
- Thus, BERT has been trained on two main tasks:
- Masked Language Model (MLM)
- Next Sentence Prediction (NSP)
Masked Language Model (MLM) Overview
- Masked Language Modeling (MLM) is a pivotal concept in modern NLP models like BERT. Unlike traditional left-to-right language models, MLM estimates the probability of a word in a sentence by masking some words and predicting them based on context. This approach allows for bidirectional learning—understanding words from both preceding and following context. The self-attention mechanism of the Transformer architecture plays a key role here, providing a non-directional approach to understanding word context.
- In MLM, the model predicts hidden tokens based on surrounding words, creating a bi-directional representation of text. This is more effective than concatenating separate left-to-right and right-to-left models, as it integrates context from all directions. Devlin et al.’s approach in BERT involves training an encoder to generate these bi-directional representations.
Next Sentence Prediction (NSP) Simplified
- Next Sentence Prediction (NSP) is another crucial task for models like BERT, focusing on understanding sentence relationships. It involves predicting whether one sentence logically follows another. This binary prediction task, easily created from any monolingual corpus, adds an essential layer of understanding multi-sentence relationships, vital for tasks like Question Answering and Natural Language Inference.
- BERT handles MLM and NSP simultaneously through specific tokenization and input formatting. WordPiece tokenization breaks down text into sub-word units. Special tokens like
[CLS]
(classification),[SEP]
(separator), and[MASK]
(for hidden tokens) play specific roles. For NSP, BERT uses the[CLS]
token’s output as a probability indicator of whether a sentence follows logically after another. Additionally, BERT uses segment embeddings to distinguish between two separate sentences in its inputs, enhancing its sentence-level understanding. - In summary, MLM and NSP are foundational for BERT, enabling it to understand both the context of individual words and the relationship between sentences, crucial for sophisticated NLP tasks.
- So, Devlin et al. pre-trained BERT using the two unsupervised tasks and empirically showed that pre-trained bi-directional representations could help execute various language tasks involving single text or text pairs.
- The final step is to conduct supervised fine-tuning to perform specific tasks.
Supervised Fine-Tuning
- Fine-tuning adjusts all pre-trained model parameters for a specific task, which is a lot faster than from-scratch training. Furthermore, it is more flexible than feature-based training that fixes pre-trained parameters. As a result, we can quickly train a model for each specific task without heavily engineering a task-specific architecture.
- The pre-trained BERT model can generate representations for single text or text pairs, thanks to the special tokens and the two unsupervised language modeling pre-training tasks. As such, we can plug task-specific inputs and outputs into BERT for each downstream task.
- For classification tasks, we feed the final
[CLS]
representation to an output layer. For multi-sentence tasks, the encoder can process a concatenated text pair (using[SEP]
) into bi-directional cross attention between two sentences. For example, we can use it for question-passage pair in a question-answering task. - By now, it should be clear why and how they repurposed the Transformer architecture, especially the self-attention mechanism through unsupervised pre-training objectives and downstream task-specific fine-tuning.
Training with Cloud TPUs
- Everything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs. Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques.
- The Transformer model architecture, developed by researchers at Google in 2017, gave BERT the foundation to make it successful. The Transformer is implemented in Google’s open source release, as well as the tensor2tensor library.
Results with BERT
- To evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture.
- On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:
- BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:
- Below are the GLUE test results from table 1 of the paper. They reported results on the two model sizes:
- The base BERT uses 110M parameters in total:
- 12 encoder blocks
- 768-dimensional embedding vectors
- 12 attention heads
- The large BERT uses 340M parameters in total:
- 24 encoder blocks
- 1024-dimensional embedding vectors
- 16 attention heads
- The base BERT uses 110M parameters in total:
Making BERT Work for You
- The models that Google has released can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that Google has released so far are English-only, but they are working on releasing models which have been pre-trained on a variety of languages in the near future.
- The open source TensorFlow implementation and pointers to pre-trained BERT models can be found here. Alternatively, you can get started using BERT through Colab with the notebook “BERT FineTuning with Cloud TPUs”.
Flavors of BERT
RoBERTa (Robustly Optimized BERT Approach):
- Development: RoBERTa is an optimized version of BERT (Bidirectional Encoder Representations from Transformers). It was developed by Facebook AI.
- Key Improvements Over BERT: RoBERTa modifies BERT’s pre-training procedure, including training the model longer, on more data, and with bigger batches. It also removes the Next Sentence Prediction (NSP) task and dynamically changes the masking pattern applied to the training data.
- Performance: RoBERTa has been shown to outperform BERT on a range of NLP tasks, achieving state-of-the-art results on several benchmark datasets.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention):
- Development: DeBERTa, developed by Microsoft, enhances the BERT and RoBERTa models with a novel disentangled attention mechanism.
- Key Innovations:
- Disentangled Attention: Unlike traditional attention mechanisms in models like BERT and RoBERTa, DeBERTa decouples the attention scores of content and position and applies them separately, which allows the model to learn more sophisticated patterns in the data.
- Enhanced Mask Decoder: DeBERTa introduces an enhanced mask decoder to predict masked positions more effectively during pre-training.
-
Performance: DeBERTa achieves superior performance on a range of NLP tasks, even surpassing RoBERTa and other contemporaries, especially in tasks that require a deeper understanding of context and complex relationships in text.
- In summary, while RoBERTa focuses on optimizing the BERT model through changes in pre-training strategies and model configurations, DeBERTa introduces a novel attention mechanism to improve the model’s ability to understand complex relationships in the text. Both models represent significant advancements in the field of NLP and have been used to push the boundaries of what’s possible in tasks like text classification, question answering, and more.
Further Reading
- Generating word embeddings from BERT
- How are the TokenEmbeddings in BERT created?
- BERT uses WordPiece, RoBERTa uses BPE
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
References
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- BERT: How and Why Does It Use The Transformer Architecture?
- What is purpose of the [CLS] token and why is its encoding output important?
- What is the vector value of [CLS] [SEP] tokens in BERT
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledBERT,
title = {BERT},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}