• Fine-tuning of large pre-trained models on downstream tasks is called “transfer learning”.
  • While full fine-tuning pre-trained models on downstream tasks is a common, effective approach, it is an inefficient approach to transfer learning.
  • The simplest way out for efficient fine-tuning could be to freeze the networks’ lower layers and adapt only the top ones to specific tasks.
  • In this article, we’ll explore Parameter Efficient Fine-Tuning (PEFT) methods that enable us to adapt a pre-trained model to downstream tasks more efficiently – in a way that trains lesser parameters and hence saves cost and training time, while also yielding performance similar to full fine-tuning.

Parameter-Efficient Fine-Tuning (PEFT)

  • Let’s start off by defining what parameter-efficient fine-tuning is and give some context on it.
  • Parameter-efficient fine-tuning is particularly used in the context of large-scale pre-trained models (sucn as in NLP), to adapt that pre-trained model to a new task without drastically increasing the number of parameters.
  • The challenge is this: modern pre-trained models (like BERT, GPT, T5, etc.) contain hundreds of millions, if not billions, of parameters. Fine-tuning all these parameters on a downstream task, especially when the available dataset for that task is small, can easily lead to overfitting. The model may simply memorize the training data instead of learning genuine patterns. Moreover, introducing additional layers or parameters during fine-tuning can drastically increase computational requirements and memory consumption.
  • As mentioned earlier, PEFT allows to only fine-tune a small number of model parameters while freezing most of the parameters of the pre-trained LLM. This helps overcome the catastrophic forgetting issue that full fine-tuned LLMs face where the LLM forgets the original task it was trained on after being fine-tuned.


  • Parameter-efficient fine-tuning is useful due the following reasons:
    1. Reduced computational costs (requires fewer GPUs and GPU time).
    2. Faster training times (finishes training faster).
    3. Lower hardware requirements (works with cheaper GPUs with less VRAM).
    4. Better modeling performance (reduces overfitting).
    5. Less storage (majority of weights can be shared across different tasks.

Practical use-case

  • Credits to the below section go to Pranay Pasula.
  • PEFT obviates the need for 40 or 80GB A100s to make use of powerful LLMs. In other words, you can fine-tune 10B+ parameter LLMs for your desired task for free or on cheap consumer GPUs.
  • Using PEFT methods like LoRA, especially 4-bit quantized base models via QLoRA, you can fine-tune 10B+ parameter LLMs that are 30-40GB in size on 16GB GPUs. If it’s out of your budget to buy a 16GB GPU/TPU, Google Colab occasionally offers a 16GB VRAM Tesla T4 for free. Remember to save your model checkpoints every now and then and reload them as necessary, in the event of a Colab disconnect/kernel crash.
  • If you’re fine-tuning on a single task, the base models are already so expressive that you need only a few (~10s-100s) of examples to perform well on this task. With PEFT via LoRA, you need to train only a trivial fraction (in this case, 0.08%), and though the weights are stored as 4-bit, computations are still done at 16-bit.
  • Note that while a good amount of VRAM is still needed for the fine-tuning process, using PEFT, with a small enough batch size, and little gradient accumulation, can do the trick while still retaining ‘fp16’ computation. In some cases, the performance on the fine-tuned task can be comparable to that of a fine-tuned 16-bit model.
  • Key takeaway: You can fine-tune powerful LLMs to perform well on a desired task using free compute. Use a <10B parameter model, which is still huge, and use quantization, PEFT, checkpointing, and provide a small training set, and you can quickly fine-tune this model for your use case.

PEFT methods

  • Below, we will delve into individual PEFT methods and delve deeper into their nuances.

Prompt Tuning

  • First introduced in the The Power of Scale for Parameter-Efficient Prompt Tuning; this paper by Lester et al. introduces a simple yet effective method called prompt tuning, which learns soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.
  • Also, prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pre-trained model.
  • The authors show that prompt tuning outperforms few-shot learning by a large margin, and becomes more competitive with scale.
  • This is an interesting approach that can help to effectively use a single frozen model for multi-task serving.
  • Model tuning requires making a task-specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, our tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.
  • Thus, instead of using discrete text prompts, prompt tuning employs soft prompts. Soft prompts are learnable and conditioned through backpropagation, making them adaptable for specific tasks.

  • Prompt Tuning offers many benefits such as:
    • Memory-Efficiency: Prompt tuning dramatically reduces memory requirements. For instance, while a T5 “XXL” model necessitates 11 billion parameters for each task-specific model, prompt-tuned models need a mere 20,480 parameters (assuming a prompt length of 5 tokens).
    • Versatility: Enables the use of a single frozen model for multi-task operations.
    • Performance: Outshines few-shot learning and becomes more competitive as the scale grows.


  • Adapter layers, often termed “Adapters”, add minimal additional parameters to the pretrained model. These adapters are inserted between existing layers of the network.
  • Adapters is a PEFT technique shown to achieve similar performance as compared to tuning the top layers while requiring as fewer parameters as two orders of magnitude.
  • Adapter-based tuning simply inserts new modules called “adapter modules” between the layers of the pre-trained network.
  • During fine-tuning, only the parameters of these adapter layers are updated, while the original model parameters are kept fixed. This results in a model with a small number of additional parameters that are task-specific.
  • Keeping the full PT model frozen, these modules are the only optimizable ones while fine-tuning – this means only a very few parameters are introduced per task yielding “compact” models.
  • They offer many benefits such as:
    • Parameter-Efficiency: By keeping the main model frozen and only updating the adapter layers, a minimal number of parameters are added per task. This results in compact models that are memory-efficient.
    • Performance: Despite the small parameter footprint, adapters often achieve performance comparable to conventional fine-tuning.

What is an Adapter Module?

  • Let’s look at the application of the adapter module in the transformer architecture in three points:
    • The adapter module (right) first projects the original \(d\)-dimensional features into a smaller \(m\)-dimensional vector, applies a non-linearity, and then projects it back to \(d\) dimensions.
    • As can be seen, the module features a skip-connection - With it in place, when the parameters of the projection layers are initialized to near-zero which eventually leads to near identity initialization of the module. This is required for stable fine-tuning and is intuitive as with it, we essentially do not disturb the learning from pre-training.
    • In a transformer block (left), the adapter is applied directly to the outputs of each of the layers (attention and feedforward).

How to decide the value of \(m\)?

  • The size \(m\) in the Adapter module determines the no. of optimizable parameters and hence poses a parameter vs performance tradeoff.
  • The original paper experimentally investigates that the performance remains fairly stable across varying Adapter sizes m and hence for a given model a fixed size can be used for all downstream tasks.

Low Rank Adaptation (LoRA)

  • The image below, source, displays LoRA in action for a diffusion model.

  • Looking to avoid high GPU costs when fine-tuning a model?
  • The basic idea behind LoRA is:

Heavily Parameterized Large Language Models + Basic Linear Algebra Theorem = Save GPU memory!

  • The downsides of some of the other fine-tuning techniques for multitask learning are:
    • Adapters: introduces inference latency that becomes significant in online low batch size inference settings.
    • Prefix tuning: reduces the model’s usable sequence length.
  • LoRA (low rank adaptation) is a PEFT (parameter efficient fine-tuning) technique that relies on a simple concept - decomposition of non-full rank matrices.
  • LoRA hypothesizes that “change in weights” during adaptation has a “low intrinsic rank”. \(\Delta W\) is non-full rank and so can be written as \(\Delta W = BA\) (c.f. figure below).
    • A matrix is said to be rank-deficient if it does not have full rank. The rank deficiency of a matrix is the difference between the lesser of the number of rows and columns, and the rank. For more, refer Wikipedia: Rank.

  • “Low intrinsic rank” is inspired by the idea of “low intrinsic dimensionality” that these over-parameterized pre-trained models are seen to reside on, and that’s also the explanation behind why fine-tuning only a part of the full model rather than full fine-tuning can yield good results.
    • LoRA operates under the hypothesis that the weight changes in the adaptation of a model (fine-tuning) have a low intrinsic rank. In other words, even though a weight matrix may be large, the actual changes made to this matrix during adaptation can be represented in a compressed format, specifically through a low-rank approximation.
  • During training, the outputs from \(W\) and \(\Delta W\) are added component wise, like so:
\[h = Wx + BAx\]
  • All we’re now left to optimize is the new matrices \(B\) and \(A\) that contain a very smaller number of parameters (combined) than the full matrix due to their dimensions.
  • In summary, all of the pre-trained weights W are kept frozen and the rank decomposition matrices of the “change in weight matrix”, \(B\) and \(A\), are optimized.
  • This yields significant benefits as compared to full-fine tuning:
    • Time and memory efficiency: With a large percentage of the parameters being frozen, the training time and the GPU memory is saved. Saving is more when using stateful optimizers like Adam, Adadelta, etc.
    • Storage efficiency: No need to store huge checkpoints for different downstream tasks. Checkpoint size is greatly reduced with reduction in trainable parameters.
    • No additional inference latency: (unlike adapters) just add the learned matrix to the pre-trained one.
    • Easy task-switching in deployment: all we need to change is a handful of weights as compared to the full model.
  • Results:
    • With GPT-3 175B, the VRAM consumption during training is reduced from 1.2TB to 350GB, and the trained checkpoint size reduced from 350GB to 35MB!!!
    • LoRA achieves performances comparable to and sometimes even better than fine-tuning the full model.


  • Definition:
    • Quantized Low-Rank Adaptation (QLoRA) aims to efficiently fine-tune massive models (like a 65B parameter model) on limited GPU memory without compromising the model’s performance. It builds on LoRA’s principles and introduces 4-bit NormalFloat (NF4) quantization and Double Quantization techniques. In other words, QLoRA is an advanced technique designed for parameter-efficient fine-tuning of large pre-trained language models (LLMs). It builds upon the principles of Low-Rank Adaptation (LoRA) but introduces additional quantization to enhance parameter efficiency further.
  • Key Components:
    1. Low-Rank Adaptation: Like LoRA, QLoRA injects trainable low-rank matrices into the architecture (specifically, the Transformer layers) of pretrained LLMs. This technique ensures efficient fine-tuning by optimizing these low-rank matrices rather than the full model, resulting in fewer trainable parameters and reduced computational costs.
    2. Quantization: The standout feature of QLoRA is its use of quantization to achieve higher memory efficiency.
      • NF4 Quantization: QLoRA employs 4-bit NormalFloat (NF4) quantization. By transforming all weights to a specific distribution that fits within the range of NF4, this technique can efficiently quantify weights without needing intricate algorithms for quantile estimation.
      • Double Quantization: This involves the quantization of quantization constants themselves to further reduce memory overhead. By using 8-bit Floats with a block size of 256 for the secondary quantization, significant memory savings are achieved without compromising model performance.
  • QLoRA leverages a frozen, 4-bit quantized pretrained language model and backpropagates the gradients into Low Rank Adapters (LoRA). This combination seems to optimize both computation (by using low-bit quantization) and the number of parameters (using low-rank structures).
  • The QLoRA paper introduces a method for democratizing access to large transformer models through quantization, specifically aiming to significantly reduce the memory usage during the fine-tuning phase of these models. Here’s a brief breakdown of the method and its practical application with the transformers library:
  • Goal:
    • QLoRA minimizes memory utilization during fine-tuning of large language models (LLM) without compromising on performance as compared to the conventional 16-bit model fine-tuning.
  • Mechanism:
    • 4-bit Quantization: Pretrained language models are compressed using 4-bit quantization.
    • Low-Rank Adapters: After freezing the quantized parameters of the language model, Low-Rank Adapters (LoRA) are added as trainable parameters.
    • Backpropagation: Gradients are backpropagated through the frozen, quantized pretrained model, specifically targeting the LoRA layers, which are the only trainable parameters during fine-tuning.
  • Data Types:
    • Storage: QLoRA employs a storage data type, typically 4-bit NormalFloat, for the base model weights.
    • Computation: A 16-bit BrainFloat data type is utilized for computations. Weights are dequantized to this computation data type for forward and backward passes. Weight gradients are only computed for the LoRA parameters.
  • Results:
    • Comparable performance with 16-bit fine-tuning methods in various tests.
    • Models finetuned with QLoRA, specifically the Guanaco models trained on the OpenAssistant dataset, deliver state-of-the-art chatbot performance, closely matching that of ChatGPT on the Vicuna benchmark.
  • Advantages:
    • Further Memory Reduction: QLoRA achieves higher memory efficiency through quantization.
    • Preserved Performance: Despite being more parameter-efficient, QLoRA ensures high model quality, on par or even superior to fully fine-tuned models in various tasks.
    • Applicability: QLoRA is versatile, being compatible with various LLMs like RoBERTa, DeBERTa, GPT-2, and GPT-3.
  • The transformers library, maintained by Hugging Face, offers an integration of the QLoRA method. This integration facilitates the efficient quantization of supported models.

Which Technique to Choose: A Mental Model

  • Choosing a PEFT is simply matching them with your objectives.

Prompt Tuning

  • What: Prompt Tuning involves learning a set of continuous, trainable params that modify the pre-trained LLM’s hidden states in response to task-specific prompts during inference, effectively fine-tuning the model at inference time.

  • When to use: Prompt Tuning is a good choice when you have a large pre-trained LLM but want to fine-tune it for multiple different downstream tasks at inference time with minimal computational resources. It is also useful when you want to generate diverse and high-quality text outputs based on specific prompts.


  • Adds thin adapter modules between layers of pretrained model

  • Only adapters are updated, original weights are frozen

  • Adapters have much lower rank/dimensions than original layers

  • Allows efficient task-specific tuning with minimal new parameters

  • Avoids forgetting original knowledge in pretrained model

  • What: LoRA (Low-Rank Adaptation) is a technique that modifies the pre-trained LLM’s attention mechanism during fine-tuning by introducing a low-rank matrix factorization that learns task-specific attention patterns.

  • When to use: LoRA is a good choice when you want to fine-tune a pre-trained LLM for a specific downstream task that requires task-specific attention patterns. It is also useful when you have limited computational resources and want to reduce the number of trainable parameters in the model. Specifically:

    • Memory Efficiency is Desired but Not Critical: LoRA offers substantial savings in terms of parameters and computational requirements. If you’re looking to achieve a balanced reduction in trainable parameters without diving into the complexities of quantization, LoRA is an ideal choice.
    • Real-time Application: LoRA ensures no added inference latency, making it suitable for real-time applications.
    • Task-Switching is Required: LoRA can share the pretrained model across multiple tasks, reducing the need for maintaining separate models for each task.


Applies low-rank adapter modules

  • Quantizes the adapter weights to low-bit representations like 8-bit integers

  • Quantization compresses the adapter parameters to reduce memory footprint

  • Typically fine-tune the low-bit quantized adapters

  • Can further increase parameter efficiency over standard low-rank adapters

  • Some performance drop typically observed after quantization

  • What: Building on LoRA, QLoRA employs quantization techniques to ensure higher memory efficiency. This method uses both Low-Rank Adaptation and Quantization to optimize parameter efficiency and maintain model performance.

  • When to use: Choose QLoRA if:

    • Maximum Memory Efficiency is Needed: If you’re deploying LLMs on devices with stringent memory constraints or if memory overhead is a primary concern, QLoRA, with its additional quantization techniques, offers a more aggressive approach to reducing memory requirements.
    • Performance is Crucial: In some scenarios, QLoRA can perform at par or even better than fully fine-tuned models. If preserving or improving performance while optimizing for efficiency is essential, then QLoRA would be the method of choice.
    • Versatility Across Models: If you’re working with multiple LLM architectures and require a method that’s adaptable across them, QLoRA’s broad applicability makes it an attractive option. - Summary:
      • Operating within strict memory constraints.
      • Prioritizing both performance and efficiency.
      • Working with a variety of LLM architectures.


  • What: Adapters are tiny NN modules that are added to pre-trained LLMs, typically between the pre-trained layers, to adapt the model to new downstream tasks. During fine-tuning, only the weights of the adapter are learned, while the pre-trained model’s parameters remain fixed.

  • When to use: When you need to fine-tune multiple downstream tasks on the same pre-trained model. Additionally, Adapters are flexible and can be quickly and easily plugged into different parts of the pre-trained model without requiring major modifications.

Prefix Tuning

  • What: Prefix tuning involves adding a small trainable prefix to the input of the pre-trained LLM during fine-tuning, which modifies the representation learned by the pre-trained model to better suit the downstream task.

  • When to use: When you want to fine-tune a pre-trained LLM for a specific downstream task and have limited computational resources when you want to 𝙢𝙤𝙙𝙞𝙛𝙮 𝙩𝙝𝙚 𝙧𝙚𝙥𝙧𝙚𝙨𝙚𝙣𝙩𝙖𝙩𝙞𝙤𝙣 𝙡𝙚𝙖𝙧𝙣𝙚𝙙 𝙗𝙮 𝙩𝙝𝙚 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣𝙚𝙙 𝙢𝙤𝙙𝙚𝙡 for a particular task.

Comparison of all PEFT methods

PEFT Methods Description When to Use Computational Overhead Memory Efficiency Versatility across Tasks Performance Impact
Prompt Tuning Modifies LLM's hidden states with trainable parameters in response to task-specific prompts. Large pre-trained LLM.
Adaptation to multiple tasks.
Low Moderate High Depends on prompt quality
LoRA Introduces a low-rank matrix into the attention mechanism to learn task-specific patterns. Tasks with specialized attention requirements.
Limited resources.
Low-Moderate Good Moderate Generally positive with good training
QLoRA Builds on LoRA with quantization for enhanced memory efficiency. Strict memory constraints.
Emphasis on performance & efficiency.
Low Excellent High Comparable or better than full fine-tuning
Prefix Tuning Adds a trainable prefix to modify LLM's learned representation. Task-specific adaptation.
Limited resources.
Low Moderate Moderate Can vary, but usually positive with proper tuning
Adapters Inserts neural modules between LLM layers; only adapter weights are updated during fine-tuning. Multiple tasks on one LLM.
Flexibility required.
Moderate Good (only adapters are fine-tuned) High (can be added for multiple tasks) Typically positive if adapters are well-tuned

Surgical fine-tuning

  • Authors: Yoonho Lee, Annie S. Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, Chelsea Finn
  • Definition: Surgical fine-tuning is a method of selectively updating specific layers in a neural network based on how a fine-tuning dataset differs from the original pretraining dataset, rather than retraining every layer.
  • Motivation:
    1. Layer Specificity: Early layers in a neural network capture fundamental features of inputs (e.g., edges or shapes in images), while deeper layers combine these features for predictions (e.g., classifying images).

    2. Efficiency: Rather than universally fine-tuning every layer, selectively updating specific layers can achieve better performance, especially when the fine-tuning dataset has notable differences from the pretraining dataset.

  • Approaches:
    1. Manual Approach:
      • Fine-tune each layer individually and create a distinct model for each layer.
      • Compare the performance of each model to identify the best layers for fine-tuning.
    2. Automated Approach:
      • Calculate gradients for each layer.
      • Derive relative gradients by dividing the layer’s gradient by its weight magnitude.
      • Normalize these relative gradients across layers, ranking them between 0 to 1.
      • Assign learning rates for layers based on their normalized relative gradient value during training.
  • Results:
    • CIFAR-C Dataset:
      • Manual approach yielded an accuracy of 82.8%.
      • Fine-tuning the entire network resulted in 79.9% accuracy.
      • The automated approach achieved an accuracy of 81.4%.
  • Significance: Surgical fine-tuning is rooted in understanding how neural networks process input. This enhanced understanding can drive the discovery of more efficient methods to improve machine learning models.
  • Consideration: For more complex datasets, discerning differences between pretraining and fine-tuning datasets can be challenging. This complexity might make automated approaches like the one proposed more valuable, even if it didn’t yield the best performance on CIFAR-C.





If you found our work useful, please cite it as:

  title   = {Multitask Learning},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}