• In 2017, OpenAI introduced a groundbreaking approach to machine learning called Reinforcement Learning from Human Feedback (RLHF), specifically focusing on human preferences, in their paper “Deep reinforcement learning from human preferences”. This innovative concept has since inspired further research and development in the field.
  • The concept behind RLHF is straightforward yet powerful: it involves using a pretrained language model and having human evaluators rank its outputs. This ranking then informs the model to develop a preference for certain types of responses, leading to more reliable and safer outputs.
  • RLHF effectively leverages human feedback to enhance the performance of language models. It combines the strengths of reinforcement learning algorithms with the nuanced understanding of human input, facilitating continuous learning and improvement in the model.
  • Incorporating human feedback, RLHF not only improves the model’s natural language understanding and generation capabilities but also boosts its efficiency in specific tasks like text classification or translation.
  • Moreover, RLHF plays a crucial role in addressing bias within language models. By allowing human input to guide and correct the model’s language use, it fosters more equitable and inclusive communication. However, it’s important to be mindful of the potential for human-induced bias in this process.

Refresher: Basics of Reinforcement Learning

  • To comprehend why reinforcement learning is employed in RLHF, we need to gain a better understanding of what it entails.
  • Reinforcement learning has its basics in mathematics where an agent is interacting with the environment as shown below (source):

  • In this interaction, the agent takes an action, and the environment responds with a state and a reward. Here’s a brief on the key terms:
    • The reward is the objective that we want to optimize.
    • A state is the representation of the environment/world at the current time index.
    • A policy is used to map from that state to an action.


  • Let’s start out by talking about what the motivation behind aligning LLMs to human feedback is.
  • The initial objective of training large language models like GPT was to predict subsequent text tokens accurately. However, this approach did not ensure that the outputs were helpful, harmless, or honest.
  • Consequently, there was a risk of generating content that might not align with ethical or safe human standards. To address this, a process was required to guide the model towards outputs that reflect human values, and that’s the role RLHF fulfills.
  • The image below (source), depicts how RLHF was leveraged in InstructGPT and will be used as the foundation of our understanding.
  • The image outlines a three-step process used to train a language model using RLHF. Here’s an explanation of each step: 1. Collect Demonstration Data, and Train a Supervised Policy.
    • A prompt is taken from a collection of prompts.
    • A human labeler (an annotator) provides the desired output, demonstrating how the model should ideally respond.
    • This labeled data is then used to fine-tune the language model (like GPT-3) using supervised learning techniques. Essentially, the model is taught to imitate the demonstrations. 2. Collect Comparison Data, and Train a Reward Model.
    • A prompt is chosen, and the model generates several potential outputs.
    • A labeler then ranks these outputs from best to worst according to criteria like helpfulness or accuracy.
    • This ranked data is used to train a reward model. The reward model learns to predict the quality of the language model’s outputs based on the rankings provided by human labelers. 3. Optimize a Policy Against the Reward Model Using Reinforcement Learning.
    • A new prompt is selected from the dataset.
    • The current policy (strategy the model uses to generate outputs) creates a response.
    • The reward model evaluates this response and assigns a reward.
    • This reward information is used to update and improve the policy through a reinforcement learning algorithm known as Proximal Policy Optimization (PPO) . The policy is adjusted to increase the likelihood of generating higher-reward outputs in the future.
  • Chip Huyen provides a zoomed out view of how the overall process works in her flowchart below:

  • Here’s a breakdown of the flowchart:

    1. Language Modeling:
      • This is the first stage where a language model is trained on a large dataset. The dataset is composed of a vast amount of text data, which can be of varying quality. The training at this stage is optimized for text completion tasks. The scale mentioned is over 1 trillion tokens, and examples of such models include GPT-x, Gopher, Falcon, LLama, Pythia, Bloom, and StableLM. This results in a Pretrained Large Language Model (LLM).
      • To expand further: This is phase of pretraining involves developing a large language model (LLM) that functions as a completion machine, using statistical knowledge to predict the likelihood of sequences in language. This is achieved by feeding the model extensive text data, often exceeding trillions of tokens, from varied sources to learn language patterns. The model’s efficacy is contingent on the quality of the training data, with the aim to minimize cross-entropy loss across training samples. As the Internet becomes saturated with data, including that generated by LLMs themselves, there’s a growing need to access proprietary data for further model improvement.
    2. Supervised Finetuning:
      • In the second stage, the pretrained LLM is further finetuned using high-quality data, which is often dialogue-focused to better suit conversational AI. This is done using demonstration data, and the process generates a Supervised Finetuning (SFT) model. The amount of data used for finetuning ranges from 10,000 to 100,000 (prompt, response) pairs. Examples of models that go through this process are Dolly-v2 and Falcon-Instruct.
      • To elaborate: This is phase involves Supervised Fine-Tuning (SFT) for dialogue, where a pre-trained model is optimized to generate preferred responses to prompts, such as direct answers to questions. High-quality demonstration data, consisting of prompt-response pairs, guides the model’s behavior. With about 13,000 such pairs, OpenAI’s approach emphasizes quality through expert labelers, while others like DeepMind use heuristics for data selection. The SFT process is critical for tailoring the model’s outputs to practical use cases, leveraging a smaller yet refined dataset to minimize cross-entropy loss for the dialogue-specific responses.
    3. Classification and Reward Modeling:
      • The model undergoes a classification process where it is trained to give a scalar score to responses based on human feedback. This is to ensure that the model can evaluate the quality of its own responses. The data used here is called comparison data, and involves 100,000 to 1 million comparisons between a prompt, a winning response, and a losing response. This stage results in the creation of a Reward model.
    4. Reinforcement Learning (RLHF):
      • This phase involves using Reinforcement Learning techniques to train the model to generate responses that maximize the scores given by the reward model, effectively teaching the AI to prefer high-quality responses as judged by humans. This stage uses prompts (10,000 to 100,000) to adjust the model’s responses. The end product is the Final model, which should be adept at handling prompts in a way that aligns with human preferences. Examples of such models are InstructGPT, ChatGPT, Claude, and StableVicuna.
      • This phase of RLHF is an advanced training process that refines the behavior of a Supervised Fine-Tuned (SFT) model. It uses human feedback to score AI-generated responses, guiding the model to produce high-quality outputs. RLHF involves training a reward model to evaluate responses and optimizing the language model to prioritize these high scores. This phase addresses the limitations of SFT by providing nuanced feedback on the quality of responses, not just their plausibility, and mitigates issues like hallucination by aligning model outputs more closely with human expectations. Despite its complexity, RLHF has been shown to enhance model performance significantly over SFT alone.
  • Below, we will expand on the key steps mentioned in this flow.

Reward Model

  • In the context of RLHF, the key function of a reward model is to evaluate a given input (such as a sequence of text) and produce a scalar reward. This reward is indicative of human preferences or judgments about the quality or desirability of the input.

  • The image above (source) displays how the reward model works internally.
  • A reward model is a function or model that takes as input the output or behavior of an AI agent, which can include sequences of text, and produces a scalar reward signal that quantifies how well those outputs align with human preferences or desired behavior.
  • Architectures for reward models include:
    • LM classifiers: An LLM fine-tuned as a binary classifier to score which response better fits the human preference
    • Value networks: Regression models that predict a scalar rating representing relative human preference
    • Critique generators: LMs trained to generate an evaluative critique explaining which response is better and why. The critique is used with instruction tuning.
  • The goal is converting noisy human subjective judgments into a consistent reward function that can guide an RL agent’s training. Better reward modeling yields superior performance.
  • To summarize, the reward model is trained using the ranked comparison data (several outputs generated by the model) based on it’s alignment criteria which can be helpful, harmless, and honesty. The reward function combines various models into the RLHF process. It evaluates generated text’s “preferability.” by including a penalty term based on the Kullback-Leibler (KL) divergence between probability distributions from the RL policy and the initial model. This penalty prevents the RL policy from deviating significantly from the pretrained model, ensuring coherent text generation.
    • The Kullback-Leibler (KL) divergence, which is a measure of the difference between two probability distributions, can be used to overlap the two distributions (initial LM output vs. tuned LM output).
      • KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It quantifies the difference between two probability distributions.
      • Thus, with RLHF, KL divergence can be used to compare the probability distribution of an agent’s current policy with a reference distribution that represents the desired behavior.

Optimizing the Policy

  • The “policy” refers to a strategy or a set of rules that an agent uses to make decisions in an environment. The policy defines how the agent selects actions based on its current observations or state.
  • The policy in PPO is iteratively updated to maximize reward while maintaining a certain level of similarity to its previous version (to prevent drastic changes that could lead to instability).
  • In DPO, the policy is optimized directly from human preferences, where it increases the relative log probability of preferred responses to unpreferred ones, thus aligning with human feedback while maintaining a balance as specified by the KL divergence constraint.

Proximal Policy Optimization (PPO)

  • Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that addresses some key challenges in training agents through policy gradient methods. Here’s a detailed look at how PPO works:

Core Principles of PPO

  1. Policy Gradient Approach: PPO operates on the policy gradient approach, where the agent directly learns a policy, typically parameterized by a neural network. The policy maps states to actions based on the current understanding of the environment.

  2. Iterative Policy Improvement: The agent collects a set of trajectories under its current policy, and then updates the policy to maximize a specially designed objective function. This process is repeated iteratively, allowing the policy to gradually improve over time.

Key Components of PPO

  1. Surrogate Objective Function: Central to PPO is its surrogate objective function, which considers the ratio of the probability of an action under the current policy to the probability under the old policy, multiplied by the advantage function. The advantage function assesses how much better an action is compared to the average action at a given state.

  2. Policy Ratio and Clipping Mechanism: The “policy ratio,” which is the ratio of the probability of an action under the new policy to that under the old policy, plays a crucial role. PPO employs a clipping mechanism in its objective function, limiting the policy ratio within a defined range (typically \([1-\epsilon, 1+\epsilon]\)). This clipping ensures that the updates to the policy are kept within a reasonable range, preventing the new policy from deviating excessively from the old one. Ultimately, this mechanism helps in maintaining the stability of the learning process.

  3. Multiple Epochs of Stochastic Gradient Ascent: In PPO, each batch of experiences is used for multiple epochs of stochastic gradient ascent. This efficient use of data for policy updates makes PPO more sample-efficient compared to some other methods.

  4. Value Function and Baseline: A value function is often trained alongside the policy in PPO. This value function estimates the expected return (cumulative future rewards) from each state and is used to compute the advantage function, which in turn informs the policy update.

Advantages of PPO

  • Stability and Reliability: The clipping mechanism in the objective function helps to avoid large, destabilizing updates to the policy, making the learning process more stable and reliable.
  • Efficiency: By reusing data for multiple gradient updates, PPO can be more sample-efficient compared to some other methods.
  • General Applicability: PPO has demonstrated good performance across a wide range of environments, from simple control tasks to complex simulations like those in 3D simulations. It offers a simpler and more robust approach compared to previous algorithms like TRPO.

Simplified Example

  • Imagine an agent learning to play a game. The agent tries different actions (moves in the game) and learns a policy that predicts which action to take in each state (situation in the game). The policy is updated based on the experiences, but instead of drastically changing the policy based on recent success or failure, PPO makes smaller, incremental changes. This way, the agent avoids drastically changing its strategy based on limited new information, leading to a more stable and consistent learning process.

Proximal Policy Optimization (PPO) is designed with a specific objective function that helps in stabilizing and improving the training process in reinforcement learning. The objective function of PPO and the role of KL divergence in it can be described as follows:

PPO’s Objective Function

  1. Policy Ratio: The core of the PPO objective function involves the policy ratio, which is the ratio of the probability of taking a certain action under the current policy to the probability under the previous policy. This ratio is multiplied by the advantage estimate, which reflects how much better a given action is compared to the average action at a given state.

  2. Clipped Surrogate Objective: To prevent excessively large updates, which could destabilize training, PPO introduces a clipping mechanism in its objective function. The policy ratio is clipped within a certain range, typically [1-ε, 1+ε] (where ε is a small value like 0.1 or 0.2). This clipping ensures that the updates to the policy are not too large, which maintains stability in training.

  3. Value Function Loss: PPO also typically includes a value function loss in its objective. This part of the objective function ensures that the estimated value of the states (as predicted by the value function) is as accurate as possible, which is important for computing reliable advantage estimates.

  4. Entropy Bonus: Some implementations of PPO include an entropy bonus to encourage exploration. This part of the objective function rewards the policy for taking a variety of actions, which helps prevent premature convergence to suboptimal policies.

Role of KL Divergence

  • While the KL divergence is not a direct component of the basic PPO objective function, it plays a significant role in some implementations of PPO:
  1. Monitoring Policy Stability: KL divergence is used as a measure to monitor how much the policy changes during training. A large KL divergence indicates a significant change in the policy, which might lead to instability.

  2. Adjusting Policy Updates:

    • KL Penalty: In some implementations, a KL penalty is added to the PPO objective function. This penalty increases when the KL divergence between the new and old policies becomes too large, thus discouraging drastic policy updates.
    • KL Constraint: Alternatively, PPO can enforce a KL constraint, where the algorithm aims to keep the KL divergence below a predefined threshold. If this threshold is exceeded, the algorithm modifies its updates to reduce the divergence.


  • Proximal Policy Optimization (PPO) stands out in the realm of reinforcement learning for its innovative approach to policy updates via gradient ascent. Its key innovation is the introduction of a clipped surrogate objective function that judiciously constrains the policy ratio. This mechanism is fundamental in preventing drastic policy shifts and ensuring a smoother, more stable learning progression.
  • PPO is particularly favored for its effectiveness and simplicity across diverse environments, striking a fine balance between policy improvement and stability.
  • The PPO objective function is designed to balance the need for effective policy improvement with the need for training stability. It does this through a clipped surrogate objective function, value function loss, and potentially an entropy bonus. KL divergence, while not a direct part of the basic PPO objective function, is often used in tandem with it to ensure that policy updates do not destabilize the learning process, either by penalizing large changes or by enforcing a constraint on the extent of change allowed between policy updates.

Direct Preference Optimization (DPO)

  • Large-scale unsupervised language models (LMs) acquire extensive world knowledge and reasoning skills, but precisely controlling their behavior is challenging due to their unsupervised training nature. Traditionally, methods like Reinforcement Learning from Human Feedback (RLHF), discussed earlier in this article, are used to steer these models, involving two stages: training a reward model based on human preference labels and then fine-tuning the LM to align with these preferences using reinforcement learning (RL). However, RLHF presents complexities and instability issues, necessitating fitting a reward model and then training a policy to optimize this reward, which is prone to stability concerns.
  • Proposed in Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al. from Stanford in 2023, Direct Preference Optimization (DPO) is a novel approach that simplifies and enhances the aforementioned process. DPO leverages a mathematical relationship between optimal policies and reward functions, demonstrating that the constrained reward maximization problem in RLHF can be optimized more effectively with a single stage of policy training. DPO redefines the RLHF objective by showing that the reward can be rewritten purely as a function of policy probabilities, allowing the LM to implicitly define both the policy and the reward function. This innovation eliminates the need for a separate reward model and the complexities of RL.
  • This paper introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. The second stage is particularly hard to get right due to stability concerns, which DPO obliterates. The way it works is, given a dataset of the form <prompt, worse completion, better completion>, you train your LLM using a new loss function which essentially encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion, weighted by how much higher the implicit reward model. This method obviates the need for an explicit reward model, as the LLM itself acts as a reward model. The key advantage is that it’s a straightforward loss function optimized using backpropagation.
  • The stability, performance, and computational efficiency of DPO are significant improvements over traditional methods. It eliminates the need for sampling from the LM during fine-tuning, fitting a separate reward model, or extensive hyperparameter tuning.
  • The figure below from the paper illustrates that DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.

  • Experiments demonstrate that DPO can fine-tune LMs to align with human preferences as effectively, if not more so, than traditional RLHF methods. It notably surpasses RLHF in controlling the sentiment of generations and enhances response quality in tasks like summarization and single-turn dialogue. Its implementation and training processes are substantially simpler.
  • In essence, DPO represents a groundbreaking shift in training language models to align with human preferences. It consolidates the two-stage process of RLHF into a single, efficient end-to-end policy learning approach. By reparameterizing the reward function and unifying policy learning and reward modeling into one streamlined optimization process, DPO offers a more efficient and lightweight method for training language models to match human preferences.


Aspect Proximal Policy Optimization (PPO) Direct Preference Optimization (DPO)
Objective Maximizes expected reward while preventing large policy updates (clipped objective function). Directly optimizes policy based on human preferences, using a binary cross-entropy objective.
Input States and rewards from the environment. States from the environment and human preference feedback.
Output Actions to be taken in the environment. Actions to be taken in the environment, aligned with human preferences.
Learning Mechanism Policy gradients with a clipped surrogate objective to update policy and value networks. Binary cross-entropy optimization on human preference data, updating a single policy network.
Network Components Separate policy and value networks. A single policy network.
Feedback Mechanism Uses rewards from the environment as feedback for learning. Uses human preference data as direct feedback for learning.
Stability Clipping mechanism in objective function to maintain stability in policy updates. Inherent stability by directly optimizing preferences with dynamic per-example importance weighting.
Complexity More complex due to dual network structure and balancing reward maximization with policy update stability. Simpler, as it bypasses explicit reward modeling and directly optimizes policy from human preferences.
Applicability Suitable for a wide range of RL environments where reward signals are available. Particularly effective in scenarios where aligning with human preferences is crucial.

Reinforcement Learning with AI Feedback (RLAIF)

  • RLAIF uses AI-generated preferences instead of human annotators. It leverages a powerful LLM to generate these preferences, offering a cost-effective and efficient alternative to human-generated feedback.
  • RLAIF operates by using a pre-trained LLMs to generate feedback for training another LLM. Essentially, the feedback-generating LLM serves as a stand-in for human annotators. This model evaluates and provides preferences or feedback on the outputs of the LLM being trained, guiding its learning process.
  • The feedback is used to optimize the LLM’s performance for specific tasks like summarization or dialogue generation. This method enables efficient scaling of the training process while maintaining or improving the model’s performance compared to methods relying on human feedback.

RLHF in Llama 2

  • RLHF in Llama 2 works through a multi-stage process that integrates both human and model-generated feedback to refine the performance of language models. Here’s how it functions:
    1. Pretraining (RLHF Step 1): Llama 2 undergoes initial pretraining with large amounts of data through self-supervised learning. This stage lays the foundation for the model by enabling it to understand language patterns and context.
    2. Supervised Fine-Tuning (RLHF Step 1): The model then undergoes supervised fine-tuning with instruction data, where it is trained to respond to prompts in ways that align with specific instructions.
    3. Reward Models Creation (RLHF Step 2): Two separate reward models are created using human preference data – one for helpfulness and one for safety. These models are trained to predict which of two responses is better based on human judgments.
    4. Margin Loss and Ranking: Unlike the previous approach that generates multiple outputs and uses a “k choose 2” comparison method, Llama 2’s dataset is based on binary comparisons, and each labeler is presented with only two responses at a time. A margin label is collected alongside binary ranks to indicate the degree of preference, which can inform the ranking loss calculation.
    5. Fine-Tuning with Reward Models (RLHF Step 2): The fine-tuning process utilizes the reward models to continue training the Llama 2 model, integrating feedback on safety and helpfulness into the model’s responses.
    6. Rejection Sampling and PPO (RLHF Step 3): Finally, Llama 2 employs rejection sampling and Proximal Policy Optimization (PPO). Rejection sampling is used to draw multiple outputs and select the one with the highest reward for the gradient update. PPO is then used to fine-tune the model further, following up on the previous models updated via rejection sampling.
  • The image below (source) showing how Llama 2 leverages RLHF.

Bias Concerns and Mitigation Strategies

  • A fair question to ask now is if RLHF can add bias to the model. This is an important topic as large conversational language models are being deployed in various applications from search engines (Bing Chat, Google’s Bard) to word documents (Microsoft office co-pilot, Google docs, Notion, etc.).
  • The answer is, yes, just as with any machine learning approach with human input, RLHF has the potential to introduce bias.
  • Let’s look at the different forms of bias it can introduce:
    • Selection bias:
      • RLHF relies on feedback from human evaluators, who may have their own biases and preferences (and can thus limit their feedback to topics or situations they can relate to). As such, the agent may not be exposed to the true range of behaviors and outcomes that it will encounter in the real world.
    • Confirmation bias:
      • Human evaluators may be more likely to provide feedback that confirms their existing beliefs or expectations, rather than providing objective feedback based on the agent’s performance.
      • This can lead to the agent being reinforced for certain behaviors or outcomes that may not be optimal or desirable in the long run.
    • Inter-rater variability:
      • Different human evaluators may have different opinions or judgments about the quality of the agent’s performance, leading to inconsistency in the feedback that the agent receives.
      • This can make it difficult to train the agent effectively and can lead to suboptimal performance.
    • Limited feedback:
      • Human evaluators may not be able to provide feedback on all aspects of the agent’s performance, leading to gaps in the agent’s learning and potentially suboptimal performance in certain situations.
  • Now that we’ve seen the different types of bias possible with RLHF, lets look at ways to mitigate them:
    • Diverse evaluator selection:
      • Selecting evaluators with diverse backgrounds and perspectives can help to reduce bias in the feedback, just as it does in the workplace.
      • This can be achieved by recruiting evaluators from different demographic groups, regions, or industries.
    • Consensus evaluation:
      • Using consensus evaluation, where multiple evaluators provide feedback on the same task, can help to reduce the impact of individual biases and increase the reliability of the feedback.
      • This is almost like ‘normalizing’ the evaluation.
    • Calibration of evaluators:
      • Calibrating evaluators by providing them with training and guidance on how to provide feedback can help to improve the quality and consistency of the feedback.
    • Evaluation of the feedback process:
      • Regularly evaluating the feedback process, including the quality of the feedback and the effectiveness of the training process, can help to identify and address any biases that may be present.
    • Evaluation of the agent’s performance:
      • Regularly evaluating the agent’s performance on a variety of tasks and in different environments can help to ensure that it is not overfitting to specific examples and is capable of generalizing to new situations.
    • Balancing the feedback:
      • Balancing the feedback from human evaluators with other sources of feedback, such as self-play or expert demonstrations, can help to reduce the impact of bias in the feedback and improve the overall quality of the training data.

Use cases

  • Now let’s look at how different papers have used this methodology with their own tweaks (source).
  • The latest LLMs like ChatGPT tend to use RLHF for finetuning instead of supervised learning.
  • Anthropic (Constitutional AI):
    • The initial policy they use for RLHF has context distillation that helps improve helpfulness, honesty, and harmlessness (HHH).
    • Preference model pretraining (PMP): fine-tunes LM on dataset of binary ranking.
  • OpenAI (InstructGPT, ChatGPT):
    • Pioneered using RLHF.
    • Both Instruct and ChatGPT first finetune the model via Supervised Learning then update it with RLHF.
    • Human generated initial LM training text, then trains the RL policy to match this.
    • Extensively uses human annotation.
    • Uses PPO.
  • DeepMind (A2C):
    • Does not use PPO, uses advantage actor-critic (A2C) instead the algorithm.
    • Trains on different rules and preferences as well as trains on things the model should not do.

Relevant publications

OpenAI’s Paper on InstructGPT

  • Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
  • Ouyang et al. (2022) from OpenAI introduces InstructGPT, a model that aligns language models with user intent on a wide range of tasks by fine-tuning with human feedback.
  • Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, they collect a dataset of labeler demonstrations of the desired model behavior, which they use to fine-tune GPT-3 using supervised fine-tuning (SFT). This process is referred to as “instruction tuning” by other papers such as Wei et al. (2022).
  • They then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using Reinforcement Learning from Human Feedback (RLHF).
  • In human evaluations on their prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
  • Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, their results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
  • It is important to note that ChatGPT is trained using the same methods as InstructGPT (using SFT followed by RLHF), but is fine-tuned from a model in the GPT-3.5 series.
  • Furthermore, the fine-tuning process proposed in the paper isn’t without its challenges. First, we need a significant volume of demonstration data. For instance, in the InstructGPT paper, they used 13k instruction-output samples for supervised fine-tuning, 33k output comparisons for reward modeling, and 31k prompts without human labels as input for RLHF. Second, fine-tuning comes with an alignment tax “negative transfer” – the process can lead to lower performance on certain critical tasks. (There’s no free lunch after all.) The same InstructGPT paper found that RLHF led to performance regressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD, HellaSwag, and WMT 2015 French to English. A potential workaround is to have several smaller, specialized models that excel at narrow tasks.
  • The figure below from the paper illustrates the three steps of training InstructGPT: (1) SFT, (2) reward model training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train the respective model in the diagram. In Step 2, boxes A-D are samples from the SFT model that get ranked by labelers.

Constitutional AI: Harmlessness from AI Feedback

  • The paper extends Reinforcement Learning from Human Feedback (RLHF) by training language models on datasets labeled for helpfulness and harmlessness. It introduces ‘HH’ models, which are trained on both criteria and have shown to be more harmless and better at following instructions than models trained on helpfulness alone.
  • An evaluation of these models’ ability to identify harmful behavior in language model interactions was conducted using a set of conversations rated for harmfulness. The study leveraged ‘red teaming’ where humans attempted to provoke the AI into harmful responses, thereby improving the training process.
  • The effectiveness of the training method was demonstrated through models’ performance on questions assessing helpfulness, honesty, and harmlessness, without relying on human labels for harmlessness.
  • This research aligns with other efforts like LaMDA and InstructGPT, which also utilize human data to train language models. The concept of ‘constitutional AI’ was introduced, focusing on self-critique and revision by the AI to foster both harmless and helpful interactions. The ultimate goal is to create AI that can self-regulate harmfulness while remaining helpful and responsive.

OpenAI’s Paper on PPO

  • Schulman et al. (2017) proposes a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.
  • Whereas standard policy gradient methods perform one gradient update per data sample, they propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which they call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).
  • Their experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, showing that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall clock time.

Anthropic’s Paper on Constitutional AI

  • As AI systems become more capable, we would like to enlist their help to supervise other AIs.
  • Bai et al. (2022) experiments with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so they refer to the method as ‘Constitutional AI’.
  • The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase they sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, they sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
  • They then train with RL using the preference model as the reward signal, i.e. they use ‘RL from AI Feedback’ (RLAIF). As a result they are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
  • The figure below from the paper shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

  • The graph below shows harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training. The Helpful and HH models were trained with human feedback as in [Bai et al., 2022], and exhibit a tradeoff between helpfulness and harmlessness. The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness. The crowdworkers evaluating these models were instructed to prefer less evasive responses when both responses were equally harmless; this is why the human feedback-trained Helpful and HH models do not differ more in their harmlessness scores.

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

  • This paper by Lee et al. from Google Research, introduces a novel method for training large language models (LLMs) with AI-generated feedback, addressing the challenges and costs associated with traditional human feedback methods.
  • The paper presents Reinforcement Learning from AI Feedback (RLAIF) as a promising alternative to the conventional Reinforcement Learning from Human Feedback (RLHF). RLAIF utilizes an off-the-shelf LLM as a preference labeler, streamlining the training process and, in some cases, surpassing the performance of models trained with human feedback.
  • This approach is applied to text generation tasks such as summarization, helpful dialogue generation, and harmless dialogue generation. The performance of RLAIF, as assessed by human raters, is comparable or superior to RLHF, challenging the assumption that larger policy models are always more effective.
  • A key advantage of RLAIF is its potential to significantly reduce reliance on expensive human annotations. The study shows the efficacy of using the same model size for both the LLM labeler and the policy model, and highlights that directly prompting the LLM for reward scores can be more effective than using a distilled reward model.
  • The authors explore methodologies for generating AI preferences aligned with human values, emphasizing the effectiveness of chain-of-thought reasoning and detailed preamble in improving AI labeler alignment.
  • The following figure from the paper shows a diagram depicting RLAIF (top) vs. RLHF (bottom).

  • RLAIF’s scalability and cost-effectiveness are notable, with the approach being over ten times cheaper than human annotation. This aligns with the growing trend in LLM research focusing on quality over quantity in datasets.
  • The paper suggests that combining RLHF and RLAIF could be a strategic approach, especially considering that LLMs like GPT-4 have been trained with human feedback. This hybrid model could represent a balanced integration of high-quality human data, amplified significantly by AI, potentially shaping the future of LLM training and influencing approaches like the development of GPT-5.

Diffusion Model Alignment Using Direct Preference Optimization

  • This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
  • The paper introduces Diffusion-DPO, a method adapted from Direct Preference Optimization (DPO), for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
  • Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
  • The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
  • The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
  • The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

  • Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
  • The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
  • The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
  • In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.

Nash Learning from Human Feedback

  • This paper by Munos et al. from Google DeepMind introduces an alternative approach to the conventional Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human preferences. This new approach, termed Nash Learning from Human Feedback (NLHF), focuses on learning a preference model from pairwise human feedback and pursuing a policy that generates responses preferred over any competing policy, thus achieving a Nash equilibrium for this preference model.
  • The NLHF approach aims to encompass a broader spectrum of human preferences, maintain policy independence, and better align with the diversity of human preferences. This method marks a significant shift from the traditional RLHF framework, which is more limited in capturing the richness and diversity of human preferences.
  • Key contributions of this work include the introduction and definition of a regularized variant of the preference model, the establishment of the existence and uniqueness of the corresponding Nash equilibrium, and the introduction of novel algorithms such as Nash-MD and Nash-EMA. Nash-MD, founded on mirror descent principles, converges to the Nash equilibrium without requiring the storage of past policies, making it particularly suitable for LLMs. Nash-EMA, inspired by fictitious play, uses an exponential moving average of past policy parameters. The paper also introduces policy-gradient algorithms Nash-MD-PG and Nash-EMA-PG for deep learning architectures. Extensive numerical experiments conducted on a text summarization task using the TL;DR dataset validate the effectiveness of the NLHF approach.
  • The regularized preference model in NLHF uses KL-regularization to quantify the divergence between the policy under consideration and a reference policy. This regularization is particularly crucial in situations where the preference model is more accurately estimated following a given policy or where it is essential to remain close to a known safe policy.
  • In terms of implementation, the paper explores gradient-based algorithms for deep learning architectures, focusing on computing the Nash equilibrium of a preference model. This exploration emphasizes the applicability of these algorithms in the context of LLMs.