• Retrieval-Augmented Generation (RAG) is a technique that enhances language model generation by incorporating external knowledge.
  • This is typically done by retrieving relevant information from a large corpus of documents and using that information to inform the generation process.
  • Let’s delve into the specifics below.


  • In numerous instances, clients possess extensive proprietary documents, such as technical manuals, and require the extraction of specific information from this voluminous content. This task can be likened to locating a needle in a haystack.
  • Recently, OpenAI introduced a novel model, GPT4-Turbo, which boasts the capability to process large documents, potentially addressing this need. However, this model is not entirely efficient due to the “Lost In The Middle” phenomenon. This phenomenon mirrors the experience where, akin to reading the Bible in its entirety but struggling to recall what follows the Book of Samuel, the model tends to forget content located towards the middle of its contextual window.
  • To circumvent this limitation, an alternative approach known as Retrieval-Augmented-Generation (RAG) has been developed. This method involves creating an index for every paragraph in the document. When a query is made, the most pertinent paragraphs are swiftly identified and subsequently fed into a Large Language Model (LLM) like GPT4. This strategy of providing only select paragraphs, as opposed to the entire document, prevents information overload within the LLM and significantly enhances the quality of the results.

Neural Retrieval

  • Before we jump into RAG, let’s take a moment to talk about neural retrievers holistically.
  • Neural retrievers are a type of information retrieval model that uses neural networks to match queries to relevant documents. They encode the query and documents into dense vector representations and compute similarity scores between them. This allows them to go beyond lexical matching and capture semantic relevance.
  • Neural retrievers represent a significant shift from traditional keyword-based information retrieval systems to ones that can understand the underlying meanings and relationships in textual data. Here’s an expanded explanation of how they work and their significance:
  • Here’s how they generally operate:
    1. Vector Encoding:
      • Both queries and documents are transformed into vectors in a high-dimensional space. This process is done by neural network-based encoders that have been trained to capture the semantic essence of text.
      • During training, these models are often exposed to vast amounts of text, allowing them to learn complex patterns and relationships between words and phrases.
    2. Semantic Matching:
      • The similarity between query and document vectors is calculated using measures such as cosine similarity. This allows the system to determine which documents are most relevant to a query, based on the content’s meaning rather than just keyword overlap.
      • This process can capture nuanced relationships, like synonyms or related concepts, that traditional methods might miss.
  • Advantages of Neural Retrievers:
    • Neural retrievers can understand the context in which terms are used, allowing for more accurate retrieval when queries or documents have ambiguous or multiple meanings.
    • They are adept at dealing with long and complex queries because they can grasp the overall intent rather than just isolated terms.
    • Many neural retrievers are trained on multilingual datasets, enabling them to handle queries in different languages effectively.
  • Challenges and Considerations:
    • Neural models, particularly those used for encoding large documents, require significant computational power for both training and inference.
    • The performance of neural retrievers depends heavily on the data they are trained on, and they may inherit biases present in the training data.
    • Keeping the document representations current is a challenge, especially for dynamically changing content.

The Retrieval Augmented Generation (RAG) Pipeline

  • With RAG, the LLM is able to leverage knowledge and information that is not necessarily in its weights by providing it access to external knowledge sources such as databases.
  • It leverages a retriever to find relevant contexts to condition the LLM, in this way, RAG is able to augment the knowledge-base of an LLM with relevant documents.
  • The retriever here could be any of the following depending on the need for semantic retrieval or not:
    • Vector database: Typically, queries are embedded using models like BERT for generating dense vector embeddings. Alternatively, traditional methods like TF-IDF can be used for sparse embeddings. The search is then conducted based on term frequency or semantic similarity.
    • Graph database: Constructs a knowledge base from extracted entity relationships within the text. This approach is precise but may require exact query matching, which could be restrictive in some applications.
    • Regular SQL database: Offers structured data storage and retrieval but might lack the semantic flexibility of vector databases.
  • The image below from Damien Benveniste, PhD talks a bit about the difference between using Graph vs Vector database for RAG.

  • In his post linked above, Damien states that Graph Databases are favored for Retrieval Augmented Generation (RAG) when compared to Vector Databases. While Vector Databases partition and index data using LLM-encoded vectors, allowing for semantically similar vector retrieval, they may fetch irrelevant data.
  • Graph Databases, on the other hand, build a knowledge base from extracted entity relationships in the text, making retrievals concise. However, it requires exact query matching which can be limiting.
  • A potential solution could be to combine the strengths of both databases: indexing parsed entity relationships with vector representations in a graph database for more flexible information retrieval. It remains to be seen if such a hybrid model exists.

  • After retrieving, you may want to look into filtering the candidates further by adding ranking and/or fine ranking layers that allow you to filter down candidates that do not match your business rules, are not personalized for the user, current context, or response limit.
  • Let’s succinctly summarize the process of RAG and then delve into its pros and cons:
    1. Vector Database Creation: RAG starts by converting an internal dataset into vectors and storing them in a vector database (or a database of your choosing).
    2. User Input: A user provides a query in natural language, seeking an answer or completion.
    3. Information Retrieval: The retrieval mechanism scans the vector database to identify segments that are semantically similar to the user’s query (which is also embedded). These segments are then given to the LLM to enrich its context for generating responses.
    4. Combining Data: The chosen data segments from the database are combined with the user’s initial query, creating an expanded prompt.
    5. Generating Text: The enlarged prompt, filled with added context, is then given to the LLM, which crafts the final, context-aware response.
  • The image below (source) displays the high-level working of RAG.

Benefits of RAG

  • So why should you use RAG for your application?
    • With RAG, the LLM is able to leverage knowledge and information that is not necessarily in its weights by providing it access to external knowledge bases.
    • RAG doesn’t require model retraining, saving time and computational resources.
    • It’s effective even with a limited amount of labeled data.
    • However, it does have its drawbacks, namely RAG’s performance depends on the comprehensiveness and correctness of the retriever’s knowledge base.
    • RAG is best suited for scenarios with abundant unlabeled data but scarce labeled data and is ideal for applications like virtual assistants needing real-time access to specific information like product manuals.
      • Scenarios with abundant unlabeled data but scarce labeled data: RAG is useful in situations where there is a lot of data available, but most of it is not categorized or labeled in a way that’s useful for training models. As an example, the internet has vast amounts of text, but most of it isn’t organized in a way that directly answers specific questions.
      • Furthermore, RAG is ideal for applications like virtual assistants: Virtual assistants, like Siri or Alexa, need to pull information from a wide range of sources to answer questions in real-time. They need to understand the question, retrieve relevant information, and then generate a coherent and accurate response.
      • Needing real-time access to specific information like product manuals: This is an example of a situation where RAG models are particularly useful. Imagine you ask a virtual assistant a specific question about a product, like “How do I reset my XYZ brand thermostat?” The RAG model would first retrieve relevant information from product manuals or other resources, and then use that information to generate a clear, concise answer.
      • In summary, RAG models are well-suited for applications where there’s a lot of information available, but it’s not neatly organized or labeled.
  • Below, let’s take a look at the publication that introduced RAG and how the original paper implemented the framework.

RAG vs. Fine-tuning

  • The table below (source) compares RAG vs. fine-tuning.

  • To summarize the above table:
    1. Retrieval systems (RAG) give LLM systems access to factual, access-controlled, timely information. Fine tuning can not do this, so there’s no competition.
    2. Fine tuning (not RAG) adapts the style, tone, and vocabulary of LLMs so that your linguistic “paint brush” matches the desired domain and style
    3. All in all, focus on RAG first. A successful LLM application must connect specialized data to the LLM workflow. Once you have a first full application working, you can add fine tuning to improve the style and vocabulary of the system. Fine tuning will not save you if the RAG connection to data is built improperly.

Ensemble of RAG

  • Leveraging an ensemble of RAG systems offers a substantial upgrade to the model’s ability to produce rich and contextually accurate text. Here’s an enhanced breakdown of how this procedure could work:
    • Knowledge sources: RAG models retrieve information from external knowledge stores to augment their knowledge in a particular domain. These can include passages, tables, images, etc. from domains like Wikipedia, books, news, databases.
    • Combining sources: At inference time, multiple retrievers can pull relevant content from different corpora. For example, one retriever searches Wikipedia, another searches news sources. Their results are concatenated into a pooled set of candidates.
    • Ranking: The model ranks the pooled candidates by their relevance to the context.
    • Selection: Highly ranked candidates are selected to condition the language model for generation.
    • Ensembling: Separate RAG models specialized on different corpora can be ensembled. Their outputs are merged, ranked, and voted on.
  • Multiple knowledge sources can augment RAG models through pooling and ensembles. Careful ranking and selection helps integrate these diverse sources for improved generation.
  • One thing to keep in mind when using multiple retrievers is to rank the different outputs from each retriever before merging them to form a response. This can be done in a variety of ways, using LTR algorithms, multi-armed bandit framework, multi-objective optimization, or according to specific business use cases.

Choosing a Vector DB using a Feature Matrix

  • To compare the plethora of Vector DB offerings, a feature matrix that highlights the differences between Vector DBs and which to use in which scenario is essential.
  • Vector DB Comparison by VectorHub offers a great comparison spanning 37 vendors and 29 features (as of this writing).

  • As a secondary resource, the following table (source) shows a comparison of some of the prevalent Vector DB offers along various feature dimensions:

  • Access the full spreadsheet here.

Building a RAG pipeline

  • The image below (source), gives a visual overview of the three different steps of RAG: Ingestion, Retrieval, and Synthesis/Response Generation.

  • In the sections below, we will go over these key areas.



  • Chunking is the process of dividing the prompts and/or the documents to be retrieved, into smaller, manageable segments or chunks. These chunks can be defined either by a fixed size, such as a specific number of characters, sentences or paragraphs.
  • In RAG, each chunk is encoded into an embedding vector for retrieval. Smaller, more precise chunks lead to a finer match between the user’s query and the content, enhancing the accuracy and relevance of the information retrieved.
  • Larger chunks might include irrelevant information, introducing noise and potentially reducing the retrieval accuracy. By controlling the chunk size, RAG can maintain a balance between comprehensiveness and precision.
  • So the next natural question that comes up is, how do you choose the right chunk size for your use case? The choice of chunk size in RAG is crucial. It needs to be small enough to ensure relevance and reduce noise but large enough to maintain the context’s integrity. Let’s look at a few methods below referred from Pinecone:
    • Fixed-size chunking: Simply decide the number of tokens in our chunk along with whether there should be overlap between them or not. Overlap between chunks guarantees there to be minimal semantic context loss between chunks. This option is computationally cheap and simple to implement.
      text = "..." # your text
      from langchain.text_splitter import CharacterTextSplitter
      text_splitter = CharacterTextSplitter(
          separator = "\n\n",
          chunk_size = 256,
          chunk_overlap  = 20
      docs = text_splitter.create_documents([text])
    • Context-aware chunking: Content-aware chunking leverages the intrinsic structure of the text to create chunks that are more meaningful and contextually relevant. Here are several approaches to achieving this:
      1. Sentence Splitting This method aligns with models optimized for embedding sentence-level content. Different tools and techniques can be used for sentence splitting:
        • Naive Splitting: A basic method where sentences are split using periods and new lines. Example:
             text = "..."  # Your text
             docs = text.split(".")
          • This method is quick but may overlook complex sentence structures.
        • NLTK (Natural Language Toolkit): A comprehensive Python library for language processing. NLTK includes a sentence tokenizer that effectively splits text into sentences. Example:
          text = "..."  # Your text
          from langchain.text_splitter import NLTKTextSplitter
          text_splitter = NLTKTextSplitter()
          docs = text_splitter.split_text(text)
        • spaCy: An advanced Python library for NLP tasks, spaCy offers efficient sentence segmentation. Example:
          text = "..."  # Your text
          from langchain.text_splitter import SpacyTextSplitter
          text_splitter = SpacyTextSplitter()
          docs = text_splitter.split_text(text)
      2. Recursive Chunking: Recursive chunking is an iterative method that splits text hierarchically using various separators. It adapts to create chunks of similar size or structure by recursively applying different criteria. Example using LangChain:
           text = "..."  # Your text
           from langchain.text_splitter import RecursiveCharacterTextSplitter
           text_splitter = RecursiveCharacterTextSplitter(
               chunk_size = 256,
               chunk_overlap = 20
           docs = text_splitter.create_documents([text])
      3. Specialized Chunking: For formatted content like Markdown or LaTeX, specialized chunking can be applied to maintain the original structure:
        • Markdown Chunking: Recognizes Markdown syntax and divides content based on structure. Example:
          from langchain.text_splitter import MarkdownTextSplitter
          markdown_text = "..."
          markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
          docs = markdown_splitter.create_documents([markdown_text])
        • LaTeX Chunking: Parses LaTeX commands and environments to chunk content while preserving its logical organization.
  • “As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will make sense to the language model as well. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.” (source)


  • Once you have your prompt chunked appropriately, the next step is to embedd it. Embedding prompts and documents in RAG involves transforming both the user’s query (prompt) and the documents in the knowledge base into a format that can be effectively compared for relevance. This process is critical for RAG’s ability to retrieve the most relevant information from its knowledge base in response to a user query. Here’s how it typically works:
  • One option to help pick which embedding model would be best suited for your task is to look at HuggingFace’s Massive Text Embedding Benchmark (MTEB) leaderboard.. There is a question of whether a dense or sparse embedding can be used so let’s look into benefits of each below:
  • Sparse embedding: Sparse embeddings such as TF-IDF are great for lexical matching the prompt with the documents. Best for applications where keyword relevance is crucial. It’s computationally less intensive but may not capture the deeper semantic meanings in the text.
  • Semantic embedding: Semantic embeddings, such as BERT or SentenceBERT lend themselves naturally to the RAG usecase.
    • BERT: Suitable for capturing contextual nuances in both the documents and queries. Requires more computational resources compared to sparse embeddings but offers more semantically rich embeddings.
    • SentenceBERT: Ideal for scenarios where the context and meaning at the sentence level are important. It strikes a balance between the deep contextual understanding of BERT and the need for concise, meaningful sentence representations. This is usually the preferred route for RAG.
Sentence Embeddings: The What and Why
Background: Differences compared to Token-Level Models like BERT
  • As an overview, let’s look into how sentence transformers differ compared to token-level embedding models such as BERT.
  • Sentence Transformers are a modification of the traditional BERT model, tailored specifically for generating embeddings of entire sentences (i.e., sentence embeddings). The key differences in their training approaches are:
    1. Objective: BERT is trained to predict masked words in a sentence and next sentence prediction. It’s optimized for understanding words and their context within a sentence. Sentence Transformers, on the other hand, are trained specifically to understand the meaning of entire sentences. They generate embeddings where sentences with similar meanings are close in the embedding space.
    2. Level of Embedding: The primary difference lies in the level of embedding. BERT provides embeddings for each token (word or subword) in a sentence, whereas sentence transformers provide a single embedding for the entire sentence.
    3. Training Data and Tasks: While BERT is primarily trained on large text corpora with tasks focused on understanding words in context, Sentence Transformers are often trained on data sets that include sentence pairs. This training focuses on similarity and relevance, teaching the model how to understand and compare the meanings of entire sentences.
    4. Siamese and Triplet Network Structures: Sentence Transformers often use Siamese or Triplet network structures. These networks involve processing pairs or triplets of sentences and adjusting the model so that similar sentences have similar embeddings, and dissimilar sentences have different embeddings. This is different from BERT’s training, which does not inherently involve direct comparison of separate sentences.
    5. Fine-Tuning for Specific Tasks: Sentence Transformers are often fine-tuned on specific tasks like semantic similarity, paraphrase identification, or information retrieval. This fine-tuning is more focused on sentence-level understanding as opposed to BERT, which might be fine-tuned for a wider range of NLP tasks like question answering, sentiment analysis, etc., focusing on word or phrase-level understanding.
    6. Applicability: BERT and similar models are more versatile for tasks that require understanding at the token level (like named entity recognition, question answering), whereas sentence transformers are more suited for tasks that rely on sentence-level understanding (like semantic search, sentence similarity).
    7. Efficiency in Generating Sentence Embeddings or Similarity Tasks: In standard BERT, generating sentence embeddings usually involves taking the output of one of the hidden layers (often the first token, [CLS]) as a representation of the whole sentence. However, this method is not always optimal for sentence-level tasks. Sentence Transformers are specifically optimized to produce more meaningful and useful sentence embeddings and are thus more efficient for tasks involving sentence similarity computations. Since they produce a single vector per sentence, computing similarity scores between sentences is computationally less intensive compared to token-level models.
  • In summary, while BERT is a general-purpose language understanding model with a focus on word-level contexts, Sentence Transformers are adapted specifically for understanding and comparing the meanings of entire sentences, making them more effective for tasks that require sentence-level semantic understanding.
  • Let’s look into how sentence transformers trained differently compared to token-level embedding models such as BERT.
  • Sentence transformers are trained to generate embeddings at the sentence level, which is a distinct approach from token-level embedding models like BERT. Here’s an overview of their training and how it differs from token-level models:
    1. Model Architecture: Sentence transformers often start with a base model similar to BERT or other transformer architectures. However, the focus is on outputting a single embedding vector for the entire input sentence, rather than individual tokens.
    2. Training Data: They are trained on a variety of datasets, often including pairs or groups of sentences where the relationship (e.g., similarity, paraphrasing) between the sentences is known.
    3. Training Objectives: BERT is pre-trained on objectives like masked language modeling (predicting missing words) and next sentence prediction, which are focused on understanding the context at the token level. Sentence transformers, on the other hand, are trained specifically to understand the context and relationships at the sentence level. Their training objective is typically to minimize the distance between embeddings of semantically similar sentences while maximizing the distance between embeddings of dissimilar sentences. This is achieved through contrastive loss functions like triplet loss, cosine similarity loss, etc.
    4. Output Representation: In BERT, the sentence-level representation is typically derived from the embedding of a special token (like [CLS]) or by pooling token embeddings (and averaging, MaxPooling, or concatenating them). Sentence transformers are designed to directly output a meaningful sentence-level representation.
    5. Fine-tuning for Downstream Tasks: Sentence transformers can be fine-tuned on specific tasks, such as semantic text similarity, where the model learns to produce embeddings that capture the nuanced meaning of entire sentences.
  • In summary, sentence transformers are specifically optimized for producing representations at the sentence level, focusing on capturing the overall semantics of sentences, which makes them particularly useful for tasks involving sentence similarity and clustering. This contrasts with token-level models like BERT, which are more focused on understanding and representing the meaning of individual tokens within their wider context.
Applying Sentence Transformers for RAG
  • Now, let’s look into why sentence transformers are the numero uno choice of models to generate embeddings for RAG.
  • RAG leverages Sentence Transformers for their ability to understand and compare the semantic content of sentences. This integration is particularly useful in scenarios where the model needs to retrieve relevant information before generating a response. Here’s how Sentence Transformers are useful in a RAG setting:
    1. Improved Document Retrieval: Sentence Transformers are trained to generate embeddings that capture the semantic meaning of sentences. In a RAG setting, these embeddings can be used to match a query (like a user’s question) with the most relevant documents or passages in a database. This is critical because the quality of the generated response often depends on the relevance of the retrieved information.
    2. Efficient Semantic Search: Traditional keyword-based search methods might struggle with understanding the context or the semantic nuances of a query. Sentence Transformers, by providing semantically meaningful embeddings, enable more nuanced searches that go beyond keyword matching. This means that the retrieval component of RAG can find documents that are semantically related to the query, even if they don’t contain the exact keywords.
    3. Contextual Understanding for Better Responses: By using Sentence Transformers, the RAG model can better understand the context and nuances of both the input query and the content of potential source documents. This leads to more accurate and contextually appropriate responses, as the generation component of the model has more relevant and well-understood information to work with.
    4. Scalability in Information Retrieval: Sentence Transformers can efficiently handle large databases of documents by pre-computing embeddings for all documents. This makes the retrieval process faster and more scalable, as the model only needs to compute the embedding for the query at runtime and then quickly find the closest document embeddings.
    5. Enhancing the Generation Process: In a RAG setup, the generation component benefits from the retrieval component’s ability to provide relevant, semantically-rich information. This allows the language model to generate responses that are not only contextually accurate but also informed by a broader range of information than what the model itself was trained on.
  • In summary, Sentence Transformers enhance the retrieval capabilities of RAG models with LLMs by enabling more effective semantic search and retrieval of information. This leads to improved performance in tasks that require understanding and generating responses based on large volumes of text data, such as question answering, chatbots, and information extraction.


  • Let’s look at three different types of retrieval: standard, sentence window, and auto-merging. Each of these approaches has specific strengths and weaknesses, and their suitability depends on the requirements of the RAG task, including the nature of the dataset, the complexity of the queries, and the desired balance between specificity and contextual understanding in the responses.

Standard/Naive approach

  • As we see in the image below (source), the standard pipeline uses the same text chunk for indexing/embedding as well as the output synthesis.

In the context of Retrieval-Augmented Generation (RAG) in Large Language Models (LLMs), here are the advantages and disadvantages of the three approaches:

  1. Simplicity and Efficiency: This method is straightforward and efficient, using the same text chunk for both embedding and synthesis, simplifying the retrieval process.
  2. Uniformity in Data Handling: It maintains consistency in the data used across both retrieval and synthesis phases.
  1. Limited Contextual Understanding: LLMs may require a larger window for synthesis to generate better responses, which this approach may not adequately provide.
  2. Potential for Suboptimal Responses: Due to the limited context, the LLM might not have enough information to generate the most relevant and accurate responses.

Sentence-Window Retrieval / Small-to-Large Chunking

  • The sentence-window approach breaks down documents into smaller units, such as sentences or small groups of sentences.
  • It decouples the embeddings for retrieval tasks (which are smaller chunks stored in a Vector DB), but for synthesis it adds back in the context around the retrieved chunks, as seen in the image below (source).

  • During retrieval, we retrieve the sentences that are most relevant to the query via similarity search and replace the sentence with the full surrounding context (using a static sentence-window around the context, implemented by retrieving sentences surrounding the one being originally retrieved) as shown in the figure below (source).

  1. Enhanced Specificity in Retrieval: By breaking documents into smaller units, it enables more precise retrieval of segments directly relevant to a query.
  2. Context-Rich Synthesis: It reintroduces context around the retrieved chunks for synthesis, providing the LLM with a broader understanding to formulate responses.
  3. Balanced Approach: This method strikes a balance between focused retrieval and contextual richness, potentially improving response quality.
  1. Increased Complexity: Managing separate processes for retrieval and synthesis adds complexity to the pipeline.
  2. Potential Contextual Gaps: There’s a risk of missing broader context if the surrounding information added back is not sufficiently comprehensive.

Auto-merging Retriever / Hierarchical Retriever

  • The image below (source), illustrates how auto-merging retrieval can work where it doesn’t retrieve a bunch of fragmented chunks as would happen with the naive approach.

  • The fragmentation in the naive approach would be worse with smaller chunk sizes as shown below (source).

  • Auto-merging retrieval aims to combine (or merge) information from multiple sources or segments of text to create a more comprehensive and contextually relevant response to a query. This approach is particularly useful when no single document or segment fully answers the query but rather the answer lies in combining information from multiple sources.
  • It allows smaller chunks to be merged into bigger parent chunks. It does this via the following steps:
    1. Define a hierarchy of smaller chunks linked to parent chunks.
    2. If the set of smaller chunks linking to a parent chunk exceeds some threshold (say, cosine similarity), then “merge” smaller chunks into the bigger parent chunk.
  • The method will finally retrieve the parent chunk for better context.
  1. Comprehensive Contextual Responses: By merging information from multiple sources, it creates responses that are more comprehensive and contextually relevant.
  2. Reduced Fragmentation: This approach addresses the issue of fragmented information retrieval, common in the naive approach, especially with smaller chunk sizes.
  3. Dynamic Content Integration: It dynamically combines smaller chunks into larger, more informative ones, enhancing the richness of the information provided to the LLM.
  1. Complexity in Hierarchy and Threshold Management: The process of defining hierarchies and setting appropriate thresholds for merging is complex and critical for effective functioning.
  2. Risk of Over-generalization: There’s a possibility of merging too much or irrelevant information, leading to responses that are too broad or off-topic.
  3. Computational Intensity: This method might be more computationally intensive due to the additional steps in merging and managing the hierarchical structure of text chunks.

Figuring out the ideal chunk size

  • Oftentimes when building a RAG applications there are many retrieval parameters/strategies to decide from (from chunk size to vector vs. keyword vs. hybrid search, for instance). Let’s zoom in on the chunk size aspect.
  • Building a RAG system involves determining the ideal chunk sizes for the documents that the retriever component will process. The ideal chunk size depends on several factors:

    1. Data Characteristics: The nature of your data is crucial. For text documents, consider the average length of paragraphs or sections. If the documents are well-structured with distinct sections, these natural divisions might serve as a good basis for chunking.

    2. Retriever Constraints: The retriever model you choose (like BM25, TF-IDF, or a neural retriever like DPR) might have limitations on the input length. It’s essential to ensure that the chunks are compatible with these constraints.

    3. Memory and Computational Resources: Larger chunk sizes can lead to higher memory usage and computational overhead. Balance the chunk size with the available resources to ensure efficient processing.

    4. Task Requirements: The nature of the task (e.g., question answering, document summarization) can influence the ideal chunk size. For detailed tasks, smaller chunks might be more effective to capture specific details, while broader tasks might benefit from larger chunks to capture more context.

    5. Experimentation: Often, the best way to determine the ideal chunk size is through empirical testing. Run experiments with different chunk sizes and evaluate the performance on a validation set to find the optimal balance between granularity and context.

    6. Overlap Consideration: Sometimes, it’s beneficial to have overlap between chunks to ensure that no important information is missed at the boundaries. Decide on an appropriate overlap size based on the task and data characteristics.

  • To summarize, determining the ideal chunk size for a RAG system is a balancing act that involves considering the characteristics of your data, the limitations of your retriever model, the resources at your disposal, the specific requirements of your task, and empirical experimentation. It’s a process that may require iteration and fine-tuning to achieve the best results.
Retriever Ensembling and Reranking
  • Thought: what if we could try a bunch of chunk sizes at once, and have a reranker prune the results?
  • This achieves two purposes:
    • Better (albeit more costly) retrieved results by pooling results from multiple chunk sizes, assuming the reranker has a reasonable level of performance.
    • A way to benchmark different retrieval strategies against each other (w.r.t reranker).
  • The process is as follows:
    • Chunk up the same document in a bunch of different ways, say with chunk sizes: 128, 256, 512, and 1024.
    • During retrieval, we fetch relevant chunks from each retriever, thus ensembling them together for retrieval.
    • Use a reranker to rank/prune results.
  • The following figure (source) delineates the process.

  • Based on evaluation results from LlamaIndex, faithfulness metrics go up slightly for the ensembled approach, indicating retrieved results are slightly more relevant. But pairwise comparisons lead to equal preference for both approaches, making it still questionable as to whether or not ensembling is better.
  • Note that the ensembling strategy can be applied for other aspects of a RAG pipeline too, beyond chunk size, such as vector vs. keyword vs. hybrid search, etc.

Using Approximate Nearest Neighbors for Retrieval

  • The next step is to consider which approximate nearest neighbors (ANN) library to choose from indexing. One option to pick the best option is to look at the leaderboard here.
  • More on ANN can be found in the ANN primer.

Response Generation / Synthesis

  • The last step of the RAG pipeline is to generate responses back to the user. In this step, the model synthesizes the retrieved information with its pre-trained knowledge to generate coherent and contextually relevant responses. This process involves integrating the insights gleaned from various sources, ensuring accuracy and relevance, and crafting a response that is not only informative but also aligns with the user’s original query, maintaining a natural and conversational tone.
  • Note that while creating the expanded prompt (with the retrieved top-\(k\) chunks) for an LLM to make an informed response generation, a strategic placement of vital information at the beginning or end of input sequences could enhance the RAG system’s effectiveness and thus make the system more performant. This is summarized in the paper below.

Lost in the Middle: How Language Models Use Long Contexts

  • While recent language models have the ability to take long contexts as input, relatively little is known about how well the language models use longer context.
  • This paper by Liu et al. from Percy Liang’s lab at Stanford, UC Berkeley, and Samaya AI analyzes language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. Put simply, they analyze and evaluate how LLMs use the context by identifying relevant information within it.
  • They tested open-source (MPT-30B-Instruct, LongChat-13B) and closed-source (OpenAI’s GPT-3.5-Turbo and Anthropic’s Claude 1.3) models. They used multi-document question-answering where the context included multiple retrieved documents and one correct answer, whose position was shuffled around. Key-value pair retrieval was carried out to analyze if longer contexts impact performance.
  • They find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. In other words, their findings basically suggest that Retrieval-Augmentation (RAG) performance suffers when the relevant information to answer a query is presented in the middle of the context window with strong biases towards the beginning and the end of it.
  • A summary of their learnings is as follows:
    • Best performance when the relevant information is at the beginning.
    • Performance decreases with an increase in context length.
    • Too many retrieved documents harm performance.
    • Improving the retrieval and prompt creation step with a ranking stage could potentially boost performance by up to 20%.
    • Extended-context models (GPT-3.5-Turbo vs. GPT-3.5-Turbo (16K)) are not better if the prompt fits the original context.
  • Considering that RAG retrieves information from an external database – which most commonly contains longer texts that are split into chunks. Even with split chunks, context windows get pretty large very quickly, at least much larger than a “normal” question or instruction. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Their analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.
  • “There is no specific inductive bias in transformer-based LLM architectures that explains why the retrieval performance should be worse for text in the middle of the document. I suspect it is all because of the training data and how humans write: the most important information is usually in the beginning or the end (think paper Abstracts and Conclusion sections), and it’s then how LLMs parameterize the attention weights during training.” (source)
  • In other words, human text artifacts are often constructed in a way where the beginning and the end of a long text matter the most which could be a potential explanation to the characteristics observed in this work.
  • You can also model this with the lens of two popular cognitive biases that humans face (primacy and recency bias), as in the following figure (source).

  • The final conclusion is that combining retrieval with ranking (as in recommender systems) should yield the best performance in RAG for question answering.
  • The following figure (source) shows an overview of the idea proposed in the paper: “LLMs are better at using info at beginning or end of input context”.

  • The following figure from the paper illustrates the effect of changing the position of relevant information (document containing the answer) on multidocument question answering performance. Lower positions are closer to the start of the input context. Performance is generally highest when relevant information is positioned at the very start or very end of the context, and rapidly degrades when models must reason over information in the middle of their input context.

The “Needle in a Haystack” Test

  • To understand the in-context retrieval ability of long-context LLMs over various parts of their prompt, a simple ‘needle in a haystack’ analysis could be conducted. This method involves embedding specific, targeted information (the ‘needle’) within a larger, more complex body of text (the ‘haystack’). The purpose is to test the LLM’s ability to identify and utilize this specific piece of information amidst a deluge of other data.
  • In practical terms, the analysis could involve inserting a unique fact or data point into a lengthy, seemingly unrelated text. The LLM would then be tasked with tasks or queries that require it to recall or apply this embedded information. This setup mimics real-world situations where essential details are often buried within extensive content, and the ability to retrieve such details is crucial.
  • The experiment could be structured to assess various aspects of the LLM’s performance. For instance, the placement of the ‘needle’ could be varied—early, middle, or late in the text—to see if the model’s retrieval ability changes based on information location. Additionally, the complexity of the surrounding ‘haystack’ can be modified to test the LLM’s performance under varying degrees of contextual difficulty. By analyzing how well the LLM performs in these scenarios, insights can be gained into its in-context retrieval capabilities and potential areas for improvement.
  • This can be accomplished using the Needle In A Haystack library. The following plot shows OpenAI’s GPT-4-128K’s (top) and (bottom) performance with varying context length.

  • The following figure (source) shows Claude 2.1’s long context question answering errors based on the areas of the prompt context length. On an average, Claude 2.1 demonstrated a 30% reduction in incorrect answers compared to Claude 2.

  • However, in their Long context prompting for Claude 2.1 blog, Anthropic noted that adding “Here is the most relevant sentence in the context:” to the start of Claude’s response raised the score from 27% to 98% on the original evaluation! The figure below from the blog shows that Claude 2.1’s performance when retrieving an individual sentence across its full 200K token context window. This experiment uses the aforementioned prompt technique to guide Claude in recalling the most relevant sentence.

Component-Wise Evaluation

  • Component-wise evaluation in RAG systems for LLMs involves assessing individual components of the system separately. This approach typically examines the performance of the retrieval component, which fetches relevant information from a database or corpus, and the generation component, which synthesizes responses based on the retrieved data. By evaluating these components individually, researchers can identify specific areas for improvement in the overall RAG system, leading to more efficient and accurate information retrieval and response generation in LLMs.
  • While metrics such as Context Precision, Context Recall, and Context Relevance provide insights into the performance of the retrieval component of the RAG system, Groundedness, and Answer Relevance offer a view into the quality of the generation.
  • Specifically,
    • Metrics to evaluate retrieval: Context Relevance, Context Recall, and Context Precision, which collectively assess the relevance, completeness, and accuracy of the information retrieved in response to a user’s query. Context Precision focuses on the system’s ability to rank relevant items higher, Context Recall evaluates how well the system retrieves all relevant parts of the context, and Context Relevance measures the alignment of retrieved information with the user’s query. These metrics ensure the effectiveness of the retrieval system in providing the most relevant and complete context for generating accurate responses.
    • Metrics to evaluate generation: Faithfulness and Answer Relevance, which measure the factual consistency of the generated answer with the given context and its relevance to the original question, respectively. Faithfulness focuses on the factual accuracy of the answer, ensuring all claims made can be inferred from the given context. Answer Relevance assesses how well the answer addresses the original question, penalizing incomplete or redundant responses. These metrics ensure the generation component produces contextually appropriate and semantically relevant answers.
  • The harmonic mean of these four aspects gives you the overall score (also called ragas score) which is a single measure of the performance of your RAG system across all the important aspects.
  • Most of the measurements do not require any labeled data, making it easier for users to run it without worrying about building a human-annotated test dataset first. In order to run ragas all you need is a few questions and if your using context_recall, a reference answer.
  • Overall, these metrics offer a comprehensive view of the RAG system’s retrieval performance, which can be implemented using libraries for evaluating RAG pipelines such as Ragas or TruLens and offer detailed insights about your RAG pipeline’s performance, focusing on the contextual and factual alignment of retrieved and generated content in response to user queries. Specifically, Ragas, offers metrics tailored for evaluating each component of your RAG pipeline in isolation. This approach complements the broader, system-level end-to-end evaluation of your system (which is detailed in End-to-End Evaluation), allowing for a deeper understanding of how well a RAG system performs in real-world scenarios where the intricacies of context and factual accuracy are paramount. The figure below (source) shows the metrics that Ragas offers which are tailored for evaluating each component (retrieval, generation) of your RAG pipeline in isolation.

  • The image below (source), shows the “triad” of metrics that can be used to evaluate RAG: Groundedness (also known as Faithfulness), Answer Relevance, and Context Relevance. Note that Context Precision and Context Recall are also important and were more recently introduced in a newer version of Ragas.

Retrieval Metrics

  • Evaluating the retrieval component of RAG in the context of LLMs involves assessing how effectively the system retrieves relevant information to support the generation of accurate and contextually appropriate responses.

Context Precision

  • Category: Contextual Alignment and Precision
  • Focus: Assesses the accuracy of the RAG system in ranking ground-truth relevant items from the context higher in the results. It is crucial in determining whether the most relevant chunks of information are prioritized at the top ranks when responding to a query.
  • Measurement Methods: Utilizes a formula that takes into account the presence of true positives (relevant items correctly ranked high) and false positives (irrelevant items incorrectly ranked high) within the top K results. This is typically evaluated using the question and its contextual information.
  • Evaluation Approach: The metric is computed by first identifying the true and false positives within the top K chunks of the context, followed by the calculation of precision at K. The formula for Context Precision is as follows:

    \[\begin{gathered}\text {Context Precision@k }=\frac{\sum \text { precision@k }}{\text { total number of relevant items in the top K results }} \\ \text {Precision@k }=\frac{\text { true positives@k }}{\text { (true positives@k }+ \text { false positives@k) }}\end{gathered}\]
    • where \(k\) represents the total number of chunks considered in the context. The value of Context Precision ranges between 0 and 1, with higher scores signifying a more precise alignment of the context with the query’s relevant items.

Context Recall

  • Category: Contextual Alignment and Recall
  • Focus: Evaluates the extent to which the context retrieved by the RAG system aligns with the annotated answer, considered as the ground truth. It specifically measures the system’s ability to retrieve all relevant parts of the context that are directly related to the ground truth answer.
  • Measurement Methods: Context recall is calculated by analyzing the correspondence between the sentences in the ground truth answer and the retrieved context. It can be measured using methods that assess the attribution of ground truth sentences to the retrieved context.
  • Evaluation Approach: The process involves identifying each sentence in the ground truth answer and determining whether it is represented in the retrieved context. The formula for calculating context recall is as follows:
\[\text{Context Recall} = \frac{\text{Number of GT sentences that can be attributed to context }}{\text{ Number of sentences in GT }}\]
  • Example:
    • Question: Where is France and what is its capital?
    • Ground truth: France is in Western Europe and its capital is Paris.
    • High context recall example: The context includes information about France being in Western Europe and mentions Paris as its capital.
    • Low context recall example: The context talks about France’s geographical features and history but does not mention its capital.
  • In this metric, values range between 0 and 1, with higher values indicating a better performance in aligning the retrieved context with the ground truth answer.

Context Relevance

  • Category: Contextual Alignment and Relevance
  • Focus:
    • “Is the passage returned relevant for answering the given query?”
    • Measures how well the context or content retrieved by the RAG system aligns with the user’s query. It specifically evaluates whether the retrieved information is relevant and appropriate for the given query, ensuring that only essential information is included to address the query effectively.
  • Measurement Methods: Can be measured with smaller BERT-style models, embedding distances, or with LLMs. The approach involves estimating the value of context relevance by identifying sentences within the retrieved context that are directly relevant for answering the given question.
  • Evaluation Approach: Involves a two-step procedure: first, the identification of relevant sentences using semantic similarity measures to produce a relevance score for each sentence. This is followed by the quantification of overall context relevance, where the final score is calculated using the formula:
\[\text {Context Relevance} = \frac{\text { Number of sentences that are relevant to the query within the retrieved context}}{\text { Total number of sentences in retrieved context}}\]
  • Examples:
    • High context relevance example: For a question like “What is the capital of France?”, a highly relevant context would be “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.”
    • Low context relevance example: For the same question, a less relevant context would include additional, unrelated information such as “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.”
  • This metric ensures that the RAG system provides concise and directly related information, enhancing the efficiency and accuracy of the response given to a specific query.

Generation Metrics

  • Evaluating the generation component of RAG in the context of LLMs involves assessing the ability of the system to seamlessly integrate retrieved information into coherent, contextually relevant, and linguistically accurate responses, ensuring a harmonious blend of retrieved data and generative language skills. Put simply, these metrics collectively provide a nuanced and multidimensional approach to evaluating RAG systems, emphasizing not just the retrieval of information but its contextual relevance, factual accuracy, and semantic alignment with user queries.

Groundedness (a.k.a. Faithfulness)

  • Category: Factual Alignment and Semantic Similarity
  • Focus:
    • “Is the answer generated faithful to the retrieved passage? Or does it contain hallucinated or extrapolated statements beyond the passage?”
    • This metric assesses the factual alignment and semantic similarity between the model’s response and the retrieved documents. It ensures that the generated response is contextually appropriate and factually grounded in the retrieved information. Specifically, it evaluates if all claims made in the model’s answer can be directly inferred from the given context.
  • Measurement Methods: Groundedness can be measured using Natural Language Inference (NLI) models, Large Language Models (LLMs), and a combination of automated systems and human judgment. The faithfulness score is calculated by comparing the claims made in the generated answer against those in the given context, using the formula:
\[\text {Faithfulness} = \frac{\text { Number of claims that can be inferred from given context }}{\text { Total number of claims in the generated answer }}\]
  • Evaluation Approach: Utilizes Chains of Thought (CoT) prompting to simulate a reasoning process, scoring the alignment on a scale of 0 or 1 through a hybrid approach of automated systems (for semantic match and factuality checking) and human judgment. The evaluation process involves identifying a set of claims from the generated answer and cross-checking each of these claims with the given context to determine their factual consistency.

  • Examples:
    • High faithfulness example: For the question “Where and when was Einstein born?” with the context “Albert Einstein (born 14 March 1879) was a German-born theoretical physicist,” a high faithfulness answer would be “Einstein was born in Germany on 14th March 1879.”
    • Low faithfulness example: For the same question and context, a low faithfulness answer would be “Einstein was born in Germany on 20th March 1879.”
  • This metric is crucial in ensuring the reliability and trustworthiness of responses generated by RAG systems, as it directly relates to the accuracy and factual consistency of the information provided in response to a query.

Answer Relevance

  • Category: Response Quality and Semantic Relevance
  • Focus:
    • “Is the generated answer relevant given the query and retrieved passage?”
    • This metric evaluates the relevance of the answer generated by the model to the user’s query. It assesses how pertinent the generated answer is to the given prompt, penalizing answers that are incomplete or contain redundant information. It does not consider factuality but focuses on the directness and appropriateness of the response to the original question.
  • Measurement Methods: Quantified using BERT-style models, embedding distances, or with LLMs. The metric is computed reference-free, using the mean cosine similarity between multiple questions generated from the answer and the original question. The formula for this measurement is:

    \[\text {Answer Relevance} = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(q, q_i)\]
    • where \(q\) is the original question, \(q_i\) are the questions generated from the answer, and sim represents the cosine similarity between their embeddings.
  • Evaluation Process: Involves prompting the LLM to generate an appropriate question for the generated answer multiple times, then measuring the mean cosine similarity between these generated questions and the original question. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align closely with the original question.

  • Examples:
    • Low relevance answer example: For the question “Where is France and what is its capital?”, a low relevance answer would be “France is in western Europe.”
    • High relevance answer example: For the same question, a high relevance answer would be “France is in western Europe and Paris is its capital.”

This metric is crucial for ensuring that the answers provided by RAG systems are not only accurate but also complete and directly address the user’s query without including unnecessary details.

  • The image below (source) shows the output format of Answer Relevance.

End-to-End Evaluation

  • Evaluating the end-to-end performance of a pipeline is also crucial, as it directly affects the user experience. Ragas provides metrics that can be employed to assess the overall performance of your pipeline, ensuring a comprehensive evaluation.

Answer Semantic Similarity

  • Category: Answer Quality and Semantic Alignment
  • Focus: Evaluates the degree of semantic similarity between the generated answer by the RAG system and the ground truth. This metric specifically assesses how closely the meaning of the generated answer mirrors that of the ground truth.
  • Measurement Methods: This metric is measured using cross-encoder models that are designed to calculate the semantic similarity score. These models analyze the semantic content of both the generated answer and the ground truth.
  • Evaluation Approach: The approach involves comparing the generated answer with the ground truth to determine the extent of semantic overlap. The semantic similarity is quantified on a scale from 0 to 1, where higher scores indicate a greater alignment between the generated answer and the ground truth. The formula for Answer Semantic Similarity is implicitly based on the evaluation of semantic overlap rather than a direct formula.

  • Example:
    • Ground truth: Albert Einstein’s theory of relativity revolutionized our understanding of the universe.
    • High similarity answer: Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.
    • Low similarity answer: Isaac Newton’s laws of motion greatly influenced classical physics.
  • In this metric, a higher score reflects a better quality of the generated response in terms of its semantic closeness to the ground truth, indicating a more accurate and contextually relevant answer.

Answer Correctness

  • Category: Answer Accuracy and Correctness
  • Focus: This metric assesses the accuracy of the answer generated by the RAG system in comparison to the ground truth. It emphasizes not just the semantic similarity but also the factual correctness of the generated answer relative to the ground truth.
  • Measurement Methods: The evaluation of answer correctness involves a combination of assessing semantic similarity and factual similarity. These aspects are integrated using a weighted scheme, which can include the use of cross-encoder models or other sophisticated methods for semantic analysis. Users can also apply a threshold value to interpret the scores in a binary manner.
  • Evaluation Approach: The approach entails comparing the generated answer with the ground truth to evaluate both semantic and factual alignment. The combined assessment of these two aspects results in the answer correctness score, which ranges from 0 to 1, where higher scores denote greater accuracy and alignment with the ground truth.

  • Example:
    • Ground truth: Einstein was born in 1879 in Germany.
    • High answer correctness example: In 1879, in Germany, Einstein was born.
    • Low answer correctness example: In Spain, Einstein was born in 1879.
  • This metric highlights the importance of not just understanding the context and content of the user’s query (as in the context relevance evaluation) but also ensuring that the answers provided are factually and semantically aligned with the established truth, thereby ensuring a high-quality response from the RAG system.

Multimodal RAG

  • Many documents contain a mixture of content types, including text and images. Yet, information captured in images is lost in most RAG applications. With the emergence of multimodal LLMs, like GPT-4V, it is worth considering how to utilize images in RAG.
  • Here are three ways to use images in RAG with LangChain:
    • Option 1:
      • Use multimodal embeddings (such as CLIP) to embed images and text.
      • Retrieve both using similarity search.
      • Pass raw images and text chunks to a multimodal LLM for answer synthesis.
    • Option 2:
      • Use a multimodal LLM (such as GPT-4V, LLaVA, or FUYU-8b) to produce text summaries from images.
      • Embed and retrieve text.
      • Pass text chunks to an LLM for answer synthesis.
    • Option 3:
      • Use a multimodal LLM (such as GPT-4V, LLaVA, or FUYU-8b) to produce text summaries from images.
      • Embed and retrieve image summaries with a reference to the raw image. You can use a multi-vector retriever with a Vector DB such as Chroma to store raw text and images along with their summaries for retrieval.
      • Pass raw images and text chunks to a multimodal LLM for answer synthesis.
  • Option 2 is appropriate for cases when a multi-modal LLM cannot be used for answer synthesis (e.g., cost, etc).
  • The following figure (source) offers an overview of all three aforementioned options.

  • LangChain offers cookbooks for Option 1 and Option 3.
  • The following infographic (source) also offers a top-level overview of Multimodal RAG:

Improving RAG Systems

  • To enhance and refine RAG systems, consider the following three structured methods, each accompanied by comprehensive guides and practical implementations:
    1. Re-ranking Retrieved Results: A fundamental and effective method involves employing a Re-ranking Model to refine the results obtained through initial retrieval. This approach prioritizes more relevant results, thereby improving the overall quality of the generated content. MonoT5, MonoBERT, DuoBERT, etc. are examples of deep models that can be used as re-rankers. For a detailed exploration of this technique, refer to the guide and code example provided by Mahesh Deshwal.
    2. FLARE Technique: Subsequent to re-ranking, one should explore the FLARE methodology. This technique dynamically queries the internet (could also be a local knowledge base) whenever the confidence level of a segment of the generated content falls below a specified threshold. This overcomes a significant limitation of conventional RAG systems, which typically query the knowledge base only at the outset and subsequently produce the final output. Akash Desai’s guide and code walkthrough offer an insightful understanding and practical application of this technique. More on the FLARE technique in the Active Retrieval Augmented Generation section.
    3. HyDE Approach: Finally, the HyDE technique introduces an innovative concept of generating a hypothetical document in response to a query. This document is then converted into an embedding vector. The uniqueness of this method lies in using the vector to identify a similar neighborhood within the corpus embedding space, thereby retrieving analogous real documents based on vector similarity. To delve into this method, refer to Akash Desai’s guide and code implementation . More on the HyDE technique in the Precise Zero-Shot Dense Retrieval Without Relevance Labels section.
  • Each of these methods offers a unique approach to refining RAG systems, contributing to more accurate and contextually relevant results.

RAFT: Adapting Language Model to Domain Specific RAG

  • A novel training recipe adapts pre-trained large language models (LLMs) to domain-specific RAG tasks, fine-tuning them to better utilize domain-specific documents by identifying and focusing on useful documents while ignoring irrelevant ones (distractor documents).
  • RAFT trains models to selectively ignore distractor documents that do not contribute to answering a given question, achieved by training with a mix of relevant (“oracle”) documents and distractors to enhance the model’s ability to filter out noise.
  • Models are trained to generate answers using a chain-of-thought approach, where they construct a reasoning path that leads to the answer, improving accuracy and making the reasoning process transparent.
  • Specifically tailored for domain-specific applications, ensuring that the model performs well on in-domain documents and is robust to variations in the quality and relevance of retrieved documents.
  • Empirically evaluated on several datasets, including PubMed, HotpotQA, and various coding-related datasets from the Gorilla project, consistently outperforming other models.
  • Designed to be robust against the inclusion of distractor documents during testing, a common challenge in practical applications of RAG.
  • These features make RAFT suitable for applications requiring accurate and reliable domain-specific information retrieval and question answering, allowing models to handle the “open-book” nature of many real-world tasks effectively.
  • RAFT, or Retrieval-Augmented Fine-Tuning, is a training approach designed to adapt language models to domain-specific retrieval-augmented generation (RAG) tasks. It improves how models utilize external domain-specific documents to enhance the accuracy and relevance of generated answers. RAFT trains models to effectively filter out irrelevant information (distractor documents) and focus on useful content for answering questions, thereby enhancing the model’s ability to handle “open-book” question-answering scenarios. RAFT should be used when:
  • You need to fine-tune language models for tasks that require integrating specific, external knowledge sources into the generation process.
  • There is a requirement to improve the precision of answers by leveraging relevant documents, particularly in specialized fields such as legal, medical, or technical domains.
  • The goal is to enhance how models manage and utilize large volumes of information effectively in their responses.

RAFT is especially valuable in scenarios where the accuracy of information is critical and where leveraging a vast amount of domain-specific knowledge can significantly improve performance and relevance.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  • The paper by Lewis et al. from Facebook AI Research, University College London, and New York University, introduces Retrieval-Augmented Generation (RAG) models combining pre-trained parametric and non-parametric memory for language generation tasks.
  • Addressing limitations of large pre-trained language models, such as difficulty in accessing and precisely manipulating knowledge, RAG models merge a pre-trained sequence-to-sequence (seq2seq) model with a dense vector index of Wikipedia, accessed by a neural retriever.
  • The RAG framework encompasses two models: RAG-Sequence, using the same retrieved document for the entire sequence, and RAG-Token, allowing different passages for each token.
  • The retrieval component, Dense Passage Retriever (DPR), uses a bi-encoder architecture with BERT-based document and query encoders. The generator component utilizes BART-large, a pre-trained seq2seq transformer with 400M parameters.
  • RAG models were trained jointly on the retriever and generator components without direct supervision on which documents to retrieve, using stochastic gradient descent with Adam. The training used a Wikipedia dump as the non-parametric knowledge source, split into 21M 100-word chunks.
  • A summary of the methods and models used for query/document embedding and retrieval, as well as the end-to-end structure of the RAG framework is as below:
    1. Query/Document Embedding:
      • The retrieval component, Dense Passage Retriever (DPR), follows a bi-encoder architecture.
      • DPR uses BERTBASE as the foundation for both document and query encoders.
      • For a document \(z\), a dense representation \(d(z)\) is produced by a document encoder, \(BERT_d\).
      • For a query \(x\), a query representation \(q(x)\) is produced by a query encoder, \(BERT_q\).
      • The embeddings are created such that relevant documents for a given query are close in the embedding space, allowing effective retrieval.
    2. Retrieval Process:
      • The retrieval process involves calculating the top-\(k\) documents with the highest prior probability, which is essentially a Maximum Inner Product Search (MIPS) problem.
      • The MIPS problem is solved approximately in sub-linear time to efficiently retrieve relevant documents.
    3. End-to-End Structure:
      • The RAG model uses the input sequence \(x\) to retrieve text documents \(z\), which are then used as additional context for generating the target sequence \(y\).
      • The generator component is modeled using BART-large, a pre-trained seq2seq transformer with 400M parameters. BART-large combines the input \(x\)with the retrieved content \(z\) for generation.
      • The RAG-Sequence model uses the same retrieved document for generating the complete sequence, while the RAG-Token model can use different passages per token.
      • The training process involves jointly training the retriever and generator components without direct supervision on what document should be retrieved. The training minimizes the negative marginal log-likelihood of each target using stochastic gradient descent with Adam.
      • Notably, the document encoder BERTd is kept fixed during training, avoiding the need for periodic updates of the document index.
  • The following figure from the paper illustrates an overview of the proposed approach. They combine a pre-trained retriever (Query Encoder + Document Index) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end. For query \(x\), they use Maximum Inner Product Search (MIPS) to find the top-\(K\) documents \(z_i\). For final prediction \(y\), they treat \(z\) as a latent variable and marginalize over seq2seq predictions given different documents.

  • In open-domain QA tasks, RAG established new state-of-the-art results, outperforming both parametric seq2seq models and task-specific retrieve-and-extract architectures. RAG models showed the ability to generate correct answers even when the right answer wasn’t in any retrieved document.
  • RAG-Sequence surpassed BART in Open MS-MARCO NLG, indicating less hallucination and more factually correct text generation. RAG-Token outperformed RAG-Sequence in Jeopardy question generation, demonstrating higher factuality and specificity.
  • On the FEVER fact verification task, RAG models achieved results close to state-of-the-art models that require more complex architectures and intermediate retrieval supervision.
  • This study showcases the effectiveness of hybrid generation models, combining parametric and non-parametric memories, offering new directions in combining these components for a range of NLP tasks.
  • Code; interactive demo.

Active Retrieval Augmented Generation

  • Despite the remarkable ability of large language models (LLMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output.
  • Augmenting LLMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval-augmented LLMs employ a retrieve-and-generate setup that only retrieves information once based on the input. This is limiting, however, in more general scenarios involving generation of long texts, where continually gathering information throughout the generation process is essential. There have been some past efforts to retrieve information multiple times while generating outputs, which mostly retrieve documents at fixed intervals using the previous context as queries.
  • This paper from Jiang et al. at CMU, Sea AI Lab, and Meta AI in EMNLP 2023 presents Forward-Looking Active REtrieval augmented generation (FLARE), a method addressing the tendency of large language models (LLMs) to produce factually inaccurate content.
  • FLARE iteratively uses predictions of upcoming sentences to actively decide when and what to retrieve across the generation process, enhancing LLMs with dynamic, multi-stage external information retrieval.
  • Unlike traditional retrieve-and-generate models that use fixed intervals or input-based retrieval, FLARE targets continual information gathering for long text generation, reducing hallucinations and factual inaccuracies.
  • The system triggers retrieval when generating low-confidence tokens, determined by a probability threshold. This anticipates future content, forming queries to retrieve relevant documents for regeneration.
  • The following figure from the paper illustrates FLARE. Starting with the user input \(x\) and initial retrieval results \(D_x\), FLARE iteratively generates a temporary next sentence (shown in gray italic) and check whether it contains low-probability tokens (indicated with underline). If so (step 2 and 3), the system retrieves relevant documents and regenerates the sentence.

  • FLARE was tested on four long-form, knowledge-intensive generation tasks/datasets, exhibiting superior or competitive performance, demonstrating its effectiveness in addressing the limitations of existing retrieval-augmented LLMs.
  • The model is adaptable to existing LLMs, as shown with its implementation on GPT-3.5, and employs off-the-shelf retrievers and the Bing search engine.
  • Code.

MuRAG: Multimodal Retrieval-Augmented Generator

  • This paper by Chen et al. from Google Research proposes Multimodal Retrieval-Augmented Transformer (MuRAG), which looks to extend the retrieval process beyond text to include other modalities like images or structured data, which can then be used alongside textual information to inform the generation process.
  • MuRAG’s magic lies in its two-phase training approach: pre-training and fine-tuning, each carefully crafted to build the model’s ability to tap into a vast expanse of multimodal knowledge.
  • The key goal of MuRAG is to incorporate both visual and textual knowledge into language models to improve their capability for multimodal question answering.
  • MuRAG is distinct in its ability to access an external non-parametric multimodal memory (images and texts) to enhance language generation, addressing the limitations of text-only retrieval in previous models.
  • MuRAG has a dual-encoder architecture combines pre-trained visual transformer (ViT) and a text encoder (T5) models to create a backbone encoder, enabling the encoding of image-text pairs, image-only, and text-only inputs into a unified/joint multimodal representation.
  • MuRAG is pre-trained on a mixture of image-text data (LAION, Conceptual Captions) and text-only data (PAQ, VQA). It uses a contrastive loss for retrieving relevant knowledge and a generation loss for answer prediction. It employs a two-stage training pipeline: initial training with small in-batch memory followed by training with a large global memory.
  • During the retriever stage, MuRAG takes a query \(q\) of any modality as input and retrieves from a memory \(\mathcal{M}\) of image-text pairs. Specifically, we apply the backbone encoder \(f_\theta\) to encode a query \(q\), and use maximum inner product search (MIPS) over all of the memory candidates \(m \in \mathcal{M}\) to find the top-\(k\) nearest neighbors \(\operatorname{Top}_K(\mathcal{M} \mid q)=\left[m_1, \cdots, m_k\right]\). Formally, we define \(\operatorname{Top}_K(\mathcal{M} \mid q)\) as follows:
\[\operatorname{Top}_K(\mathcal{M} \mid q)=\underset{m \in \mathcal{M}}{\operatorname{Top}} \quad f_\theta(q)_{[\mathrm{CLS}]} \cdot f_\theta(m)_{[\mathrm{CLS}]}\]
  • During the reader stage, the retrievals (the raw image patches) are combined with the query \(q\) as an augmented input \(\left[m_1, \cdots, m_k, q\right]\), which is fed to the backbone encoder \(f_\theta\) to produce retrievalaugmented encoding. The decoder model \(g_\theta\) uses attention over this representation to generate textual outputs \(\mathbf{y}=y_1, \cdots, y_n\) token by token.

    \[p\left(y_i \mid y_{i-1}\right)=g_\theta\left(y_i \mid f_\theta\left(\operatorname{Top}_K(\mathcal{M} \mid q) ; q\right) ; y_{1: i-1}\right)\]
    • where \(y\) is decoded from a given vocabulary \(\mathcal{V}\).
  • The figure below from the original paper (source) shows how the model taps into an external repository to retrieve a diverse range of knowledge encapsulated within both images and textual fragments. This multimodal information is then employed to enhance the generative process. The upper section outlines the setup for the pre-training phase, whereas the lower section specifies the framework for the fine-tuning phase.

  • The process can be summarized as follows:
    • For retrieval, MuRAG uses maximum inner product search to find the top-\(k\) most relevant image-text pairs from the memory given a question. The “memory” here refers to the external knowledge base that the model can retrieve information from. Specifically, the memory contains a large collection of image-text pairs that are encoded offline by the backbone encoder prior to training.
    • During training and inference, given a question, MuRAG’s retriever module will search through this memory to find the most relevant image-text pairs using maximum inner product search.
    • The memory serves as the knowledge source and can contain various types of multimodal data like images with captions, passages of text, tables, etc. that are related to the downstream task.
    • For example, when fine-tuning on the WebQA dataset, the memory contains 1.1 million image-text pairs extracted from Wikipedia that the model can retrieve from to answer questions.
    • So in summary, the memory is the large non-parametric external knowledge base encoded in a multimodal space that MuRAG learns to retrieve relevant knowledge from given a question, in order to augment its language generation capabilities. The memory provides the world knowledge to complement what is stored implicitly in the model’s parameters.
    • For reading, the retrieved multimodal context is combined with the question embedding and fed into the decoder to generate an answer.
  • MuRAG achieves state-of-the-art results on two multimodal QA datasets - WebQA and MultimodalQA, outperforming text-only methods by 10-20% accuracy. It demonstrates the value of incorporating both visual and textual knowledge.
  • Key limitations are the reliance on large-scale pre-training data, computational costs, and issues in visual reasoning like counting objects. But overall, MuRAG represents an important advance in building visually-grounded language models.

Hypothetical Document Embeddings (HyDE)

  • Published in Precise Zero-Shot Dense Retrieval without Relevance Labels by Gao et al. from CMU and University of Waterloo, proposes an innovative approach called Hypothetical Document Embeddings (HyDE) for effective zero-shot dense retrieval in the absence of relevance labels. HyDE leverages an instruction-following language model, such as InstructGPT, to generate a hypothetical document that captures relevance patterns, although it may contain factual inaccuracies. An unsupervised contrastive encoder, like Contriever, then encodes this document into an embedding vector to identify similar real documents in the corpus embedding space, effectively filtering out incorrect details.
  • The implementation of HyDE combines InstructGPT (a GPT-3 model) and Contriever models, utilizing OpenAI playground’s default temperature setting for generation. For English retrieval tasks, the English-only Contriever model was used, while for non-English tasks, the multilingual mContriever was employed.
  • The following image from the paper illustrates the HyDE model. Documents snippets are shown. HyDE serves all types of queries without changing the underlying GPT-3 and Contriever/mContriever models.

  • Experiments were conducted using the Pyserini toolkit. The results demonstrate HyDE’s significant improvement over the state-of-the-art unsupervised dense retriever Contriever, with strong performance comparable to fine-tuned retrievers across various tasks and languages. Specifically, in web search and low-resource tasks, HyDE showed sizable improvements in precision and recall-oriented metrics. It remained competitive even compared to fine-tuned models, particularly in terms of recall. In multilingual retrieval, HyDE improved the mContriever model and outperformed non-Contriever models fine-tuned on MS-MARCO. However, there were some performance gaps with fine-tuned mContrieverFT, likely due to under-training in non-English languages.
  • Further analysis explored the effects of using different generative models and fine-tuned encoders with HyDE. Larger language models brought greater improvements, and the use of fine-tuned encoders with HyDE showed that less powerful instruction language models could impact the performance of the fine-tuned retriever.
  • One possible pitfall of HyDE is that it can potentially “hallucinate” in the sense that it generates hypothetical documents that may contain invented or inaccurate details. This phenomenon occurs because HyDE uses an instruction-following language model, like InstructGPT, to generate a document based on a query. The generated document is intended to capture the relevance patterns of the query, but since it’s created without direct reference to real-world data, it can include false or fictional information. This aspect of HyDE is a trade-off for its ability to operate in zero-shot retrieval scenarios, where it creates a contextually relevant but not necessarily factually accurate document to guide the retrieval process.
  • In conclusion, the paper introduces a new paradigm of interaction between language models and dense encoders/retrievers, showing that relevance modeling and instruction understanding can be effectively handled by a powerful and flexible language model. This approach eliminates the need for relevance labels, offering practical utility in the initial stages of a search system’s life, and paving the way for further advancements in tasks like multi-hop retrieval/QA and conversational search.

RAGAS: Automated Evaluation of Retrieval Augmented Generation

  • This paper by Es et al. from Exploding Gradients, Cardiff University, and AMPLYFI introduces RAGAS, a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) systems.
  • RAGAS focuses on evaluating the performance of RAG systems in dimensions such as the effectiveness of the retrieval system in providing relevant context, the LLM’s ability to utilize this context, and the overall quality of generation.
  • The framework proposes a suite of metrics to evaluate these dimensions without relying on ground truth human annotations.
  • RAGAS focuses on three quality aspects: Faithfulness, Answer Relevance, and Context Relevance.
    • Faithfulness: Defined as the extent to which the generated answer is grounded in the provided context. It’s measured using the formula: \(F = \frac{|V|}{|S|}\) where, \(|V|\) is the number of statements supported by the context and \(|S|\) is the total number of statements extracted from the answer.
    • Answer Relevance: This metric assesses how well the answer addresses the given question. It’s calculated by generating potential questions from the answer and measuring their similarity to the original question using the formula: \(AR = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(q, q_i)\) where \(q\) is the original question, \(q_i\) are the generated questions, and sim denotes the cosine similarity between their embeddings.
    • Context Relevance: Measures the extent to which the retrieved context contains only the information necessary to answer the question. It is quantified using the proportion of extracted relevant sentences to the total sentences in the context: \(CR = \frac{\text{number of extracted sentences}}{\text{total number of sentences in } c(q)}\)
  • The paper validates RAGAS using the WikiEval dataset, demonstrating its alignment with human judgments in evaluating these aspects.
  • The authors argue that RAGAS contributes to faster and more efficient evaluation cycles for RAG systems, which is vital due to the rapid adoption of LLMs.
  • RAGAS is validated using the WikiEval dataset, which includes question-context-answer triples annotated with human judgments for faithfulness, answer relevance, and context relevance.
  • The evaluation shows that RAGAS aligns closely with human judgments, particularly in assessing faithfulness and answer relevance.
  • Code.

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

  • This paper by Ovadia et al. from Microsoft presents an insightful comparison of knowledge injection methods in large language models (LLMs). The core question addressed is whether unsupervised fine-tuning (USFT) is more effective than retrieval-augmented generation (RAG) for improving LLM performance on knowledge-intensive tasks.
  • The researchers focus on LLMs’ ability to memorize, understand, and retrieve factual data, using a knowledge base scraped from Wikipedia and a dataset of current events questions created with GPT-4. The study employs models like Llama2-7B, Mistral-7B, and Orca2-7B, evaluating them on tasks from the Massively Multitask Language Understanding Evaluation (MMLU) benchmark and a current events dataset.
  • Two methods of knowledge injection are explored: fine-tuning, which continues the model’s pre-training process using task-specific data, and retrieval-augmented generation (RAG), which uses external knowledge sources to enhance LLMs’ responses. The paper also delves into supervised, unsupervised, and reinforcement learning-based fine-tuning methods.
  • The key finding is that RAG outperforms unsupervised fine-tuning in knowledge injection. RAG, which uses external knowledge sources, is notably more effective in terms of knowledge injection than USFT alone and even more so than a combination of RAG and fine-tuning, particularly in scenarios where questions directly corresponded to the auxiliary dataset. This suggests that USFT may not be as efficient in embedding new knowledge into the model’s parameters.
  • The figure below from the paper shows a visualization of the knowledge injection framework.

  • Note that USFT in this context is a direct continuation of pre-training (hence also called continued pre-training in literature), predicting the next token on the dataset. Interestingly, fine-tuning with multiple paraphrases of the same fact significantly improves the baseline performance, indicating the importance of repetition and varied presentation of information for effective knowledge assimilation.
  • The authors created a knowledge base by scraping Wikipedia articles relevant to various topics, which was used for both fine-tuning and RAG. Additionally, a dataset of multiple-choice questions about current events was generated using GPT-4, with paraphrases created to augment this dataset.
  • Limitations of the study include the exclusive focus on unsupervised fine-tuning, without exploring supervised fine-tuning or reinforcement learning from human feedback (RLHF). The study also notes a high variance in accuracy performance across experiments, making it challenging to ascertain the statistical significance of the results.
  • The paper also questions why baseline models don’t achieve a 25% accuracy rate for multiple-choice questions with four options, suggesting that the tasks may not represent truly “unseen” knowledge. Moreover, the research primarily assesses straightforward knowledge or fact tasks, without delving into reasoning capabilities.
  • In summary, while fine-tuning can be beneficial, RAG is identified as a superior method for knowledge injection in LLMs, especially for tasks involving new information. The results highlight the potential of using diverse fine-tuning techniques and auxiliary knowledge bases for further research in this domain.

Dense X Retrieval: What Retrieval Granularity Should We Use?

  • One crucial choice in RAG pipeline design is chunking: should it be sentence level, passage level, or chapter level? This choice significantly impacts your retrieval and response generation performance.
  • This paper by Chen et al. from the University of Washington, Tencent AI Lab, University of Pennsylvania, Carnegie Mellon University introduces a novel approach to dense retrieval in open-domain NLP tasks by using “propositions” as retrieval units, instead of the traditional document passages or sentences. A proposition is defined as an atomic expression within text, encapsulating a distinct factoid in a concise, self-contained natural language format. This change in retrieval granularity has a significant impact on both retrieval and downstream task performances.
  • Propositions follow three key principles:
    1. Each proposition encapsulates a distinct meaning, collectively representing the semantics of the entire text.
    2. They are minimal and indivisible, ensuring precision and clarity.
    3. Each proposition is contextualized and self-contained, including all necessary text context (like coreferences) for full understanding.
  • The authors developed a text generation model, named “Propositionizer,” to segment Wikipedia pages into propositions. This model was fine-tuned in two steps, starting with prompting GPT-4 for paragraph-to-propositions pairs generation, followed by fine-tuning a Flan-T5-large model.
  • The effectiveness of propositions as retrieval units was evaluated using the FACTOIDWIKI dataset, a processed English Wikipedia dump segmented into passages, sentences, and propositions. Experiments were conducted on five open-domain QA datasets: Natural Questions (NQ), TriviaQA (TQA), Web Questions (WebQ), SQuAD, and Entity Questions (EQ). Six different dense retriever models were compared: SimCSE, Contriever, DPR, ANCE, TAS-B, and GTR.
  • The figure below from the paper illustrates the fact that that segmenting and indexing a retrieval corpus on the proposition level can be a simple yet effective strategy to increase dense retrievers’ generalization performance at inference time \((A, B)\). We empirically compare the retrieval and downstream open-domain QA tasks performance when dense retrievers work with Wikipedia indexed at the level of 100-word passage, sentence or proposition \((C, D)\).

  • Results:
    1. Passage Retrieval Performance: Proposition-based retrieval consistently outperformed sentence and passage-level retrieval across all datasets and models. This was particularly evident with unsupervised retrievers like SimCSE and Contriever, which showed an average Recall@5 improvement of 12.0% and 9.3%, respectively.
    2. Cross-Task Generalization: The advantage of proposition retrieval was most pronounced in cross-task generalization settings, especially for queries about less common entities. It showed significant improvement over other granularities in datasets not seen during the training of the retriever models.
    3. Downstream QA Performance: In the retrieve-then-read setting, proposition-based retrieval led to stronger downstream QA performance. This was true for both unsupervised and supervised retrievers, with notable improvements in exact match (EM) scores.
    4. Density of Question-Related Information: Propositions proved to offer a higher density of relevant information, resulting in the correct answers appearing more frequently within the top-l retrieved words. This was a significant advantage over sentence and passage retrieval, particularly in the range of 100-200 words.
    5. Error Analysis: The study also highlighted the types of errors typical to each retrieval granularity. For example, passage-level retrieval often struggled with entity ambiguity, while proposition retrieval faced challenges in multi-hop reasoning tasks.
  • The figure plot from the paper shows that retrieving by propositions yields the best retrieval performance in both passage retrieval task and downstream open-domain QA task, e.g. with Contriever or GTR as the backbone retriever.

  • The research demonstrates that using propositions as retrieval units significantly improves dense retrieval performance and downstream QA task accuracy, outperforming traditional passage and sentence-based methods. The introduction of FACTOIDWIKI, with its 250 million propositions, is expected to facilitate future research in information retrieval.

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

  • This paper by Saad-Falcon et al. from Stanford University and UC Berkeley, the paper introduces ARES (Automated RAG Evaluation System) for evaluating Retrieval-Augmented Generation (RAG) systems in terms of context relevance, answer faithfulness, and answer relevance.
  • ARES generates synthetic training data using a language model and fine-tunes lightweight LM judges to assess individual RAG components. It utilizes a small set of human-annotated data points for prediction-powered inference (PPI), enabling statistical guarantees for its predictions.
  • The framework has three stages:
    1. LLM Generation of Synthetic Dataset: ARES uses generative LLMs (like FLAN-T5 XXL) to create synthetic datasets of question-answer pairs derived from target corpus passages. This stage includes both positive and negative examples for training.
    2. Preparing LLM Judges: Separate lightweight LM models are fine-tuned for three classification tasks - context relevance, answer faithfulness, and answer relevance - using the synthetic dataset. These models are tuned using a contrastive learning objective.
    3. Ranking RAG Systems with Confidence Intervals:
      • After preparing the LLM judges, the next step involves using them to score and rank various RAG systems. This process begins with ARES sampling in-domain query-document-answer triples from each RAG approach. The judges then label each triple, assessing context relevance, answer faithfulness, and answer relevance. These labels are averaged for each in-domain triple to evaluate the performance of the RAG systems across the three metrics.
      • While average scores could be reported as quality metrics for each RAG system, these scores are based on unlabeled data and predictions from synthetically-trained LLM judges, which may introduce noise. An alternative is to rely solely on a small human preference validation set for evaluation, examining the extent to which each RAG system aligns with human annotations. However, this method requires labeling outputs from each RAG system separately, which can be time-consuming and expensive.
      • To enhance the precision of the evaluation, ARES employs prediction-powered inference (PPI). PPI is a statistical method that narrows the confidence interval of predictions on a small annotated dataset by utilizing predictions on a larger, non-annotated dataset. It combines labeled datapoints and ARES judge predictions on non-annotated datapoints to construct tighter confidence intervals for RAG system performance.
      • PPI involves using LLM judges on the human preference validation set to learn a rectifier function. This function constructs a confidence set of the ML model’s performance, taking into account each ML prediction in the larger non-annotated dataset. The confidence set helps create a more precise confidence interval for the average performance of the ML model (e.g., its context relevance, answer faithfulness, or answer relevance accuracy). By integrating the human preference validation set with a larger set of datapoints with ML predictions, PPI develops reliable confidence intervals for ML model performance, outperforming traditional inference methods.
      • The PPI rectifier function addresses errors made by the LLM judge and generates confidence bounds for the success and failure rates of the RAG system. It estimates performances in context relevance, answer faithfulness, and answer relevance. PPI also allows for estimating confidence intervals with a specified probability level; in these experiments, a standard 95% alpha is used.
      • Finally, the accuracy confidence interval for each component of the RAG is determined, and the midpoints of these intervals are used to rank the RAG systems. This ranking enables a comparison of different RAG systems and configurations within the same system, aiding in identifying the optimal approach for a specific domain.
        • In summary, ARES employs PPI to score and rank RAG systems, using human preference validation sets to calculate confidence intervals. PPI operates by first generating predictions for a large sample of data points, followed by human annotation of a small subset. These annotations are used to calculate confidence intervals for the entire dataset, ensuring accuracy in the system’s evaluation capabilities.
  • To implement ARES for scoring a RAG system and comparing to other RAG configurations, three components are needed:
    • A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
    • A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system.
    • A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring.
  • The figure below from the paper shows an overview of ARES: As inputs, the ARES pipeline requires an in-domain passage set, a human preference validation set of 150 annotated datapoints or more, and five few-shot examples of in-domain queries and answers, which are used for prompting LLMs in synthetic data generation. To prepare our LLM judges for evaluation, we first generate synthetic queries and answers from the corpus passages. Using our generated training triples and a constrastive learning framework, we fine-tune an LLM to classify query–passage–answer triples across three criteria: context relevance, answer faithfulness, and answer relevance. Finally, we use the LLM judge to evaluate RAG systems and generate confidence bounds for the ranking using PPI and the human preference validation set.

  • Experiments conducted on datasets from KILT and SuperGLUE demonstrate ARES’s accuracy in evaluating RAG systems, outperforming existing automated evaluation approaches like RAGAS. ARES is effective across various domains, maintaining accuracy even with domain shifts in queries and documents.
  • The paper highlights the strengths of ARES in cross-domain applications and its limitations, such as its inability to generalize across drastic domain shifts (e.g., language changes, text-to-code). It also explores the potential of using GPT-4 for generating labels as a replacement for human annotations in the PPI process.
  • ARES code and datasets are available for replication and deployment at GitHub.
  • Code


  title   = {Retrieval Augmented Generation},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}