• Up until now, in all the sections previously, we’ve looked through many of the foundational elements of NLP.
  • Now, lets talk about the current research within this domain. As I make my way through more papers, I’ll keep updating this page with all the information I think is vital to share!


Dense Passage Retrieval for Open-Domain Question Answering

  • Authors: Vladimir Karpukhin, Barlas Oguz,Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih
  • In open-domain question answering (system’s capability to answer questions on any topic rather than being restricted on a specific domain), it’s vital to efficiently identify the right passages from vast information sources (retrieval). Traditional methods, like TF-IDF and BM25, utilize sparse vector models to pick these passages. However, Karpukhin and colleagues in their 2020 EMNLP paper demonstrate a novel approach: using dense vector representations. They employ a dual-encoder framework to generate embeddings from a select set of questions and passages.
  • Their objective is metric learning: crafting a vector space where relevant question-passage pairs are closer together than unrelated ones. They optimize this by focusing on the likelihood of selecting the correct (positive) passage amidst a sea of irrelevant (negative) ones.
  • Collecting negative examples for training from such a vast pool is challenging. Their solution? Utilizing random passages, ones that match the most question tokens without the actual answer (via BM25), and relevant passages paired with other questions. The most effective model they produced uses these “gold” passages from the same training batch as negative instances, combined with one BM25 negative passage.
  • Results were promising. When tested on diverse open-domain QA datasets, their model greatly outperformed the established Lucene-BM25 system, enhancing top-20 passage retrieval accuracy by 9%-19%. This led to their model setting new performance benchmarks in open-domain QA.

Dense Passage Retriever (DPR):

  1. Purpose: The goal of the DPR is to improve the retrieval component in open-domain QA. This involves efficiently retrieving relevant text passages from a vast collection when given a question.
  2. Key Task: Given a large number \(M\) of text passages, the DPR aims to index all of these passages in a low-dimensional continuous space, making it efficient to retrieve the top \(k\) most relevant passages for a given input question. \(M\) can be very large, like 21 million passages, but \(k\) (the number of passages we want to retrieve for a given question) is relatively small, often between 20 and 100.
  3. DPR’s Mechanism:
    • Dense Encoder for Passages \(EP(\cdot)\): It converts any text passage to a \(d\)-dimensional real-valued vector. This encoder processes and indexes all \(M\) passages for retrieval.
    • Encoder for Questions \(EQ(\cdot)\): At runtime, when a question is posed, this encoder turns the question into a \(d\)-dimensional vector.
    • Similarity Measurement: The similarity between a question and a passage is calculated using the dot product of their respective vectors: \(sim(q, p) = EQ(q) \cdot EP(p)\).
  4. Passage Size and Boundaries: The passage’s size and the decision of where a passage begins and ends affect the retriever and reader. Fixed-length passages have been found to be more effective in retrieval and QA accuracy.
  5. Encoders Implementation: The encoders for both questions and passages are based on BERT networks, a popular deep learning model for NLP. They use the representation at the [CLS] token as the output, meaning the output vector has 768 dimensions.
  6. Inference: During the process of answering a question, the system uses the passage encoder to process all passages and then indexes them using FAISS, an efficient library for similarity search. For any given question, its embedding is computed, and the top \(k\) passages with the closest embeddings are retrieved.
  7. Training:
    • The main goal during training is to optimize the encoders such that relevant questions and passages have a high similarity (close in vector space) and irrelevant ones have a low similarity.
    • The training data consists of question-passage pairs with both positive (relevant) and negative (irrelevant) passages. The system is trained to increase the similarity for relevant pairs and decrease it for irrelevant ones.
    • For training, they have explicit positive examples (relevant passages) but need to choose negatives from a vast collection. They experimented with different types of negative passages: random, those ranked high by BM25 but not containing the answer, and relevant passages for other questions.
  8. In-batch Negatives: A training optimization method is discussed where they use relevant passages from the same batch of questions as negatives, which makes computation more efficient. This technique leverages the similarities between passages in the same batch to boost the number of training examples, effectively reusing computation.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  • The paper by Lewis et al. from Facebook AI Research, University College London, and New York University, introduces Retrieval-Augmented Generation (RAG) models combining pre-trained parametric and non-parametric memory for language generation tasks.
  • Addressing limitations of large pre-trained language models, such as difficulty in accessing and precisely manipulating knowledge, RAG models merge a pre-trained sequence-to-sequence (seq2seq) model with a dense vector index of Wikipedia, accessed by a neural retriever.
  • The RAG framework encompasses two models: RAG-Sequence, using the same retrieved document for the entire sequence, and RAG-Token, allowing different passages for each token.
  • The retrieval component, Dense Passage Retriever (DPR), uses a bi-encoder architecture with BERT-based document and query encoders. The generator component utilizes BART-large, a pre-trained seq2seq transformer with 400M parameters.
  • RAG models were trained jointly on the retriever and generator components without direct supervision on which documents to retrieve, using stochastic gradient descent with Adam. The training used a Wikipedia dump as the non-parametric knowledge source, split into 21M 100-word chunks.
  • In open-domain QA tasks, RAG established new state-of-the-art results, outperforming both parametric seq2seq models and task-specific retrieve-and-extract architectures. RAG models showed the ability to generate correct answers even when the right answer wasn’t in any retrieved document.
  • RAG-Sequence surpassed BART in Open MS-MARCO NLG, indicating less hallucination and more factually correct text generation. RAG-Token outperformed RAG-Sequence in Jeopardy question generation, demonstrating higher factuality and specificity.
  • On the FEVER fact verification task, RAG models achieved results close to state-of-the-art models that require more complex architectures and intermediate retrieval supervision.
  • This study showcases the effectiveness of hybrid generation models, combining parametric and non-parametric memories, offering new directions in combining these components for a range of NLP tasks.


Zephyr: Direct Distillation of LM Alignment

  • Authors: Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clementine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf
  • The paper introduces a technique termed “distilled direct preference optimization” (dDPO), designed to align a small language model (LM) to user intent via distillation, eliminating the need for human feedback. Furthermore, the study presents a 7B parameter language model named Zephyr, which is specifically tailored to align with user intent. Their approach has 3 main steps:
    1. Distilled Supervised Fine-Tuning (dSFT): They first fine-tune the base 7B Mistral model using the UltraChat dataset, which contains 1.4M dialogues generated by having a large proprietary teacher model like GPT-3.5 Turbo converse with itself. This provides a strong initialization for the student model.
    2. AI Feedback (AIF) Collection: An ensemble of diverse open chat models (e.g. Claude, Falcon) are used to generate responses to prompts from the UltraFeedback dataset. These responses are then scored by a powerful teacher model like GPT-4. The top scoring response is taken as the “chosen” response and one random lower scoring response as the “rejected” response. This provides training pairs of good vs bad responses.
    3. Distilled Direct Preference Optimization (dDPO): The dSFT model is further optimized by training it to rank the “chosen” responses higher than “rejected” responses from the AIF collection step. This is done by directly optimizing a preference likelihood objective on the static AIF data without needing to sample from the model during training.
  • They apply this approach to train Zephyr-7B, starting from Mistral-7B. First dSFT using UltraChat (1.4M examples from GPT-3.5), then AIF from UltraFeedback (64K prompts ranked by GPT-4), then dDPO.
  • Results:
    • Zephyr-7B sets a new SOTA for 7B models on MT-Bench (7.34 score) and AlpacaEval (90.6% win rate), surpassing prior best dSFT and PPO distillation methods.
    • It matches performance of 70B RLHF models like LLaMA2 on MT-Bench.
    • Ablations show dSFT is necessary before dDPO, and overfitting dDPO can still improve performance.
  • The key technical innovation is direct distillation of preferences without human involvement, through dSFT then dDPO, achieving strong alignment for small 7B models.
  • The resulting 7B Zephyr model sets a new SOTA for alignment and conversational ability compared to other 7B models. It even outperforms the 70B LLaMA2 model on the MT-Bench benchmark.
  • Key advantages are that it requires no human labeling or feedback, scales easily to larger models, and can be trained in just a few hours on commercially available hardware. Limitations are potential biases inherited from the teacher models and lack of safety considerations. Overall, it demonstrates the surprising efficacy of distillation and preference learning for aligning smaller open models.
  • The image below (source) gives a graphical sense of Zephyr’s performance on tasks as compared with our LLMs.


Lost in the Middle: How Language Models Use Long Contexts

  • This paper by Liu et al. from Stanford University, University of California Berkeley, and Samaya AI, focuses on analyzing language models’ performance in tasks that require identifying relevant information in long input contexts. The research particularly highlights issues in multi-document question answering and key-value retrieval tasks, revealing a significant degradation in performance when relevant information is situated in the middle of lengthy contexts.
  • The study involved an experimental setup for multi-document question answering. Models were tasked with identifying relevant information from a set of documents to answer questions. The researchers manipulated both the length of the input context and the position of the relevant information to observe changes in task performance.
  • Several state-of-the-art open and closed language models were evaluated. Among the open models were MPT-30B-Instruct, capable of handling up to 8192 tokens, and LongChat-13B (16K), which extends the context window to 16384 tokens. Closed models included GPT-3.5-Turbo and its variant with an expanded context length of 16K tokens, as well as Claude-1.3 and Claude-1.3 (100K).
  • The results revealed a distinct U-shaped performance curve across these models. They performed best when relevant information appeared at the beginning or end of the input context. However, the performance significantly declined when accessing information in the middle of long contexts, challenging the efficacy of extended-context models in utilizing their input effectively.
  • A synthetic key-value retrieval task was also used to assess models’ ability to retrieve exact matches from an input context. The task’s simplicity varied across models, with some achieving near-perfect performance, while others struggled with larger contexts.
  • The study also explored the impact of model architecture on context usage, comparing decoder-only and encoder-decoder models. Encoder-decoder models like Flan-T5-XXL and Flan-UL2 exhibited more stable performance across various contexts. However, they also began to show performance degradation with sequences longer than their training-time context windows.
  • The impact of query-aware contextualization was examined. While this dramatically improved performance in the key-value retrieval task, it had only a minimal effect on the multi-document question answering task.
  • Instruction fine-tuning’s effect was analyzed by comparing models like MPT-30B and MPT-30B-Instruct, both fine-tuned for instructions. Both models showed similar U-shaped performance curves, indicating that instruction fine-tuning alone is not responsible for these trends.
  • In a case study on open-domain question answering, the research found that model performance does not always improve with an increase in the amount of context provided. The study observed that performance saturates before retriever recall, suggesting that providing too much context may not be beneficial and could potentially reduce accuracy.


Precise Zero-Shot Dense Retrieval without Relevance Labels

  • The paper by Gao, Ma, Lin, and Callan from Carnegie Mellon University and University of Waterloo introduces Hypothetical Document Embeddings (HyDE), a novel approach for fully zero-shot dense retrieval in the absence of relevance labels. HyDE utilizes instruction-following language models (like InstructGPT) to generate a hypothetical document capturing relevance patterns, although these documents may contain inaccuracies or fictional details.
  • Dense retrieval has been effective across various tasks and languages but creating an effective fully zero-shot dense retrieval system without relevance labels remains challenging. Traditional methods like negative mining, distillation, and task-specific pre-training have been proposed to enhance supervised dense retrieval models, yet zero-shot dense retrieval still presents difficulties.
  • HyDE’s methodology involves two main steps: generating a hypothetical document that answers the query, and then encoding this document into an embedding vector using an unsupervised contrastively learned encoder like Contriever. This process pivots away from traditional dense retrieval’s reliance on relevance judgments, instead utilizing a language model’s ability to generate relevant content.
  • Experiments conducted with HyDE used InstructGPT and Contriever models, along with datasets such as TREC DL19, DL20 (based on MS-MARCO), and a collection from the BEIR dataset for web search, question answering, fact verification, and non-English retrieval tasks. The results showed that HyDE outperforms the state-of-the-art unsupervised dense retriever Contriever and is comparable to fine-tuned retrievers across these tasks and languages.
  • The paper concludes by reflecting on HyDE’s novel approach to relevance modeling, which shifts from traditional numerical relevance scores to leveraging natural language generation models. This paradigm suggests a future where the need for relevance labels might be eliminated, and relevance modeling and instruction understanding can be delegated to more powerful and flexible language models. HyDE is practical in the initial stages of a search system’s life, providing performance comparable to fine-tuned models without reliance on relevance labels.

ALCUNA: Large Language Models Meet New Knowledge

  • Authors: Xunjian Yin, Baizhou Huang, and Xiaojun Wan
  • The paper proposes a new method called KnowGen to generate artificial entities with new knowledge by making changes to the attributes and relationships of existing entities. This simulates the natural process of new knowledge emerging in the real world.
  • KnowGen is applied to structured biological taxonomic data from the EOL database to create artificial organisms. This results in a benchmark dataset called ALCUNA for evaluating large language models (LLMs) on their ability to handle new knowledge.
  • ALCUNA contains questions testing the model’s knowledge understanding, differentiation, and association abilities when faced with new entities.
  • Several popular LLMs like ChatGPT, Alpaca, Vicuna, and ChatGLM are evaluated on ALCUNA in zero-shot and few-shot settings. The results show these models still struggle with reasoning between new and existing knowledge.
  • Analysis reveals factors impacting model performance on new knowledge like entity similarity, contextual knowledge, and input representation format.
  • The paper argues benchmarks with truly new knowledge like ALCUNA are important to drive progress in LLMs’ ability to understand and reason with new information, as opposed to existing knowledge already seen during training.
  • The artificial nature of the knowledge in ALCUNA makes it reusable as a standard benchmark to assess different models on new knowledge without having to collect new data repeatedly.
  • This paper proposes a novel method to automatically generate new structured knowledge for evaluating LLMs’ capabilities in more realistic and challenging settings involving unfamiliar information. The ALCUNA benchmark constructed using this approach provides insights into current model limitations and opportunities for improvement.

The Perils & Promises of Fact-checking with Large Language Models

  • Authors: Dorian Quelle & Alexandre Bovet
  • The paper evaluates using large language models (LLMs) like GPT-3.5 and GPT-4 for automated fact-checking of claims. This is important as LLMs are being used more in high stakes domains like research and journalism.
  • They test the models on two datasets: PolitFact (US political claims) and a multilingual dataset from Data Commons. The models are evaluated with and without providing contextual information from web searches.
  • Motivation: Fact-checking is important to combat misinformation, but manual fact-checking has limited capacity. Large language models (LLMs) like GPT-3.5 and GPT-4 are increasingly used for writing and information gathering, so understanding their fact-checking abilities is critical.

  • Methods: Evaluated GPT-3.5 and GPT-4 on fact-checking claims from PolitiFact and a multilingual dataset. Tested models with and without retrieving context from Google. Compared performance across languages.

  • Key Results:
    • GPT-4 outperformed GPT-3.5 overall.
    • Providing context significantly improved accuracy, highlighting the importance of evidence gathering.
    • Models struggled with ambiguous “half-true” type verdicts.
    • Performance varied across languages - non-English claims saw a boost when translated to English first.
    • No sharp drop in accuracy after GPT-3.5/4 training cutoff dates, suggesting continued learning from human feedback.
  • Limitations:
    • Biased evaluation due to use of GPT-4 as a scorer.
    • Did not explore model scaling or curating better training data.
    • Safety/ethics of potential misinformation not addressed.
  • Implications:
    • LLMs show promise for assisting human fact-checkers but cannot fully automate the process yet.
    • Critical examination of LLM reasoning is important before deployment.
    • Understanding model limitations and language-specific differences is key.
    • Continued learning after initial training needs more investigation.
  • The paper provides a comprehensive evaluation of GPT-3.5 and GPT-4 on fact-checking, using novel context retrieval and multilingual data. Key findings highlight the models’ strengths as well as areas needing improvement before responsible LLM-assisted fact-checking.