• Below we will look at a few designs

Designing an autocomplete search system for Airbnb

1. Inputs and Outputs

Input: User input as they type into the search bar. This could be partial or complete words.

Output: A list of suggested search terms or destinations, dynamically updated as more characters are typed.

2. Data and Features

Data Collection:

  • Query Logs: Historical search data, including successful queries that led to bookings or prolonged engagement.
  • User Data: Information might include past searches, bookings, and user preferences if logged in.


  • Text Input Features: Current string typed by the user.
  • Contextual Features: Time of day, user location (if available), user language settings.
  • User History Features (if available): Past search queries, favorite destinations, previous bookings.

3. Handling Sparsity and Biases

  • Handling Sparsity: Use embeddings for textual input to capture semantic meanings from sparse data.
  • Bias Mitigation: Ensure that the autocomplete suggestions do not favor only popular destinations but also provide personalized suggestions based on user history and context.

4. Model Selection

Several model types could be effective for an autocomplete system:

  • Sequence-to-Sequence Models (Seq2Seq): Good for predicting the next item in a sequence, such as the next word in a search query.
  • Recurrent Neural Networks (RNNs), particularly LSTM (Long Short-Term Memory) networks, which are effective in handling sequences of text data.
  • Transformer Models: Utilize attention mechanisms that can be more effective and efficient than RNNs, especially with the addition of BERT-like masked language modeling for predicting text.

5. Loss Function and Optimization

Loss Function:

  • Cross-Entropy Loss: Commonly used for classification tasks like predicting the next word in a sequence.

Optimization Algorithm:

  • Adam: A popular choice for training deep learning models due to its efficient handling of sparse gradients.

6. Fine-Tuning Step

  • Personalization Layer: Introduce a layer that adjusts predictions based on the user’s historical data, making the suggestions more relevant to the individual user.

7. Evaluation Metrics

  • Offline Metrics:
    • Top-k Accuracy: Measures whether the correct suggestion is within the top k predictions.
    • Mean Reciprocal Rank (MRR): Useful for ranking problems, especially where the order of suggestions matters.
  • Online Metrics:
    • Click-Through Rate (CTR) on suggestions: Higher CTR indicates more relevant suggestions.
    • Engagement Metrics: Time spent after clicking a suggestion, conversion rates into bookings.

8. End-to-End Process

  1. Data Preprocessing: Clean and preprocess historical search data and user data.
  2. Feature Engineering: Develop features that capture the essence of the user’s current input and context.
  3. Model Development: Train the chosen model on historical data.
  4. Model Evaluation: Use offline metrics to evaluate model performance. Adjust parameters as necessary.
  5. Deployment: Integrate the model into the Airbnb search infrastructure.
  6. Online Testing: Implement A/B testing to compare the new model against the current system.
  7. Continuous Monitoring and Updating: Regularly monitor the system’s performance and update the model with new data or fine-tune parameters to adapt to changing user behaviors or preferences.

By following these steps, you can build a robust autocomplete system for Airbnb that not only enhances user experience by reducing search friction but also potentially increases the likelihood of bookings by making relevant suggestions based on the user’s intent and context.

Airbnb search ranking with personalization

1. Inputs and Outputs

Input: The input to the model includes data about:

  • User Profile: Age, location, past booking history, click behavior, user preferences.
  • Listing Data: Location, price, amenities, host rating, number of reviews, availability, historical occupancy.
  • Search Context: Time of search, length of stay, number of guests, search filters.
  • Temporal Dynamics: Seasonality, local events.

Output: The output is a ranked list of listings tailored to maximize the likelihood of booking based on the user’s profile and search context.

2. Data and Features

Data Collection:

  • User Data: Gather from user profiles and interaction logs.
  • Listing Data: Extract from the property management system.
  • Search Data: Capture real-time data when users perform searches.


  • User Features: Demographics, booking history vector, interaction patterns.
  • Listing Features: Categorical embeddings (location, room type), numerical features (price normalized by location, number of beds).
  • Contextual Features: Time features (day of the week, time of day), special dates (holidays).
  • Text Features: NLP features from listing descriptions using TF-IDF or embeddings.

Handling Sparsity:

  • Use embedding layers for categorical data to reduce dimensionality and handle sparsity.
  • Apply techniques like feature hashing on high-cardinality features.

Positional Bias:

  • Implement models that account for positional bias in how users view search results, such as using a Position-Based Model.

3. Model Selection


  • Gradient Boosted Decision Trees (GBDT): Effective for sparse, heterogeneous data.
  • Deep Learning Models: Utilize neural networks with embeddings for deep feature interaction.
  • Ranking Models: Pairwise models like RankNet or listwise models like ListNet.

4. Loss Function and Optimization

Loss Function:

  • Use a ranking-specific loss, such as pairwise logistic loss or listwise log likelihood loss.

Optimization Algorithm:

  • Stochastic Gradient Descent (SGD) for deep learning models.
  • AdaBoost, LightGBM for tree-based models.

5. Fine-Tuning Step

Layer Addition:

  • Add a personalization layer in neural networks that adjusts weights based on user-specific data to personalize rankings.
  • Use LoRA (Low-Rank Adaptation) on attention layers to fine-tune deep learning models without large-scale retraining.

6. Evaluation

Offline Metrics:

  • NDCG (Normalized Discounted Cumulative Gain): Measures the quality of the ranking.
  • Precision@k, Recall@k: Effectiveness of capturing relevant listings.

Online Metrics:

  • CTR (Click-Through Rate): Measures engagement.
  • Conversion Rate: Direct measure of booking effectiveness.

7. End-to-End Process

  1. Data Collection: Gather and preprocess data from various sources.
  2. Feature Engineering: Develop and select relevant features.
  3. Model Training: Train the model using historical data.
  4. Validation: Validate the model using a separate validation set.
  5. Fine-Tuning: Optimize the model with user-specific data.
  6. Deployment: Deploy the model in a production environment.
  7. Monitoring and Updating: Continuously monitor model performance and update the model as needed based on performance and new data.

This system provides a comprehensive approach to personalizing search rankings to maximize advertisement bookings, addressing various challenges like sparsity and positional bias, and ensuring that the model remains effective and relevant over time.

Vacation Rental

User Search Sessions as Sentences

| Session 1: L1 → L2 → L3 → L4 → L5 → L2 | | Session 2: L2 → L6 → L7 → L2 → L8 | | Session 3: L9 → L10 → L2 → L11 → L12 | ————————————————-

L = Listing (analogous to a word in NLP)

Training the Model (Skip-gram with Negative Sampling)

| Central Listing (Target) | Context Listings (Positive) | Negative Listings (Randomly Sampled) | |————————–|——————————|————————————-| | L2 | L1, L3, L6, L7, L8 | L15, L20, L25 | | L7 | L6, L2 | L17, L21, L26 | | L10 | L9, L2, L11 | L18, L22, L27 | ————————————————————-

Training Objectives

| • Minimize distance between L2 and L1, L3, L6, L7, L8 | | • Maximize distance between L2 and L15, L20, L25 | | • Refine embeddings iteratively using gradient descent | ————————————————-


| • Booked listings treated as global context, influencing all session listings | | • Market-specific negatives added for more relevant within-market comparisons | ————————————————-

Usage of Embeddings

| • Search Ranking: Embeddings influence the order of listing display | | • Recommendations: “Similar Listings” powered by nearest neighbor embeddings | | • Real-Time Personalization: Adjust search based on session interactions | ————————————————-

  • Airbnb creates listing embeddings by employing a model inspired by techniques used in natural language processing, particularly those similar to the Skip-gram model from Word2Vec. This model is adapted to the domain of online property listings rather than textual data. The essence of the approach is to treat user interactions with listings—clicks within their search sessions—as sequences similar to sentences in a text corpus, where the ‘words’ are the listings.

Detailed Breakdown of the Model Used for Listing Embeddings

1. Model Foundation: Skip-gram with Negative Sampling

  • Skip-gram Model: This model is generally used in NLP to predict the context words (surrounding words) for a given target word. In the case of Airbnb, the target ‘word’ is a central listing, and the ‘context words’ are other listings that users also viewed or interacted with during their search sessions.
  • Negative Sampling: A technique used to improve the efficiency of training the Skip-gram model, especially with a large number of outputs (i.e., the vast number of possible listings). Negative sampling involves randomly selecting a small number of ‘negative’ listings (those not in the context) to update weights for, instead of updating all possible listings in the vocabulary. This reduces computational complexity significantly.

2. Adaptation to Listing Data

  • Session Data as Sentences: Airbnb treats each user’s search session as a ‘sentence’ and each listing clicked within that session as a ‘word’. This analogy allows the use of NLP techniques to capture the semantic relationships between listings based on their co-occurrence in sessions.
  • Global Context and Market-Specific Modifications:
    • Global Context: Listings that result in a booking are given special importance and are treated as global context, always considered in the positive context for the central listing, regardless of their position in the session.
    • Market-Specific Negatives: To address the issue of non-representative negative samples (which might be from completely different geographical or market contexts), Airbnb adds negatives from the same market as the target listing to refine the learned similarities and ensure they are market-relevant.

3. Training Process

  • The embeddings are initialized randomly and then refined through training. The training involves sliding through user sessions and, for each listing (the central listing), adjusting its embedding to be more similar to other listings within its context window (positive examples) and less similar to a set of negative examples drawn randomly and from the same market.
  • The adjustments are made using gradient descent, optimizing a loss function designed to increase the similarity (reduce distance) between the embeddings of listings that appear together and decrease the similarity between those that do not.

4. Practical Implementation and Usage

  • Embedding Calculation: Once trained, each listing in Airbnb’s marketplace is represented as a dense vector in a 32-dimensional space.
  • Usage in Search and Recommendations: These embeddings are used to calculate similarity scores between listings, influencing both the ranking of search results and the content of recommendation carousels like “Similar Listings.” The embeddings allow for real-time personalization by dynamically adjusting recommendations based on the listings a user interacts with during their session.

5. Evaluation and Optimization

  • Airbnb evaluates the effectiveness of these embeddings through both offline tests (e.g., measuring the accuracy of the embeddings in predicting user bookings) and online A/B testing to assess their impact on user engagement and conversion rates.

The model Airbnb uses for listing embeddings exemplifies an innovative application of NLP techniques to an e-commerce setting, leveraging user behavior data to learn deep semantic relationships between properties, which are then used to enhance user experience through more personalized and relevant search results.

  • ML problem: Share similar listings on a vacation rental property via users browsing history

Designing an ML System for Recommending Similar Vacation Rental Properties

1. Input and Output

  • Input: User profile (past bookings, clicks, search preferences), current property details (location, price, amenities), contextual information (time of year, local events).
  • Output: A ranked list of vacation rental properties similar to the user’s current or past preferences.

2. Data Collection

  • Property Data: Information about each property including location, price, amenities, property type, capacity, photos, reviews, and ratings.
  • User Data: User profiles, historical interaction data (views, bookings, preferences, reviews left by the user).
  • Contextual Data: Seasonality, local events, and holidays which might influence rental popularity.

3. Features

  • Property Features: Categorical (type, amenities), numerical (price, capacity, number of bedrooms and bathrooms), text (description, title).
  • User Features: Demographic information, booking history, search queries.
  • Interaction Features: Click-through rates, booking rates for properties viewed but not booked.
  • Contextual Features: Time-related features like booking date, seasonality.

4. Handling Sparsity

  • Use of Embeddings: Represent sparse categorical features (like property type or amenities) as dense embeddings which can capture the underlying similarities between different categories.
  • Matrix Factorization Techniques: Helpful for handling sparsity in user-item interaction data.

5. Handling Positional Bias

  • Randomization in Training Data: To mitigate position bias in click data, randomize the order of recommendations during training.
  • Debiasing Techniques: Implement models that explicitly model and correct for positional biases.

6. Model Selection

  • Collaborative Filtering Models: Such as matrix factorization to capture the implicit relationships between users and properties based on historical interactions.
  • Content-Based Filtering Models: Using property features to recommend similar properties.
  • Hybrid Models: Combining collaborative and content-based methods to leverage both user behavior and property characteristics.
  • Neural Network Approaches: Such as using Multi-Layer Perceptrons (MLP) or more complex architectures like GPT for processing sequential interaction data and text features from user queries and property descriptions.

7. Loss Function

  • Ranking Loss Functions: Such as pairwise ranking loss (e.g., hinge loss) or listwise approaches (e.g., ListNet, ListMLE) suitable for training recommendation systems where the goal is to rank a list of items.

8. Optimization Algorithm

  • Stochastic Gradient Descent (SGD) or more advanced variants like Adam which are commonly used for training deep learning models due to their efficiency in handling large datasets.

9. Fine-Tuning Steps

  • Adding a Personalization Layer: A layer added to fine-tune the model to adapt to individual user preferences based on their interaction history.
  • Fine-Tuning on Specific Features: Like textual descriptions using pre-trained models like GPT fine-tuned on property descriptions to capture nuanced similarities.

10. Evaluation Metrics

  • Offline Metrics: Precision@k, Recall@k, NDCG (Normalized Discounted Cumulative Gain).
  • Online Metrics: Conversion rate, click-through rate, user satisfaction surveys.

11. End-to-End Process

  1. Data Collection: Gather and preprocess data from various sources.
  2. Feature Engineering: Create meaningful features from raw data.
  3. Model Training: Train various models using a hold-out validation set.
  4. Evaluation: Evaluate models using both offline and online metrics.
  5. Deployment: Deploy the chosen model into a production environment.
  6. Monitoring and Updating: Continuously monitor the model’s performance and update it with new data or retrain if performance degrades.

This comprehensive ML system combines data-driven insights with advanced machine learning techniques to provide personalized and context-aware recommendations for vacation rental properties.


  • Q and A

Data set

  • Documents to index


  • Document -> Chunk -> Embedding -> Index


  • Query -> Index -> Top K


  • LLM -> Response

Extreme Classification Problem

Data set

  • Input: Wikipedia article
  • Output: labels of the article from its topics


  • BERT for Feature Extraction:
    • Utilize a pre-trained BERT model to convert article text into embeddings.
  • Clustering for Label Indexing:
    • Employ clustering algorithms (like K-means) to group similar labels.
  • Matching Model:
    • Choose between simpler models (like GBT) or more complex ones (like neural networks) to match articles to label clusters.
  • Ranking Model:
    • Options include linear models, gradient boosting machines, or neural networks for ranking labels within clusters.

End-to-End Process

  • Data Collection and Preprocessing: Source and preprocess Wikipedia articles.
  • Feature Extraction with BERT: Generate embeddings for each article using BERT.
  • Semantic Label Indexing: Cluster labels to group similar topics using clustering algorithms.
  • Training Matching Model: Train a model to map articles to relevant label clusters.
  • Training Ranking Model: Develop a model to rank labels within each identified cluster.
  • Fine-Tuning BERT: Optionally fine-tune BERT with Wikipedia articles.
  • Evaluation: Use the aforementioned metrics to evaluate model performance.
  • Deployment: Deploy the model for real-time classification of Wikipedia articles.
  • Monitoring and Updates: Continuously monitor performance and update models as needed.
+---------------------+    +----------------------+    +------------------------+
|                     |    |                      |    |                        |
|   Wikipedia Article | -> | BERT Text Embeddings | -> | Semantic Label Indexing |
|                     |    |                      |    | (Label Clustering)     |
+---------------------+    +----------------------+    +------------------------+
                                                        |                          |
                                                        | Machine-Learned Matching |
                                                        |                          |
                                                        |                        |
                                                        |       Ranking          |
                                                        |                        |
                                                        |                        |
                                                        |     Output Labels      |
                                                        |                        |


  1. Wikipedia Article: The raw input text.

  2. BERT Text Embeddings: This stage processes the input article through a pre-trained BERT model to generate rich text embeddings.

  3. Semantic Label Indexing (Label Clustering): This phase involves clustering the labels into groups of semantically similar topics. It’s an offline process that organizes the label space for more efficient processing.

  4. Machine-Learned Matching: In this step, a model (like a neural network or logistic regression) matches the BERT embeddings of the article with relevant label clusters.

  5. Ranking: A ranking model then scores and sorts the labels within the identified clusters to prioritize the most relevant labels for the article.

  6. Output Labels: The final output is a set of labels representing the topics of the Wikipedia article.

This architecture represents a pipeline where each component feeds into the next, starting from raw text input and ending with a set of topic labels. The system leverages BERT’s ability to understand complex language patterns, combined with clustering and machine learning models for efficient and effective label prediction in an extreme multi-label classification context.


Data set

  • Customer Service Conversations: A collection of transcripts from customer interactions, including queries and responses.
  • FAQs and Knowledge Base: Company-specific frequently asked questions and detailed articles or documents providing in-depth information.
  • User Feedback Logs: Historical logs of user feedback, including ratings and comments about the chatbot’s performance.
  • Intent and Entity Recognition Data: Data annotated for training the chatbot to recognize various user intents (e.g., booking a ticket, asking for help) and entities (e.g., dates, locations, product names).


  • Generative model
  • Think as a Q and A system


  • Offline Evaluation
    • Accuracy of Intent Recognition: Measuring the percentage of correctly identified user intents.
    • Entity Extraction Accuracy: The precision and recall in identifying and extracting relevant entities.
    • Response Relevance: Assessing how relevant and correct the chatbot’s responses are to the given context.
    • Latency: The response time of the chatbot, which impacts user experience.
    • User Simulation Testing: Using simulated conversations to evaluate the chatbot’s performance in various scenarios.
  • Online Evaluation
    • User Satisfaction Surveys: Gathering direct feedback from users regarding their experience with the chatbot.
    • Engagement Metrics: Tracking metrics like the number of conversations, conversation length, and user return rate.
    • Resolution Rate: The percentage of conversations where the user’s query was successfully resolved.
    • Fallback Rate: The frequency at which the chatbot has to fall back to a default response or transfer to a human agent.
    • Conversion Metrics: In commercial settings, measuring the chatbot’s effectiveness in facilitating transactions or conversions.
    • A/B Testing: Comparing different versions of the chatbot to see which performs better in real-world user interactions.

Text to Speech

Data set



Sentiment Analysis

Data Sets

  • Amazon Reviews Dataset
    • Format Example:
      • Review ID: R123456789
      • Product ID: B00XYZABC
      • Reviewer ID: U987654321
      • Rating: 4
      • Review Title: "Great product, excellent quality!"
      • Review Text: "I've been using this product for a month now, and it's been fantastic. The quality is top-notch, and it works exactly as advertised. Highly recommend!"
      • Review Date: 2023-03-15
      • Verified Purchase: Yes
      • Helpful Votes: 82
      • Total Votes: 100
      • Product Category: Electronics
  • Stanford Sentiment Treebank
    • Format Example:
      • Phrase ID: 226166
      • Sentence ID: 8545
      • Phrase: "The movie was a fantastic display of creativity and imagination."
      • Sentiment Label: Positive (4 out of 4)
  • IMDB Movie Dataset
    • Format: Text Label


Classification vs. Generative

  • Training Decoder LLM (Language Model)
    • Training Phase: Full ground truth is fed, with masked attention ensuring predictions are based only on previous tokens.
    • Model Selection: Utilize Open LLM Leaderboard for choosing an appropriate model.
    • Input: Text to be categorized.
    • Output: Sentiment classification (Positive, Negative) and explanation.
    • Training Strategy: Initial training with teacher forcing.
    • Architecture Components:
      • Output Embeddings: Tokenized using Byte Pair Encoding (BPE), frequency-based.
      • Positional Encoding: Sinusoidal or RoPE (Rotary Positional Encoding) for absolute or relative position.
      • Masked Multi-Headed Attention: Queries (Q), Keys (K), Values (V); Masking prevents previewing future tokens.
      • Skip Connections and Layer Normalization: For backpropagation and training stability.
      • Feed-Forward Network.
      • Final Layer: Linear layer with softmax function and sampling.
      • Loss Function: Next token prediction or causal loss.
    • Prompting Strategies:
      • Zero-shot, Few-shot, Chain of Thought.
      • Context Windows: LLaMA 2 (4k tokens), GPT-4 (8k tokens), Claude (100k tokens).
    • Further Training Techniques:
      • FFT (Fast Fourier Transform): Addressing model limitations like catastrophic forgetting.
      • PEFT (Prompt Engineering and Fine-Tuning): Techniques like Soft Prompt Tuning, LoRA (Low-Rank Adaptation).
  • Training Encoder
    • Resource: HuggingFace’s MTEB Leaderboard.
    • Input/Output: Text to classification (0, 1, -1).
    • Architecture Components:
      • Input Embedding: Word Piece Encoding.
      • Positional Encoding: Contextual position understanding.
      • Multi-Headed Attention: Queries (Q), Keys (K), Values (V).
      • Layer Normalization and Skip Connections.
      • Feedforward Network.
      • Loss Function: Next Sentence Prediction (NSP) and Masked Language Model (MLM).
    • Fine-Tuning:
      • Labeling strategy using heuristics or existing datasets.
      • Addition of a classification layer on top of BERT for binary sentence classification.
      • Fine-tuning process to adjust weights of BERT layers and classification layer.
  • Inference Phase
    • Sequential generation: The model generates one token at a time using previous outputs as part of the input.


Offline Evaluation

  • Accuracy: Measures the percentage of correctly predicted sentiments compared to the ground truth.
  • Precision, Recall, and F1-Score: Precision evaluates the correctness of positive predictions, recall assesses coverage of actual positive cases, and F1-score provides a balance between precision and recall.
  • ROUGE Score: Particularly useful for generative models; measures overlap between generated text and reference summaries.
  • Confusion Matrix: Visual representation of the model’s predictions to understand types of errors (true positives, false positives, true negatives, false negatives).
  • AUC-ROC Curve: Area Under the Curve - Receiver Operating Characteristics; evaluates the trade-off between true positive rate and false positive rate.
  • Mean Squared Error (MSE): For regression-based sentiment models, measures the average squared difference between estimated values and actual value.

Online Evaluation

  • User Satisfaction Surveys: Direct feedback from users regarding the perceived quality of sentiment analysis.
  • Engagement Metrics: Measures such as time spent on the analyzed content, click-through rates for recommended content based on sentiment analysis, etc.
  • Conversion Rates: In commercial settings, how sentiment analysis influences user actions (purchases, subscriptions).
  • A/B Testing: Comparing different models or algorithm versions in live environments to see which performs better in real-world applications.
  • Retention Analysis: Understanding if and how sentiment analysis impacts user retention on platforms or services.
  • Behavioral Metrics: Observing changes in user behavior in response to content, which has been analyzed for sentiment (e.g., changes in commenting or sharing behavior).

Data sets

  • XSum: Document (Paragraph or more) One line summary ID
  • CNN/DailyMail: Article Highlights ID
  • BookSum: longform literary content, books, plays, Chapter by Chapter summaries
  • SAMSum: ID Dialogue Summarization


  • Extractive vs Abstractive Summary
    • Extractive: Factually correct, not coherent in sentence structure, can be repetitive/ redundant and thus not convey meaning. It’s like a highlighter.
    • Abstractive: It’s like a pen. Sentences will be fully formed and coherent and provide adequate summaries, but can have errors in them.
  • Training Abstractive Decoder LLM:
    • Training Phase: The full ground truth is fed in, but masked attention ensures predictions are made based only on previous tokens. This allows for parallel processing while maintaining the sequential nature of the task.
    • How do you pick a model? Open LLM leaderboard
    • Input Text:Text to be summarized ending with and then so decoder knows to start auto-regressively generating
    • Train initially by teacher forcing
    • Architecture:
      • Output embeddings: Tokenized w/ Byte Pair Encoding: frequency based
      • Positional encoding -> sinusodal (absolute position), RoPE rotary (LLama)
      • Masked multi-headed attention (Q (query - question), K (key - book title) , V (value - books pages w/ answer)), masked to prevent “cheating”, looking at future tokens
      • Add (skip connections, backprop) + Norm (Layer norm training stability)
      • (Skip cross attention)
      • Feed forward
      • ADD and NORM
      • output: linear layer + softmax + sampling
      • loss: next token prediction or causal loss
    • Prompting:
      • Zero-shot, few-shot, CoVE, let’s think step by step (Chain of thought).
      • When it comes to prompting, an analogy is like you have a student sitting in an exam. You give them the snipped of text from a book that can help them solve a question. But finetuning is you teaching the student the night before the fundamental concept behind the question, often times, finetuning outperforms prompting.
      • Llama 2 context window of 4k, GPT 4 8K, Claude 100K tokens ~ a book
    • FFT:
      • Assuming you have access to the model and its weights
      • Catastrophic forgetting
      • Resources/Compute is quite significant
      • Significant amount of data else overfitting
    • PEFT:
      • Soft prompt tuning
      • LoRA
      • QLoRA
      • QALoRA
  • Inference:
    • Inference Phase: The model obtains the prompt and then generates the sequence one token at a time, using its own previous outputs as part of the input for each subsequent prediction. This is a truly sequential process, reflecting how the model will be used in practice.
  • Training Extractive Encoder:
    • HuggingFace MTEB leaderboard
    • Architecture:
      • Input embedding: Tokenized w/ Word Piece Encoding: linguistic based
      • Positional encoding: to understand the position of each word in the context
      • Multi-headed attention: Q, K, V
      • ADD(Skip, residual connections for smoothing gradients during backprop. It improves the loss landscape) and Layer Norm
      • Feedforward
      • Add and Norm
      • loss: NSP and MLM
    • Finetune:
      • For training, you need labels for each sentence indicating whether it should be part of the summary. This can be done by:
      • Using heuristics (e.g., sentences that appear in both the document and the summary are labeled as 1).
      • Leveraging existing datasets where this annotation is already provided.
      • Add a classification layer on top of BERT. This layer will output a binary classification (e.g., 0 or 1) for each sentence, indicating whether it should be included in the summary.
      • Fine-tune the model on your prepared dataset. This involves adjusting the weights of both the pre-trained BERT layers and the newly added classification layer to minimize the loss function.


  • Offline
    • Metrics
      • ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which compares the generated summary to reference summaries.
      • BERT-Score
  • Online
    • Metrics
      • Engagement
      • Feedback survey
    • Dogfooding
    • A/B testing
      • KPIs
      • Click through
      • Asking again



The encoder’s function is to convert raw audio input into a higher-level representation that captures both the acoustic and contextual information present in the speech signal:

  1. VGG Convolution: The architecture starts with a VGG-style convolutional network, which likely processes the input spectrogram. This network, inspired by the VGG architecture used in image processing, can capture local patterns within the spectrogram, such as the formants and harmonics that are crucial for understanding speech.

  2. Multi-head Attention: After the convolutional layers, the architecture employs multi-head attention layers. In the encoder, this attention mechanism allows the model to focus on different parts of the spectrogram to capture various aspects of the acoustic context.

  3. Add & Norm: The addition of residual connections and layer normalization ensures that the encoder can learn deep representations without running into issues like vanishing gradients.


The decoder’s role is to generate the output sequence, which in the case of speech recognition, is the textual transcript of the spoken content:

  1. 1-D Convolution: Before reaching the attention layers, there is a 1-D convolution layer that may act on some form of intermediate representation, possibly handling temporal aspects of the encoded speech.

  2. Masked Multi-head Attention: This layer in the decoder allows the model to use the previously predicted output (during training, this is the ground truth shifted by one position) without seeing future positions, preserving the auto-regressive property of sequence generation.

  3. Add & Norm and FFN: As in the encoder, these components further process the data, helping in learning complex mappings from the encoded audio to the textual output.

  4. CTC: The Connectionist Temporal Classification layer maps the dense output from the attention and FFN layers into a sequence of tokens (such as characters or subwords) that represent the transcript. CTC does not require pre-segmented data, making it suitable for the variable lengths of input audio and output text.

  5. Softmax & Linear: The softmax layer outputs a probability distribution over the set of possible tokens at each position in the sequence, and the linear layer adjusts the dimensions as necessary.

  6. Output: The output is the transcribed text generated from the audio input.

In summary, the encoder in this architecture processes audio input, likely spectrograms, and encodes it into a context-aware representation. The decoder then takes this representation and generates a sequence of text tokens that correspond to the spoken words in the audio.

Hate Speech

Seq-Seq: RNN/Bidirectional LSTM:

  • Input: Sequential tokenized text data; each token can be represented as an embedding vector such as pre-trained (like GloVe or Word2Vec) or learned from scratch
  • Output: Similar to BERT, a probability distribution over classes for the classification task.
  • Layers Added:
    • Essentially, a BiLSTM consists of two LSTMs: one taking the input in a forward direction, and the other in a backward direction.
    • A fully connected layer for classification, with a dropout layer preceding it.
  • Loss Function: Categorical or binary cross-entropy, depending on the nature of the classification.
  • Regularization:
    • Dropout layers within the LSTM layers and/or before the fully connected layer.
    • L2 regularization can be applied to the LSTM and dense layers.
  • Finetuning Method:
    • Training the LSTM layers along with the classification head, adjusting the model to the specifics of the hate speech detection task.
    • Typically involves training from scratch if using a non-pre-trained LSTM model.
  • Token Sampling:
    • In cases of very long sequences, techniques like truncated backpropagation through time can be used to handle long dependencies without processing the entire sequence at once.

4) Evaluation Metrics: Offline Metrics: Accuracy, Precision, Recall, F1 Score, and possibly AUC-ROC for understanding model performance across various thresholds. BERT-Score, Movers-Score Online Metrics: If deployed in a live environment, track user feedback, model’s response time, and overall user engagement with the system.

Fake News

  • Classification of fake or real 1) Input and Output:

  • Input: Textual data of news articles.
  • Output: Classification result indicating whether the news is ‘real’ or ‘fake’.

2) Data:

  • Sources: Collect datasets comprising both genuine and fake news articles, including examples of neural fake news.
  • Volume and Variety: Ensure a large and diverse dataset to cover various news styles and content.

2) Data Preparation and Preprocessing:

  • Cleaning: Remove irrelevant information, normalize text (like lowercasing), and correct formatting issues.
  • Labeling: Accurately label each article as ‘real’ or ‘fake’ based on verified sources.

2) Tokenization and Embeddings:

  • Tokenization: Convert text into tokens. For deep learning models, consider using subword tokenization like BPE (Byte Pair Encoding) or WordPiece to handle out-of-vocabulary words.
  • Embeddings: Use pre-trained embeddings like GloVe or Word2Vec, or embeddings from models like BERT for richer contextual representation.

3) Features:

  • Textual Features: Extract features like n-grams, TF-IDF scores, sentiment scores, etc.
  • Contextual Features: If available, include author credibility, publishing source reliability, etc.
  • Handling Sparsity: Use dimensionality reduction techniques like PCA for high-dimensional sparse features.
  • Positional Bias: In sequence models (like LSTM), consider using positional encoding to maintain the order of words.

4) Model Selection:

  • BERT and Variants: For their ability to understand context and perform well on classification tasks.
    • BERT for Contextual Understanding:
    • Text Analysis with BERT: BERT’s role is to deeply analyze the structure and context of the news text. It can identify linguistic patterns and anomalies that are indicative of fake news.
    • Feature Extraction: BERT processes the news text and provides embeddings that can be used as features in identifying fake news.
  • GPT and RAG: Useful for generating responses based on retrieved information and input text.
    • RAG for Information Retrieval: RAG retrieves relevant information from a large corpus or database, providing factual data that can be used to verify the claims made in the news article.
    • GPT for Response Generation: GPT uses the retrieved information to generate an analysis or a verdict on the likelihood of the news being fake or real.
    • RAG:
      • New document -> embedding model -> vector db
      • Query -> embedding model -> FAISS -> document retrieval
  • LSTM/BiLSTM: For capturing long-term dependencies in text.
  • CNNs: For extracting local features from text.
  • Grover: AllenNLP

5) Fine-Tuning and Regularization:

  • Loss Function: Binary cross-entropy for binary classification tasks.
  • Optimization Algorithm: Adam or AdamW for efficient training.
  • Regularization Techniques: Implement dropout, early stopping, and possibly L1/L2 regularization.
  • Parameter-Efficient Fine-Tuning (PEFT): Techniques like Adapters or LoRA for efficient fine-tuning without extensive retraining.

5) Integrating Grover, BERT, and GPT with RAG:

  • Parallel Analysis: Each model analyzes the input news independently. BERT focuses on contextual embeddings, GPT with RAG on fact-checking and response generation, and Grover on identifying AI-generated text patterns.
  • Combined Decision-Making: The outputs from all three models are combined to make a final decision. This can be done through a voting system, a weighted average, or another decision-making algorithm that takes into account the strengths of each model.
  • Cross-Verification: The conclusions drawn by each model can be cross-verified with the others. For instance, if Grover identifies a piece as likely fake, but BERT and GPT with RAG suggest otherwise, the system might flag the article for further human review.

6) Token Sampling and Context Extension:

  • Token Sampling: Techniques like truncated backpropagation through time in sequence models to handle long texts.
  • Context Extension: Use mechanisms to handle longer context (like attention in Transformer models).

7) Evaluation Metrics:

  • Offline Metrics: Accuracy, Precision, Recall, F1 Score, and AUC-ROC.
  • Online Metrics: User feedback, real-time performance, adaptability to emerging fake news styles.

Named Entity Recognition

Designing a Named Entity Recognition (NER) system for medical documents involves creating a model that can accurately identify and classify medical terms and entities in text. Here’s a detailed plan for such a system:

1) Input and Output:

  • Input: Unstructured text data from medical documents, such as clinical notes, research papers, or patient records.
  • Output: Entities within the text identified and classified into categories like medication names, dosages, medical conditions, procedures, patient information, etc.

2) Examples of Data:

  • Clinical notes describing patient symptoms, diagnoses, and treatments.
  • Research papers with detailed descriptions of medical procedures, drugs, and clinical trials.
  • Patient records containing demographic information, medical history, and treatment plans.

3) Data Preparation and Preprocessing:

  • Data Cleaning: Remove irrelevant elements like headers, footers, or non-textual content.
  • Standardization: Convert text to a consistent format (e.g., uniform date formats, capitalization).
  • Anonymization: Ensure that sensitive patient information is anonymized for privacy.

4) Tokenization and Embeddings:

  • Tokenization: Use a tokenizer capable of handling medical terminology, possibly a custom tokenizer if standard ones (like spaCy’s tokenizer) are insufficient.
  • Embeddings: Employ domain-specific embeddings trained on medical texts (like BioBERT, a variant of BERT trained on biomedical literature) to capture the context and nuances of medical language.

5) Features:

  • Entity-Based Features: Include features specific to medical entities, such as drug names, symptoms, or procedure terms.
  • Contextual Features: Use embeddings that capture the context in which the entities occur.
  • Handling Sparsity: If using sparse representations like TF-IDF, apply dimensionality reduction techniques.
  • Positional Encoding: For models like Transformer-based ones, incorporate positional encodings to maintain the sequence of words.

6) Model Selection:

  • BERT and Variants (BioBERT, ClinicalBERT): Fine-tune these models on the medical NER task. They are effective in capturing contextual information in complex texts.
  • CRF (Conditional Random Fields): Useful for sequence modeling in NER. Can be combined with LSTM for enhanced performance.
  • BiLSTM with Attention Mechanism: A bidirectional LSTM to capture contextual dependencies, with an attention mechanism to focus on relevant parts of the text.

7) Fine-Tuning and Regularization:

  • Loss Function: Conditional Random Fields (CRF) loss is often used for sequence tagging tasks like NER.
  • Optimization Algorithm: Adam or AdamW, with learning rate scheduling.
  • Regularization: Implement dropout in LSTM layers and possibly weight decay in AdamW.
  • Parameter-Efficient Fine-Tuning (PEFT): Use techniques like adapters or layer-wise learning rate adjustments for efficient fine-tuning.

8) Token Sampling and Context Extension:

  • Token Sampling: Apply techniques like truncated backpropagation for dealing with long texts in RNN-based models.
  • Context Extension: Utilize attention mechanisms or Transformer-based models to handle longer contexts, essential in medical texts for capturing relevant information.

9) Evaluation Metrics:

  • Offline Metrics: Precision, Recall, and F1 score at the entity level.
  • Online Metrics (if deployed in a real-world setting): User feedback, system’s response time, and integration capability with existing medical record systems.

10) Implementation Steps:

  1. Data Annotation: Annotate a corpus of medical texts with the relevant entities.
  2. Model Training: Train the chosen model on this annotated dataset.
  3. Model Evaluation: Evaluate the model using the specified metrics on a held-out test set.
  4. Iterative Improvement: Continuously improve the model by retraining it with new data and refining based on feedback.

Document Similarity

1) Given two documents, find if they are of the same topic or not

Design Overview:

1) Data: large corpus of news data

  • Preprocessing:
    • remove stopwords “a” , “in” etc
    • lowercase
    • remove punctuations
    • stemming/lemmatization: root Ex: Builds, building, built -> lemma build
  • Tokenize:
    • Byte pair encoding: GPT, subword tokenization, frequency based
    • Word piece: BERT, subword tokenization, linguistic based, WordPiece looks at the likelihood of improving the language model.
    • Sentence piece: doesn’t seperate by space

2) Input: Two text documents and output: similarity score indicating topic relatedness (score between 0 and 1)

3) Options with just embeddings + Cosine similarity: TF-IDF, or Doc2Vect: provides a fixed-length vector for any size of text, making it more manageable for downstream tasks. - It captures the context of words in a document, providing a more nuanced understanding than simple word-level analysis.

4) Model: Embedding model like BERT for Similarity Twin-Tower Architecture with BERT 1. Architecture Setup: Two identical BERT models are set up in parallel. These models share the same architecture and weights. Each ‘tower’ processes one document independently of the other. 2. Input Processing: Each document is preprocessed (tokenized, padded or truncated) to fit BERT’s input requirements. Document A is fed into Tower A, and Document B into Tower B. 3. BERT Processing: Each tower processes its respective document. BERT’s layers analyze the text, encoding it into a high-dimensional vector space. This results in two sets of embeddings: one for each document.

5) Feed in Sentence Embeddings, Pool: Since each sentence may have a different number of tokens, and thus a different number of vectors, a pooling operation is required to create a fixed-size sentence embedding. Pooling is a way to aggregate multiple vectors into a single vector. Common pooling methods include taking the mean or the maximum of the vectors, or using the embedding of the [CLS] token that BERT outputs, which is intended to represent the entire sequence.