Vinija's Notes • Recommendation Systems • Candidate Generation and Retrieval

Overview
- Candidate Generation in Recommender Systems
- Content-Based Filtering
Collaborative Filtering
Multiple sources of candidates
- Candidate Generation
- Retrieval
Retrieval
Negative Sampling
Evaluation
References

Overview

Candidate Generation in Recommender Systems

Overview: Candidate generation is the first step in the recommendation process where potential items are identified based on user and item features. For instance, a search engine might use a user’s location to filter and identify relevant candidates using approximate nearest neighbor algorithms.

Key Points:

Content-Based Filtering:
- Recommends items similar to what the user has engaged with before (e.g., more corgi videos for a user who watches corgi content).
Collaborative Filtering:
- Recommends items based on similarities between user behaviors (e.g., recommending corgi videos to User B based on User A’s preferences, assuming both users have similar profiles).

Content-Based Filtering

Content-based filtering is a type of recommendation system that suggests items to users based on the similarity between the features of items and the user’s preferences. In other words, it recommends items that are similar to those a user has liked or interacted with in the past. This technique works by analyzing the attributes or features of items, such as text descriptions, images, or tags, and building a profile of the user’s preferences based on the attributes of the items they have interacted with. The system then suggests items that are similar to those the user has interacted with, based on the similarity between the item attributes and the user profile.
For example, consider a content-based filtering system for recommending movies. The system analyzes the attributes of the movies, such as the genre, actors, directors, and plot summary, and builds a user profile based on the movies they have liked in the past. If a user has previously liked action movies featuring Bruce Willis, the system might recommend other action movies that feature Bruce Willis.
Content-based filtering has several advantages over other recommendation approaches. For instance, it can be used effectively in situations where there is limited or no data on user behavior, such as in new or niche domains where there may not be a large user base. Additionally, content-based filtering can provide more personalized recommendations since it relies on the user’s individual preferences rather than the behavior of others. However, it also has limitations. It may fail to recommend items that are outside the user’s existing preferences or that are not similar to the items they have previously interacted with. Moreover, it may not be effective in situations where there is a lack of item information or where the items have limited attributes or features.
To implement a content-based filtering system, you need to analyze the features of the items and the user interactions. Here’s a simplified implementation example using Python and TensorFlow.
Input:
- A matrix representing user interactions with items (e.g., ratings, likes).
- Features of the items (e.g., genres, actors for movies).
Output:
- A list of recommended items for each user.
Here’s a simplified implementation of a content-based filtering system using Python and TensorFlow:

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Sample data: user-item interaction matrix (ratings)
user_ratings = np.array([
    [5, 3, 0, 1],
    [4, 0, 0, 1],
    [1, 1, 0, 5],
    [1, 0, 0, 4],
    [0, 1, 5, 4],
])

# Sample data: item features matrix
item_features = np.array([
    [1, 0, 1],  # Features for item 1
    [1, 0, 0],  # Features for item 2
    [0, 1, 0],  # Features for item 3
    [0, 1, 1],  # Features for item 4
])

# Define the model
input_layer = Input(shape=(item_features.shape[1],))
hidden_layer = Dense(10, activation='relu')(input_layer)
output_layer = Dense(user_ratings.shape[0], activation='relu')(hidden_layer)

model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(item_features, user_ratings.T, epochs=50, verbose=0)

# Predict user preferences for new items
new_item_features = np.array([[0, 0, 1]])  # New item features
predicted_ratings = model.predict(new_item_features)

print("Predicted Ratings for New Item:", predicted_ratings)

As the lecture slide above shows, from Stanford University, Neural Networks can also be used to run content-based filtering in order to scale. The code above implements a simple feed forward network in order to predict whether a user will interact with an item or not.

Collaborative Filtering

Collaborative filtering recommends items to users based on similarities in user behavior, such as ratings, purchases, or clicks. It identifies users with similar preferences and suggests items liked by these similar users. For example, in a movie recommendation system, if a user rates several action movies highly, the system might recommend other action movies highly rated by similar users.
Collaborative filtering doesn’t need detailed item attributes and can provide serendipitous recommendations, suggesting items users may not discover on their own. However, it struggles with the “cold start” problem, where new items or users have limited data available, making accurate recommendations difficult. It also may not be effective for users with unique preferences not shared by others.
To overcome these limitations, collaborative filtering uses user and item embeddings learned automatically. The dot product of user and item embeddings, plus bias terms, estimates ratings.

Matrix factorization

Matrix factorization techniques decompose the user-item interaction matrix into lower-dimensional matrices representing latent factors for users and items. This helps in identifying the underlying structure in the interaction data.
Matrix factorization is a simple embedding model. Given the feedback matrix \(\mathrm{A} \in R^{m \times n}\), where \(m\) is the number of users (or queries) and \(n\) is the number of items, the model learns:
- A user embedding matrix \(U \in \mathbb{R}^{m \times d}\), where row \(i\) is the embedding for user \(i\).
- An item embedding matrix \(V \in \mathbb{R}^{n \times d}\), where row \(j\) is the embedding for item \(j\). The image below depicts this: (source)

The embeddings are learned such that the product \(U V^{T}\) is a good approximation of the feedback matrix. Observe that the \((i, j)\) entry of \(U . V^{T}\) is simply the dot product \(\left\langle U_{i}, V_{j}\right\rangle\) of the embeddings of user \(i\) and item \(j\), which you want to be close to \(A_{i, j}\).
Matrix factorization is a fundamental technique used in recommender systems to decompose the user-item interaction matrix into lower-dimensional latent factor matrices. Various algorithms and tools can implement matrix factorization, each suited for different scenarios in terms of scalability, complexity, and flexibility. Here are some of the most commonly used methods and technologies for implementing matrix factorization in recommender systems:

Using Scikit-learn’s NMF:

NMF restricts the matrices to have only non-negative elements, which makes this method particularly suitable for datasets where the factors are inherently non-negative (e.g., pixel representations, text data).
Scikit-learn offers an implementation of NMF that can be readily used for building recommender systems with constraints on factorization.

from sklearn.decomposition import NMF
import numpy as np

# Sample data: User-item matrix
data = np.array([
    [4, np.nan, np.nan, 5, np.nan],
    [np.nan, 3, np.nan, 4, 3],
    [2, np.nan, np.nan, 5, np.nan],
    [np.nan, 4, 4, np.nan, np.nan],
])

# Replace np.nan with 0 for NMF
data[np.isnan(data)] = 0

model = NMF(n_components=2, init='random', random_state=0)
user_features = model.fit_transform(data)
item_features = model.components_

print("User Features:\n", user_features)
print("Item Features:\n", item_features)

Using PySpark’s ALS:

ALS is particularly popular in collaborative filtering for implicit datasets. It alternates between fixing the user factors and solving for item factors, and vice versa, which makes it particularly effective for sparse datasets.
Apache Spark MLlib provides a scalable ALS implementation suitable for large datasets and distributed computing environments.

from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName('Recommender').getOrCreate()

# Sample data: DataFrame format
data = spark.createDataFrame([
    (0, 0, 4.0),
    (1, 1, 3.0),
    (2, 3, 5.0),
    (3, 1, 4.0),
    (1, 3, 4.0),
], ["user", "item", "rating"])

# ALS model
als = ALS(maxIter=5, regParam=0.01, userCol="user", itemCol="item", ratingCol="rating")
model = als.fit(data)

# Predictions
predictions = model.transform(data)
predictions.show()

Neural Collaborative Filtering (NCF)

Neural Collaborative Filtering (NCF) is a deep learning-based approach to collaborative filtering for recommendation systems. Unlike traditional matrix factorization techniques, NCF uses neural networks to model the complex, non-linear interactions between users and items. Here’s a detailed explanation of NCF, including how it works and how it performs candidate generation.

Architecture Details

Input Embeddings:
- User Embeddings: Each user is assigned a unique identifier, which is mapped to a dense vector through an embedding layer. This embedding captures the latent features of the user.
- Item Embeddings: Similarly, each item is assigned a unique identifier, which is mapped to a dense vector through another embedding layer. This embedding captures the latent features of the item.
Embedding Layers:
- Embedding layers are initialized randomly and are learned during the training process.
- These layers convert sparse user/item IDs into dense, continuous vector representations.
Interaction Modeling:
- Concatenation or Element-wise Product: The user and item embeddings are either concatenated or combined using an element-wise product.
- Multi-Layer Perceptron (MLP): The combined embeddings are passed through multiple dense layers (MLP). These layers capture the non-linear interactions between users and items. Each layer typically uses activation functions like ReLU (Rectified Linear Unit) to introduce non-linearity.
- Output Layer: The final dense layer outputs a prediction score, which can represent the likelihood of a user interacting with an item (e.g., clicking, rating, purchasing).
Loss Function:
- The model is trained using a loss function appropriate for the task. For implicit feedback (e.g., clicks), binary cross-entropy loss is commonly used. For explicit feedback (e.g., ratings), mean squared error loss can be used.

Candidate Generation

Candidate generation is the first stage in a recommendation pipeline, where a large set of potential items (candidates) are selected for each user. NCF can be used to perform candidate generation as follows:

Embedding Space Mapping:
- During training, NCF learns embeddings for users and items. These embeddings map users and items into a shared latent space where similar users and items are close to each other.
Generating Candidates:
- For each user, candidate items can be generated by finding items that are close in the embedding space.
- Dot Product Similarity: One common method is to compute the dot product between the user’s embedding and all item embeddings. Items with higher dot product values are considered more relevant to the user.
- Top-N Selection: After computing the similarity scores, the top-N items with the highest scores are selected as candidates.
Negative Sampling:
- To efficiently train the model, negative sampling is used. For each positive user-item interaction, multiple negative samples (items not interacted with by the user) are generated.
- Negative samples help the model learn to distinguish between items a user might interact with and items they are less likely to interact with.

Example Process

Initialization:
- Initialize the embedding layers for users and items with random weights.
- Define the MLP architecture with several dense layers.
Training:
- For each user-item pair in the training data, look up the user and item embeddings.
- Combine the embeddings and pass them through the MLP.
- Compute the prediction score and compare it to the actual interaction.
- Adjust the embedding weights and MLP parameters using backpropagation to minimize the loss.
Candidate Generation:
- For a given user, retrieve the user’s embedding.
- Compute the similarity scores (dot product) between the user’s embedding and all item embeddings.
- Sort the items by their similarity scores and select the top-N items as candidates.

Example Code (Pseudocode for Clarity)

import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, Concatenate
from tensorflow.keras.models import Model

num_users = 10000  # Number of users
num_items = 5000   # Number of items
embedding_size = 50  # Dimensionality of embeddings

# User embedding layer
user_input = Input(shape=(1,), name='user_input')
user_embedding = Embedding(input_dim=num_users, output_dim=embedding_size, name='user_embedding')(user_input)
user_vector = Flatten()(user_embedding)

# Item embedding layer
item_input = Input(shape=(1,), name='item_input')
item_embedding = Embedding(input_dim=num_items, output_dim=embedding_size, name='item_embedding')(item_input)
item_vector = Flatten()(item_embedding)

# Concatenate user and item vectors
concatenated = Concatenate()([user_vector, item_vector])

# MLP layers
dense_1 = Dense(128, activation='relu')(concatenated)
dense_2 = Dense(64, activation='relu')(dense_1)
output = Dense(1, activation='sigmoid')(dense_2)

# Define the model
model = Model([user_input, item_input], output)
model.compile(optimizer='adam', loss='binary_crossentropy')

# Training
model.fit([user_ids, item_ids], labels, epochs=10, batch_size=32)

# Candidate Generation
user_id = 123  # Example user ID
user_embedding_vector = model.get_layer('user_embedding').get_weights()[0][user_id]
item_embedding_matrix = model.get_layer('item_embedding').get_weights()[0]

# Compute similarity scores
similarity_scores = item_embedding_matrix.dot(user_embedding_vector)
top_n_items = similarity_scores.argsort()[-10:][::-1]  # Get top-N items

print("Top-N items for user {}: {}".format(user_id, top_n_items))

Neural Collaborative Filtering (NCF) creates embeddings for users and items, which are learned during training. These embeddings are used to model interactions through a neural network. For candidate generation, NCF uses the learned embeddings to find items similar to a user’s preferences by calculating similarity scores and selecting the top-N items. This process leverages the shared latent space to efficiently generate personalized recommendations.

Collaborative Filtering Visualization

2D Embedding

One feature was not enough to explain the preferences of all users. To overcome this problem, let’s add a second feature: the degree to which each movie is a blockbuster or an art-house movie. With a second feature, we can now represent each movie with the following two-dimensional embedding: (source)

We again place our users in the same embedding space to best explain the feedback matrix: for each \((user, item)\) pair, we would like the dot product of the user embedding and the item embedding to be close to 1 when the user watched the movie, and to 0 otherwise. (source)

We use a common embedding space for both items and users to measure similarity. While surprising, this abstract representation allows us to effectively compare them. Instead of hand-engineering, embeddings can be learned automatically, which is the strength of collaborative filtering models.
When the model learns embeddings, fixing movie embeddings allows learning user embeddings that group similar users together. Conversely, fixing user embeddings allows learning movie embeddings that group similar movies together based on user preferences.

Examples

As a real-world example, let’s look at an example (source) with binary labels signifying engagement (using favorites, likes, and clicks):

Let’s talk about the meaning of each rating in the above dataset. We can choose what each of our labels mean so here, we have chosen the following:
- 1: user was engaged after being shown the item.
- 0: user did not engage after being shown the item.
- ?: the item is not yet shown to the user.

Multiple sources of candidates

The image above (source) represents options of both candidate gen and retrieval.

Candidate Generation

Heuristic-Based, Real-Time:
- Recent Media from Followed Author: This method retrieves the latest content from authors or creators that a user follows. For example, on Instagram, this could mean showing the newest posts from followed accounts. It captures real-time interactions by ensuring users see the freshest content from their network and serves as a candidate generation method.
Heuristic-Based, Pre-Generated :
- Pool of Media Related to Interest Topic: This method involves creating a pre-generated pool of media items related to user interest topics identified through past interactions. On a news platform, this could involve showing articles related to a user’s frequently read topics like technology or sports. It ensures content related to the user’s general interests is readily available, capturing long-term interests, and serves as a candidate generation method.

Retrieval

ML-Based, Pre-Generated:
- Personalized PageRank (PPR): This method applies the Personalized PageRank algorithm to a graph of user-item interactions to precompute a ranking of items for each user. For example, on Amazon, this algorithm could precompute a list of recommended products based on browsing and purchase history. It provides efficient retrieval of personalized recommendations, capturing long-term user interests, and functions as a retrieval method.
- PPR is typically applied after a candidate set is already generated. It ranks these pre-generated candidates to determine which items are the most relevant to present to the user.
Two-Tower Neural Network (NN): This method uses a neural network architecture with two towers: one for users and one for items. Each tower generates embeddings, which are then used to match users with items. For instance, on Netflix, this model could recommend movies or shows based on the latest viewing behavior of the user. It provides highly personalized and dynamic recommendations by leveraging real-time user data and functions as a retrieval method.
- This can serve as both candidate generation and retrieval.

Retrieval

While candidate generation is the first step in a recommendation pipeline, responsible for selecting a broad set of potentially relevant items for a user. This process needs to be efficient and scalable, as it narrows down the pool of items from a very large dataset to a smaller, manageable set that can be further refined in subsequent steps.
Candidate retrieval involves finding the specific items that are most relevant to a user’s query or context from the set of generated candidates. Here’s a detailed explanation of how candidate retrieval works, particularly using embedding models.

Embedding Models

Embedding models transform user and item IDs into dense vector representations that capture latent features. There are two main approaches:

Matrix Factorization Model: The user’s embedding is precomputed and stored in a user embedding matrix. At serve time, the system simply looks up the user’s embedding from this matrix.
Deep Neural Network (DNN) Model: The user’s embedding is computed dynamically at serve time by passing the user’s feature vector through the DNN. This embedding, denoted as \(\psi(x)\), captures the user’s preferences based on their features.

Generating the Query Embedding

Once the query embedding, \(q\), is obtained (either from the precomputed user embedding matrix or dynamically computed via the DNN), the next step is to find items whose embeddings are similar to \(q\).

Nearest Neighbor Search

The system searches for item embeddings, \(V_{j}\), that are close to the query embedding \(q\) in the embedding space. This is a nearest neighbor search problem, where similarity can be measured using metrics such as cosine similarity or dot product. The similarity score, \(s(q, V_{j})\), is calculated for each item embedding. The system then retrieves the top \(k\) items with the highest similarity scores as the recommended items.

Large-scale Retrieval

Handling large-scale data efficiently is crucial for real-time recommendations. There are several methods to optimize the retrieval process:
Exhaustive Scoring: For static query embeddings (precomputed user embeddings), the system can precompute and store a list of top candidates for each query, allowing for fast retrieval during serving. For dynamically computed embeddings, the system might need to score each potential candidate in real-time, which can be computationally expensive.
Approximate Nearest Neighbors (ANN): ANN algorithms approximate the nearest neighbor search, significantly reducing computation time while maintaining high accuracy. ANN libraries like FAISS, Annoy, or ScaNN are commonly used to handle large-scale nearest neighbor searches efficiently.

Use Cases

YouTube

The image below (source) shows the candidate generation architecture.

At a high level, the architecture employs a multi-class classification neural network for non-linear matrix factorization based on user watch history data. Fully watched videos are considered positive examples, while negative examples are randomly sampled, ensuring they do not overlap with the positive examples.
During inference, Approximate Nearest Neighbors (ANN) and similarity measures are used to retrieve the top 100 or so candidate videos based on user and video embeddings.
In the architecture shown in the image, embedded vectors are created for each user based on different categories: search (captures what the user searched for), watch (captures positive examples), and geolocation. These embeddings are averaged to form comprehensive user vectors.
During training, the model uses softmax to predict class probabilities. However, during serving, ANN is employed to quickly identify the nearest video candidates. The resulting user vector is then processed through a feed-forward neural network with ReLU activation functions to refine the recommendation.

Negative Sampling

Purpose: Negative sampling is used during the training phase of recommender systems to address bias in the training data and improve model performance and fairness.
Process: During training, items that the user has not interacted with are randomly selected and assigned negative labels, representing disinterest or lack of preference. This helps balance the dataset by providing a more representative view of user preferences.
Learning: Including negative samples allows the model to learn to distinguish between items a user is likely to prefer and those they are likely to dislike. This helps the model make better predictions.
Serving Phase: In the serving phase, the recommender system uses the trained model to generate personalized recommendations based on user preferences and other relevant factors, aiming to suggest items that align with the user’s interests.
Goal: The primary goal of negative sampling is to improve the model’s overall performance and fairness by considering both positive and negative examples during training. This helps capture nuanced user preferences and mitigate biases from the limited availability of positive data.
Techniques: The exact techniques for mitigating bias may vary depending on the specific recommender system and data characteristics. The ultimate objective is to balance addressing biases with providing accurate and relevant recommendations.
To illustrate, consider a user and three posts (0, 1, 2) on a platform. The user engaged with post 1 but not post 2, although post 2 is slightly more relevant than post 0. Each post has a 1D embedding corresponding to its ID. Without negative sampling, training the user embedding with gradient descent might result in the user embedding being closer to post 0 than post 2, leading to less accurate recommendations.

Evaluation

To test if a recommender system has a good candidate generation or candidate retrieval pool, you can use various evaluation metrics that measure the quality of the recommended items. Some commonly used metrics are:
- Precision: Precision measures the percentage of relevant items among the recommended items. In the context of candidate generation, precision measures how many of the recommended candidates are actually relevant to the user’s preferences.
- Recall: Recall measures the percentage of relevant items that were recommended. In the context of candidate generation, recall measures how many of the relevant candidates were recommended to the user.
- F1 Score: F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is a good overall metric for evaluating candidate generation performance.
- Mean Average Precision (mAP): MAP measures the average precision across different queries. In the context of candidate generation, MAP measures the average precision of the recommended candidates across all the users.
- Normalized Discounted Cumulative Gain (NDCG): NDCG measures the relevance of the recommended items by assigning higher scores to items that are more relevant. In the context of candidate generation, NDCG measures how well the recommended candidates are ranked according to their relevance.
By evaluating these metrics, you can assess the quality of the recommended items and identify areas where the candidate generation or retrieval pool needs improvement.

Overview

Candidate Generation in Recommender Systems

Content-Based Filtering

Collaborative Filtering

Matrix factorization

Using Scikit-learn’s NMF:

Using PySpark’s ALS:

Neural Collaborative Filtering (NCF)

Architecture Details

Candidate Generation

Example Process

Example Code (Pseudocode for Clarity)

Collaborative Filtering Visualization

2D Embedding

Examples

Multiple sources of candidates

Candidate Generation

Retrieval

Retrieval

Embedding Models

Generating the Query Embedding

Nearest Neighbor Search

Large-scale Retrieval

Use Cases

YouTube

Negative Sampling

Evaluation

References