Recommendation Systems • Candidate Generation
 Overview
 Notation
 Contentbased Filtering
 Collaborative Filtering
 Deep Neural Network based Recommendations
 Candidate Retrieval
 Evaluation
 References
Overview
 Candidate generation is the first stage of recommendations and is typically achieved by finding features for users that relate to features of the items. Example, if the user searches for food, the search engine will look at the users location to determine possible candidates. It would filter out searches that were not in New York and look at latent representation of the user and find items that have latent representations that are close using approximate nearest neighbor algorithms.
 Note: Candidate generation is for recommender systems like a search engine where the candidates are not particularly based on time (Eg: searching for how to tie a tie should roughly return the same results in every instance) whereas candidate retrieval is for newsfeed based systems where time is important, such as Instagram or Facebook, to make sure you don’t keep seeing the same content again.
 However, in order to achieve this, we would have to iterate over every item which could be computationally expensive.
 Let’s dive a little deeper into candidate generation, aka the first step in a general recommendation architecture.
 In recommender systems, candidate generation is the process of selecting a set of items that are likely to be recommended to a user. This process is typically performed after user preferences have been collected, and it involves filtering a large set of potential recommendations down to a more manageable number. The goal of candidate generation is to reduce the number of items that must be evaluated by the system, while still ensuring that the most relevant recommendations are included.
 Given a query (user information), the model generates a set of relevant candidates (videos, movies).
 There are two common candidate generation approaches:
 Contentbased filtering: Uses similarity between content to recommend new content.
 For e.g., if user watches corgi videos, the model will recommend more corgi videos.
 Collaborative filtering: Uses similarity between queries (2 or more users) and items (videos, movies) to provide recommendations.
 For e.g., if user \(A\) watches corgi videos and user \(A\) is similar to user \(B\) (in demographics and other areas), then the model can recommend corgi videos to user \(B\) even if user \(B\) has never watched a corgi video before.
 Contentbased filtering: Uses similarity between content to recommend new content.
Notation
 A quick note on notation you’ll see in the notes moving forward: we will represent \(r\) to be ratings, specifically if user \(i\) has rated item \(j\), \(n\) to hold a number value of users or items, and \(y\) to hold the rating given by a user to a particular item. We can see the notation in the image below (source)
Contentbased Filtering
 Contentbased filtering is a type of recommendation system that suggests items to users based on the similarity between the features of items and the user’s preferences. In other words, it recommends items that are similar to items that a user has liked or interacted with in the past.
 Contentbased filtering works by analyzing the attributes or features of the items, such as text descriptions, images, or tags, and building a profile of the user’s preferences based on the attributes of the items they have interacted with. The system then suggests items that are similar to those the user has interacted with, based on the similarity between the item attributes and the user profile.
 For example, consider a contentbased filtering system for recommending movies. The system analyzes the attributes of the movies, such as the genre, actors, directors, and plot summary, and builds a user profile based on the movies they have liked in the past. If a user has previously liked action movies with Bruce Willis, the system might recommend other action movies that feature Bruce Willis.
 Contentbased filtering has several advantages over other recommendation approaches. For instance, it can be used effectively in situations where there is limited or no data on user behavior, such as in new or niche domains where there may not be a large user base. Additionally, contentbased filtering can provide more personalized recommendations since it relies on the user’s individual preferences rather than the behavior of others.
 However, contentbased filtering also has limitations. For instance, it may fail to recommend items that are outside the user’s existing preferences or that are not similar to the items they have previously interacted with. Additionally, it may not be effective in situations where there is a lack of item information or where the items have limited attributes or features.
 Itembased: In itemcentered contentbased filtering, the recommender system suggests new items based on their similarity to the previously selected items, which are considered as implicit feedback. We can see an example in the image below. (source)
 Userbased:Usercentered contentbased filtering is a recommendation system approach where user preferences are collected through explicit feedback, such as questionnaires. This information is then used to recommend items with similar features to the user’s liked items. We can see an example in the image below. (source)
Pros
 Model does not need data about other users, only needs info for the current user. Makes it easier to scale.
 Can recommend niche items to each user, items that other users may not be interested in.
Cons
 Requires a lot of domain knowledge as the features are handengineered. Thus, the model is only as good as the handengineered features.
 Model will only make recommendations based on existing interests of the user and is not able to expand on their interests.
Example
 Let’s look at an example, first by looking at features in the image below (source):
 We will be creating an algorithm that learns to match users to items via Deep learning.
 Below is what our architecture will look like (source):
 Let’s look at a sequential model in TensorFlow implementing a contentbased filtering model: (source)
 We have 2 dense hidden layers and the final layer outputs 32 numbers. These are all using the relu activation function.
Code deepdive
 Now, lets build a contentbased filtering system using neural networks to recommend movies. The architecture is listed below (source)
 This has been taken from  Coursera: DeepLearning.AI’s specialization
 Lets import our packages and load our dataset:
import numpy as np
import numpy.ma as ma
from numpy import genfromtxt
from collections import defaultdict
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
from recsysNN_utils import *
pd.set_option("display.precision", 1)
# Load Data, set configuration variables
item_train, user_train, y_train, item_features, user_features, item_vecs, movie_dict, user_to_genre = load_data()
num_user_features = user_train.shape[1]  3 # remove userid, rating count and ave rating during training
num_item_features = item_train.shape[1]  1 # remove movie id at train time
uvs = 3 # user genre vector start
ivs = 3 # item genre vector start
u_s = 3 # start of columns to use in training, user
i_s = 1 # start of columns to use in training, items
scaledata = True # applies the standard scalar to data if true
print(f"Number of training vectors: {len(item_train)}")
 Let’s take a quick look at our feature vectors and data:
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)
[user id] [rating count] [rating ave] Act ion Adve nture Anim ation Chil dren Com edy Crime Docum entary Drama Fan tasy Hor ror Mys tery Rom ance Sci Fi Thri ller
2 16 4.1 3.9 5.0 0.0 0.0 4.0 4.2 4.0 4.0 0.0 3.0 4.0 0.0 4.2 3.9
2 16 4.1 3.9 5.0 0.0 0.0 4.0 4.2 4.0 4.0 0.0 3.0 4.0 0.0 4.2 3.9
2 16 4.1 3.9 5.0 0.0 0.0 4.0 4.2 4.0 4.0 0.0 3.0 4.0 0.0 4.2 3.9
2 16 4.1 3.9 5.0 0.0 0.0 4.0 4.2 4.0 4.0 0.0 3.0 4.0 0.0 4.2 3.9
2 16 4.1 3.9 5.0 0.0 0.0 4.0 4.2 4.0 4.0 0.0 3.0 4.0 0.0 4.2 3.9
 Let’s prepare the data by doing a bit of preprocessing:
# scale training data
if scaledata:
item_train_save = item_train
user_train_save = user_train
scalerItem = StandardScaler()
scalerItem.fit(item_train)
item_train = scalerItem.transform(item_train)
scalerUser = StandardScaler()
scalerUser.fit(user_train)
user_train = scalerUser.transform(user_train)
print(np.allclose(item_train_save, scalerItem.inverse_transform(item_train)))
print(np.allclose(user_train_save, scalerUser.inverse_transform(user_train)))
 And now split it into test and train:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test = train_test_split(y_train, train_size=0.80, shuffle=True, random_state=1)
print(f"movie/item training data shape: {item_train.shape}")
print(f"movie/item test data shape: {item_test.shape}")
movie/item training data shape: (46549, 17)
movie/item test data shape: (11638, 17)
 Now, let’s construct our neural network as the architecture displayed above. It will have two networks that are combined by a dot product.
# GRADED_CELL
# UNQ_C1
num_outputs = 32
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(num_outputs),
])
item_NN = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(num_outputs),
])
# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)
# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)
# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])
# specify the inputs and output of the model
model = Model([input_user, input_item], output)
model.summary()
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 14)] 0
__________________________________________________________________________________________________
input_2 (InputLayer) [(None, 16)] 0
__________________________________________________________________________________________________
sequential (Sequential) (None, 32) 40864 input_1[0][0]
__________________________________________________________________________________________________
sequential_1 (Sequential) (None, 32) 41376 input_2[0][0]
__________________________________________________________________________________________________
tf_op_layer_l2_normalize/Square [(None, 32)] 0 sequential[0][0]
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_1/Squa [(None, 32)] 0 sequential_1[0][0]
__________________________________________________________________________________________________
tf_op_layer_l2_normalize/Sum (T [(None, 1)] 0 tf_op_layer_l2_normalize/Square[0
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_1/Sum [(None, 1)] 0 tf_op_layer_l2_normalize_1/Square
__________________________________________________________________________________________________
tf_op_layer_l2_normalize/Maximu [(None, 1)] 0 tf_op_layer_l2_normalize/Sum[0][0
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_1/Maxi [(None, 1)] 0 tf_op_layer_l2_normalize_1/Sum[0]
__________________________________________________________________________________________________
tf_op_layer_l2_normalize/Rsqrt [(None, 1)] 0 tf_op_layer_l2_normalize/Maximum[
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_1/Rsqr [(None, 1)] 0 tf_op_layer_l2_normalize_1/Maximu
__________________________________________________________________________________________________
tf_op_layer_l2_normalize (Tenso [(None, 32)] 0 sequential[0][0]
tf_op_layer_l2_normalize/Rsqrt[0]
__________________________________________________________________________________________________
tf_op_layer_l2_normalize_1 (Ten [(None, 32)] 0 sequential_1[0][0]
tf_op_layer_l2_normalize_1/Rsqrt[
__________________________________________________________________________________________________
dot (Dot) (None, 1) 0 tf_op_layer_l2_normalize[0][0]
tf_op_layer_l2_normalize_1[0][0]
==================================================================================================
Total params: 82,240
Trainable params: 82,240
Nontrainable params: 0
__________________________________________________________________________________________________
 Finally, it’s time to use a mean squared error loss and an Adam optimizer
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
loss=cost_fn)
 Let’s get to training!
tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], ynorm_train, epochs=30)
Train on 46549 samples
Epoch 1/30
46549/46549 [==============================]  6s 122us/sample  loss: 0.1254
Epoch 2/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1187
Epoch 3/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1169
Epoch 4/30
46549/46549 [==============================]  5s 118us/sample  loss: 0.1154
Epoch 5/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1142
Epoch 6/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.1130
Epoch 7/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.1119
Epoch 8/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.1110
Epoch 9/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.1095
Epoch 10/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1083
Epoch 11/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1073
Epoch 12/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1066
Epoch 13/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1059
Epoch 14/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1054
Epoch 15/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1047
Epoch 16/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.1041
Epoch 17/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1036
Epoch 18/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1030
Epoch 19/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1027
Epoch 20/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1021
Epoch 21/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.1018
Epoch 22/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1014
Epoch 23/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.1010
Epoch 24/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.1006
Epoch 25/30
46549/46549 [==============================]  5s 116us/sample  loss: 0.1003
Epoch 26/30
46549/46549 [==============================]  5s 114us/sample  loss: 0.0999
Epoch 27/30
46549/46549 [==============================]  5s 115us/sample  loss: 0.0997
Epoch 28/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.0991
Epoch 29/30
46549/46549 [==============================]  5s 113us/sample  loss: 0.0989
Epoch 30/30
46549/46549 [==============================]  5s 112us/sample  loss: 0.0985
<tensorflow.python.keras.callbacks.History at 0x7fab691f12d0>
 Evaluate the model to determine the loss on the test data.
model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], ynorm_test)
11638/11638 [==============================]  0s 33us/sample  loss: 0.1045
0.10449595100221243
 Making predictions is the next step, first lets start off by predicting for a new user:
new_user_id = 5000
new_rating_ave = 1.0
new_action = 1.0
new_adventure = 1
new_animation = 1
new_childrens = 1
new_comedy = 5
new_crime = 1
new_documentary = 1
new_drama = 1
new_fantasy = 1
new_horror = 1
new_mystery = 1
new_romance = 5
new_scifi = 5
new_thriller = 1
new_rating_count = 3
user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
new_action, new_adventure, new_animation, new_childrens,
new_comedy, new_crime, new_documentary,
new_drama, new_fantasy, new_horror, new_mystery,
new_romance, new_scifi, new_thriller]])
 Let’s look at the toprated movies for the new user:
# generate and replicate the user vector to match the number movies in the data set.
user_vecs = gen_user_vecs(user_vec,len(item_vecs))
# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs, item_vecs, model, u_s, i_s,
scaler, scalerUser, scalerItem, scaledata=scaledata)
print_pred_movies(sorted_ypu, sorted_user, sorted_items, movie_dict, maxcount = 10)
y_p movie id rating ave title genres
4.86762 64969 3.61765 Yes Man (2008) Comedy
4.86692 69122 3.63158 Hangover, The (2009) ComedyCrime
4.86477 63131 3.625 Role Models (2008) Comedy
4.85853 60756 3.55357 Step Brothers (2008) Comedy
4.85785 68135 3.55 17 Again (2009) ComedyDrama
4.85178 78209 3.55 Get Him to the Greek (2010) Comedy
4.85138 8622 3.48649 Fahrenheit 9/11 (2004) Documentary
4.8505 67087 3.52941 I Love You, Man (2009) Comedy
4.85043 69784 3.65 Brüno (Bruno) (2009) Comedy
4.84934 89864 3.63158 50/50 (2011) ComedyDrama
 Now let’s make predictions for an existing user:
uid = 36
# form a set of user vectors. This is the same vector, transformed and repeated.
user_vecs, y_vecs = get_user_vecs(uid, scalerUser.inverse_transform(user_train), item_vecs, user_to_genre)
# scale the vectors and make predictions for all movies. Return results sorted by rating.
sorted_index, sorted_ypu, sorted_items, sorted_user = predict_uservec(user_vecs, item_vecs, model, u_s, i_s, scaler,
scalerUser, scalerItem, scaledata=scaledata)
sorted_y = y_vecs[sorted_index]
#print sorted predictions
print_existing_user(sorted_ypu, sorted_y.reshape(1,1), sorted_user, sorted_items, item_features, ivs, uvs, movie_dict, maxcount = 10)
y_p y user user genre ave movie rating ave title genres
3.1 3.0 36 3.00 2.86 Time Machine, The (2002) Adventure
3.0 3.0 36 3.00 2.86 Time Machine, The (2002) Action
2.8 3.0 36 3.00 2.86 Time Machine, The (2002) SciFi
2.3 1.0 36 1.00 4.00 Beautiful Mind, A (2001) Romance
2.2 1.0 36 1.50 4.00 Beautiful Mind, A (2001) Drama
1.6 1.5 36 1.75 3.52 Road to Perdition (2002) Crime
1.6 2.0 36 1.75 3.52 Gangs of New York (2002) Crime
1.5 1.5 36 1.50 3.52 Road to Perdition (2002) Drama
1.5 2.0 36 1.50 3.52 Gangs of New York (2002) Drama
Collaborative Filtering
 Collaborative filtering is a type of recommendation system that suggests items to users based on the similarities and patterns in the behavior of a group of users. In other words, it recommends items that users with similar preferences have liked or interacted with.
 Collaborative filtering works by analyzing the behavior and preferences of a group of users, such as their ratings, purchases, or clicks, and identifying users who have similar preferences or behavior. The system then recommends items to a user based on the items that have been liked or interacted with by other users who share similar preferences or behavior.
 For example, consider a collaborative filtering system for recommending movies. The system analyzes the ratings or reviews of movies provided by a group of users and identifies users who have similar preferences. If a user has rated several movies highly that have also been highly rated by other users with similar preferences, the system might recommend other movies that have been highly rated by those similar users.
 Collaborative filtering has several advantages over other recommendation approaches. For instance, it can be used effectively in situations where there is limited information on item attributes or where the user base is large and diverse. Additionally, collaborative filtering can provide serendipitous recommendations since it can suggest items that a user may not have discovered otherwise.
 However, collaborative filtering also has limitations. For instance, it may not be effective in situations where there is limited or no data on user behavior or where users have unique preferences that are not shared by others. Additionally, it may suffer from the “cold start” problem, where new items or users have limited data available, making it difficult to recommend items accurately.
 To address some of the limitations of contentbased filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user \(A\) based on the interests of a similar user \(B\).
 In contrast to contentbased filtering, collaborative filtering is able to learn embeddings automatically, without relying on handengineered features.
 It is also able to take interests of user \(A\) and recommend it to user \(B\).
 The goal of a collaborative filtering recommender system is to generate two vectors:
 A ‘parameter vector’ for each user that embodies the item’s tastes of a user.
 A feature vector for each item which embodies some description of the movie.
 The dot product of the two vectors plus the bias term should produce an estimate of the rating the user might give to that movie.
Pros
 We don’t need to have prior domain knowledge as the embeddings are automatically learned.
Cons
 Cold start problem: If the item/document has not been seen during training, our model can not create an embedding from it and can not query the model for this item.
 Mean Normalization can help with this, as we see below. Another solution is to show some reasonable items to the new users and have them rate a few items.
 You can also use the user’s information (demographics, expressed preferences, web browser they are using, their IP) to recommend them reasonable items.
A Movie Recommendation Casestudy
 Consider a movie recommendation system (taken from Google’s course on Recommendation Systems in which the training data consists of a feedback matrix in which:
 Each row represents a user.
 Each column represents an item (a movie).
 The feedback about movies falls into one of two categories:
 Explicit: users specify how much they liked a particular movie by providing a numerical rating.
 Implicit: if a user watches a movie, the system infers that the user is interested.
 To simplify, we will assume that the feedback matrix is binary; that is, a value of 1 indicates interest in the movie.
 When a user visits the homepage, the system should recommend movies based on both:
 Similarity to movies the user has liked in the past.
 Movies that similar users liked.
 For the sake of illustration, let’s handengineer some features for the movies described in the following table: (source for image below)
1D Embedding
 Suppose we assign to each movie a scalar in [1, 1] that describes whether the movie is for children (negative values) or adults (positive values). Suppose we also assign a scalar to each user in [1, 1] that describes the user’s interest in children’s movies (closer to 1) or adult movies (closer to +1). The product of the movie embedding and the user embedding should be higher (closer to 1) for movies that we expect the user to like. (source for image below)
 In the diagram below (source), each checkmark identifies a movie that a particular user watched. The third and fourth users have preferences that are well explained by this feature—the third user prefers movies for children and the fourth user prefers movies for adults. However, the first and second users’ preferences are not well explained by this single feature.
2D Embedding
 One feature was not enough to explain the preferences of all users. To overcome this problem, let’s add a second feature: the degree to which each movie is a blockbuster or an arthouse movie. With a second feature, we can now represent each movie with the following twodimensional embedding: (source)
 We again place our users in the same embedding space to best explain the feedback matrix: for each \((user, item)\) pair, we would like the dot product of the user embedding and the item embedding to be close to 1 when the user watched the movie, and to 0 otherwise. (source)

Note: We represented both items and users in the same embedding space. This may seem surprising. After all, users and items are two different entities. However, you can think of the embedding space as an abstract representation common to both items and users, in which we can measure similarity or relevance using a similarity metric. In this example, we handengineered the embeddings. In practice, the embeddings can be learned automatically, which is the power of collaborative filtering models. In the next two sections, we will discuss different models to learn these embeddings, and how to train them.

The collaborative nature of this approach is apparent when the model learns the embeddings. Suppose the embedding vectors for the movies are fixed. Then, the model can learn an embedding vector for the users to best explain their preferences. Consequently, embeddings of users with similar preferences will be close together. Similarly, if the embeddings for the users are fixed, then we can learn movie embeddings to best explain the feedback matrix. As a result, embeddings of movies liked by similar users will be close in the embedding space.
Examples
 As a realworld example, let’s look at an example (source) with binary labels signifying engagement (using favorites, likes, and clicks):
 Let’s talk about the meaning of each rating in the above dataset. We can choose what each of our labels mean so here, we have chosen the following:
1
: user was engaged after being shown the item.0
: user did not engage after being shown the item.?
: the item is not yet shown to the user.
Matrix factorization
 Matrix factorization is a simple embedding model. Given the feedback matrix \(\mathrm{A} \in R^{m \times n}\), where \(m\) is the number of users (or queries) and \(n\) is the number of items, the model learns:
 A user embedding matrix \(U \in \mathbb{R}^{m \times d}\), where row \(i\) is the embedding for user \(i\).
 An item embedding matrix \(V \in \mathbb{R}^{n \times d}\), where row \(j\) is the embedding for item \(j\). The image below depicts this: (source)

The embeddings are learned such that the product \(U V^{T}\) is a good approximation of the feedback matrix. Observe that the \((i, j)\) entry of \(U . V^{T}\) is simply the dot product \(\left\langle U_{i}, V_{j}\right\rangle\) of the embeddings of user \(i\) and item \(j\), which you want to be close to \(A_{i, j}\).

Note: Matrix factorization typically gives a more compact representation than learning the full matrix. The full matrix has \(O(n m)\) entries, while the embedding matrices \(U, V\) have \(O((n+m) d)\) entries, where the embedding dimension \(d\) is typically much smaller than \(m\) and \(n\). As a result, matrix factorization finds latent structure in the data, assuming that observations lie close to a lowdimensional subspace. In the preceding example, the values of \(n, m\), and \(d\) are so low that the advantage is negligible. In realworld recommendation systems, however, matrix factorization can be significantly more compact than learning the full matrix.
Choosing the Objective Function
 One intuitive objective function is the squared distance. To do this, minimize the sum of squared errors over all pairs of observed entries:
 In this objective function, you only sum over observed pairs (i, j), that is, over nonzero values in the feedback matrix. However, only summing over values of one is not a good ideaa matrix of all ones will have a minimal loss and produce a model that can’t make effective recommendations and that generalizes poorly. The image below (source), depicts this.
 Perhaps you could treat the unobserved values as zero, and sum over all entries in the matrix. This corresponds to minimizing the squared Frobenius distance between \(A\) and its approximation \(U V^{T}\):

You can solve this quadratic problem through Singular Value Decomposition (SVD) of the matrix. However, SVD is not a great solution either, because in real applications, the matrix $A$ may be very sparse. For example, think of all the videos on YouTube compared to all the videos a particular user has viewed. The solution \(U V^{T}\) (which corresponds to the model’s approximation of the input matrix) will likely be close to zero, leading to poor generalization performance.

In contrast, Weighted Matrix Factorization decomposes the objective into the following two sums:
 A sum over observed entries.
 A sum over unobserved entries (treated as zeroes).

Here, \(w_{0}\) is a hyperparameter that weights the two terms so that the objective is not dominated by one or the other. Tuning this hyperparameter is very important.

Note: In practical applications, you also need to weight the observed pairs carefully. For example, frequent items (for example, extremely popular YouTube videos) or frequent queries (for example, heavy users) may dominate the objective function. You can correct for this effect by weighting training examples to account for item frequency. In other words, you can replace the objective function by:
\[\sum_{(i, j) \in \mathrm{obs}} w_{i, j}\left(A_{i, j}\left\langle U_{i}, V_{j}\right\rangle\right)^{2}+w_{0} \sum_{i, j \notin \mathrm{obs}}\left\langle U_{i}, V_{j}\right\rangle^{2}\] where \(w_{i, j}\) is a function of the frequency of query \(\mathrm{i}$ and item $\mathrm{j}\).
Minimizing the Objective Function
 Common algorithms to minimize the objective function include:
 Stochastic gradient descent (SGD) is a generic method to minimize loss functions.
 Weighted Alternating Least Squares (WALS) is specialized to this particular objective.
The objective is quadratic in each of the two matrices \(\mathrm{U}\) and \(\mathrm{V}\). (Note, however, that the problem is not jointly convex.) WALS works by initializing the embeddings randomly, then alternating between:
 Fixing \(U\) and solving for \(V\).
 Fixing \(V\) and solving for \(U\).
 Each stage can be solved exactly (via solution of a linear system) and can be distributed. This technique is guaranteed to converge because each step is guaranteed to decrease the loss.
Cost function for binary labels (regression to classification)
 Let’s look at how we can generalize our algorithm, note that this is a lot like linear regression but we instead to predict these binary outputs, thus move from regression to a binary classification problem.
 Below we have the cost function for this binary application: (source)
 This is how we take linear regressionlike collaborative filtering algorithm and generalize it to work for binary labels.
SGD vs. WALS
 SGD and WALS have advantages and disadvantages. Review the information below to see how they compare:
SGD
 (+) Very flexible—can use other loss functions.
 (+) Can be parallelized.
 () Slower—does not converge as quickly.
 () Harder to handle the unobserved entries (need to use negative sampling or gravity).
WALS
 () Reliant on Loss Squares only.
 (+) Can be parallelized.
 (+) Converges faster than SGD.
 (+) Easier to handle unobserved entries.
Example: Learning process for vectors
 Below is a diagram of how vectors are learned using matrix factorization. (source)
 \(Y\) contains ratings; 0.5 to 5 inclusive in 0.5 steps. 0 if the movie has not been rated.
 \(R\) has a 1 where movies have been rated.
 Movies are in rows, users in columns. Each user has a parameter vector \(w_{user}\) and bias. Each movie has a feature vector \(x_{movie}\). These vectors are simultaneously learned by using the existing user/movie ratings as training data.
 One training example is shown above: \(w^{(1)} \cdot x^{(1)} + b^{(1)}=4\).
 The feature vector \(x_{movie}\) must satisfy all the users while the user vector \(w_{user}\) must satisfy all the movies. Since all the users collaborate to generate the rating set, we call it collaborative filtering.
 Once the feature vectors and parameters are learned, they can be used to predict how a user might rate an unrated movie.
Code deepdive
 Now lets look into the code for a movie recommendation model from Andrew Ng’s Coursera specialization.
 The first part of the exercise was to implement the collaborative filtering cost function. Let’s skip ahead to training:
movieList, movieList_df = load_Movie_List_pd()
my_ratings = np.zeros(num_movies) # Initialize my ratings
# Check the file small_movie_list.csv for id of each movie in our dataset
# For example, Toy Story 3 (2010) has ID 2700, so to rate it "5", you can set
my_ratings[2700] = 5
#Or suppose you did not enjoy Persuasion (2007), you can set
my_ratings[2609] = 2;
# We have selected a few movies we liked / did not like and the ratings we gave are as follows:
my_ratings[929] = 5 # Lord of the Rings: The Return of the King, The
my_ratings[246] = 5 # Shrek (2001)
my_ratings[2716] = 3 # Inception
my_ratings[1150] = 5 # Incredibles, The (2004)
my_ratings[382] = 2 # Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[366] = 5 # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622] = 5 # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988] = 3 # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 1 # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1 # Nothing to Declare (Rien à déclarer)
my_ratings[793] = 5 # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]
print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
if my_ratings[i] > 0 :
print(f'Rated {my_ratings[i]} for {movieList_df.loc[i,"title"]}');
New user ratings:
Rated 5.0 for Shrek (2001)
Rated 5.0 for Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
Rated 2.0 for Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Rated 5.0 for Harry Potter and the Chamber of Secrets (2002)
Rated 5.0 for Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Rated 5.0 for Lord of the Rings: The Return of the King, The (2003)
Rated 3.0 for Eternal Sunshine of the Spotless Mind (2004)
Rated 5.0 for Incredibles, The (2004)
Rated 2.0 for Persuasion (2007)
Rated 5.0 for Toy Story 3 (2010)
Rated 3.0 for Inception (2010)
Rated 1.0 for Louis Theroux: Law & Disorder (2008)
Rated 1.0 for Nothing to Declare (Rien à déclarer) (2010)
 Above we are starting with my ratings and printing all the ratings > 0.
 Now let’s add these reviews to 𝑌 and 𝑅 and normalize the ratings.
# Reload ratings and add new ratings
Y, R = load_ratings_small()
Y = np.c_[my_ratings, Y]
R = np.c_[(my_ratings != 0).astype(int), R]
# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)
 Now, let’s initialize the parameters and select the Adam optimizer
# Useful Values
num_movies, num_users = Y.shape
num_features = 100
# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users, num_features),dtype=tf.float64), name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64), name='X')
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float64), name='b')
# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e1)
 The final step left here is to train our collaborative filtering model. This will learn the parameters \(X\), \(W\), and \(b\).
 Since learning these parameters doesn’t follow the typical ‘layers’ offered in TensorFlow neural network package, we will need to use custom training loop instead of the traditional model flow:
compile()
,fit()
, andpredict()
.  In order to do this,we will first return the gradient of the loss relative to the tracked variables.
 Further, the gradients can then be applied to the parameters using an optimizer.
iterations = 200
lambda_ = 1
for iter in range(iterations):
# Use TensorFlow’s GradientTape
# to record the operations used to compute the cost
with tf.GradientTape() as tape:
# Compute the cost (forward pass included in cost)
cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss
grads = tape.gradient( cost_value, [X,W,b] )
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients( zip(grads, [X,W,b]) )
# Log periodically.
if iter % 20 == 0:
print(f"Training loss at iteration {iter}: {cost_value:0.1f}")
 Below is the loss we had logged periodically:
Training loss at iteration 0: 2321138.6
Training loss at iteration 20: 136158.3
Training loss at iteration 40: 51858.7
Training loss at iteration 60: 24596.9
Training loss at iteration 80: 13629.5
Training loss at iteration 100: 8487.1
Training loss at iteration 120: 5807.2
Training loss at iteration 140: 4311.2
Training loss at iteration 160: 3434.9
Training loss at iteration 180: 2901.7
 Below we will start making our predictions which will be based off our initial ratings given earlier. We can compare the original and predicted value in the output below:
# Make a prediction using trained weights and biases
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
#restore the mean
pm = p + Ymean
my_predictions = pm[:,0]
# sort predictions
ix = tf.argsort(my_predictions, direction='DESCENDING')
for i in range(17):
j = ix[i]
if j not in my_rated:
print(f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')
print('\n\nOriginal vs Predicted ratings:\n')
for i in range(len(my_ratings)):
if my_ratings[i] > 0:
print(f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')
Predicting rating 4.55 for movie My Sassy Girl (Yeopgijeogin geunyeo) (2001)
Predicting rating 4.53 for movie Martin Lawrence Live: Runteldat (2002)
Predicting rating 4.53 for movie Delirium (2014)
Predicting rating 4.53 for movie Particle Fever (2013)
Predicting rating 4.53 for movie Laggies (2014)
Predicting rating 4.53 for movie One I Love, The (2014)
Predicting rating 4.52 for movie Eichmann (2007)
Predicting rating 4.52 for movie Battle Royale 2: Requiem (Batoru rowaiaru II: Chinkonka) (2003)
Predicting rating 4.52 for movie Into the Abyss (2011)
Predicting rating 4.52 for movie Son of the Bride (Hijo de la novia, El) (2001)
Predicting rating 4.51 for movie Rivers and Tides (2001)
Original vs Predicted ratings:
Original 5.0, Predicted 4.89 for Shrek (2001)
Original 5.0, Predicted 4.84 for Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
Original 2.0, Predicted 2.13 for Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Original 5.0, Predicted 4.89 for Harry Potter and the Chamber of Secrets (2002)
Original 5.0, Predicted 4.91 for Lord of the Rings: The Return of the King, The (2003)
Original 3.0, Predicted 3.00 for Eternal Sunshine of the Spotless Mind (2004)
Original 5.0, Predicted 4.89 for Incredibles, The (2004)
Original 2.0, Predicted 2.15 for Persuasion (2007)
Original 5.0, Predicted 4.80 for Toy Story 3 (2010)
Original 3.0, Predicted 3.01 for Inception (2010)
Original 1.0, Predicted 1.48 for Louis Theroux: Law & Disorder (2008)
 Below we have a dataframe showing data from the top 20 ratings:
filter=(movieList_df["number of ratings"] > 20)
movieList_df["pred"] = my_predictions
movieList_df = movieList_df.reindex(columns=["pred", "mean rating", "number of ratings", "title"])
movieList_df.loc[ix[:300]].loc[filter].sort_values("mean rating", ascending=False)
pred mean rating number of ratings title
2112 4.097693 4.238255 149 Dark Knight, The (2008)
155 4.004862 4.155914 93 Snatch (2000)
211 4.246676 4.122642 159 Memento (2000)
929 4.911655 4.118919 185 Lord of the Rings: The Return of the King, The...
2700 4.795389 4.109091 55 Toy Story 3 (2010)
653 4.331339 4.021277 188 Lord of the Rings: The Two Towers, The (2002)
2804 4.393629 3.989362 47 Harry Potter and the Deathly Hallows: Part 1 (...
773 4.082829 3.960993 141 Finding Nemo (2003)
1771 4.246275 3.944444 81 Casino Royale (2006)
2455 4.244220 3.887931 58 Harry Potter and the HalfBlood Prince (2009)
361 4.052389 3.871212 132 Monsters, Inc. (2001)
246 4.893447 3.867647 170 Shrek (2001)
1150 4.885246 3.836000 125 Incredibles, The (2004)
366 4.839989 3.761682 107 Harry Potter and the Sorcerer's Stone (a.k.a. ...
79 4.226652 3.699248 133 XMen (2000)
622 4.889378 3.598039 102 Harry Potter and the Chamber of Secrets (2002)
Deep Neural Network based Recommendations
 Deep neural networks (DNNs) have been increasingly used in recommendation systems due to their ability to learn complex patterns in user behavior and item attributes. DNNbased recommendation systems typically use a neural network architecture to model the useritem interactions and make predictions for item recommendations.
 One common approach for DNNbased recommendation is to use a matrix factorization model as a baseline and then incorporate additional layers of neural networks to capture more complex patterns in the useritem interactions. This is known as a deep matrix factorization model. The neural network layers can be used to learn nonlinear transformations of the input features, which can improve the accuracy of the recommendations.
 Another popular approach for DNNbased recommendation is to use a sequence modeling architecture, such as a recurrent neural network (RNN) or a transformer network. These models can capture temporal dependencies in user behavior and item popularity, allowing for more accurate and personalized recommendations. For example, an RNN can be used to model the sequence of items that a user has interacted with over time, and then use this information to predict which item the user is likely to interact with next.
 “DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations.” (source)
 A possible DNN model for recommendation is the softmax model, which frames the problem as a multiclass prediction task where the input is the user query and the output is a probability vector with the same size as the item corpus. This vector represents the likelihood of a user interacting with each item, such as watching or clicking on a YouTube video.
 The input features to the DNN can include both dense features like watch time and time since last watch, as well as sparse features such as watch history and country. Unlike matrix factorization, side features like age or country can also be added to the input vector, which is denoted as x.
 The image below (source) illustrates how that architecture would look like.
 By incorporating hidden layers and nonlinear activation functions like ReLU, the model can capture more intricate relationships within the data. However, as the number of parameters increases, the model can become more challenging to train and more resourceintensive to deploy.
Candidate Retrieval
 “Suppose you have an embedding model. Given a user, how would you decide which items to recommend?
 At serve time, given a query, you start by doing one of the following:
 For a matrix factorization model, the query (or user) embedding is known statically, and the system can simply look it up from the user embedding matrix.
 For a DNN model, the system computes the query embedding \(\psi(x)\) at serve time by running the network on the feature vector \(x\).
 Once you have the query embedding $q$, search for item embeddings \(V_{j}\) that are close to \(q\) in the embedding space. This is a nearest neighbor problem. For example, you can return the top \(\mathrm{k}\) items according to the similarity score \(s\left(q, V_{j}\right)\).” (source)
 “You can use a similar approach in relateditem recommendations. For example, when the user is watching a YouTube video, the system can first look up the embedding of that item, and then look for embeddings of other items that are close in the embedding space.” (source)
Largescale Retrieval
 “To compute the nearest neighbors in the embedding space, the system can exhaustively score every potential candidate. Exhaustive scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient:
 If the query embedding is known statically, the system can perform exhaustive scoring offline, precomputing and storing a list of the top candidates for each query. This is a common practice for relateditem recommendation.
 Use approximate nearest neighbors.” (source)
Use Cases
 Twitter: Twitter uses candidate generation to recommend accounts for users to follow. The recommendation engine generates a list of accounts that the user might be interested in based on their activity on the platform. Twitter’s algorithm considers factors such as the user’s tweets, retweets, likes, and interactions with other accounts to generate a list of recommended accounts. In addition to recommending accounts, Twitter also generates candidate tweets for users to engage with. The system analyzes the user’s activity and suggests tweets that they may be interested in liking or retweeting.
 Pinterest: Pinterest uses candidate generation to recommend pins to users based on their interests. The recommendation engine analyzes the user’s activity on the platform, such as the pins they have saved, liked, and commented on, to generate a list of recommended pins. Pinterest also generates candidate boards for users to follow. The system identifies boards that are related to the user’s interests and suggests them as potential candidates for follow.
 Netflix: Netflix uses candidate generation to recommend TV shows and movies to users. The recommendation engine analyzes the user’s viewing history and behavior on the platform to generate a list of recommended titles. Netflix’s algorithm considers factors such as the user’s watch history, search queries, and interactions with the platform to generate a personalized list of recommended titles. The platform also generates candidate trailers for users to watch. The system analyzes the user’s viewing history and suggests trailers for titles that the user might be interested in.
 Amazon: Amazon uses candidate generation to recommend products to users based on their browsing and purchasing history. The recommendation engine analyzes the user’s activity on the platform, such as the items they have searched for, added to their cart, and purchased, to generate a list of recommended products. Amazon’s algorithm considers factors such as the user’s search history, purchase history, and interactions with the platform to generate a personalized list of recommended products. The platform also generates candidate reviews for users to read. The system analyzes the user’s browsing history and suggests reviews for products that the user might be interested in.
Youtube
 The image below (source) shows the candidate generation architecture.
 How this works on a high level is, they use a multiclass classification neural network model that is non linear matrix factorization model based on user watch history data.
 Here, they count a fully watched video as a positive example and negative examples are randomly sampled but they made sure these dont contain positive examples as well.
 ANN with similarity measures are used at inference time based on user and video embeddings to get the top 100 or so candidates.
 Now lets go back to the architecture from the image above. Embedded videos are created for each user and a weighted average is taken for different categories such as search(since the user could search something and not watch it), watch(shows a positive example), and geolocation.
 During training, it leverages softmax to predict class probabilities but during serving it uses ANN.
 This then goes through a feed forward neural network that uses a ReLU activation.
Evaluation
 To test if a recommender system has a good candidate generation or candidate retrieval pool, you can use various evaluation metrics that measure the quality of the recommended items. Some commonly used metrics are:
 Precision: Precision measures the percentage of relevant items among the recommended items. In the context of candidate generation, precision measures how many of the recommended candidates are actually relevant to the user’s preferences.
 Recall: Recall measures the percentage of relevant items that were recommended. In the context of candidate generation, recall measures how many of the relevant candidates were recommended to the user.
 F1 Score: F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is a good overall metric for evaluating candidate generation performance.
 Mean Average Precision (MAP): MAP measures the average precision across different queries. In the context of candidate generation, MAP measures the average precision of the recommended candidates across all the users.
 Normalized Discounted Cumulative Gain (NDCG): NDCG measures the relevance of the recommended items by assigning higher scores to items that are more relevant. In the context of candidate generation, NDCG measures how well the recommended candidates are ranked according to their relevance.
 By evaluating these metrics, you can assess the quality of the recommended items and identify areas where the candidate generation or retrieval pool needs improvement.
References
 Google’s Recommendation Systems Developer Course
 Coursera: Music Recommender System Project
 Coursera: DeepLearning.AI’s specialization.
 Recommender system from learned embeddings
 Google’s Recommendation Systems Developer Crash Course: Embeddings Video Lecture
 ALS introduction by Sophie Wats
 Matrix Factorization
 Recommendation System for Ecommerce using Alternating Least Squares (ALS) on Apache Spark