Recommendation Systems • Architectures
 Overview
 Deep Neural Network Models for Recommendation
 Wide and Deep (2016)
 Deep Factorization Machine / DeepFM (2017)
 Neural Collaborative Filtering / NCF (2017)
 Deep and Cross Networks / DCN (2017)
 AutoInt (2019)
 DLRM (2019)
 DCN V2 (2020)
 DHEN (2022)
 GDCN (2023)
 Graph Neural Networksbased RecSys Architectures
 Two Towers in RecSys
 Summary
 References
Overview
 This article explores the various architectures used in recommender systems, focusing on how these systems process and utilize different types of features for generating recommendations.
 Recommender systems typically deal with two kinds of features: dense and sparse. Dense features are continuous real values, such as movie ratings or release years. Sparse features, on the other hand, are categorical and can vary in cardinality, like movie genres or the list of actors in a film.
 The architectural transformation of these features in RecSys models can be broadly divided into two parts:
 Dense Features (Continuous / real / numerical values):
 Movie Ratings: This feature represents the continuous real values indicating the ratings given by users to movies. For example, a rating of 4.5 out of 5 would be a dense feature value.
 Movie Release Year: This feature represents the continuous real values indicating the year in which the movie was released. For example, the release year 2000 would be a dense feature value.
 Sparse Features (Categorical with low or high cardinality):
 Movie Genre: This feature represents the categorical information about the genre(s) of a movie, such as “Action,” “Comedy,” or “Drama.” These categorical values have low cardinality, meaning there are a limited number of distinct genres.
 Movie Actors: This feature represents the categorical information about the actors who starred in a movie. These categorical values can have high cardinality, as there could be numerous distinct actors in the dataset.
 Dense Features (Continuous / real / numerical values):
 In the model architecture, the dense features like movie ratings and release year can be directly fed into a feedforward dense neural network. The dense network performs transformations and computations on the continuous real values of these features.
 On the other hand, the sparse features like movie genre and actors require a different approach. Such features are often encoded as onehot vectors, e.g.,
[0,1,0]
; however, this often leads to excessively highdimensional feature spaces for large vocabularies. This is especially true in the case of webscale recommender systems such as CTR prediction, the inputs are mostly categorical features, e.g.,country = usa
. Instead of directly using the raw categorical values, an embedding network is employed to reduce the dimensionality. Each of the sparse, highdimensional categorical features are first converted into a lowdimensional, dense realvalued vector, often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized randomly and then the values are trained to minimize the final loss function during model training. The embedding network maps each sparse feature value (e.g., genre or actor) to a lowdimensional dense vector representation called an embedding. These embeddings capture the semantic relationships and similarities between different categories. The embedding lookup tables contain precomputed embeddings for each sparse feature value, allowing for efficient retrieval during the model’s inference.  By combining the outputs of the dense neural network and the embedding lookup tables, the model can capture the interactions between dense and sparse features, leading to better recommendations based on both continuous and categorical information.
 The figure below (source) illustrates a deep neural network (DNN) architecture for processing both dense and sparse features: dense features are processed through an MLP (multilayer perceptron) to create dense embeddings, while sparse features are converted to sparse embeddings via separate embedding tables (A and B). These embeddings are then combined to facilitate densesparse interactions before being fed into the DNN architecture to produce the output.

Additionally, the Criteo Leaderboard helps us see which architecture performs well on the Criteo dataset based on CTR.
 When serving these models, further optimizations can be applied, as indicated below:
 Both physical transformations (like model pruning and quantization) and logical transformations (such as splitting the model or separating out embedding tables).
 Logical transformations are particularly focused on optimizing the model’s execution storage and latency concerning the available hardware for serving. For instance, embedding tables might be hosted on CPU machines equipped with large memory and IO bandwidth, while the processing of dense features and their interaction with sparse features can be allocated to GPUs, benefiting from parallel processing speedups. The chosen serving paradigm is essentially a deployment plan tailored to the hardware setup, often seen in enterprise environments.
 A common paradigm in recommender systems involves using GPU for dense networks and high CPU memory machines for embedding tables.
 The plot below (source) is a visual representation of the models and architectures for the task of ClickThrough Rate Prediction on the Criteo dataset. With this usecase as our poster child, we will discuss the inner workings of some of the major model architectures listed in the plot.
Deep Neural Network Models for Recommendation
 Deep neural network models have gained significant popularity in the field of recommendation systems. These models leverage various variations of artificial neural networks (ANNs) to effectively capture complex patterns and make accurate recommendations.
 Pros: Capable of learning complex, nonlinear relationships between inputs. Can handle a variety of feature types. Suitable for both candidate generation and fine ranking.
 Cons: Can be computationally expensive and require a lot of data to train effectively. Might overfit on small datasets. The inner workings of the model can be hard to interpret (“black box”).
 Use case: Best suited when you have a large dataset and require a model that can capture complex patterns and interactions between features.
 Let’s explore some of the variations of neural building blocks (source):
 Feedforward Neural Networks (FFNNs):
 FNNs are a type of ANN where information flows in a unidirectional manner from one layer to the next.
 Multilayer perceptrons (MLPs) are a specific type of shallow FFNNs that consist of at least three layers: an input layer, one or more hidden layers, and an output layer.
 MLPs are versatile and can be applied to a wide range of scenarios.
 Convolutional Neural Networks (CNNs):
 CNNs are primarily known for their effectiveness in image processing tasks, such as object identification and image classification.
 They employ convolutional operations to extract important features from input data.
 Recurrent Neural Networks (RNNs):
 RNNs are specifically designed to handle sequential data and capture temporal dependencies.
 They are commonly used in natural language processing (NLP) tasks to parse language patterns and process sequential data.
 Feedforward Neural Networks (FFNNs):
 In the realm of recommendation systems, deep learning (DL) models build upon traditional techniques like factorization to model interactions between variables. DL models also utilize embeddings to handle categorical variables. Embeddings are learned vector representations of entity features, where similar entities (users or items) have smaller distances in the vector space. For example, a deep learning approach to collaborative filtering can learn user and item embeddings based on their interactions using a neural network.
 Deep learning techniques tap into the extensive library of novel network architectures and optimization algorithms. They excel in training on large datasets, leverage the power of deep learning for feature extraction, and enable the creation of more expressive models.
Wide and Deep (2016)
 A quick note that the first half of this section is taken from (ML Frontiers)
 While NCF revolutionized the domain of recommender system, it lacks an important ingredient that turned out to be extremely important for the success of recommenders: cross features. The idea of cross features has been first popularized in Google’s 2016 paper Wide & Deep Learning for Recommender Systems by Cheng et al.
Background: Cross Features
What are feature crosses and why are they important?

A cross feature is a secondorder feature (i.e., a crossproduct transformation) that’s created by “crossing” two of the original features, thus modeling the interactive effects between the two features. For example, in the Google Play Store, firstorder features include the impressed app, or the list of userinstalled apps. These two can be combined to create powerful crossfeatures, such as:
AND(user_installed_app='netflix', impression_app='hulu')
 which is 1 if the user has Netflix installed and the impressed app is Hulu.

Cross features can also be more coarsegrained such as:
AND(user_installed_category='video', impression_category='video')
 which is 1 if the user installed video apps before and the impressed app is a video app as well. The authors argue that adding cross features of different granularities enables both memorization (from more granular crosses) and generalization (from less granular crosses).

As another example (source), imagine that we are building a recommender system to sell a blender to customers. Then, a customer’s past purchase history such as
purchased_bananas
andpurchased_cooking_books
, or geographic features, are single features. If one has purchased both bananas and cooking books, then this customer will more likely click on the recommended blender. The combination ofpurchased_bananas
andpurchased_cooking_books
is referred to as a feature cross, which provides additional interaction information beyond the individual features.
What are the challenges in learning feature crosses?
 In webscale applications, data is mostly categorical, leading to large and sparse feature space. Identifying effective feature crosses in this setting often requires manual feature engineering or exhaustive search.
 Traditional feedforward multilayer perceptron (MLP) models are universal function approximators; however, they cannot efficiently approximate even 2nd or 3rdorder feature crosses (Wang et al. (2020), Beutel et al. (2018)).
Motivation
 Generalized linear models with nonlinear feature transformations are widely used for largescale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of crossproduct feature transformations are effective and interpretable, while generalization requires more feature engineering effort. However, memorization and generalization are both important for recommender systems. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through lowdimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can overgeneralize and recommend less relevant items when the useritem interactions are sparse and highrank.
The Wide and Deep architecture demonstrated the critical importance of cross features, that is, secondorder features that are created by crossing two of the original features. It combines a wide and shallow module for cross features with a deep module much like NCF. It seeks to obtain the best of both worlds by combining the unique strengths of wide and deep models, i.e., memorization and generalization respectively, thus enabling better recommendations.
 Wide and Deep learning jointly train wide linear models and deep neural networks – to combine the benefits of memorization and generalization for recommender systems. Wide linear models can effectively memorize sparse feature interactions using crossproduct feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings.
Architecture
Wide part: The wide part of the model is a generalized linear model that takes into account crossproduct feature transformations, in addition to the original features. The crossproduct transformations capture interactions between categorical features. For example, if you were building a real estate recommendation system, you might include a crossproduct transformation of
city=San Francisco
ANDtype=condo
. These crossproduct transformations can effectively capture specific, niche rules, offering the model the benefit of memorization.
Deep part: The deep part of the model is a feedforward neural network that takes all features as input, both categorical and continuous. However, categorical features are typically transformed into embeddings first, as neural networks work with numerical data. The deep part of the model excels at generalizing patterns from the data to unseen examples, offering the model the benefit of generalization.
 As a recap, a Generalized Linear Model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables to have error distribution models other than a normal distribution. GLMs are used to model relationships between a response variable and one or more predictor variables. Examples of GLMs include logistic regression (used for binary outcomes like pass/fail), Poisson regression (for count data), and linear regression (for continuous data with a normal distribution).
 As an example (source), say you’re trying to offer food/beverage recommendations based on an input query. People looking for specific items like “iced decaf latte with nonfat milk” really mean it. Just because it’s pretty close to “hot latte with whole milk” in the embedding space doesn’t mean it’s an acceptable alternative. Similarly, there are millions of these rules where the transitivity (a relation between three elements such that if it holds between the first and second and it also holds between the second and third it must necessarily hold between the first and third) of embeddings may actually do more harm than good. On the other hand, queries that are more exploratory like “seafood” or “italian food” may be open to more generalization and discovering a diverse set of related items.
Building upon the food recommendation example earlier, as shown in the graph below (source), sparse features like
query="fried chicken"
anditem="chicken fried rice"
are used in both the wide part (left) and the deep part (right) of the model.
 For the wide component utilizing a generalized linear model, crossproduct transformations are carried out on the binary features (e.g.,
AND(gender=female, language=en)
) is 1 if and only if the constituent features (gender=female
andlanguage=en
) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds nonlinearity to the generalized linear model.  For the deep component utilizing a feedforward neural network, each of the sparse, highdimensional categorical features are first converted into a lowdimensional, dense realvalued vector, often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized randomly and then the values are trained to minimize the final loss function during model training.
 During training, the prediction errors are backpropagated to both sides to train the model parameters, i.e., the two models function as one “cohesive” architecture and are trained jointly with the same loss function.
 The figure below from the paper shows how Wide and Deep models form a sweet middle compared to simply Wide and simply Deep models:
 Thus, the key architectural choice in Wide and Deep is to have both a wide module, which is a linear model that takes all cross features directly as inputs, and a deep module, which is essentially an NCF, and then combine both modules into a single output task head that learns from user/app engagements.
 By combining these two components, Wide and Deep models aim to achieve a balance between memorization and generalization, which can be particularly useful in recommendation systems, where both aspects can be important. The wide part can capture specific item combinations that a particular user might like (based on historical data), while the deep part can generalize from user behavior to recommend items that the user hasn’t interacted with yet but might find appealing based on their broader preferences. Put simply, Wide and Deep architectures combine a deep neural network component for capturing complex patterns and a wide component using a generalized linear model that models feature interactions explicitly. This allows the model to learn both deep representations and exploit feature interactions, providing a balance between memorization and generalization.
 The Wide and Deep model consists of two parts: dense neural network (deep) for continuous features and embedding models for the categorical features (wide). The architectural diagram below (source) showcases this structure.
 In the Wide & Deep Learning model, both the wide and deep components handle sparse features, but in different ways:
 Wide Component:
 The wide component is a generalized linear model that uses raw input features and transformed features.
 An important transformation in the wide component is the crossproduct transformation. This is particularly useful for binary features, where a crossproduct transformation like “AND(gender=female language=en)” is 1 if and only if both constituent features are 1, and 0 otherwise.
 Such transformations capture the interactions between binary features and add nonlinearity to the generalized linear model.
 Deep Component:
 The deep component is a feedforward neural network.
 For handling categorical features, which are often sparse and highdimensional, the deep component first converts these features into lowdimensional, dense realvalued vectors, commonly referred to as embedding vectors. The dimensionality of these embeddings usually ranges from 10 to 100.
 These dense embedding vectors are then fed into the hidden layers of the neural network. The embeddings are initialized randomly and trained to minimize the final loss function during model training.
 Combined Model:
 The wide and deep components are combined using a weighted sum of their output log odds, which is then fed to a common logistic loss function for joint training.
 In this combined model, the wide part focuses on memorization (exploiting explicit feature interactions), while the deep part focuses on generalization (learning implicit feature representations).
 The combined model ensures that both sparse and dense features are effectively utilized, with sparse features often transformed into dense representations for efficient processing in the deep neural network.
Music Example
 In a music recommendation app using the Wide & Deep Learning model, the input features for both the wide and deep components would be tailored to capture different aspects of user preferences and characteristics of the music items. Let’s consider what these inputs might look like:
Input to the Wide Component
The wide component would primarily use sparse, categorical features, possibly transformed to capture specific interactions:
 User Features: Demographics (age, gender, location), user ID, historical user behavior (e.g., genres listened to frequently, favorite artists).
 Music Item Features: Music genre, artist ID, album ID, release year.
 CrossProduct Transformations: Combinations of categorical features that are believed to interact in meaningful ways. For instance, “user’s favorite genre = pop” AND “music genre = pop”, or “user’s location = USA” AND “artist’s origin = USA”. These crossproducts help capture interaction effects that are specifically relevant to music recommendations.
Input to the Deep Component
 The deep component would use both dense and sparse features, with sparse features transformed into dense embeddings:
 User Features (as Embeddings): Embeddings for user ID, embedding vectors for historical preferences (like a vector summarizing genres listened to), demographics if treated as categorical.
 Music Item Features (as Embeddings): Embeddings for music genre, artist ID, album ID. These embeddings capture the nuanced relationships in the music domain.
 Additional Dense Features: If available, numerical features like the number of times a song has been played, user’s average listening duration, or ratings given by the user.
 The embeddings created to serve as the input to the Dense component are “learned embeddings” or “trainable embeddings,” as they are learned directly from the data during the training process of the model.
 Here’s a Python code snippet using TensorFlow to illustrate how a categorical feature (like user IDs) is embedded:
import tensorflow as tf
# Assuming we have 10,000 unique users and we want to embed them into a 64dimensional space
num_unique_users = 10000
embedding_dimension = 64
# Create an input layer for user IDs (assuming user IDs are integers ranging from 0 to 9999)
user_id_input = tf.keras.Input(shape=(1,), dtype='int32')
# Create an embedding layer
user_embedding_layer = tf.keras.layers.Embedding(input_dim=num_unique_users,
output_dim=embedding_dimension,
input_length=1,
name='user_embedding')
# Apply the embedding layer to the user ID input
user_embedding = user_embedding_layer(user_id_input)
# Flatten the embedding output to feed into a dense layer
user_embedding_flattened = tf.keras.layers.Flatten()(user_embedding)
# Add a dense layer (more layers can be added as needed)
dense_layer = tf.keras.layers.Dense(128, activation='relu')(user_embedding_flattened)
# Create a model
model = tf.keras.Model(inputs=user_id_input, outputs=dense_layer)
# Compile the model
model.compile(optimizer='adam', loss='mse') # Adjust the loss based on your specific task
# Model summary
model.summary()
In this code:
 We first define the number of unique users (
num_unique_users
) and the dimensionality of the embedding space (embedding_dimension
).  An input layer is created to accept user IDs.
 An embedding layer (
tf.keras.layers.Embedding
) is added to transform each user ID into a 64dimensional vector. This layer is set to be trainable, meaning its weights (the embeddings) are learned during training.  The embedding layer’s output is then flattened and passed through a dense layer for further processing.

The model is compiled with an optimizer and loss function, which should be chosen based on the specific task (e.g., classification, regression).
 This code example demonstrates how to create trainable embeddings for a categorical feature within a neural network using TensorFlow. These embeddings are specifically tailored to the data and task at hand, learning to represent each category (in this case, user IDs) in a way that is useful for the model’s predictive task.
Combining Inputs in Wide & Deep Model
 Joint Model: The wide and deep components are joined in a unified model. The wide component helps with memorization of explicit feature interactions (especially useful for categorical data), while the deep component contributes to generalization by learning implicit patterns and relationships in the data.
 Feature Transformation: Sparse features are more straightforwardly handled in the wide part through crossproduct transformations, while in the deep part, they are typically converted into dense embeddings.

Model Training: Both parts are trained jointly, allowing the model to leverage the strengths of both memorization and generalization.
 In a music recommendation app, this combination allows the model to not only consider obvious interactions (like a user’s past preferences for certain genres or artists) but also to uncover more subtle patterns and relationships within the data, which might not be immediately apparent but are influential in determining a user’s music preferences.
Results
 They productionized and evaluated the system on Google Play Store, a massivescale commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide and Deep significantly increased app acquisitions compared with wideonly and deeponly models.
 Compared to a deeponly model, Wide and Deep improved acquisitions in the Google Play store by 1%. Consider that Google makes tens of billions in revenue each year from its Play Store, and it’s easy to see how impactful Wide and Deep was.
Summary
 Architecture: The Wide and Deep model in recommendation systems incorporates cross features, particularly in the “wide” component of the model. The wide part is designed for memorization and uses linear models with crossproduct feature transformations, effectively capturing interactions between categorical features. This is crucial for learning specific, rulebased information, which complements the “deep” part of the model that focuses on generalization through deep neural networks. By combining these approaches, Wide and Deep models effectively capture both simple, rulebased patterns and complex, nonlinear relationships within the data.
 Pros: Balances memorization (wide component) and generalization (deep component), capturing both complex patterns and explicit feature interactions.
 Cons: Increased model complexity and potential challenges in training and optimization.
 Advantages: Improved performance by leveraging both deep representations and explicit feature interactions.
 Example Use Case: Ecommerce platforms where a combination of user behavior and item features plays a crucial role in recommendations.
 Phase: Ranking.
 Recommendation Workflow: Given it’s complexity, the deep and wide architecture is suitable for the ranking phase. The wide component can capture explicit feature interactions and enhance the candidate generation process. The deep component allows for learning complex patterns and interactions, improving the ranking of candidate items based on useritem preferences.
Deep Factorization Machine / DeepFM (2017)
 The first half of this section is taken from (ML Frontiers linked here)
 Similar to Google’s DCN, Huawei’s DeepFM, introduced in Guo et al. (2017), also replaces manual feature engineering in the wide component of Wide and Deep with a dedicated neural network that learns cross features. However, unlike DCN, the wide component is not a cross neural network, but instead a socalled factorization machine (FM) layer.
What does the FM layer do? It simply takes the dotproducts of all pairs of embeddings. For example, if a movie recommender takes 4 idfeatures as inputs, such as user id, movie id, actor ids, and director id, then the FM layer computes 6 dot products, corresponding to the combinations
usermovie
,useractor
,userdirector
,movieactor
,moviedirector
, andactordirector
. The output of the FM layer is then concatenated with the output of the deep component and passed into a sigmoid layer which outputs the model’s predictions.
 However, note that DeepFM (similar to DCN) learns this in a bruteforce manner simply by considering all possible combinations uniformly (i.e., it calculates all pairwise interactions), while newer implementations such as AutoInt leverage selfattention to automatically determine the most informative feature interactions, i.e., which feature interactions to pay the most attention to (and which to ignore by setting the attention weights to zero).
 The figure below from the paper shows the Wide & deep architecture of DeepFM. The wide and deep component share the same input raw feature vector, which enables DeepFM to learn low and highorder feature interactions simultaneously from the input raw features. Note that in the below figure, you notice a circle marked as “+” in the FM layer in addition to the inner products. Think of this like a skip connection that passes the concatenation of the inputs directly into the output unit.
 The authors show that DeepFM beats a host of its competitors, including Google’s Wide and Deep, by more than 0.42% Logloss on companyinternal data.
 DeepFM replaces the cross neural network in DCN with factorization machines, that is, dot products.
 DeepFM combines FM with deep neural networks. It utilizes FM to model pairwise feature interactions and a deep neural network to capture higherorder feature interactions. This architecture leverages both linear and nonlinear relationships between features.
Summary
 Pros: Combines the benefits of FM and deep neural networks, capturing both pairwise and higherorder feature interactions. In other words, accurate modeling of both linear and nonlinear relationships between features, providing a comprehensive understanding of feature interactions.
 Cons:
 DeepFM creates feature crosses in a bruteforce way, simply by considering all possible combinations. This is not only inefficient, it could also create feature crosses that aren’t helpful at all, and just make the model overfit.
 Increased model complexity and potential challenges in training and optimization.
 Example Use Case: Clickthrough rate prediction in online advertising or personalized recommendation systems.
 Phase: Candidate Generation, Ranking.
 Recommendation Workflow: DeepFM is commonly utilized in both the candidate generation and ranking phases. It combines the strengths of factorization machines and deep neural networks. In the candidate generation phase, DeepFM can capture pairwise feature interactions efficiently. In the ranking phase, it can leverage deep neural networks to model higherorder feature interactions and improve the ranking of candidate items.
Neural Collaborative Filtering / NCF (2017)
 The integration of deep learning into recommender systems witnessed a significant breakthrough with the introduction of Neural Collaborative Filtering (NCF), introduced in He et. al (2017) from NUS Singapore, Columbia University, Shandong University, and Texas A&M University.
 This innovative approach marked a departure from the (then standard) matrix factorization method. Prior to NCF, the gold standard in recommender systems was matrix factorization, which relied on learning latent vectors (a.k.a. embeddings) for both users and items, and then generate recommendations for a user by taking the dot product between the user vector and the item vectors. The closer the dot product is to 1, the better the match. As such, matrix factorization can be simply viewed as a linear model of latent factors.
The key idea behind NCF is to substitute the inner product in matrix factorization with a neural network architecture to that can learn an arbitrary nonlinear function from data. To supercharge the learning process of the useritem interaction function with nonlinearities, they concatenated user and item embeddings, and then fed them into a multilayer perceptron (MLP) with a single task head predicting user engagement, like clicks. Both MLP weights and embedding weights (which user/item IDs are mapped to) were learned through backpropagation of loss gradients during model training.
 The hypothesis underpinning NCF posits that useritem interactions are nonlinear, contrary to the linear assumption in matrix factorization.
 The figure below from the paper illustrates the neural collaborative filtering framework.
 NCF proved the value of replacing (then standard) linear matrix factorization algorithms with a neural network. With a relatively simply 4layer neural network, NCF proved that there’s immense value of applying deep neural networks in recommender systems, marking the pivotal transition away from matrix factorization and towards deep recommenders. They were able to beat the best matrix factorization algorithms at the time by 5% hit rate on the Movielens and Pinterest benchmark datasets. Empirical evidence showed that using deeper layers of neural networks offers better recommendation performance.
 Despite its revolutionary impact, NCF lacked an important ingredient that turned out to be extremely important for the success of recommenders: cross features, a concept popularized by the Wide & Deep paper described above.
Summary
 NCF proved the value of replacing (then standard) linear matrix factorization algorithms with a neural network.
 The NCF framework, which is both generic and capable of expressing and generalizing matrix factorization, utilized a multilayer perceptron to imbue the model with nonlinear capabilities.
 With a relatively simple 4layer neural network, they were able to beat the best matrix factorization algorithms at the time by 5% hit rate on the Movielens and Pinterest benchmark datasets.
Deep and Cross Networks / DCN (2017)
 Wide and Deep has proven the significance of cross features, however it has a huge downside: the cross features need to be manually engineered, which is a tedious process that requires engineering resources, infrastructure, and domain expertise. Cross features à la Wide and Deep are expensive. They don’t scale.
 The key idea of Deep and Cross Networks (DCN), introduced in a Wang et al. (2017) by Google is to replace the wide component in Wide and Deep with a “cross neural network”, a neural network dedicated to learning cross features of arbitrarily high order. However, note that DCN (similar to DeepFM) learns this in a bruteforce manner simply by considering all possible combinations uniformly (i.e., it calculates all pairwise interactions), while newer implementations such as AutoInt leverage selfattention to automatically determine the most informative feature interactions, i.e., which feature interactions to pay the most attention to (and which to ignore by setting the attention weights to zero).
 Similar to Huawei’s DeepFM, introduced in Guo et al. (2017), DCN also replaces manual feature engineering in the wide component of Wide and Deep with a dedicated cross neural network that learns cross features. However, unlike DeepFM, the wide component is a cross neural network, instead of a socalled factorization machine layer.
 DCN was designed to learn explicit and boundeddegree cross features more effectively. It starts with an input layer (typically an embedding layer), followed by a cross network containing multiple cross layers that models explicit feature interactions, and then combines with a deep network that models implicit feature interactions.
 Cross Network: This is the core of DCN. It explicitly applies feature crossing at each layer, and the highest polynomial degree increases with layer depth. The following figure shows the \((i + 1)^{th}\) cross layer.
 Deep Network: It is a traditional feedforward multilayer perceptron (MLP).
 The deep network and cross network are then combined to form DCN (Wang et al. (2020)). As shown in the figure below, we could stack a deep network on top of the cross network (stacked structure); we could also place them in parallel (parallel structure).
 What makes a cross neural network different from a standard MLP? As a reminder, in an (fullyconnected) MLP, each neuron in the next layer is a linear combination of all neurons in the previous layer, plus a bias term:

The Cross Network helps in better generalizing on sparse features by learning explicit boundeddegree feature interactions. This is particularly useful for sparse data, where traditional deep learning models might struggle due to the high dimensionality and lack of explicit feature interaction modeling.

By contrast, in the cross neural network the next layer is constructed by forming secondorder combinations of the first layer with itself:
 At the input, sparse features are transformed into dense vectors through an embedding procedure while dense features are normalized. These processed features are then combined into a single vector \(x_0\), which includes the stacked embedding vectors for the sparse features and the normalized dense features. This combined vector is then fed into the network.
 Hence, a cross neural network of depth \(L\) will learn cross features in the form of polynomials of degrees up to \(L\). The deeper the neural network, the higherorder interactions are learned.
 The unified wide and cross model architecture is training jointly with mean squared error (MSE) as it’s loss function.

For model evaluation, the Root Mean Squared Error (RMSE, the lower the better) is reported per TensorFlow: Deep & Cross Network (DCN).
 The Deep and Cross Network (DCN) introduces a novel approach to handling feature interactions and dealing with sparse features. Let’s break down how DCN accomplishes these tasks:
DCN and Variants
 The following section is referenced from ML Frontiers Substack.
 DCN was one of the first algorithms to replace the manual engineering of cross features in Wide&Deeplike models with an algorithm that exhaustively computes all possible crosses. The cross layers in DCN have two free parameters, the weight vector w and the bias vector b.
 DCNV2 replaced DCN’s crossing vector w with a crossing matrix W, which makes the cross layers more expressive and allows us to get further performance gains, in particular when stacking more layers. While in DCN we saw performance plateau after 2 layers, DCNV2 allows us to stack 4 layers or even more and still see performance improvements.
 DCNMix replaces the crossing matrix W with a more expressive mixture of lowrank experts which are combined using a gating network, beating DCNV2 on the Movielens benchmark dataset.
 GDCN adds an information gate on top of each cross layer in DCNV2 which controls how much weight the model should assign to each feature interaction, preventing the model from overfitting to noisy feature crosses. GDCN is the current champion on the Criteo problem.
Forming HigherOrder Feature Interactions

Mechanism of the Cross Network: In a standard MultiLayer Perceptron (MLP), each neuron in a layer is a linear combination of all neurons from the previous layer. The formula for this is typically \(x_{l+1} = b_{l+1} + W \cdot x_l\), where \(x_l\) is the input from the previous layer, \(W\) is the weight matrix, and \(b_{l+1}\) is the bias. However, in the Cross Network of DCN, the idea is to explicitly form higherorder interactions of features.

SecondOrder Combinations: In the Cross Network, the next layer is created by incorporating secondorder combinations of the previous layer’s features. The formula used is \(x_{l+1} = b_{l+1} + x_l + x_l \cdot W \cdot x_l^T\). This approach allows the network to automatically learn complex feature interactions (cross features) that are higher than firstorder, which would be impossible in a standard MLP without manual feature engineering.
Handling Sparse Features through Embedding

Sparse to Dense Transformation: Neural networks generally work better with dense input data. However, in many realworld applications, features are often sparse (like categorical data). DCN addresses this challenge by transforming sparse features into dense vectors through an embedding process.

Embedding Process: This embedding is a technique where sparse, highdimensional data (like onehot encoded vectors) are converted into a lowerdimensional, continuous, and dense vector. Each unique category in the sparse feature is mapped to a dense vector, and these vectors are learned during the training process. This transformation is crucial because it enables the network to work with a dense representation of the data, which is more efficient and effective for learning complex patterns.
Explicit Feature Crossing and Polynomial Degree

Explicit Feature Crossing: The Cross Network in DCN explicitly handles feature crossing at each layer. By doing this, it models interactions between different features directly, rather than relying on the deep network to implicitly capture these interactions.

Increasing Polynomial Degree with Depth: As the depth of the Cross Network increases, the polynomial degree of the feature interactions also increases. This means that in deeper layers of the Cross Network, the model can capture more complex interactions (higherorder feature combinations). The network is essentially learning polynomials of features, where the degree of the polynomial increases with the depth of the network.

BoundedDegree Cross Features: The design of the Cross Network ensures that the degree of these polynomials is bounded and controlled by the depth of the network. This control is crucial to avoid an explosion in the complexity of the model, which could lead to overfitting and computational inefficiency.

DCN’s Cross Network forms higherorder feature interactions by explicitly crossing features at each layer, increasing the polynomial degree with the depth of the network. At the same time, it addresses the challenge of sparse features by embedding them into dense vectors, making them suitable for processing by the neural network. This design allows DCN to automatically and efficiently learn complex feature interactions without the need for manual feature engineering.
 Integrating Outputs: The outputs from the Cross Network and the Deep Network are concatenated.
 Final Prediction: The concatenated vector is then fed into a logits layer for the final prediction, such as in a classification task. This layer effectively combines the strengths of both explicit feature interactions and deep learned representations.
Input and Output to Each Component
 Input to Cross and Deep Networks: Both networks take the same input vector, which is a combination of dense embeddings (from sparse features) and normalized dense features.

Output: The outputs of both networks are combined in the Combination Layer for the final model output.
 Based on the paper, the architecture and composition of each layer in the Cross and Deep Networks of the Deep & Cross Network (DCN) are as follows:
Cross Network Layers
Each layer in the Cross Network is defined by the following formula: \(x_l+1 = x_0 x^{T}_l w_l + b_l + x_l\)
 Inputs and Outputs: \(x_l\) and \(x_l+1\) are the outputs from the lth and (l + 1)th cross layers respectively, represented as column vectors.
 Weight and Bias Parameters: Each layer has its own weight (\(w_l\)) and bias (\(b_l\)) parameters, which are learned during training.
 Feature Crossing Function: The feature crossing function is represented by \(f(x_l, w_l, b_l)\), and it is designed to fit the residual of \(x_l+1  x_l\). This function captures interactions between the features.
 Residual Connection: Each layer adds back its input after the feature crossing, which helps in preserving the information and building upon the previous layer’s output.
Deep Network Layers
Each layer in the Deep Network is structured as a standard fullyconnected layer and is defined by the following formula: \(hl+1 = f(w_l hl + b_l)\)
 Inputs and Outputs: \(hl\) and \(hl+1\) are the lth and (l + 1)th hidden layers’ outputs respectively.
 Weight and Bias Parameters: Similar to the cross layer, each deep layer has its own weight matrix (\(w_l\)) and bias vector (\(b_l\)).
 Activation Function: The function \(f(\cdot)\) is typically a nonlinear activation function, such as ReLU (Rectified Linear Unit), which introduces nonlinearity into the model, allowing it to learn complex patterns in the data.
Summary
 Cross Network Layers: These layers are specifically designed for efficient feature crossing, capturing interactions between different input features at each layer. They employ a unique operation combining linear transformation, feature interaction, and residual connections.
 Deep Network Layers: These are standard fullyconnected layers that use weights, biases, and nonlinear activation functions to learn abstract representations and complex patterns in the data.
Results
 Compared to a model with just the deep component, DCN has a 0.1% statistically significant lower logloss on the Criteo display ads benchmark dataset. And that’s without any manual feature engineering, as in Wide and Deep! (It would have been nice to see a comparison between DCN and Wide and Deep. Alas, the authors of DCN didn’t have a good method to manually create cross features for the Criteo dataset, and hence skipped this comparison.)
 The Deep and Cross Network (DCN) architecture includes a cross network component that captures crossfeature interactions. It combines a deep network with cross layers, allowing the model to learn explicit feature interactions and capture nonlinear relationships between features.
Summary
 DCN showed that we can get even more performance gains by replacing manual engineering of cross features with an algorithmic approach that automatically creates all possible feature crosses up to any arbitrary order. Compared to Wide & Deep, DCN achieved 0.1% lower logloss on the Criteo display ads benchmark dataset.
 Pros: Captures explicit highorder feature interactions and nonlinear relationships through cross layers, allowing for improved modeling of complex patterns.
 Cons:
 DCN creates feature crosses in a bruteforce way, simply by considering all possible combinations. This is not only inefficient, it could also create feature crosses that aren’t helpful at all, and just make the model overfit.
 More complex than simple feedforward networks.
 May not perform well on tasks where feature interactions aren’t important.
 Increased model complexity, potential overfitting on sparse data.
 Use case: Useful for tasks where highorder feature interactions are critical, such as CTR prediction and ranking tasks.
 Example Use Case: Advertising platforms where understanding the interactions between user characteristics and ad features is essential for personalized ad targeting.
 Phase: Ranking, Final Ranking.
 Recommendation Workflow: The deep and cross architecture is typically applied in the ranking phase and the final ranking phase. The deep and cross network captures explicit feature interactions and nonlinear relationships, enabling accurate ranking of candidate items based on user preferences. It contributes to the final ranking of candidate items, leveraging its ability to model complex patterns and interactions.
Music use case
 Using the Deep & Cross Network (DCN) for a music recommender system involves several steps, from processing the input data to obtaining output recommendations. Here’s a stepbystep approach on how you would use DCN in this context:
 Data Preparation
 Collect Data: Gather user data and music metadata. User data might include user IDs, past listening history, ratings, and demographic information. Music metadata could include song IDs, genres, artists, albums, and release years.
 Feature Engineering:
 Sparse Features: Categorical data like user IDs, song IDs, artist names, genres, etc., are considered sparse features. These will be transformed into dense vectors using embedding layers.
 Dense Features: Numerical data like age, listening duration, and rating scores are dense features.
 Building the Model
 Embedding Layer for Sparse Features:
 Use embedding layers to transform sparse categorical features into dense vectors. For instance, map each user ID and song ID to a fixedsize embedding vector.
 Deep Component:
 Construct a series of dense layers. These layers will process both the dense features and the output of the embedding layers (the transformed sparse features).
 Apply nonlinear activation functions (like ReLU) in these layers to capture complex patterns in the data.
 Cross Component:
 Build the cross layers to model feature interactions. Each layer in the Cross Network explicitly captures interactions between the features.
 The initial input to the Cross Network is the concatenated embeddings and normalized dense features.
 Combining Deep and Cross Components:
 Merge the outputs of the deep and cross components. This combination enables the model to leverage both deep feature transformations and explicit feature interactions.
 Embedding Layer for Sparse Features:
 Training the Model
 Compile the Model: Choose an appropriate loss function (like categorical crossentropy for classification) and an optimizer (like Adam).
 Input Data: Feed the processed input data (both embeddings of sparse features and dense features) into the model.
 Train the Model: Use usersong interactions as training data. For instance, if a user has listened to or rated a song, these interactions are used as positive samples.
 Generating Recommendations
 Model Prediction: Use the trained model to predict the likelihood of a user liking a particular song or set of songs.
 PostProcessing: Sort the songs for each user based on the predicted likelihoods and recommend the top songs.
 Model Evaluation
 Evaluate the model using metrics like accuracy, precision, recall, or more sophisticated ones like Mean Average Precision at K (MAP@K).
 Example Use Case
 Personalized Song Recommendations: For a given user, the model predicts which songs they would likely enjoy, based on their past interactions and the learned feature interactions.
 Discovering New Music: The model can help users discover new songs or artists that they might not have found on their own, but are likely to enjoy based on their profile and listening history.
 DCN’s ability to handle both sparse and dense features, along with its unique architecture that captures deep feature representations and explicit feature interactions, makes it wellsuited for complex tasks like music recommendation. This model can effectively leverage the rich and varied data in music recommender systems to provide personalized and accurate recommendations.
AutoInt (2019)
 Proposed in AutoInt: Automatic Feature Interaction Learning via SelfAttentive Neural Networks by Song et al. from from Peking University and MilaQuebec AI Institute, and HEC Montreal in CIKM 2019.
 The paper introduces AutoInt (short for “automated feature interaction learning”), a novel method for efficiently learning highorder feature interactions in an automated way. Developed to address the inefficiencies and overfitting problems in existing models like DCN and DeepFM, which create feature crosses in a bruteforce manner, AutoInt leverages selfattention to determine the most informative feature interactions.
 AutoInt employs a multihead selfattentive neural network with residual connections, designed to explicitly model feature interactions in a 16dimensional embedding space. It overcomes the limitations of prior models by focusing on relevant feature combinations, avoiding unnecessary and unhelpful feature crosses.
 Processing Steps:
 Input Layer: Represents user profiles and item attributes as sparse vectors.
 Embedding Layer: Projects each feature into a 16dimensional space.
 Interacting Layer: Utilizes several multihead selfattention layers to automatically identify the most informative feature interactions. The attention mechanism is based on dot product for its effectiveness in capturing feature interactions.
 Output Layer: Uses the learned feature interactions for CTR estimation.
 The goal of AutoInt is to map the original sparse and highdimensional feature vector into lowdimensional spaces and meanwhile model the highorder feature interactions. As shown in the below figure, AutoInt takes the sparse feature vector \(x\) as input, followed by an embedding layer that projects all features (i.e., both categorical and numerical features) into the same lowdimensional space. Next, embeddings of all fields are fed into a novel interacting layer, which is implemented as a multihead selfattentive neural network. For each interacting layer, highorder features are combined through the attention mechanism, and different kinds of combinations can be evaluated with the multihead mechanisms, which map the features into different subspaces. By stacking multiple interacting layers, different orders of combinatorial features can be modeled. The output of the final interacting layer is the lowdimensional representation of the input feature, which models the highorder combinatorial features and is further used for estimating the clickthrough rate through a sigmoid function. The figure below from the paper shows an overview of AutoInt.
 The figure below from the paper illustrates the input and embedding layer, where both categorical and numerical fields are represented by lowdimensional dense vectors.
 AutoInt demonstrates superior performance over competitors like Wide and Deep and DeepFM on benchmark datasets like MovieLens and Criteo, thanks to its efficient handling of feature interactions.
 The technical innovations in AutoInt consist of: (i) introduction of multihead selfattention to learn which cross features really matter, replacing the bruteforce generation of all possible feature crosses, and (ii) the model’s ability to learn important feature crosses such as
GenreGender
,GenreAge
, andRequestTimeReleaseTime
, which are crucial for accurate CTR prediction.  AutoInt showcases efficiency in processing largescale, sparse, highdimensional data, with a stack of 3 attention layers, each having 2 heads. The attention mechanism improves model explainability by highlighting relevant feature interactions, as exemplified in the attention matrix learned on the MovieLens dataset.
 AutoInt addresses the need for a model that is both powerful in capturing complex interactions and interpretable in its recommendations, without the inefficiency and overfitting issues seen in models that generate feature crosses in a bruteforce manner.
Summary
 This section is taken from (ML Frontiers linked here).
 The key idea in DCN and DeepFM was to create feature crosses in a bruteforce way, simply by considering all possible combinations. This is not only inefficient, it could also create feature crosses that aren’t helpful at all, and just make the model overfit.
 What we need, then, is a way to determine automatically which feature interactions to pay attention to and which to ignore. We need – you’ve guessed it – selfattention!
AutoInt introduces the idea of multihead self attention in the context of recommender systems: instead of simply generating all possible pairwise cross features in a bruteforce way, we use the attention mechanism to learn which cross features really matter.
 That was the key insight behind AutoInt, short for “automated feature interaction learning”, proposed by Song et al. (2019) from Peking University, China. In particular, the authors first project each individual feature into a 16dimensional embedding space and then pass these embeddings into a stack of several multihead selfattention layers that automatically create the most informative feature interactions. The inputs going into the key, query, and value matrices are simply the list of all feature embeddings, and the attention function is simply the dot product, “due to its simplicity and effectiveness” in capturing feature interactions.
 This sounds complicated, but really there’s no magic here – just a bunch of matrix multiplications. As a concrete example, here’s the attention matrix that one of the attention heads in AutoInt learns on the MovieLens benchmark dataset:
 The model learns that the feature crosses formed by
GenreGender
,GenreAge
, andRequestTimeReleaseTime
are important, which are all marked in green. This makes sense: men usually prefer different movies than women, and kids prefer different movies than adults. What about theRequestTimeReleaseTime
cross feature? It simply encodes movie freshness, at the time of the training example.  Using a stack of three attention layers with two heads each, the authors of AutoInt were able to beat a host of competitors, including Wide and Deep and DeepFM, on the MovieLens and Criteo benchmark datasets.
DLRM (2019)
 Let’s fastforward by a year to Meta’s DLRM (“deep learning for recommender systems”) architecture, proposed in Naumov et al. (2019), another important milestone in recommender system modeling.
 This paper by Naumov et al. from Facebook in 2019 introduces the DLRM (deep learning for recommender systems) architecture, a significant development in recommender system modeling, which was opensourced in both PyTorch and Caffe2 frameworks.
 Contrary to the “deep learning” part in it’s name, DLRM represents a progression from the DeepFM architecture, maintaining the FM (factorization machine) component while discarding the deep neural network part. The fundamental hypothesis of DLRM is that interactions are paramount in recommender systems, which can be modeled using shallow MLPs (and complex deep learning components are thus not essential).
 The DLRM model handles continuous (dense) and categorical (sparse) features that describe users and products. DLRM exercises a wide range of hardware and system components, such as memory capacity and bandwidth, as well as communication and compute resources as shown in the figure below from the paper.
 The figure below from the paper shows the overall structure of DLRM.
 DLRM uniquely handles both continuous (dense) and categorical (sparse) features that describe users and products, projecting them into a shared embedding space. These features are then passed through MLPs before and after computing pairwise feature interactions (dot products). This method significantly differs from other neural networkbased recommendation models in its explicit computation of feature interactions and treatment of each embedded feature vector as a single unit, contrasting with approaches like Deep and Cross which consider each element in the feature vector separately.
DLRM shows that interactions are all you need: it’s akin to using just the FM component of DeepFM but with MLPs added before and after the interactions to increase modeling capacity.
 The architecture of DLRM includes multiple MLPs, which are added to increase the model’s capacity and expressiveness, enabling it to model more complex interactions. This aspect is critical as it allows for fitting data with higher precision, given adequate parameters and depth in the MLPs.
 Compared to other DLbased approaches to recommendation, DLRM differs in two ways. First, it computes the feature interactions explicitly while limiting the order of interaction to pairwise interactions. Second, DLRM treats each embedded feature vector (corresponding to categorical features) as a single unit, whereas other methods (such as Deep and Cross) treat each element in the feature vector as a new unit that should yield different cross terms. These design choices help reduce computational/memory cost while maintaining competitive accuracy.
 A key contribution of DLRM is its specialized parallelization scheme, which utilizes model parallelism on the embedding tables to manage memory constraints and exploits data parallelism in the fullyconnected layers for computational scalability. This approach is particularly effective for systems with diverse hardware and system components, like memory capacity and bandwidth, as well as communication and compute resources.
 The paper demonstrates that DLRM surpasses the performance of the DCN model on the Criteo dataset, validating the authors’ hypothesis about the predominance of feature interactions. Moreover, DLRM has been characterized for its performance on the Big Basin AI platform, proving its utility as a benchmark for future algorithmic experimentation, system codesign, and benchmarking in the field of deep learningbased recommendation models.
 Facebook AI post.
Summary
 The key idea behind DLRM is to take the approach from DeepFM but only keep the FM part, not the Deep part, and expand on top of that. The underlying hypothesis is that the interactions of features are really all that matter in recommender systems. “Interactions are all you need!”, you may say.
 The deep component is not really needed. DLRM uses a bunch of MLPs to model feature interactions. Under the hood, DLRM projects all sparse and dense features into the same embedding space, passes them through MLPs (blue triangles in the above figure), computes all pairs of feature interactions (the cloud), and finally passes this interaction signal through another MLP (the top blue triangle). The interactions here are simply dot products, just like in DeepFM.
 The key difference to the DeepFM’s “FM” though is the addition of all these MLPs, the blue triangles. Why do we need those? Because they’re adding modeling capacity and expressiveness, allowing us to model more complex interactions. After all, one of the most important rules in neural networks is that given enough parameters, MLPs with sufficient depth and width can fit data to arbitrary precision!
 In the paper, the authors show that DLRM beats DCN on the Criteo dataset. The authors’ hypothesis proved to be true. Interactions, it seems, may really be all you need.
DCN V2 (2020)
 Proposed in DCN V2: Improved Deep & Cross Network and Practical Lessons for Webscale Learning to Rank Systems by Wang et al. from Google, DCNV2 is an enhanced version of the Deep & Cross Network (DCN), designed to effectively learn feature interactions in largescale learning to rank (LTR) systems.
 The paper addresses DCN’s limited expressiveness in learning predictive feature interactions, especially in webscale systems with extensive training data.
 DCNV2 is focused on the efficient and effective learning of predictive feature interactions, a crucial aspect of applications like search recommendation systems and computational advertising. It tackles the inefficiency of traditional methods, including manual identification of feature crosses and reliance on deep neural networks (DNNs) for higherorder feature crosses.
 The embedding layer in DCNV2 processes both categorical (sparse) and dense features, supporting various embedding sizes, essential for industrialscale applications with diverse vocabulary sizes.
 The core of DCNV2 is its cross layers, which explicitly create feature crosses. These layers are based on a base layer with original features, utilizing learned weight matrices and bias vectors for each cross layer.
 The figure below from the paper visualizes a cross layer.
 As shown in the figure below, DCNV2 employs a novel architecture that combines a cross network with a deep network. This combination is realized through two architectures: a stacked structure where the cross network output feeds into the deep network, and a parallel structure where outputs from both networks are concatenated. The cross operation in these layers is represented as \(\mathrm{x}_{l+1}=\mathrm{x}_0 \odot\left(W_l \mathrm{x}_l+\mathrm{b}_l\right)+\mathrm{x}_l\).
A key feature of DCNV2 is the use of lowrank techniques to approximate feature crosses in a subspace, improving performance and reducing latency. This is further enhanced by a MixtureofExpert architecture, which decomposes the matrix into multiple smaller subspaces aggregated through a gating mechanism.
 DCNV2 demonstrates superior performance in extensive studies and comparisons with stateoftheart algorithms on benchmark datasets like Criteo and MovieLens1M. It offers significant gains in offline accuracy and online business metrics in Google’s webscale LTR systems.
 The paper also delves into polynomial approximation from both bitwise and featurewise perspectives, illustrating how DCNV2 creates feature interactions up to a certain order with a given number of cross layers, thus being more expressive than the original DCN.
Architecture changes
 In DCN V2, several specific architectural changes were made to enhance its performance and efficiency, particularly in the cross layers. Here are the detailed aspects of how these changes enable the model to capture a wider range of interactions:
 Mixture of LowRank Cross Layers:
 DCN V2 introduces a mixture of lowrank cross layers. This means that instead of using fullrank matrices (which can be computationally expensive and might overfit), the model employs lowrank matrices in the cross layers.
 LowRank Approximation: This involves representing the weight matrices in the cross layers using a factorization approach, where a weight matrix is approximated as the product of two smaller matrices. This reduces the number of parameters and computational complexity.
 Effect on Feature Interactions: By using lowrank matrices, the model efficiently captures the essential interactions without the overhead of fullrank operations. This approach strikes a balance between model expressiveness and computational efficiency, particularly beneficial for largescale applications.
 Enhanced Expressiveness in Cross Network:
 Modifying Cross Layer Operations: The cross network in DCN V2 might have modified the mathematical operations within its layers to better capture complex explicit cross terms. This could involve changes in how the feature crossing is computed or in how the inputs and outputs of each layer are combined.
 Capturing HigherOrder Interactions: Adjustments in the cross layer operations enable the model to capture higherorder interactions more effectively. This is crucial for dealing with complex and highdimensional data where simple pairwise interactions are not sufficient.
 Mixture of LowRank Cross Layers:
 Background: In the original DCN, each cross layer used a fullrank matrix for the feature crossing operation. While effective, this could be computationally intensive and less efficient for largescale data.
 LowRank Approach in DCN V2: DCN V2 introduces lowrank matrices in the cross layers. A lowrank matrix can be represented as the product of two smaller matrices (
U
andV
), such that the original weight matrixW
is approximated byU * V^T
.  Implication: This means that the feature crossing operation, which originally involved the full matrix
W
, now utilizes this lowrank approximation. The operation becomes more efficient in terms of computation while still maintaining the ability to capture essential feature interactions.  Capturing HigherOrder Interactions:  Original Operation: Traditionally, a cross layer would perform a feature crossing by computing the outer product of the input vector with itself and then applying a linear transformation using the weight matrix. This process captures secondorder interactions.
 Enhancement in DCN V2: With lowrank matrices, the model can still effectively capture these interactions but in a more computationally efficient manner. The lowrank approximation allows the model to handle more complex interactions without exponentially increasing the computational complexity. This is crucial in highdimensional data, where the number of potential feature interactions can be very large.
 Stacked and Parallel Structures:
 Stacked Structure: In this structure, the model processes data through the cross network and then the deep network sequentially. This allows the deep network to further refine and process the feature interactions captured by the cross network.
 Parallel Structure: Here, the cross and deep networks operate in parallel, and their outputs are combined at the end. This allows the model to learn from both explicit (cross network) and implicit (deep network) feature interactions simultaneously and then combine these insights.
 DCN V2, with its introduction of lowrank cross layers and potential modifications to the cross layer operations, enhances its ability to model complex feature interactions more efficiently. The choice between stacked and parallel structures offers flexibility in how these interactions are processed and combined, making DCN V2 adaptable to a variety of data characteristics and application requirements. These specific architectural advancements position DCN V2 as a more effective and efficient model for handling webscale data.
Music use case
 Creating a music recommender system using DCN V2 involves several steps, from data preparation to model deployment. Here’s a detailed use case illustrating how DCN V2 can be effectively utilized for this purpose:
 Data Collection and Preparation:
 Collect Data: Gather comprehensive data involving user interactions with music tracks. This data might include:
 User Data: User demographics, historical listening data, ratings, and preferences.
 Music Data: Track IDs, genres, artists, albums, release years, and other metadata.  Feature Engineering:
 Categorical Features: User IDs, track IDs, artist names, genres (sparse features).
 Numerical Features: User listening duration, frequency of listening to certain genres or artists (dense features).
 Model Architecture Setup:
 Embedding Layer for Sparse Features:
 Convert sparse categorical features into dense embeddings. For instance, create embeddings for user IDs and track IDs.  Deep Component of DCN V2:
 Set up a series of dense layers for processing both dense features and embeddings from the sparse features.  Cross Component of DCN V2:
 Implement the cross network with a mixture of lowrank cross layers to efficiently model explicit feature interactions.  Stacked or Parallel Structure:
 Choose between a stacked or parallel architecture based on exploratory analysis and experimentation.
 Model Training:
 Input Data: Process and feed the data into the model, including usertrack interaction data.
 Training Process:
 Train the model using appropriate loss functions (e.g., categorical crossentropy for multiclass classification of music tracks).
 Employ techniques like batch normalization, dropout, or regularization as needed to improve performance and reduce overfitting.
 Generating Music Recommendations:
 Model Prediction: For a given user, use the model to predict the likelihood of them enjoying various tracks.
 Recommendation Strategy:
 Generate a list of recommended tracks for each user based on predicted likelihoods.
 Consider personalizing recommendations based on userspecific data like historical preferences.
 Model Evaluation and Refinement:
 Evaluation Metrics: Use accuracy, precision, recall, F1score, or more complex metrics like Mean Average Precision at K (MAP@K) for evaluation.
 Feedback Loop: Incorporate user feedback to refine and improve the model iteratively.
 Deployment and Scaling:
 Deployment: Deploy the model in a production environment where it can handle realtime recommendation requests.
 Scalability: Ensure the system is scalable to handle large numbers of users and tracks, leveraging the efficiency of the DCN V2 architecture.
 Example Use Case:
 Personalized Playlist Creation: For each user, the system generates a personalized playlist based on their unique preferences, historical listening habits, and interactions with different music tracks.

New Music Discovery: The system recommends new tracks and artists that the user might enjoy but hasn’t listened to yet, broadening their music experience.
 Using DCN V2 for a music recommender system leverages the model’s ability to understand both explicit and implicit feature interactions, offering a powerful tool for delivering personalized music experiences. Its efficient architecture makes it suitable for handling the complexity and scale of music recommendation tasks.
Summary
 Proposed in DCN V2: Improved Deep & Cross Network and Practical Lessons for Webscale Learning to Rank Systems by Wang et al. from Google. An enhanced version of the Deep & Cross Network (DCN), DCNV2, effectively learns feature interactions in largescale learning to rank (LTR) systems.
 DCNV2 addresses the limitations of the original DCN, particularly in webscale systems with vast amounts of training data, where DCN exhibited limited expressiveness in its cross network for learning predictive feature interactions.
 The paper focuses on efficient and effective learning of predictive feature interactions, crucial in applications like search recommendation systems and computational advertising. Traditional approaches often involve manual identification of feature crosses or rely on deep neural networks (DNNs), which can be inefficient for higherorder feature crosses.
 DCNV2 includes an embedding layer that processes both categorical (sparse) and dense features. It supports different embedding sizes, crucial for industrialscale applications with varying vocabulary sizes.
 The core of DCNV2 is its cross layers, which create explicit feature crosses. These layers are built upon a base layer containing original features and use learned weight matrices and bias vectors for each cross layer.
 DCNV2’s effectiveness is demonstrated through extensive studies and comparisons with stateoftheart algorithms on benchmark datasets like Criteo and MovieLens1M. It outperforms these algorithms and offers significant offline accuracy and online business metrics gains in Google’s webscale LTR systems.
 In summary, the key change in DCN V2’s cross network that enhances its expressiveness is the incorporation of lowrank matrices in the cross layers. This approach optimizes the computation of feature interactions, making the network more efficient and scalable, especially for complex, highdimensional datasets. The use of lowrank matrices allows the network to capture complex feature interactions (including higherorder interactions) more effectively without the computational burden of fullrank operations.
DHEN (2022)
 Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, they observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have nonoverlapping information.
 Proposed in DHEN: A Deep and Hierarchical Ensemble Network for LargeScale ClickThrough Rate Prediction, this paper by Zhang et al. from Meta introduces DHEN (Deep and Hierarchical Ensemble Network), a novel architecture designed for largescale ClickThrough Rate (CTR) prediction. The significance of DHEN lies in its ability to learn feature interactions effectively, a crucial aspect in the performance of online advertising services. Recognizing that different interaction models offer varying advantages and capture nonoverlapping information, DHEN integrates a hierarchical ensemble framework with diverse interaction modules, including AdvancedDLRM, selfattention, Linear, Deep Cross Net, and Convolution. These modules enable DHEN to learn a hierarchy of interactions across different orders, addressing the limitations and variable performance of previous models on different datasets.
 The following figure from the paper shows a twolayer twomodule hierarchical ensemble (left) and its expanded details (right). A general DHEN can be expressed as a mixture of multiple highorder interactions. Dense feature input for the interaction modules are omitted in this figure for clarity.
 In CTR prediction tasks, the feature inputs usually contain discrete categorical terms (sparse features) and numerical values (dense features). DHEN uses the same feature processing layer in DLRM, which is shown in the figure below. The sparse lookup tables map the categorical terms to a list of “static” numerical embeddings. Specifically, each categorical term is assigned a trainable \(d\)dimensional vector as its feature representation. On the other hand, the numerical values are processed by dense layers. Dense layers compose of several Multilayer Perceptions (MLPs) from which an output of a \(d\)dimensional vector is computed. After a concatenation of the output from sparse lookup table and dense layer, the final output of the feature processing layer \(X_0 \in \mathbb{R}^{d \times m}\) can be expressed as \(X_0=\left(x_0^1, x_0^2, \ldots, x_0^m\right)\), where \(m\) is the number of the output embeddings and \(d\) is the embedding dimension.
 A key technical advancement in this work is the development of a codesigned training system tailored for DHEN’s complex, multilayer structure. This system introduces the Hybrid Sharded Data Parallel, a novel distributed training paradigm. This approach not only caters to the deeper structure of DHEN but also significantly enhances training efficiency, achieving up to 1.2x better throughput compared to existing models.
 Empirical evaluations on largescale datasets for CTR prediction tasks have demonstrated the effectiveness of DHEN. The model showed an improvement of 0.27% in Normalized Entropy (NE) gain over stateoftheart models, underlining its practical effectiveness. The paper also discusses improvements in training throughput and scaling efficiency, highlighting the systemlevel optimizations that make DHEN particularly adept at handling large and complex datasets in the realm of online advertising.n the Normalized Entropy (NE) of prediction and 1.2x better training throughput than stateoftheart baseline, demonstrating their effectiveness in practice.
Summary
 This section is taken from (ML Frontiers linked here).
 In contrast to DCN, the feature interactions in DLRM are limited to be secondorder (i.e., pairwise) only: they’re just dot products of all pairs of embeddings. Going back to the movie example (with features user, movie, actors, director), the secondorder interactions would be usermovie, useractor, userdirector, movieactor, moviedirector, and actordirector. A thirdorder interaction would be something like usermoviedirector, actoractoruser, directoractoruser, and so on.
 For example, certain users may be fans of Steven Spielbergdirected movies starring Tom Hanks, and there should be a cross feature for that! Alas, in standard DLRM, there isn’t. That’s a major limitation.
 Enter DHEN, short for “Deep Hierarchical Ensemble Network”. Proposed in Zhang et al. (2022), the key idea is to create a “hierarchy” of cross features that grows deeper with the number of DHEN layers, and so can include third, fourth, and arbitrarily high orders of interactions.

Here’s how DHEN works at a high level: suppose we have two input features going into DHEN, and let’s denote them by A and B. A 1layer DHEN module would then create the entire hierarchy of cross features including the features themselves up to second order, namely:
A, AxA, AxB, BxA, B, BxB,
 where “x” is not just a single interaction but stands for a combination of the following 5 interactions:
 dot product,
 selfattention (similar to AutoInt),
 convolution,
 linear: y = Wx, or
 the cross module from DCN.
 where “x” is not just a single interaction but stands for a combination of the following 5 interactions:

Add another layer, and things start to get pretty complex:
A, AxA, AxB, AxAxA, AxAxB, AxBxA, AxBxB, B, BxB, BxA, BxBxB, BxBxA, BxAxB, BxAxA,
 where “x” stands for one of 5 interactions, resulting in 62 distinct signals! DHEN is a beast, and its computational complexity (due to its recursive nature) is a nightmare. In order to get it to work, the authors of the DHEN paper even invented a new distributed training paradigm called “Hybrid Sharded Data Parallel”, which achieves 1.2X higher throughput than the (then) stateoftheart distributed learning algorithm.
 But most importantly, DHEN works: in their experiments on internal clickthrough rate data, the authors measure a 0.27% improvement in NE compared to DLRM, using a stack of 8 DHEN layers. You may question whether such a seemingly small improvement in NE is worth such an enormous increase in complexity  alas, at a scale such as Meta’s, it probably is!
 DHEN goes not just a step but one giant leap further than DLRM by introducing a hierarchy of feature interactions consisting of dot product, AutoIntlike selfattention, convolution, linear processing, and DCNlike crossing, that replace DLRM’s simple dot product.
GDCN (2023)
 Proposed in the paper Towards Deeper, Lighter, and Interpretable Cross Network for CTR Prediction by Wang et al. (2023) from Fudan University and Microsoft Research Asia in CIKM ‘23. The paper introduces the Gated Deep Cross Network (GDCN) and the Fieldlevel Dimension Optimization (FDO) approach. GDCN aims to address significant challenges in ClickThrough Rate (CTR) prediction for recommender systems and online advertising, specifically the automatic capture of highorder feature interactions, interpretability issues, and the redundancy of parameters in existing methods.
 GDCN is inspired by DCNV2 and consists of an embedding layer, a Gated Cross Network (GCN), and a Deep Neural Network (DNN). The GCN forms its core structure, which captures explicit boundeddegree highorder feature crosses/interactions. The GCN employs an information gate in each cross layer (representing a higher order interaction) to dynamically filter and amplify important interactions. This gate controls the information flow, ensuring that the model focuses on relevant interactions. This approach not only allows for deeper feature crossing but also adds a layer of interpretability by identifying crucial interactions, thus modelling implicit feature crosses.
 GDCN is a generalization of DCNV2, offering dynamic instancebased interpretability and the ability to utilize deeper cross features without a loss in performance.
The unique selling point of DCNV2 is that it treats all cross features equally, while GDCN uses information gates for finegrained control over feature importance.
 GDCN transforms highdimensional, sparse input into lowdimensional, dense representations. Unlike most CTR models, GDCN allows arbitrary embedding dimensions.
 Two structures are proposed: GDCNS (stacked) and GDCNP (parallel). GDCNS feeds the output of GCN into a DNN, while GDCNP feeds the input vector in parallel into GCN and DNN, concatenating their outputs.
 Alongside GDCN, the FDO approach focuses on optimizing the dimensions of each field in the embedding layer based on their importance. FDO addresses the issue of redundant parameters by learning independent dimensions for each field based on its intrinsic importance. This approach allows for a more efficient allocation of embedding dimensions, reducing unnecessary parameters and enhancing enhancing efficiency without compromising performance. FDO uses methods like PCA to determine optimal dimensions and only needs to be done once, with the dimensions applicable to subsequent model updates.
 The following figure shows the architecture of the GDCNS and GDCNP. \(\otimes\) is the cross operation (a.k.a, the gated cross layer).
 The following figure visualizes the gated cross layer. \(\odot\) is elementwise/Hadamard product, and \(\times\) is matrix multiplication.
 Results indicate that GDCN, especially when paired with the FDO approach, outperforms stateoftheart methods in terms of prediction performance, interpretability, and efficiency. GDCN was evaluated on five datasets (Criteo, Avazu, Malware, Frappe, MLtag) using metrics like AUC and Logloss, showcasing the effectiveness and superiority of GDCN in capturing deeper highorder interactions. These experiments also demonstrate the interpretability of the GCN model and the successful parameter reduction achieved by the FDO approach. The datasets underwent preprocessing like feature removal for infrequent items and normalization. The comparison included various classes of CTR models and demonstrated GDCN’s effectiveness in handling highorder feature interactions without the drawbacks of overfitting or performance degradation observed in other models. GDCN achieves comparable or better performance with only a fraction (about 23%) of the original model parameters.
 In summary, GDCN addresses the limitations of existing CTR prediction models by offering a more interpretable, efficient, and effective approach to handling highorder feature interactions, supported by the innovative use of information gates and dimension optimization techniques.
Graph Neural Networksbased RecSys Architectures
 Graph Neural Networks (GNN) architectures utilize graph structures to capture relationships between users, items, and their interactions. GNNs propagate information through the useritem interaction graph, enabling the model to learn user and item representations that incorporate relational dependencies. This is particularly useful in scenarios with rich graphbased data.
 Pros: Captures relational dependencies and propagates information through graph structures, enabling better modeling of complex relationships.
 Cons: Requires graphbased data and potentially higher computational resources for training and inference.
 Advantages: Improved recommendations by incorporating the rich relational information among users, items, and their interactions.
 Example Use Case: Social recommendation systems, where useruser connections or itemitem relationships play a significant role in personalized recommendations.
 Phase: Candidate Generation, Ranking, Retrieval.
 Recommendation Workflow: GNN architectures are suitable for multiple phases of the recommendation workflow. In the candidate generation phase, GNNs can leverage graph structures to capture relational dependencies and generate potential candidate items. In the ranking phase, GNNs can learn user and item embeddings that incorporate relational information, leading to improved ranking. In the retrieval phase, GNNs can assist in efficient retrieval of relevant items based on their graphbased representations.
 Pros: Captures relational dependencies and propagates information through graph structures, enabling better modeling of complex relationships.
 For a detailed overview of GNNs in RecSys, please refer to the GNN primer.
Two Towers in RecSys
 This section is inspired from ML Frontiers.
 “One of the more popular architecture in personalization / RecSys is two tower network. The two towers of the network usually represent user tower (\(U\)) and candidate tower (\(C\)). The towers produce a dense vector (embedding representation) of \(U\) and \(C\) respectively. The final network is just a dot product or cosine similarity function.
 Let’s consider the cost of executing user tower/network is \(u\) and cost of executing candidate tower is \(c\) and dot product is \(d\).
 At request time, the cost of executing the whole network for ranking N candidates for one user: \(N*(u + c + d)\).
 Since the user is fixed, you need to compute it only once. So, the cost becomes: \(u + N*(c+d)\). Embeddings could be cached. So, the final cost becomes \(u + N* d+ k\) when \(k\) is.” (source)
 The image below (source) showcases this.
 The twotower architecture consists of two separate branches: a query tower and a candidate tower. The query tower learns user representations based on user history, while the candidate tower learns item representations based on item features. The two towers are typically combined in the final stage to generate recommendations.
 Pros: Explicitly models user and item representations separately, allowing for better understanding of user preferences and item features.
 Cons: Requires additional computation to learn and combine the representations from the query and candidate towers.
 Advantages: Improved personalization by learning user and item representations separately, which can capture finegrained preferences.
 Example Use Case: Personalized recommendation systems where understanding the user’s historical behavior and item features separately is critical.
 Phase: Candidate Generation, Ranking.
 Recommendation Workflow: The twotower architecture is often employed in the candidate generation and ranking phases. In the candidate generation phase, the twotower architecture enables the separate processing of user and item features, capturing their respective representations. In the ranking phase, the learned representations from the query and candidate towers are combined to assess the relevance of candidate items to the user’s preferences.
 The twotower model approach in recommender systems gained formal recognition in the machine learning community with Huawei’s 2019 PAL paper. This model was developed to address biases in ranking models, particularly position bias observed in recommender systems.
 The twotower model consists of two separate “towers”: one for learning relevance (user/item interactions) and another for learning biases (like position bias). These towers are combined in different ways – either multiplicatively or additively – to yield the final output.
 Examples of popular twotower implementations:
 Huawei’s PAL model uses a multiplicative approach to combine the outputs of the two towers, addressing position bias within the context of their app store.
 YouTube’s “Watch Next” paper introduced an additive twotower model, which not only addresses position bias but also incorporates other selection biases by using additional features like device type.
 The twotower model has been shown to significantly improve recommendation systems. For instance, Huawei’s PAL model demonstrated improvements in clickthrough and conversion rates by around 25%. YouTube’s model, by adding a shallow tower for bias learning, showed an improvement in their engagement metric.
 Challenges and considerations:
 A key challenge in twotower models is ensuring that both towers learn independently during training, as relevance can confound the learning of position bias.
 Techniques like Dropout have been applied to mitigate overreliance on certain features, like position, and improve generalization.
 The twotower model approach is seen as a powerful method for building unbiased ranking models in recommender systems. It’s a research domain with substantial potential, indicating that the field is still far from reaching its full capability.
Split Network
 A split network is a generalized version of a two tower network. The same optimization of embedding lookup holds here as well. Instead of a dot product, a simple neural network could be used to produce output.
 The image below (source) showcases this.
 In a split network architecture, different components of the recommendation model are split and processed separately. For example, the user and item features may be processed independently and combined in a later stage. This allows for parallel processing and efficient handling of largescale recommender systems.
 Pros: Enables parallel processing, efficient handling of largescale systems, and flexibility in designing and optimizing different components separately.
 Cons: Requires additional coordination and synchronization between the split components, potentially increasing complexity.
 Advantages: Scalability, flexibility, and improved performance in handling largescale recommender systems.
 Example Use Case: Recommendation systems with a massive number of users and items, where parallel processing is crucial for efficient computation.
 Phase: Candidate Generation, Ranking, Final Ranking.
 Recommendation Workflow: The split network architecture can be utilized in various phases. During the candidate generation phase, the split network can be used to process user and item features independently, allowing efficient retrieval of potential candidate items. In the ranking phase, the split network can be employed to learn representations and capture interactions between the user and candidate items. Finally, in the final ranking phase, the split network can contribute to the overall ranking of the candidate items based on learned representations.
Summary
 Neural Collaborative Filtering (NCF) represents a pioneering approach in recommender systems. It was one of the initial studies to replace the thenstandard linear matrix factorization algorithms with neural networks, thus facilitating the integration of deep learning into recommender systems.
 The Wide & Deep model underscored the significance of cross features—specifically, secondorder features formed by intersecting two original features. This model effectively combines a broad, shallow module for handling cross features with a deep module, paralleling the approach of NCF.
 Deep and Cross Neural Network (DCN) was among the first to transition from manually engineered cross features to an algorithmic method capable of autonomously generating all potential feature crosses to any desired order.
 Deep Factorization Machine (DeepFM) shares conceptual similarities with DCN. However, it distinctively substitutes the cross layers in DCN with factorization machines, or more specifically, dot products.
 Automatic Interactions (AutoInt) brought multihead selfattention mechanisms, previously known in Large Language Models (LLMs), into the domain of feature interaction. This technique moves away from bruteforce generation of all possible feature interactions, which can lead to model overfitting on noisy feature crosses. Instead, it employs attention mechanisms to enable the model to selectively focus on the most relevant feature interactions.
 Deep Learning Recommendation Model (DLRM) marked a departure from previous models by discarding the deep module. It relies solely on an interaction layer that computes dot products, akin to the factorization machine component in DeepFM, followed by a MultiLayer Perceptron (MLP). This model emphasizes the sufficiency of interaction layers alone.
 Deep Hierarchical Embedding Network (DHEN) builds upon the DLRM framework by replacing the conventional dot product with a sophisticated hierarchy of feature interactions, including dot product, convolution, selfattention akin to AutoInt, and crossing features similar to those in DCN.
 Gated Deep Cross Network (GDCN) enhances ClickThrough Rate (CTR) prediction in recommender systems by improving interpretability, efficiency, and handling of highorder feature interactions.
 The Two Towers model in recommender systems, known for its separate user and candidate towers, optimizes personalized recommendations and addresses biases like position bias, representing an evolving and powerful approach in building unbiased ranking models.