• This article delves into the architectures of recommender systems, focusing on the processing of two main types of data: dense and sparse features. These features are fundamental to generating personalized recommendations, but each presents unique challenges and requires specific strategies for optimal processing.

Dense vs. Sparse Features

  • Dense Features: These are continuous and quantifiable values that generally represent a measurable attribute. They are straightforward to integrate into models due to their numerical nature.
    • Examples:
      • User Age: A single integer value representing the age of a user.
      • Movie Ratings: Real values such as 4.5 on a 5-point scale indicating the user’s rating of a movie.
  • Sparse Features: These features are categorical and can have a wide range of possible values, many of which may occur infrequently, leading to sparsity. Sparse features often result in large, high-dimensional datasets where most of the data points are zeros.
    • Examples:
      • Movie Genres: Categories like Action, Comedy, or Drama, with each movie potentially tagged with multiple genres.
      • User Occupation: A wide range of job titles, many of which might appear rarely in the dataset.

Handling Sparsity

  • Sparsity refers to the situation where the data available for making recommendations is very limited compared to the total universe of possible data points. This issue becomes prominent as the number of users and items in the system increases, but the number of interactions (ratings, purchases, views, etc.) per user or item does not keep pace.
  • Embedding Layers: Transforming categorical data into a dense, lower-dimensional form. This not only reduces the dimensionality but also helps to capture more complex patterns in the data.
  • One-hot Encoding: A simple method where each category is represented by a binary vector. However, this often exacerbates the dimensionality problem, especially with high-cardinality features.

Feature Crossing

  • Feature crossing involves creating new features by combining two or more existing features, aiming to capture interactions between them that may be predictive of the outcome but are not captured when considering the features independently.

  • How It’s Done:
    • Manual Crossing: Explicitly defining new features based on domain knowledge, such as combining ‘age’ and ‘income’ to create an ‘age x income’ feature.
    • Automated Crossing: Using algorithms like decision trees or deep learning models that can learn to combine features internally and create non-linear combinations of features.
  • Issues with Feature Crossing:
    • Scalability: Manual feature engineering, including feature crossing, can be labor-intensive and hard to scale as the number of features grows.
    • Model Complexity: Automatically learned feature crosses can make the model more complex and harder to interpret. This can also lead to overfitting if not managed properly with regularization techniques.
  • The figure below (source) illustrates a deep neural network (DNN) architecture for processing both dense and sparse features: dense features are processed through an MLP (multi-layer perceptron) to create dense embeddings, while sparse features are converted to sparse embeddings via separate embedding tables (A and B). These embeddings are then combined to facilitate dense-sparse interactions before being fed into the DNN architecture to produce the output.

  • Additionally, the Criteo Leaderboard helps us see which architecture performs well on the Criteo dataset based on CTR.
  • The plot below (source) is a visual representation of the models and architectures for the task of Click-Through Rate Prediction on the Criteo dataset. With this use-case as our poster child, we will discuss the inner workings of some major model architectures listed in the plot.

Wide and Deep (2016)

  • A quick note that the first half of this section is taken from (ML Frontiers)
  • While NCF revolutionized the domain of recommender system, it lacks an important ingredient that turned out to be extremely important for the success of recommenders: cross features. The idea of cross features has been first popularized in Google’s 2016 paper Wide & Deep Learning for Recommender Systems by Cheng et al.

Background: Cross Features

What are feature crosses and why are they important?

  • A cross feature is a second-order feature (i.e., a cross-product transformation) that’s created by “crossing” two of the original features, thus modeling the interactive effects between the two features. For example, in the Google Play Store, first-order features include the impressed app, or the list of user-installed apps. These two can be combined to create powerful cross-features, such as:

     AND(user_installed_app='netflix', impression_app='hulu')
    • which is 1 if the user has Netflix installed and the impressed app is Hulu.
  • Cross features can also be more coarse-grained such as:

     AND(user_installed_category='video', impression_category='video')
    • which is 1 if the user installed video apps before and the impressed app is a video app as well. The authors argue that adding cross features of different granularities enables both memorization (from more granular crosses) and generalization (from less granular crosses).
  • As another example (source), imagine that we are building a recommender system to sell a blender to customers. Then, a customer’s past purchase history such as purchased_bananas and purchased_cooking_books, or geographic features, are single features. If one has purchased both bananas and cooking books, then this customer will more likely click on the recommended blender. The combination of purchased_bananas and purchased_cooking_books is referred to as a feature cross, which provides additional interaction information beyond the individual features.

What are the challenges in learning feature crosses?

  • In web-scale applications, data is mostly categorical, leading to large and sparse feature space. Identifying effective feature crosses in this setting often requires manual feature engineering or exhaustive search.
  • Traditional feed-forward multilayer perceptron (MLP) models are universal function approximators; however, they cannot efficiently approximate even 2nd or 3rd-order feature crosses (Wang et al. (2020), Beutel et al. (2018)).


  • Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. However, memorization and generalization are both important for recommender systems. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank.

The Wide and Deep architecture demonstrated the critical importance of cross features, that is, second-order features that are created by crossing two of the original features. It combines a wide and shallow module for cross features with a deep module much like NCF. It seeks to obtain the best of both worlds by combining the unique strengths of wide and deep models, i.e., memorization and generalization respectively, thus enabling better recommendations.

  • Wide and Deep learning jointly train wide linear models and deep neural networks – to combine the benefits of memorization and generalization for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings.


Wide part: The wide part of the model is a generalized linear model that takes into account cross-product feature transformations, in addition to the original features. The cross-product transformations capture interactions between categorical features. For example, if you were building a real estate recommendation system, you might include a cross-product transformation of city=San Francisco AND type=condo. These cross-product transformations can effectively capture specific, niche rules, offering the model the benefit of memorization.

Deep part: The deep part of the model is a feed-forward neural network that takes all features as input, both categorical and continuous. However, categorical features are typically transformed into embeddings first, as neural networks work with numerical data. The deep part of the model excels at generalizing patterns from the data to unseen examples, offering the model the benefit of generalization.

  • As a recap, a Generalized Linear Model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables to have error distribution models other than a normal distribution. GLMs are used to model relationships between a response variable and one or more predictor variables. Examples of GLMs include logistic regression (used for binary outcomes like pass/fail), Poisson regression (for count data), and linear regression (for continuous data with a normal distribution).
  • As an example (source), say you’re trying to offer food/beverage recommendations based on an input query. People looking for specific items like “iced decaf latte with nonfat milk” really mean it. Just because it’s pretty close to “hot latte with whole milk” in the embedding space doesn’t mean it’s an acceptable alternative. Similarly, there are millions of these rules where the transitivity (a relation between three elements such that if it holds between the first and second and it also holds between the second and third it must necessarily hold between the first and third) of embeddings may actually do more harm than good. On the other hand, queries that are more exploratory like “seafood” or “italian food” may be open to more generalization and discovering a diverse set of related items.

Building upon the food recommendation example earlier, as shown in the graph below (source), sparse features like query="fried chicken" and item="chicken fried rice" are used in both the wide part (left) and the deep part (right) of the model.

  • For the wide component utilizing a generalized linear model, cross-product transformations are carried out on the binary features (e.g., AND(gender=female, language=en)) is 1 if and only if the constituent features (gender=female and language=en) are all 1, and 0 otherwise. This captures the interactions between the binary features, and adds nonlinearity to the generalized linear model.
  • For the deep component utilizing a feed-forward neural network, each of the sparse, high-dimensional categorical features are first converted into a low-dimensional, dense real-valued vector, often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized randomly and then the values are trained to minimize the final loss function during model training.
  • During training, the prediction errors are backpropagated to both sides to train the model parameters, i.e., the two models function as one “cohesive” architecture and are trained jointly with the same loss function.
  • The figure below from the paper shows how Wide and Deep models form a sweet middle compared to simply Wide and simply Deep models:

  • Thus, the key architectural choice in Wide and Deep is to have both a wide module, which is a linear model that takes all cross features directly as inputs, and a deep module, which is essentially an NCF, and then combine both modules into a single output task head that learns from user/app engagements.
  • By combining these two components, Wide and Deep models aim to achieve a balance between memorization and generalization, which can be particularly useful in recommendation systems, where both aspects can be important. The wide part can capture specific item combinations that a particular user might like (based on historical data), while the deep part can generalize from user behavior to recommend items that the user hasn’t interacted with yet but might find appealing based on their broader preferences. Put simply, Wide and Deep architectures combine a deep neural network component for capturing complex patterns and a wide component using a generalized linear model that models feature interactions explicitly. This allows the model to learn both deep representations and exploit feature interactions, providing a balance between memorization and generalization.
  • The Wide and Deep model consists of two parts: dense neural network (deep) for continuous features and embedding models for the categorical features (wide). The architectural diagram below (source) showcases this structure.

  • In the Wide & Deep Learning model, both the wide and deep components handle sparse features, but in different ways:
  1. Wide Component:
    • The wide component is a generalized linear model that uses raw input features and transformed features.
    • An important transformation in the wide component is the cross-product transformation. This is particularly useful for binary features, where a cross-product transformation like “AND(gender=female language=en)” is 1 if and only if both constituent features are 1, and 0 otherwise.
    • Such transformations capture the interactions between binary features and add nonlinearity to the generalized linear model.
  2. Deep Component:
    • The deep component is a feed-forward neural network.
    • For handling categorical features, which are often sparse and high-dimensional, the deep component first converts these features into low-dimensional, dense real-valued vectors, commonly referred to as embedding vectors. The dimensionality of these embeddings usually ranges from 10 to 100.
    • These dense embedding vectors are then fed into the hidden layers of the neural network. The embeddings are initialized randomly and trained to minimize the final loss function during model training.
  3. Combined Model:
    • The wide and deep components are combined using a weighted sum of their output log odds, which is then fed to a common logistic loss function for joint training.
    • In this combined model, the wide part focuses on memorization (exploiting explicit feature interactions), while the deep part focuses on generalization (learning implicit feature representations).
    • The combined model ensures that both sparse and dense features are effectively utilized, with sparse features often transformed into dense representations for efficient processing in the deep neural network.

Music Example

  • In a music recommendation app using the Wide & Deep Learning model, the input features for both the wide and deep components would be tailored to capture different aspects of user preferences and characteristics of the music items. Let’s consider what these inputs might look like:

Input to the Wide Component

-The wide component would primarily use sparse, categorical features, possibly transformed to capture specific interactions:

  1. User Features: Demographics (age, gender, location), user ID, historical user behavior (e.g., genres listened to frequently, favorite artists).
  2. Music Item Features: Music genre, artist ID, album ID, release year.
  3. Cross-Product Transformations: Combinations of categorical features that are believed to interact in meaningful ways. For instance, “user’s favorite genre = pop” AND “music genre = pop”, or “user’s location = USA” AND “artist’s origin = USA”. These cross-products help capture interaction effects that are specifically relevant to music recommendations.

Input to the Deep Component

  • The deep component would use both dense and sparse features, with sparse features transformed into dense embeddings:
  1. User Features (as Embeddings): Embeddings for user ID, embedding vectors for historical preferences (like a vector summarizing genres listened to), demographics if treated as categorical.
  2. Music Item Features (as Embeddings): Embeddings for music genre, artist ID, album ID. These embeddings capture the nuanced relationships in the music domain.
  3. Additional Dense Features: If available, numerical features like the number of times a song has been played, user’s average listening duration, or ratings given by the user.
    • The embeddings created to serve as the input to the Dense component are “learned embeddings” or “trainable embeddings,” as they are learned directly from the data during the training process of the model.
  • Here’s a Python code snippet using TensorFlow to illustrate how a categorical feature (like user IDs) is embedded:
import tensorflow as tf

# Assuming we have 10,000 unique users and we want to embed them into a 64-dimensional space
num_unique_users = 10000
embedding_dimension = 64

# Create an input layer for user IDs (assuming user IDs are integers ranging from 0 to 9999)
user_id_input = tf.keras.Input(shape=(1,), dtype='int32')

# Create an embedding layer
user_embedding_layer = tf.keras.layers.Embedding(input_dim=num_unique_users, 

# Apply the embedding layer to the user ID input
user_embedding = user_embedding_layer(user_id_input)

# Flatten the embedding output to feed into a dense layer
user_embedding_flattened = tf.keras.layers.Flatten()(user_embedding)

# Add a dense layer (more layers can be added as needed)
dense_layer = tf.keras.layers.Dense(128, activation='relu')(user_embedding_flattened)

# Create a model
model = tf.keras.Model(inputs=user_id_input, outputs=dense_layer)

# Compile the model
model.compile(optimizer='adam', loss='mse')  # Adjust the loss based on your specific task

# Model summary

In this code:

  • We first define the number of unique users (num_unique_users) and the dimensionality of the embedding space (embedding_dimension).
  • An input layer is created to accept user IDs.
  • An embedding layer (tf.keras.layers.Embedding) is added to transform each user ID into a 64-dimensional vector. This layer is set to be trainable, meaning its weights (the embeddings) are learned during training.
  • The embedding layer’s output is then flattened and passed through a dense layer for further processing.
  • The model is compiled with an optimizer and loss function, which should be chosen based on the specific task (e.g., classification, regression).

  • This code example demonstrates how to create trainable embeddings for a categorical feature within a neural network using TensorFlow. These embeddings are specifically tailored to the data and task at hand, learning to represent each category (in this case, user IDs) in a way that is useful for the model’s predictive task.

Combining Inputs in Wide & Deep Model

  • Joint Model: The wide and deep components are joined in a unified model. The wide component helps with memorization of explicit feature interactions (especially useful for categorical data), while the deep component contributes to generalization by learning implicit patterns and relationships in the data.
  • Feature Transformation: Sparse features are more straightforwardly handled in the wide part through cross-product transformations, while in the deep part, they are typically converted into dense embeddings.
  • Model Training: Both parts are trained jointly, allowing the model to leverage the strengths of both memorization and generalization.

  • In a music recommendation app, this combination allows the model to not only consider obvious interactions (like a user’s past preferences for certain genres or artists) but also to uncover more subtle patterns and relationships within the data, which might not be immediately apparent but are influential in determining a user’s music preferences.


  • They productionized and evaluated the system on Google Play Store, a massive-scale commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide and Deep significantly increased app acquisitions compared with wide-only and deep-only models.
  • Compared to a deep-only model, Wide and Deep improved acquisitions in the Google Play store by 1%. Consider that Google makes tens of billions in revenue each year from its Play Store, and it’s easy to see how impactful Wide and Deep was.


  • Architecture: The Wide and Deep model in recommendation systems incorporates cross features, particularly in the “wide” component of the model. The wide part is designed for memorization and uses linear models with cross-product feature transformations, effectively capturing interactions between categorical features. This is crucial for learning specific, rule-based information, which complements the “deep” part of the model that focuses on generalization through deep neural networks. By combining these approaches, Wide and Deep models effectively capture both simple, rule-based patterns and complex, non-linear relationships within the data.
  • Pros: Balances memorization (wide component) and generalization (deep component), capturing both complex patterns and explicit feature interactions.
  • Cons: Increased model complexity and potential challenges in training and optimization.
  • Advantages: Improved performance by leveraging both deep representations and explicit feature interactions.
  • Example Use Case: E-commerce platforms where a combination of user behavior and item features plays a crucial role in recommendations.
  • Phase: Ranking.
  • Recommendation Workflow: Given it’s complexity, the deep and wide architecture is suitable for the ranking phase. The wide component can capture explicit feature interactions and enhance the candidate generation process. The deep component allows for learning complex patterns and interactions, improving the ranking of candidate items based on user-item preferences.

Deep Factorization Machine / DeepFM (2017)

  • The first half of this section is taken from (ML Frontiers linked here)
  • Similar to Google’s DCN, Huawei’s DeepFM, introduced in Guo et al. (2017), also replaces manual feature engineering in the wide component of Wide and Deep with a dedicated neural network that learns cross features. However, unlike DCN, the wide component is not a cross neural network, but instead a so-called factorization machine (FM) layer.

What does the FM layer do? It simply takes the dot-products of all pairs of embeddings. For example, if a movie recommender takes 4 id-features as inputs, such as user id, movie id, actor ids, and director id, then the FM layer computes 6 dot products, corresponding to the combinations user-movie, user-actor, user-director, movie-actor, movie-director, and actor-director. The output of the FM layer is then concatenated with the output of the deep component and passed into a sigmoid layer which outputs the model’s predictions.

  • However, note that DeepFM (similar to DCN) learns this in a brute-force manner simply by considering all possible combinations uniformly (i.e., it calculates all pair-wise interactions), while newer implementations such as AutoInt leverage self-attention to automatically determine the most informative feature interactions, i.e., which feature interactions to pay the most attention to (and which to ignore by setting the attention weights to zero).
  • The figure below from the paper shows the Wide & deep architecture of DeepFM. The wide and deep component share the same input raw feature vector, which enables DeepFM to learn low- and high-order feature interactions simultaneously from the input raw features. Note that in the below figure, you notice a circle marked as “+” in the FM layer in addition to the inner products. Think of this like a skip connection that passes the concatenation of the inputs directly into the output unit.

  • The authors show that DeepFM beats a host of its competitors, including Google’s Wide and Deep, by more than 0.42% Logloss on company-internal data.
  • DeepFM replaces the cross neural network in DCN with factorization machines, that is, dot products.

  • DeepFM combines FM with deep neural networks. It utilizes FM to model pairwise feature interactions and a deep neural network to capture higher-order feature interactions. This architecture leverages both linear and non-linear relationships between features.


  • Pros: Combines the benefits of FM and deep neural networks, capturing both pairwise and higher-order feature interactions. In other words, accurate modeling of both linear and non-linear relationships between features, providing a comprehensive understanding of feature interactions.
  • Cons:
    • DeepFM creates feature crosses in a brute-force way, simply by considering all possible combinations. This is not only inefficient, it could also create feature crosses that aren’t helpful at all, and just make the model overfit.
    • Increased model complexity and potential challenges in training and optimization.
  • Example Use Case: Click-through rate prediction in online advertising or personalized recommendation systems.
  • Phase: Candidate Generation, Ranking.
  • Recommendation Workflow: DeepFM is commonly utilized in both the candidate generation and ranking phases. It combines the strengths of factorization machines and deep neural networks. In the candidate generation phase, DeepFM can capture pairwise feature interactions efficiently. In the ranking phase, it can leverage deep neural networks to model higher-order feature interactions and improve the ranking of candidate items.

Neural Collaborative Filtering / NCF (2017)

  • The integration of deep learning into recommender systems witnessed a significant breakthrough with the introduction of Neural Collaborative Filtering (NCF), introduced in He et. al (2017) from NUS Singapore, Columbia University, Shandong University, and Texas A&M University.
  • This innovative approach marked a departure from the (then standard) matrix factorization method. Prior to NCF, the gold standard in recommender systems was matrix factorization, which relied on learning latent vectors (a.k.a. embeddings) for both users and items, and then generate recommendations for a user by taking the dot product between the user vector and the item vectors. The closer the dot product is to 1, the better the match. As such, matrix factorization can be simply viewed as a linear model of latent factors.

The key idea behind NCF is to substitute the inner product in matrix factorization with a neural network architecture to that can learn an arbitrary non-linear function from data. To supercharge the learning process of the user-item interaction function with non-linearities, they concatenated user and item embeddings, and then fed them into a multi-layer perceptron (MLP) with a single task head predicting user engagement, like clicks. Both MLP weights and embedding weights (which user/item IDs are mapped to) were learned through backpropagation of loss gradients during model training.

  • The hypothesis underpinning NCF posits that user-item interactions are non-linear, contrary to the linear assumption in matrix factorization.
  • The figure below from the paper illustrates the neural collaborative filtering framework.

  • NCF proved the value of replacing (then standard) linear matrix factorization algorithms with a neural network. With a relatively simply 4-layer neural network, NCF proved that there’s immense value of applying deep neural networks in recommender systems, marking the pivotal transition away from matrix factorization and towards deep recommenders. They were able to beat the best matrix factorization algorithms at the time by 5% hit rate on the Movielens and Pinterest benchmark datasets. Empirical evidence showed that using deeper layers of neural networks offers better recommendation performance.
  • Despite its revolutionary impact, NCF lacked an important ingredient that turned out to be extremely important for the success of recommenders: cross features, a concept popularized by the Wide & Deep paper described above.


  • NCF proved the value of replacing (then standard) linear matrix factorization algorithms with a neural network.
  • The NCF framework, which is both generic and capable of expressing and generalizing matrix factorization, utilized a multi-layer perceptron to imbue the model with non-linear capabilities.
  • With a relatively simple 4-layer neural network, they were able to beat the best matrix factorization algorithms at the time by 5% hit rate on the Movielens and Pinterest benchmark datasets.

Deep and Cross Networks / DCN (2017)

  • Wide and Deep has proven the significance of cross features, however it has a huge downside: the cross features need to be manually engineered, which is a tedious process that requires engineering resources, infrastructure, and domain expertise. Cross features à la Wide and Deep are expensive. They don’t scale.
  • The key idea of Deep and Cross Networks (DCN), introduced in a Wang et al. (2017) by Google is to replace the wide component in Wide and Deep with a “cross neural network”, a neural network dedicated to learning cross features of arbitrarily high order. However, note that DCN (similar to DeepFM) learns this in a brute-force manner simply by considering all possible combinations uniformly (i.e., it calculates all pair-wise interactions), while newer implementations such as AutoInt leverage self-attention to automatically determine the most informative feature interactions, i.e., which feature interactions to pay the most attention to (and which to ignore by setting the attention weights to zero).
  • Similar to Huawei’s DeepFM, introduced in Guo et al. (2017), DCN also replaces manual feature engineering in the wide component of Wide and Deep with a dedicated cross neural network that learns cross features. However, unlike DeepFM, the wide component is a cross neural network, instead of a so-called factorization machine layer.
  • DCN was designed to learn explicit and bounded-degree cross features more effectively. It starts with an input layer (typically an embedding layer), followed by a cross network containing multiple cross layers that models explicit feature interactions, and then combines with a deep network that models implicit feature interactions.
    • Cross Network: This is the core of DCN. It explicitly applies feature crossing at each layer, and the highest polynomial degree increases with layer depth. The following figure shows the \((i + 1)^{th}\) cross layer.

    • Deep Network: It is a traditional feedforward multilayer perceptron (MLP).
    • The deep network and cross network are then combined to form DCN (Wang et al. (2020)). As shown in the figure below, we could stack a deep network on top of the cross network (stacked structure); we could also place them in parallel (parallel structure).

  • What makes a cross neural network different from a standard MLP? As a reminder, in an (fully-connected) MLP, each neuron in the next layer is a linear combination of all neurons in the previous layer, plus a bias term:
\[x_{l+1} = b_{l+1} + W\cdot x_l\]
  • The Cross Network helps in better generalizing on sparse features by learning explicit bounded-degree feature interactions. This is particularly useful for sparse data, where traditional deep learning models might struggle due to the high dimensionality and lack of explicit feature interaction modeling.

  • By contrast, in the cross neural network the next layer is constructed by forming second-order combinations of the first layer with itself:

\[x_{l+1}=b_{l+1} + x_l + x_l \cdot W \cdot x_l^T\]
  • At the input, sparse features are transformed into dense vectors through an embedding procedure while dense features are normalized. These processed features are then combined into a single vector \(x_0\), which includes the stacked embedding vectors for the sparse features and the normalized dense features. This combined vector is then fed into the network.
  • Hence, a cross neural network of depth \(L\) will learn cross features in the form of polynomials of degrees up to \(L\). The deeper the neural network, the higher-order interactions are learned.
  • The unified wide and cross model architecture is training jointly with mean squared error (MSE) as it’s loss function.
  • For model evaluation, the Root Mean Squared Error (RMSE, the lower the better) is reported per TensorFlow: Deep & Cross Network (DCN).

  • The Deep and Cross Network (DCN) introduces a novel approach to handling feature interactions and dealing with sparse features. Let’s break down how DCN accomplishes these tasks:

DCN and Variants

  • The following section is referenced from ML Frontiers Substack.
  • DCN was one of the first algorithms to replace the manual engineering of cross features in Wide&Deep-like models with an algorithm that exhaustively computes all possible crosses. The cross layers in DCN have two free parameters, the weight vector w and the bias vector b.
  • DCN-V2 replaced DCN’s crossing vector w with a crossing matrix W, which makes the cross layers more expressive and allows us to get further performance gains, in particular when stacking more layers. While in DCN we saw performance plateau after 2 layers, DCN-V2 allows us to stack 4 layers or even more and still see performance improvements.
  • DCN-Mix replaces the crossing matrix W with a more expressive mixture of low-rank experts which are combined using a gating network, beating DCN-V2 on the Movielens benchmark dataset.
  • GDCN adds an information gate on top of each cross layer in DCN-V2 which controls how much weight the model should assign to each feature interaction, preventing the model from overfitting to noisy feature crosses. GDCN is the current champion on the Criteo problem.

Forming Higher-Order Feature Interactions

  1. Mechanism of the Cross Network: In a standard Multi-Layer Perceptron (MLP), each neuron in a layer is a linear combination of all neurons from the previous layer. The formula for this is typically \(x_{l+1} = b_{l+1} + W \cdot x_l\), where \(x_l\) is the input from the previous layer, \(W\) is the weight matrix, and \(b_{l+1}\) is the bias. However, in the Cross Network of DCN, the idea is to explicitly form higher-order interactions of features.

  2. Second-Order Combinations: In the Cross Network, the next layer is created by incorporating second-order combinations of the previous layer’s features. The formula used is \(x_{l+1} = b_{l+1} + x_l + x_l \cdot W \cdot x_l^T\). This approach allows the network to automatically learn complex feature interactions (cross features) that are higher than first-order, which would be impossible in a standard MLP without manual feature engineering.

Handling Sparse Features through Embedding

  1. Sparse to Dense Transformation: Neural networks generally work better with dense input data. However, in many real-world applications, features are often sparse (like categorical data). DCN addresses this challenge by transforming sparse features into dense vectors through an embedding process.

  2. Embedding Process: This embedding is a technique where sparse, high-dimensional data (like one-hot encoded vectors) are converted into a lower-dimensional, continuous, and dense vector. Each unique category in the sparse feature is mapped to a dense vector, and these vectors are learned during the training process. This transformation is crucial because it enables the network to work with a dense representation of the data, which is more efficient and effective for learning complex patterns.

Explicit Feature Crossing and Polynomial Degree

  1. Explicit Feature Crossing: The Cross Network in DCN explicitly handles feature crossing at each layer. By doing this, it models interactions between different features directly, rather than relying on the deep network to implicitly capture these interactions.

  2. Increasing Polynomial Degree with Depth: As the depth of the Cross Network increases, the polynomial degree of the feature interactions also increases. This means that in deeper layers of the Cross Network, the model can capture more complex interactions (higher-order feature combinations). The network is essentially learning polynomials of features, where the degree of the polynomial increases with the depth of the network.

  3. Bounded-Degree Cross Features: The design of the Cross Network ensures that the degree of these polynomials is bounded and controlled by the depth of the network. This control is crucial to avoid an explosion in the complexity of the model, which could lead to overfitting and computational inefficiency.

  • DCN’s Cross Network forms higher-order feature interactions by explicitly crossing features at each layer, increasing the polynomial degree with the depth of the network. At the same time, it addresses the challenge of sparse features by embedding them into dense vectors, making them suitable for processing by the neural network. This design allows DCN to automatically and efficiently learn complex feature interactions without the need for manual feature engineering.

  • Integrating Outputs: The outputs from the Cross Network and the Deep Network are concatenated.
  • Final Prediction: The concatenated vector is then fed into a logits layer for the final prediction, such as in a classification task. This layer effectively combines the strengths of both explicit feature interactions and deep learned representations.

Input and Output to Each Component

  • Input to Cross and Deep Networks: Both networks take the same input vector, which is a combination of dense embeddings (from sparse features) and normalized dense features.
  • Output: The outputs of both networks are combined in the Combination Layer for the final model output.

  • Based on the paper, the architecture and composition of each layer in the Cross and Deep Networks of the Deep & Cross Network (DCN) are as follows:

Cross Network Layers

Each layer in the Cross Network is defined by the following formula: \(x_l+1 = x_0 x^{T}_l w_l + b_l + x_l\)

  • Inputs and Outputs: \(x_l\) and \(x_l+1\) are the outputs from the l-th and (l + 1)-th cross layers respectively, represented as column vectors.
  • Weight and Bias Parameters: Each layer has its own weight (\(w_l\)) and bias (\(b_l\)) parameters, which are learned during training.
  • Feature Crossing Function: The feature crossing function is represented by \(f(x_l, w_l, b_l)\), and it is designed to fit the residual of \(x_l+1 - x_l\). This function captures interactions between the features.
  • Residual Connection: Each layer adds back its input after the feature crossing, which helps in preserving the information and building upon the previous layer’s output.

Deep Network Layers

Each layer in the Deep Network is structured as a standard fully-connected layer and is defined by the following formula: \(hl+1 = f(w_l hl + b_l)\)

  • Inputs and Outputs: \(hl\) and \(hl+1\) are the l-th and (l + 1)-th hidden layers’ outputs respectively.
  • Weight and Bias Parameters: Similar to the cross layer, each deep layer has its own weight matrix (\(w_l\)) and bias vector (\(b_l\)).
  • Activation Function: The function \(f(\cdot)\) is typically a non-linear activation function, such as ReLU (Rectified Linear Unit), which introduces non-linearity into the model, allowing it to learn complex patterns in the data.


  • Cross Network Layers: These layers are specifically designed for efficient feature crossing, capturing interactions between different input features at each layer. They employ a unique operation combining linear transformation, feature interaction, and residual connections.
  • Deep Network Layers: These are standard fully-connected layers that use weights, biases, and non-linear activation functions to learn abstract representations and complex patterns in the data.


  • Compared to a model with just the deep component, DCN has a 0.1% statistically significant lower logloss on the Criteo display ads benchmark dataset. And that’s without any manual feature engineering, as in Wide and Deep! (It would have been nice to see a comparison between DCN and Wide and Deep. Alas, the authors of DCN didn’t have a good method to manually create cross features for the Criteo dataset, and hence skipped this comparison.)
  • The Deep and Cross Network (DCN) architecture includes a cross network component that captures cross-feature interactions. It combines a deep network with cross layers, allowing the model to learn explicit feature interactions and capture non-linear relationships between features.


  • DCN showed that we can get even more performance gains by replacing manual engineering of cross features with an algorithmic approach that automatically creates all possible feature crosses up to any arbitrary order. Compared to Wide & Deep, DCN achieved 0.1% lower logloss on the Criteo display ads benchmark dataset.
  • Pros: Captures explicit high-order feature interactions and non-linear relationships through cross layers, allowing for improved modeling of complex patterns.
  • Cons:
    • DCN creates feature crosses in a brute-force way, simply by considering all possible combinations. This is not only inefficient, it could also create feature crosses that aren’t helpful at all, and just make the model overfit.
    • More complex than simple feed-forward networks.
    • May not perform well on tasks where feature interactions aren’t important.
    • Increased model complexity, potential overfitting on sparse data.
  • Use case: Useful for tasks where high-order feature interactions are critical, such as CTR prediction and ranking tasks.
  • Example Use Case: Advertising platforms where understanding the interactions between user characteristics and ad features is essential for personalized ad targeting.
  • Phase: Ranking, Final Ranking.
  • Recommendation Workflow: The deep and cross architecture is typically applied in the ranking phase and the final ranking phase. The deep and cross network captures explicit feature interactions and non-linear relationships, enabling accurate ranking of candidate items based on user preferences. It contributes to the final ranking of candidate items, leveraging its ability to model complex patterns and interactions.

Music use case

  • Using the Deep & Cross Network (DCN) for a music recommender system involves several steps, from processing the input data to obtaining output recommendations. Here’s a step-by-step approach on how you would use DCN in this context:
  1. Data Preparation
    • Collect Data: Gather user data and music metadata. User data might include user IDs, past listening history, ratings, and demographic information. Music metadata could include song IDs, genres, artists, albums, and release years.
    • Feature Engineering:
      • Sparse Features: Categorical data like user IDs, song IDs, artist names, genres, etc., are considered sparse features. These will be transformed into dense vectors using embedding layers.
      • Dense Features: Numerical data like age, listening duration, and rating scores are dense features.
  2. Building the Model
    • Embedding Layer for Sparse Features:
      • Use embedding layers to transform sparse categorical features into dense vectors. For instance, map each user ID and song ID to a fixed-size embedding vector.
    • Deep Component:
      • Construct a series of dense layers. These layers will process both the dense features and the output of the embedding layers (the transformed sparse features).
      • Apply non-linear activation functions (like ReLU) in these layers to capture complex patterns in the data.
    • Cross Component:
      • Build the cross layers to model feature interactions. Each layer in the Cross Network explicitly captures interactions between the features.
      • The initial input to the Cross Network is the concatenated embeddings and normalized dense features.
    • Combining Deep and Cross Components:
      • Merge the outputs of the deep and cross components. This combination enables the model to leverage both deep feature transformations and explicit feature interactions.
  3. Training the Model
    • Compile the Model: Choose an appropriate loss function (like categorical cross-entropy for classification) and an optimizer (like Adam).
    • Input Data: Feed the processed input data (both embeddings of sparse features and dense features) into the model.
    • Train the Model: Use user-song interactions as training data. For instance, if a user has listened to or rated a song, these interactions are used as positive samples.
  4. Generating Recommendations
    • Model Prediction: Use the trained model to predict the likelihood of a user liking a particular song or set of songs.
    • Post-Processing: Sort the songs for each user based on the predicted likelihoods and recommend the top songs.
  5. Model Evaluation
    • Evaluate the model using metrics like accuracy, precision, recall, or more sophisticated ones like Mean Average Precision at K (MAP@K).
  • Example Use Case
  • Personalized Song Recommendations: For a given user, the model predicts which songs they would likely enjoy, based on their past interactions and the learned feature interactions.
  • Discovering New Music: The model can help users discover new songs or artists that they might not have found on their own, but are likely to enjoy based on their profile and listening history.
  • DCN’s ability to handle both sparse and dense features, along with its unique architecture that captures deep feature representations and explicit feature interactions, makes it well-suited for complex tasks like music recommendation. This model can effectively leverage the rich and varied data in music recommender systems to provide personalized and accurate recommendations.

AutoInt (2019)

  • Proposed in AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks by Song et al. from from Peking University and Mila-Quebec AI Institute, and HEC Montreal in CIKM 2019.
  • The paper introduces AutoInt (short for “automated feature interaction learning”), a novel method for efficiently learning high-order feature interactions in an automated way. Developed to address the inefficiencies and overfitting problems in existing models like DCN and DeepFM, which create feature crosses in a brute-force manner, AutoInt leverages self-attention to determine the most informative feature interactions.
  • AutoInt employs a multi-head self-attentive neural network with residual connections, designed to explicitly model feature interactions in a 16-dimensional embedding space. It overcomes the limitations of prior models by focusing on relevant feature combinations, avoiding unnecessary and unhelpful feature crosses.
  • Processing Steps:
    1. Input Layer: Represents user profiles and item attributes as sparse vectors.
    2. Embedding Layer: Projects each feature into a 16-dimensional space.
    3. Interacting Layer: Utilizes several multi-head self-attention layers to automatically identify the most informative feature interactions. The attention mechanism is based on dot product for its effectiveness in capturing feature interactions.
    4. Output Layer: Uses the learned feature interactions for CTR estimation.
  • The goal of AutoInt is to map the original sparse and high-dimensional feature vector into low-dimensional spaces and meanwhile model the high-order feature interactions. As shown in the below figure, AutoInt takes the sparse feature vector \(x\) as input, followed by an embedding layer that projects all features (i.e., both categorical and numerical features) into the same low-dimensional space. Next, embeddings of all fields are fed into a novel interacting layer, which is implemented as a multi-head self-attentive neural network. For each interacting layer, high-order features are combined through the attention mechanism, and different kinds of combinations can be evaluated with the multi-head mechanisms, which map the features into different subspaces. By stacking multiple interacting layers, different orders of combinatorial features can be modeled. The output of the final interacting layer is the low-dimensional representation of the input feature, which models the high-order combinatorial features and is further used for estimating the clickthrough rate through a sigmoid function. The figure below from the paper shows an overview of AutoInt.

  • The figure below from the paper illustrates the input and embedding layer, where both categorical and numerical fields are represented by low-dimensional dense vectors.

  • AutoInt demonstrates superior performance over competitors like Wide and Deep and DeepFM on benchmark datasets like MovieLens and Criteo, thanks to its efficient handling of feature interactions.
  • The technical innovations in AutoInt consist of: (i) introduction of multi-head self-attention to learn which cross features really matter, replacing the brute-force generation of all possible feature crosses, and (ii) the model’s ability to learn important feature crosses such as Genre-Gender, Genre-Age, and RequestTime-ReleaseTime, which are crucial for accurate CTR prediction.
  • AutoInt showcases efficiency in processing large-scale, sparse, high-dimensional data, with a stack of 3 attention layers, each having 2 heads. The attention mechanism improves model explainability by highlighting relevant feature interactions, as exemplified in the attention matrix learned on the MovieLens dataset.
  • AutoInt addresses the need for a model that is both powerful in capturing complex interactions and interpretable in its recommendations, without the inefficiency and overfitting issues seen in models that generate feature crosses in a brute-force manner.


  • This section is taken from (ML Frontiers linked here).
  • The key idea in DCN and DeepFM was to create feature crosses in a brute-force way, simply by considering all possible combinations. This is not only inefficient, it could also create feature crosses that aren’t helpful at all, and just make the model overfit.
  • What we need, then, is a way to determine automatically which feature interactions to pay attention to and which to ignore. We need – you’ve guessed it – self-attention!

AutoInt introduces the idea of multi-head self attention in the context of recommender systems: instead of simply generating all possible pair-wise cross features in a brute-force way, we use the attention mechanism to learn which cross features really matter.

  • That was the key insight behind AutoInt, short for “automated feature interaction learning”, proposed by Song et al. (2019) from Peking University, China. In particular, the authors first project each individual feature into a 16-dimensional embedding space and then pass these embeddings into a stack of several multi-head self-attention layers that automatically create the most informative feature interactions. The inputs going into the key, query, and value matrices are simply the list of all feature embeddings, and the attention function is simply the dot product, “due to its simplicity and effectiveness” in capturing feature interactions.
  • This sounds complicated, but really there’s no magic here – just a bunch of matrix multiplications. As a concrete example, here’s the attention matrix that one of the attention heads in AutoInt learns on the MovieLens benchmark dataset:

  • The model learns that the feature crosses formed by Genre-Gender, Genre-Age, and RequestTime-ReleaseTime are important, which are all marked in green. This makes sense: men usually prefer different movies than women, and kids prefer different movies than adults. What about the RequestTime-ReleaseTime cross feature? It simply encodes movie freshness, at the time of the training example.
  • Using a stack of three attention layers with two heads each, the authors of AutoInt were able to beat a host of competitors, including Wide and Deep and DeepFM, on the MovieLens and Criteo benchmark datasets.

DLRM (2019)

  • Let’s fast-forward by a year to Meta’s DLRM (“deep learning for recommender systems”) architecture, proposed in Naumov et al. (2019), another important milestone in recommender system modeling.
  • This paper by Naumov et al. from Facebook in 2019 introduces the DLRM (deep learning for recommender systems) architecture, a significant development in recommender system modeling, which was open-sourced in both PyTorch and Caffe2 frameworks.
  • Contrary to the “deep learning” part in it’s name, DLRM represents a progression from the DeepFM architecture, maintaining the FM (factorization machine) component while discarding the deep neural network part. The fundamental hypothesis of DLRM is that interactions are paramount in recommender systems, which can be modeled using shallow MLPs (and complex deep learning components are thus not essential).
  • The DLRM model handles continuous (dense) and categorical (sparse) features that describe users and products. DLRM exercises a wide range of hardware and system components, such as memory capacity and bandwidth, as well as communication and compute resources as shown in the figure below from the paper.

  • The figure below from the paper shows the overall structure of DLRM.

  • DLRM uniquely handles both continuous (dense) and categorical (sparse) features that describe users and products, projecting them into a shared embedding space. These features are then passed through MLPs before and after computing pairwise feature interactions (dot products). This method significantly differs from other neural network-based recommendation models in its explicit computation of feature interactions and treatment of each embedded feature vector as a single unit, contrasting with approaches like Deep and Cross which consider each element in the feature vector separately.

DLRM shows that interactions are all you need: it’s akin to using just the FM component of DeepFM but with MLPs added before and after the interactions to increase modeling capacity.

  • The architecture of DLRM includes multiple MLPs, which are added to increase the model’s capacity and expressiveness, enabling it to model more complex interactions. This aspect is critical as it allows for fitting data with higher precision, given adequate parameters and depth in the MLPs.
  • Compared to other DL-based approaches to recommendation, DLRM differs in two ways. First, it computes the feature interactions explicitly while limiting the order of interaction to pairwise interactions. Second, DLRM treats each embedded feature vector (corresponding to categorical features) as a single unit, whereas other methods (such as Deep and Cross) treat each element in the feature vector as a new unit that should yield different cross terms. These design choices help reduce computational/memory cost while maintaining competitive accuracy.
  • A key contribution of DLRM is its specialized parallelization scheme, which utilizes model parallelism on the embedding tables to manage memory constraints and exploits data parallelism in the fully-connected layers for computational scalability. This approach is particularly effective for systems with diverse hardware and system components, like memory capacity and bandwidth, as well as communication and compute resources.
  • The paper demonstrates that DLRM surpasses the performance of the DCN model on the Criteo dataset, validating the authors’ hypothesis about the predominance of feature interactions. Moreover, DLRM has been characterized for its performance on the Big Basin AI platform, proving its utility as a benchmark for future algorithmic experimentation, system co-design, and benchmarking in the field of deep learning-based recommendation models.
  • Facebook AI post.


  • The key idea behind DLRM is to take the approach from DeepFM but only keep the FM part, not the Deep part, and expand on top of that. The underlying hypothesis is that the interactions of features are really all that matter in recommender systems. “Interactions are all you need!”, you may say.
  • The deep component is not really needed. DLRM uses a bunch of MLPs to model feature interactions. Under the hood, DLRM projects all sparse and dense features into the same embedding space, passes them through MLPs (blue triangles in the above figure), computes all pairs of feature interactions (the cloud), and finally passes this interaction signal through another MLP (the top blue triangle). The interactions here are simply dot products, just like in DeepFM.
  • The key difference to the DeepFM’s “FM” though is the addition of all these MLPs, the blue triangles. Why do we need those? Because they’re adding modeling capacity and expressiveness, allowing us to model more complex interactions. After all, one of the most important rules in neural networks is that given enough parameters, MLPs with sufficient depth and width can fit data to arbitrary precision!
  • In the paper, the authors show that DLRM beats DCN on the Criteo dataset. The authors’ hypothesis proved to be true. Interactions, it seems, may really be all you need.

DCN V2 (2020)

  • Proposed in DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems by Wang et al. from Google, DCN-V2 is an enhanced version of the Deep & Cross Network (DCN), designed to effectively learn feature interactions in large-scale learning to rank (LTR) systems.
  • The paper addresses DCN’s limited expressiveness in learning predictive feature interactions, especially in web-scale systems with extensive training data.
  • DCN-V2 is focused on the efficient and effective learning of predictive feature interactions, a crucial aspect of applications like search recommendation systems and computational advertising. It tackles the inefficiency of traditional methods, including manual identification of feature crosses and reliance on deep neural networks (DNNs) for higher-order feature crosses.
  • The embedding layer in DCN-V2 processes both categorical (sparse) and dense features, supporting various embedding sizes, essential for industrial-scale applications with diverse vocabulary sizes.
  • The core of DCN-V2 is its cross layers, which explicitly create feature crosses. These layers are based on a base layer with original features, utilizing learned weight matrices and bias vectors for each cross layer.
  • The figure below from the paper visualizes a cross layer.

  • As shown in the figure below, DCN-V2 employs a novel architecture that combines a cross network with a deep network. This combination is realized through two architectures: a stacked structure where the cross network output feeds into the deep network, and a parallel structure where outputs from both networks are concatenated. The cross operation in these layers is represented as \(\mathrm{x}_{l+1}=\mathrm{x}_0 \odot\left(W_l \mathrm{x}_l+\mathrm{b}_l\right)+\mathrm{x}_l\).

A key feature of DCN-V2 is the use of low-rank techniques to approximate feature crosses in a subspace, improving performance and reducing latency. This is further enhanced by a Mixture-of-Expert architecture, which decomposes the matrix into multiple smaller sub-spaces aggregated through a gating mechanism.

  • DCN-V2 demonstrates superior performance in extensive studies and comparisons with state-of-the-art algorithms on benchmark datasets like Criteo and MovieLens-1M. It offers significant gains in offline accuracy and online business metrics in Google’s web-scale LTR systems.
  • The paper also delves into polynomial approximation from both bitwise and feature-wise perspectives, illustrating how DCN-V2 creates feature interactions up to a certain order with a given number of cross layers, thus being more expressive than the original DCN.

Architecture changes

  • In DCN V2, several specific architectural changes were made to enhance its performance and efficiency, particularly in the cross layers. Here are the detailed aspects of how these changes enable the model to capture a wider range of interactions:
  1. Mixture of Low-Rank Cross Layers:
    • DCN V2 introduces a mixture of low-rank cross layers. This means that instead of using full-rank matrices (which can be computationally expensive and might overfit), the model employs low-rank matrices in the cross layers.
    • Low-Rank Approximation: This involves representing the weight matrices in the cross layers using a factorization approach, where a weight matrix is approximated as the product of two smaller matrices. This reduces the number of parameters and computational complexity.
    • Effect on Feature Interactions: By using low-rank matrices, the model efficiently captures the essential interactions without the overhead of full-rank operations. This approach strikes a balance between model expressiveness and computational efficiency, particularly beneficial for large-scale applications.
  2. Enhanced Expressiveness in Cross Network:
    • Modifying Cross Layer Operations: The cross network in DCN V2 might have modified the mathematical operations within its layers to better capture complex explicit cross terms. This could involve changes in how the feature crossing is computed or in how the inputs and outputs of each layer are combined.
    • Capturing Higher-Order Interactions: Adjustments in the cross layer operations enable the model to capture higher-order interactions more effectively. This is crucial for dealing with complex and high-dimensional data where simple pairwise interactions are not sufficient.
    • Mixture of Low-Rank Cross Layers:
    • Background: In the original DCN, each cross layer used a full-rank matrix for the feature crossing operation. While effective, this could be computationally intensive and less efficient for large-scale data.
    • Low-Rank Approach in DCN V2: DCN V2 introduces low-rank matrices in the cross layers. A low-rank matrix can be represented as the product of two smaller matrices (U and V), such that the original weight matrix W is approximated by U * V^T.
    • Implication: This means that the feature crossing operation, which originally involved the full matrix W, now utilizes this low-rank approximation. The operation becomes more efficient in terms of computation while still maintaining the ability to capture essential feature interactions. - Capturing Higher-Order Interactions:
    • Original Operation: Traditionally, a cross layer would perform a feature crossing by computing the outer product of the input vector with itself and then applying a linear transformation using the weight matrix. This process captures second-order interactions.
    • Enhancement in DCN V2: With low-rank matrices, the model can still effectively capture these interactions but in a more computationally efficient manner. The low-rank approximation allows the model to handle more complex interactions without exponentially increasing the computational complexity. This is crucial in high-dimensional data, where the number of potential feature interactions can be very large.
  3. Stacked and Parallel Structures:
    • Stacked Structure: In this structure, the model processes data through the cross network and then the deep network sequentially. This allows the deep network to further refine and process the feature interactions captured by the cross network.
    • Parallel Structure: Here, the cross and deep networks operate in parallel, and their outputs are combined at the end. This allows the model to learn from both explicit (cross network) and implicit (deep network) feature interactions simultaneously and then combine these insights.
  • DCN V2, with its introduction of low-rank cross layers and potential modifications to the cross layer operations, enhances its ability to model complex feature interactions more efficiently. The choice between stacked and parallel structures offers flexibility in how these interactions are processed and combined, making DCN V2 adaptable to a variety of data characteristics and application requirements. These specific architectural advancements position DCN V2 as a more effective and efficient model for handling web-scale data.

Music use case

  • Creating a music recommender system using DCN V2 involves several steps, from data preparation to model deployment. Here’s a detailed use case illustrating how DCN V2 can be effectively utilized for this purpose:
  1. Data Collection and Preparation:
    • Collect Data: Gather comprehensive data involving user interactions with music tracks. This data might include:
    • User Data: User demographics, historical listening data, ratings, and preferences.
    • Music Data: Track IDs, genres, artists, albums, release years, and other metadata. - Feature Engineering:
    • Categorical Features: User IDs, track IDs, artist names, genres (sparse features).
    • Numerical Features: User listening duration, frequency of listening to certain genres or artists (dense features).
  2. Model Architecture Setup:
    • Embedding Layer for Sparse Features:
    • Convert sparse categorical features into dense embeddings. For instance, create embeddings for user IDs and track IDs. - Deep Component of DCN V2:
    • Set up a series of dense layers for processing both dense features and embeddings from the sparse features. - Cross Component of DCN V2:
    • Implement the cross network with a mixture of low-rank cross layers to efficiently model explicit feature interactions. - Stacked or Parallel Structure:
    • Choose between a stacked or parallel architecture based on exploratory analysis and experimentation.
  3. Model Training:
    • Input Data: Process and feed the data into the model, including user-track interaction data.
    • Training Process:
    • Train the model using appropriate loss functions (e.g., categorical cross-entropy for multi-class classification of music tracks).
    • Employ techniques like batch normalization, dropout, or regularization as needed to improve performance and reduce overfitting.
  4. Generating Music Recommendations:
    • Model Prediction: For a given user, use the model to predict the likelihood of them enjoying various tracks.
    • Recommendation Strategy:
    • Generate a list of recommended tracks for each user based on predicted likelihoods.
    • Consider personalizing recommendations based on user-specific data like historical preferences.
  5. Model Evaluation and Refinement:
    • Evaluation Metrics: Use accuracy, precision, recall, F1-score, or more complex metrics like Mean Average Precision at K (MAP@K) for evaluation.
    • Feedback Loop: Incorporate user feedback to refine and improve the model iteratively.
  6. Deployment and Scaling:
    • Deployment: Deploy the model in a production environment where it can handle real-time recommendation requests.
    • Scalability: Ensure the system is scalable to handle large numbers of users and tracks, leveraging the efficiency of the DCN V2 architecture.
  • Example Use Case:
  • Personalized Playlist Creation: For each user, the system generates a personalized playlist based on their unique preferences, historical listening habits, and interactions with different music tracks.
  • New Music Discovery: The system recommends new tracks and artists that the user might enjoy but hasn’t listened to yet, broadening their music experience.

  • Using DCN V2 for a music recommender system leverages the model’s ability to understand both explicit and implicit feature interactions, offering a powerful tool for delivering personalized music experiences. Its efficient architecture makes it suitable for handling the complexity and scale of music recommendation tasks.


  • Proposed in DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems by Wang et al. from Google. An enhanced version of the Deep & Cross Network (DCN), DCN-V2, effectively learns feature interactions in large-scale learning to rank (LTR) systems.
  • DCN-V2 addresses the limitations of the original DCN, particularly in web-scale systems with vast amounts of training data, where DCN exhibited limited expressiveness in its cross network for learning predictive feature interactions.
  • The paper focuses on efficient and effective learning of predictive feature interactions, crucial in applications like search recommendation systems and computational advertising. Traditional approaches often involve manual identification of feature crosses or rely on deep neural networks (DNNs), which can be inefficient for higher-order feature crosses.
  • DCN-V2 includes an embedding layer that processes both categorical (sparse) and dense features. It supports different embedding sizes, crucial for industrial-scale applications with varying vocabulary sizes.
  • The core of DCN-V2 is its cross layers, which create explicit feature crosses. These layers are built upon a base layer containing original features and use learned weight matrices and bias vectors for each cross layer.
  • DCN-V2’s effectiveness is demonstrated through extensive studies and comparisons with state-of-the-art algorithms on benchmark datasets like Criteo and MovieLens-1M. It outperforms these algorithms and offers significant offline accuracy and online business metrics gains in Google’s web-scale LTR systems.
  • In summary, the key change in DCN V2’s cross network that enhances its expressiveness is the incorporation of low-rank matrices in the cross layers. This approach optimizes the computation of feature interactions, making the network more efficient and scalable, especially for complex, high-dimensional datasets. The use of low-rank matrices allows the network to capture complex feature interactions (including higher-order interactions) more effectively without the computational burden of full-rank operations.

DHEN (2022)

  • Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, they observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information.
  • Proposed in DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction, this paper by Zhang et al. from Meta introduces DHEN (Deep and Hierarchical Ensemble Network), a novel architecture designed for large-scale Click-Through Rate (CTR) prediction. The significance of DHEN lies in its ability to learn feature interactions effectively, a crucial aspect in the performance of online advertising services. Recognizing that different interaction models offer varying advantages and capture non-overlapping information, DHEN integrates a hierarchical ensemble framework with diverse interaction modules, including AdvancedDLRM, self-attention, Linear, Deep Cross Net, and Convolution. These modules enable DHEN to learn a hierarchy of interactions across different orders, addressing the limitations and variable performance of previous models on different datasets.
  • The following figure from the paper shows a two-layer two-module hierarchical ensemble (left) and its expanded details (right). A general DHEN can be expressed as a mixture of multiple high-order interactions. Dense feature input for the interaction modules are omitted in this figure for clarity.

  • In CTR prediction tasks, the feature inputs usually contain discrete categorical terms (sparse features) and numerical values (dense features). DHEN uses the same feature processing layer in DLRM, which is shown in the figure below. The sparse lookup tables map the categorical terms to a list of “static” numerical embeddings. Specifically, each categorical term is assigned a trainable \(d\)-dimensional vector as its feature representation. On the other hand, the numerical values are processed by dense layers. Dense layers compose of several Multi-layer Perceptions (MLPs) from which an output of a \(d\)-dimensional vector is computed. After a concatenation of the output from sparse lookup table and dense layer, the final output of the feature processing layer \(X_0 \in \mathbb{R}^{d \times m}\) can be expressed as \(X_0=\left(x_0^1, x_0^2, \ldots, x_0^m\right)\), where \(m\) is the number of the output embeddings and \(d\) is the embedding dimension.

  • A key technical advancement in this work is the development of a co-designed training system tailored for DHEN’s complex, multi-layer structure. This system introduces the Hybrid Sharded Data Parallel, a novel distributed training paradigm. This approach not only caters to the deeper structure of DHEN but also significantly enhances training efficiency, achieving up to 1.2x better throughput compared to existing models.
  • Empirical evaluations on large-scale datasets for CTR prediction tasks have demonstrated the effectiveness of DHEN. The model showed an improvement of 0.27% in Normalized Entropy (NE) gain over state-of-the-art models, underlining its practical effectiveness. The paper also discusses improvements in training throughput and scaling efficiency, highlighting the system-level optimizations that make DHEN particularly adept at handling large and complex datasets in the realm of online advertising.n the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.


  • This section is taken from (ML Frontiers linked here).
  • In contrast to DCN, the feature interactions in DLRM are limited to be second-order (i.e., pairwise) only: they’re just dot products of all pairs of embeddings. Going back to the movie example (with features user, movie, actors, director), the second-order interactions would be user-movie, user-actor, user-director, movie-actor, movie-director, and actor-director. A third-order interaction would be something like user-movie-director, actor-actor-user, director-actor-user, and so on.
  • For example, certain users may be fans of Steven Spielberg-directed movies starring Tom Hanks, and there should be a cross feature for that! Alas, in standard DLRM, there isn’t. That’s a major limitation.
  • Enter DHEN, short for “Deep Hierarchical Ensemble Network”. Proposed in Zhang et al. (2022), the key idea is to create a “hierarchy” of cross features that grows deeper with the number of DHEN layers, and so can include third, fourth, and arbitrarily high orders of interactions.
  • Here’s how DHEN works at a high level: suppose we have two input features going into DHEN, and let’s denote them by A and B. A 1-layer DHEN module would then create the entire hierarchy of cross features including the features themselves up to second order, namely:

    A, AxA, AxB, BxA, B, BxB,
    • where “x” is not just a single interaction but stands for a combination of the following 5 interactions:
      • dot product,
      • self-attention (similar to AutoInt),
      • convolution,
      • linear: y = Wx, or
      • the cross module from DCN.
  • Add another layer, and things start to get pretty complex:

    A, AxA, AxB, AxAxA, AxAxB, AxBxA, AxBxB,
    B, BxB, BxA, BxBxB, BxBxA, BxAxB, BxAxA,
    • where “x” stands for one of 5 interactions, resulting in 62 distinct signals! DHEN is a beast, and its computational complexity (due to its recursive nature) is a nightmare. In order to get it to work, the authors of the DHEN paper even invented a new distributed training paradigm called “Hybrid Sharded Data Parallel”, which achieves 1.2X higher throughput than the (then) state-of-the-art distributed learning algorithm.
  • But most importantly, DHEN works: in their experiments on internal click-through rate data, the authors measure a 0.27% improvement in NE compared to DLRM, using a stack of 8 DHEN layers. You may question whether such a seemingly small improvement in NE is worth such an enormous increase in complexity - alas, at a scale such as Meta’s, it probably is!
  • DHEN goes not just a step but one giant leap further than DLRM by introducing a hierarchy of feature interactions consisting of dot product, AutoInt-like self-attention, convolution, linear processing, and DCN-like crossing, that replace DLRM’s simple dot product.

GDCN (2023)

  • Proposed in the paper Towards Deeper, Lighter, and Interpretable Cross Network for CTR Prediction by Wang et al. (2023) from Fudan University and Microsoft Research Asia in CIKM ‘23. The paper introduces the Gated Deep Cross Network (GDCN) and the Field-level Dimension Optimization (FDO) approach. GDCN aims to address significant challenges in Click-Through Rate (CTR) prediction for recommender systems and online advertising, specifically the automatic capture of high-order feature interactions, interpretability issues, and the redundancy of parameters in existing methods.
  • GDCN is inspired by DCN-V2 and consists of an embedding layer, a Gated Cross Network (GCN), and a Deep Neural Network (DNN). The GCN forms its core structure, which captures explicit bounded-degree high-order feature crosses/interactions. The GCN employs an information gate in each cross layer (representing a higher order interaction) to dynamically filter and amplify important interactions. This gate controls the information flow, ensuring that the model focuses on relevant interactions. This approach not only allows for deeper feature crossing but also adds a layer of interpretability by identifying crucial interactions, thus modelling implicit feature crosses.
  • GDCN is a generalization of DCN-V2, offering dynamic instance-based interpretability and the ability to utilize deeper cross features without a loss in performance.

The unique selling point of DCN-V2 is that it treats all cross features equally, while GDCN uses information gates for fine-grained control over feature importance.

  • GDCN transforms high-dimensional, sparse input into low-dimensional, dense representations. Unlike most CTR models, GDCN allows arbitrary embedding dimensions.
  • Two structures are proposed: GDCN-S (stacked) and GDCN-P (parallel). GDCN-S feeds the output of GCN into a DNN, while GDCN-P feeds the input vector in parallel into GCN and DNN, concatenating their outputs.
  • Alongside GDCN, the FDO approach focuses on optimizing the dimensions of each field in the embedding layer based on their importance. FDO addresses the issue of redundant parameters by learning independent dimensions for each field based on its intrinsic importance. This approach allows for a more efficient allocation of embedding dimensions, reducing unnecessary parameters and enhancing enhancing efficiency without compromising performance. FDO uses methods like PCA to determine optimal dimensions and only needs to be done once, with the dimensions applicable to subsequent model updates.
  • The following figure shows the architecture of the GDCN-S and GDCN-P. \(\otimes\) is the cross operation (a.k.a, the gated cross layer).

  • The following figure visualizes the gated cross layer. \(\odot\) is elementwise/Hadamard product, and \(\times\) is matrix multiplication.

  • Results indicate that GDCN, especially when paired with the FDO approach, outperforms state-of-the-art methods in terms of prediction performance, interpretability, and efficiency. GDCN was evaluated on five datasets (Criteo, Avazu, Malware, Frappe, ML-tag) using metrics like AUC and Logloss, showcasing the effectiveness and superiority of GDCN in capturing deeper high-order interactions. These experiments also demonstrate the interpretability of the GCN model and the successful parameter reduction achieved by the FDO approach. The datasets underwent preprocessing like feature removal for infrequent items and normalization. The comparison included various classes of CTR models and demonstrated GDCN’s effectiveness in handling high-order feature interactions without the drawbacks of overfitting or performance degradation observed in other models. GDCN achieves comparable or better performance with only a fraction (about 23%) of the original model parameters.
  • In summary, GDCN addresses the limitations of existing CTR prediction models by offering a more interpretable, efficient, and effective approach to handling high-order feature interactions, supported by the innovative use of information gates and dimension optimization techniques.

Graph Neural Networks-based RecSys Architectures

  • Graph Neural Networks (GNN) architectures utilize graph structures to capture relationships between users, items, and their interactions. GNNs propagate information through the user-item interaction graph, enabling the model to learn user and item representations that incorporate relational dependencies. This is particularly useful in scenarios with rich graph-based data.
    • Pros: Captures relational dependencies and propagates information through graph structures, enabling better modeling of complex relationships.
      • Cons: Requires graph-based data and potentially higher computational resources for training and inference.
      • Advantages: Improved recommendations by incorporating the rich relational information among users, items, and their interactions.
      • Example Use Case: Social recommendation systems, where user-user connections or item-item relationships play a significant role in personalized recommendations.
    • Phase: Candidate Generation, Ranking, Retrieval.
    • Recommendation Workflow: GNN architectures are suitable for multiple phases of the recommendation workflow. In the candidate generation phase, GNNs can leverage graph structures to capture relational dependencies and generate potential candidate items. In the ranking phase, GNNs can learn user and item embeddings that incorporate relational information, leading to improved ranking. In the retrieval phase, GNNs can assist in efficient retrieval of relevant items based on their graph-based representations.
  • For a detailed overview of GNNs in RecSys, please refer to the GNN primer.

Two Towers in RecSys

  • This section is inspired from ML Frontiers.
  • “One of the more popular architecture in personalization / RecSys is two tower network. The two towers of the network usually represent user tower (\(U\)) and candidate tower (\(C\)). The towers produce a dense vector (embedding representation) of \(U\) and \(C\) respectively. The final network is just a dot product or cosine similarity function.
  • Let’s consider the cost of executing user tower/network is \(u\) and cost of executing candidate tower is \(c\) and dot product is \(d\).
  • At request time, the cost of executing the whole network for ranking N candidates for one user: \(N*(u + c + d)\).
  • Since the user is fixed, you need to compute it only once. So, the cost becomes: \(u + N*(c+d)\). Embeddings could be cached. So, the final cost becomes \(u + N* d+ k\) when \(k\) is.” (source)
  • The image below (source) showcases this.

  • The two-tower architecture consists of two separate branches: a query tower and a candidate tower. The query tower learns user representations based on user history, while the candidate tower learns item representations based on item features. The two towers are typically combined in the final stage to generate recommendations.
    • Pros: Explicitly models user and item representations separately, allowing for better understanding of user preferences and item features.
    • Cons: Requires additional computation to learn and combine the representations from the query and candidate towers.
    • Advantages: Improved personalization by learning user and item representations separately, which can capture fine-grained preferences.
    • Example Use Case: Personalized recommendation systems where understanding the user’s historical behavior and item features separately is critical.
    • Phase: Candidate Generation, Ranking.
    • Recommendation Workflow: The two-tower architecture is often employed in the candidate generation and ranking phases. In the candidate generation phase, the two-tower architecture enables the separate processing of user and item features, capturing their respective representations. In the ranking phase, the learned representations from the query and candidate towers are combined to assess the relevance of candidate items to the user’s preferences.
  • The two-tower model approach in recommender systems gained formal recognition in the machine learning community with Huawei’s 2019 PAL paper. This model was developed to address biases in ranking models, particularly position bias observed in recommender systems.
  • The two-tower model consists of two separate “towers”: one for learning relevance (user/item interactions) and another for learning biases (like position bias). These towers are combined in different ways – either multiplicatively or additively – to yield the final output.
  • Examples of popular two-tower implementations:
    • Huawei’s PAL model uses a multiplicative approach to combine the outputs of the two towers, addressing position bias within the context of their app store.
    • YouTube’s “Watch Next” paper introduced an additive two-tower model, which not only addresses position bias but also incorporates other selection biases by using additional features like device type.
  • The two-tower model has been shown to significantly improve recommendation systems. For instance, Huawei’s PAL model demonstrated improvements in click-through and conversion rates by around 25%. YouTube’s model, by adding a shallow tower for bias learning, showed an improvement in their engagement metric.
  • Challenges and considerations:
    • A key challenge in two-tower models is ensuring that both towers learn independently during training, as relevance can confound the learning of position bias.
    • Techniques like Dropout have been applied to mitigate over-reliance on certain features, like position, and improve generalization.
  • The two-tower model approach is seen as a powerful method for building unbiased ranking models in recommender systems. It’s a research domain with substantial potential, indicating that the field is still far from reaching its full capability.

Split Network

  • A split network is a generalized version of a two tower network. The same optimization of embedding lookup holds here as well. Instead of a dot product, a simple neural network could be used to produce output.
  • The image below (source) showcases this.

  • In a split network architecture, different components of the recommendation model are split and processed separately. For example, the user and item features may be processed independently and combined in a later stage. This allows for parallel processing and efficient handling of large-scale recommender systems.
    • Pros: Enables parallel processing, efficient handling of large-scale systems, and flexibility in designing and optimizing different components separately.
    • Cons: Requires additional coordination and synchronization between the split components, potentially increasing complexity.
    • Advantages: Scalability, flexibility, and improved performance in handling large-scale recommender systems.
    • Example Use Case: Recommendation systems with a massive number of users and items, where parallel processing is crucial for efficient computation.
    • Phase: Candidate Generation, Ranking, Final Ranking.
    • Recommendation Workflow: The split network architecture can be utilized in various phases. During the candidate generation phase, the split network can be used to process user and item features independently, allowing efficient retrieval of potential candidate items. In the ranking phase, the split network can be employed to learn representations and capture interactions between the user and candidate items. Finally, in the final ranking phase, the split network can contribute to the overall ranking of the candidate items based on learned representations.


  • Neural Collaborative Filtering (NCF) represents a pioneering approach in recommender systems. It was one of the initial studies to replace the then-standard linear matrix factorization algorithms with neural networks, thus facilitating the integration of deep learning into recommender systems.
  • The Wide & Deep model underscored the significance of cross features—specifically, second-order features formed by intersecting two original features. This model effectively combines a broad, shallow module for handling cross features with a deep module, paralleling the approach of NCF.
  • Deep and Cross Neural Network (DCN) was among the first to transition from manually engineered cross features to an algorithmic method capable of autonomously generating all potential feature crosses to any desired order.
  • Deep Factorization Machine (DeepFM) shares conceptual similarities with DCN. However, it distinctively substitutes the cross layers in DCN with factorization machines, or more specifically, dot products.
  • Automatic Interactions (AutoInt) brought multi-head self-attention mechanisms, previously known in Large Language Models (LLMs), into the domain of feature interaction. This technique moves away from brute-force generation of all possible feature interactions, which can lead to model overfitting on noisy feature crosses. Instead, it employs attention mechanisms to enable the model to selectively focus on the most relevant feature interactions.
  • Deep Learning Recommendation Model (DLRM) marked a departure from previous models by discarding the deep module. It relies solely on an interaction layer that computes dot products, akin to the factorization machine component in DeepFM, followed by a Multi-Layer Perceptron (MLP). This model emphasizes the sufficiency of interaction layers alone.
  • Deep Hierarchical Embedding Network (DHEN) builds upon the DLRM framework by replacing the conventional dot product with a sophisticated hierarchy of feature interactions, including dot product, convolution, self-attention akin to AutoInt, and crossing features similar to those in DCN.
  • Gated Deep Cross Network (GDCN) enhances Click-Through Rate (CTR) prediction in recommender systems by improving interpretability, efficiency, and handling of high-order feature interactions.
  • The Two Towers model in recommender systems, known for its separate user and candidate towers, optimizes personalized recommendations and addresses biases like position bias, representing an evolving and powerful approach in building unbiased ranking models.