• The concept of Multi-Armed Bandits (MAB) is inspired from the idea of gambling in a casino. MAB is a classic problem in probability theory and decision-making that involves a gambler who faces a row of slot machines, or “one-armed bandits,” each with a different unknown probability of winning. The gambler must decide which machines to play in order to maximize their total reward over a series of plays.
  • The MAB problem is often used as a metaphor for decision-making under uncertainty in various fields, such as statistics, machine learning, reinforcement learning, and economics.
  • MABs are a type of classical reinforcement learning that aims to strike a balance between exploration and exploitation. They achieve this by exploring new actions to understand their potential rewards and then exploiting the current best action to maximize overall reward. The objective is to gain knowledge about and select actions that maximize the total reward while minimizing regret.
  • Let’s look at the different facets of this problem below.

MAB Use-cases

  • MAB are a class of algorithms used for decision-making and optimization in a variety of contexts. They are commonly used in the context of online testing and personalization.


  • MAB A/B testing are both common approaches for testing and optimizing user experiences in various scenarios such as website design, online advertising, and recommendation systems. Both approaches involve running experiments to compare different versions of a product or service and measure their performance. However, they differ in several key respects, and each has its own strengths and limitations:
  1. A/B Testing:
    • Method: A/B testing is a traditional approach where you split your audience into two (or more) groups and expose each group to a different version of your product or service. You then measure the performance of each version based on some defined metric (e.g., click-through rate, conversion rate) and pick the version that performs the best.
    • Advantages: A/B testing is simple to implement and interpret. It gives you a clear comparison between two or more options. This approach is also robust to various statistical and methodological issues, as long as the test is properly designed.
    • Limitations: A/B testing can be inefficient. While the test is running, a significant portion of your audience is exposed to sub-optimal versions. Furthermore, it does not allow for real-time learning or adjustment based on the incoming data. A/B testing also doesn’t work well with personalization because it assumes each version performs uniformly across all audience segments.
  2. MAB:
    • Method: MAB approaches aim to balance the exploration-exploitation trade-off. In other words, they try to learn from the ongoing experiment and adjust the allocation of traffic to different versions based on their observed performance. This allows for more efficient use of resources and real-time learning.
    • Advantages: MAB approaches can be more efficient than A/B testing because they dynamically adjust the allocation of users to different versions based on their performance. This can lead to higher overall performance during the testing phase. MAB methods are also better suited to personalization because they can adapt to variations in performance across different user segments.
    • Limitations: MAB methods can be more complex to implement and interpret. They also require careful tuning to balance exploration and exploitation. Furthermore, while MAB methods can learn and adapt more quickly, they might also be more sensitive to short-term fluctuations in performance metrics.


  • In terms of personalization, MAB approaches often outperform A/B testing because they can handle more complex situations and adapt more quickly to the observed data. They can dynamically allocate more resources to versions that perform better for specific user segments, leading to more personalized experiences.
  • However, the choice between A/B testing and MAB often depends on the specific context, the available resources, and the level of sophistication required.

Bandit Algorithms

  • Let’s look at how the bandit algorithms work at a high level first.
  • Looking at the image below (source), we see that there is a learner that interacts with the environment with an action and the environment responds back with a reward.

  • The bandit game then, proceeds as follows for each round:
    • The learner chooses one action from a set of actions and shares it with the environment.
    • The environment generates a real-valued reward and sends it back to the learner.
    • The goal for the learner here is to maximize the cumulative reward or minimize the cumulative regret, where regret is the difference in total reward gained in \(n\) rounds and the total reward that could have been gained had the optimal action been chosen.

Exploration vs. Exploitation Tradeoff

  • Exploration vs. exploitation is at the heart of every bandit algorithm.


  • Exploitation is about recommending the optimal candidate given all the evidence and data that is currently available. Let’s look at a few strategies for exploitation:
    • Greedy exploit policy: the greedy exploit policy would involve always selecting the arm with the highest estimated reward and not exploring other arms. It assumes that the estimated rewards are accurate and the agent has a good understanding of the underlying reward distribution. It aims to make the best use of the current knowledge to select the arm with the highest expected reward at each time step, without allocating resources to explore other arms.
      • The image below (source) shows an overview of how Netflix uses greedy exploit policy in recommending a title.

      • When the member arrives at the homepage, we can see they have four titles in their candidate pool and the member features as well as the title features are extracted and provided to four pre-trained models.
      • At online time, the four models are scored based on the features that were extracted and a probability of playing the title is computed and the title with the highest probability is selected.


  • Exploration is about recommending other candidates to gather more feedback.
  • While exploration sounds like a sacrifice, it may be a good long term strategy as it allows us to gather information to make the best overall decision. Let’s look at the different strategies of exploration:
    • Naive Exploration or random exploration: Add noise to the greedy policy such as E-greedy algorithm. It chooses to explore without any specific strategy or information. In this approach, the system selects arms or recommendations randomly, without taking into account any knowledge or estimation of the probabilities of success or quality of the options.
    • Optimism in the Face of Uncertainty: prefer actions where you have little information thus, gaining more information about unknown actions. Upper Confidence Bound (UCB) is a great example of this. Arms with higher uncertainty or unknown performance are given higher optimism or initial estimates of their potential rewards. These strategies are designed to encourage exploration of uncertain arms in order to gather more information and potentially identify higher-rewarding arms.
    • Probability Matching: select the action according to the best probability such as what Thompson Sampling does. It’s a strategy where an agent distributes its actions or choices across different arms in proportion to their probabilities of success, rather than consistently choosing the arm with the highest estimated reward.

Contextual Bandit Algorithms

Epsilon Greedy

  • Epsilon-greedy algorithm is a strategy that combines both exploitation and exploration in multi-armed bandit problems. It is not solely an exploitation or exploration algorithm, but rather a hybrid approach that balances both aspects.
  • The “exploitation” aspect of the epsilon-greedy algorithm refers to the selection of the arm with the currently estimated highest reward based on the exploitation rate (1 - \(\epsilon\)). This is done to maximize the short-term rewards by choosing the arm that appears to be the best according to the current knowledge.
  • The “exploration” aspect of the epsilon-greedy algorithm refers to the selection of a random arm with a probability of \(\epsilon\). This is done to explore other arms and gather more information about their rewards, even if they are not currently estimated to be the best. This exploration allows the agent to potentially discover better arms that may have initially lower estimated rewards but could have higher rewards in the long run.
  • By balancing exploitation and exploration, the epsilon-greedy algorithm aims to find a trade-off between taking advantage of the current best arm and exploring other arms to improve the overall performance. The value of epsilon (\(\epsilon\)) controls the level of exploration, where higher epsilon values result in more exploration, and lower epsilon values result in more exploitation.
  • The pseudo-code below (source) shows how the exploitation and exploration process work given the value of \(\epsilon\).
    machines :: [ GamblingMachine ]
num_plays :: Map(GamblingMachine -> Int)
    total_reward :: Map(GamblingMachine -> Float)

select e // In the original post, e = 0.1
while True:
    x <- UNIFORM([0,1]) // A uniformly distributed random variable
if x < e: //exploration phase
    let m = random_choice(machines)
    reward <- play_machine( m )
    num_plays[m] += 1
    total_reward[m] += reward
else: //exploitation phase
    average_rewards = Map( (m, total_reward[m] / num_plays[m]) for m in machines )
    best_machine = argmax( average_rewards ) //Find the machine with the highest reward
    reward <- play_machine( best_machine )
    num_plays[best_machine] += 1
    total_reward[best_machine] += reward

Thompson Sampling (Bayesian)

  • This is a probabilistic algorithm that samples arms from their posterior distributions and selects the arm with the highest sampled reward. It uses Bayesian methods to update the posterior distributions based on observed data.
  • The key idea behind Thompson Sampling is to maintain a probability distribution over the expected rewards of each arm or action, and sample from these distributions to determine which arm to choose at each decision point.
  • The algorithm uses Bayesian methods to update the probability distributions based on observed rewards, which allows it to adaptively learn the optimal arm over time.
  • In the code below (source), we assume there are 5 arms with unknown probabilities of success (true_probs), and we want to run the Thompson Sampling algorithm for 1000 rounds (n_rounds). We initialize counters for each arm (n_pulls and n_successes) as zeros, and in each round, we sample from the beta distribution using the updated counters, choose the arm with the highest sampled probability, simulate pulling the chosen arm, and observe the reward.
import numpy as np

# Number of arms
n_arms = 5

# True unknown probability of success for each arm (unknown to the algorithm)
true_probs = np.random.rand(n_arms)

# Number of rounds
n_rounds = 1000

# Initialize counters for each arm
n_pulls = np.zeros(n_arms)
n_successes = np.zeros(n_arms)

# Main loop for Thompson Sampling
for t in range(n_rounds):
    # Sample from the probability distributions for each arm
    sampled_probs = np.random.beta(n_successes + 1, n_pulls - n_successes + 1)
    # Choose the arm with the highest sampled probability
    chosen_arm = np.argmax(sampled_probs)
    # Simulate pulling the chosen arm and observe the reward
    reward = np.random.binomial(1, true_probs[chosen_arm])
    # Update the counters for the chosen arm
    n_pulls[chosen_arm] += 1
    n_successes[chosen_arm] += reward
    # Print the chosen arm and reward for this round
    print("Round:", t+1)
    print("Chosen Arm:", chosen_arm)
    print("Reward:", reward)

# Print the estimated probabilities of success for each arm
estimated_probs = n_successes / n_pulls
print("Estimated Probabilities of Success:", estimated_probs)

Upper Confidence Bound

  • As we saw earlier, this strategy is based on Optimism in the Face of Uncertainty principle.
  • This algorithm uses an upper confidence bound to balance exploration and exploitation. It selects arms with higher upper confidence bounds, which represent the uncertainty of the estimated rewards.
  • The UCB algorithm leverages the optimistic estimate of the expected rewards by using the upper confidence bound, which encourages exploration by choosing arms with uncertain or poorly estimated rewards. As the algorithm accumulates more data through the number of pulls, the confidence interval term shrinks, and the algorithm tends to exploit the arms with higher estimated rewards more often.
  • In the code below (source), we can see how UCB is implemented with 5 arms.
import numpy as np

# Number of arms
n_arms = 5

# True unknown probability of success for each arm (unknown to the algorithm)
true_probs = np.random.rand(n_arms)

# Number of rounds
n_rounds = 1000

# Initialize counters for each arm
n_pulls = np.zeros(n_arms)
n_successes = np.zeros(n_arms)

# Main loop for UCB
for t in range(n_rounds):
    # Calculate the upper confidence bound for each arm
    ucb_values = n_successes / n_pulls + np.sqrt(2 * np.log(t + 1) / (n_pulls + 1e-6))
    # Choose the arm with the highest upper confidence bound
    chosen_arm = np.argmax(ucb_values)
    # Simulate pulling the chosen arm and observe the reward
    reward = np.random.binomial(1, true_probs[chosen_arm])
    # Update the counters for the chosen arm
    n_pulls[chosen_arm] += 1
    n_successes[chosen_arm] += reward
    # Print the chosen arm and reward for this round
    print("Round:", t+1)
    print("Chosen Arm:", chosen_arm)
    print("Reward:", reward)

# Print the estimated probabilities of success for each arm
estimated_probs = n_successes / n_pulls
print("Estimated Probabilities of Success:", estimated_probs)

Advantages of Multi-armed Bandits

  • MABs are particularly useful when there’s a need to balance exploration (trying new options) and exploitation (sticking with the best-known option). Here are some of the benefits of using multi-armed bandits:

  • Efficient Exploration-Exploitation Balance: MABs help in balancing the need to explore unknown options and exploit known good ones, optimizing for long-term rewards.

  • Online Learning: They are designed to learn and adapt over time, making them suitable for non-stationary environments where the underlying distribution can change.

  • Resource Optimization: By continually exploring and exploiting the best options, MABs can optimize the allocation of resources, such as budget, time, or any other limited resource.

  • Personalization: In contexts like recommendation systems, MABs can personalize content to individual users, adapting to their preferences and behaviors in real-time.

  • Ease of Implementation: Some MAB algorithms, like epsilon-greedy, are quite simple to implement, making them accessible for a variety of applications.

  • Robustness: Many MAB algorithms are robust to noisy or incomplete information and can work well even when the assumptions about the underlying models are not exactly met.

  • Low Regret: MAB algorithms are designed to minimize regret, which is the difference between the reward obtained and the reward that would have been obtained by always choosing the optimal action. This ensures that they perform near-optimally with enough observations.

  • Scalability: MABs can be applied to problems with many actions (or arms) and can scale to handle complex scenarios.

  • Real-time Decision Making: They enable real-time decision-making, allowing systems to respond quickly to changes in the environment.

  • Experimentation and A/B Testing: MABs can be used to conduct efficient A/B testing, dynamically allocating more traffic to better-performing variations. This can speed up experimentation and lead to more rapid improvements.

  • Contextual Information: Contextual bandit algorithms, a variant of MAB, incorporate additional context information into the decision-making process, allowing for more nuanced and sophisticated strategies.

  • Application Across Various Domains: Multi-armed bandits have wide applicability across domains like healthcare (for personalized treatment strategies), online advertising (for ad placement), recommender systems, finance, and more.

  • Ethical Considerations: By potentially reducing unnecessary exploration in critical scenarios, MABs may help in making more ethical decisions, like in healthcare, where unnecessary exploration could have significant consequences.

  • In summary, the benefits of multi-armed bandits are manifold, and they provide a flexible, robust, and efficient means of decision-making and optimization in a wide array of applications and industries.

Contextual Bandits


  • Contextual bandits extend the concept of multi-armed bandits by considering the context or additional information available before making each action selection. Instead of solely relying on historical data, contextual bandits take into account the specific context or conditions associated with each decision point.
  • In the context of recommendation systems and search algorithms, this could involve considering various customer-related data such as demographics, device type, historical preferences, as well as environmental factors like the day of the week or time of day. By incorporating context, contextual bandits aim to learn how different actions interact with specific contexts and their impact on the expected reward. This enables more personalized and optimized decision-making in dynamic environments.

Industrial Deployments


  • Spotify introduced a method that extends the use of contextual bandits, a technique commonly used in recommendation systems, to handle multiple objectives fairly. The approach uses a mathematical function called the Generalized Gini index (GGI) to combine and balance the different objectives. The paper proposes an algorithm that learns from user interactions and adjusts recommendations to maximize long-term rewards for each objective, as measured by the GGI function
  • This paper discusses the design of multi-objective recommender systems for online platforms that involve multiple stakeholders. The recommender systems aim to optimize various objectives such as user satisfaction, fairness, diversity, and revenue. Traditional bandit models are limited in handling multiple objectives simultaneously, but contextual bandit models offer more flexibility in incorporating side information.
  • The paper proposes a multi-objective formulation of a contextual bandit recommender system, which considers the uncertainty and relevance of items in the decision-making process. Unlike traditional bandits, contextual bandits use context or side information to guide recommendations. However, multi-objective variants of contextual bandits have received less attention in previous research.
  • The focus of the paper is on developing multi-objective contextual bandit models for digital platforms with multiple stakeholders. The reward in this setting is assumed to be a noisy linear function of the context or side information. By considering multiple objectives, these models can effectively optimize recommendations while balancing different stakeholder goals.
  • The paper describes the use of a multi-objective contextual bandit (MAB) model to optimize the recommender system on Spotify. The proposed model is based on the Generalized Gini Index (GGI) and assimilates contextual information to optimize multiple objectives simultaneously. The MAB framework allows Spotify to make sequential decisions by balancing exploration and exploitation, considering various user-centric and supplier-centric objectives.
  • In the article, the authors explain that traditional bandit settings only consider a single scalar feedback after each action, while multi-objective systems require joint optimization of multiple criteria. By extending the contextual bandit framework to accommodate multiple objectives, Spotify aims to achieve efficiency and fairness in optimizing their recommender system.
  • Therefore, the approach presented in the article involves the use of a multi-objective contextual bandit model, which leverages the GGI and contextual information to optimize for multiple objectives simultaneously.
  • Existing research in multi-armed bandits has explored addressing the multi-objective nature of the problem. Various algorithms, such as the knowledge gradient, hierarchical optimistic optimization strategy, Thompson sampling, and combinations of bi-objective optimization with combinatorial bandits, have been proposed to tackle multi-objective optimization (MOO) in traditional bandit settings. However, these approaches do not specifically address the contextual bandit setting, where additional context information is observed at each iteration.
  • In the context of contextual bandits, there has been relatively less focus on multi-objective variants. Only a few algorithms exist that consider similarity information or dominant objectives, but these approaches rely on assumptions about access to distances between context-arm pairs rather than user behavioral data. These assumptions are often restrictive in real-world industrial settings, where rewards are derived from user behavioral data.
  • Therefore, the proposed method aims to extend multi-objective bandit models specifically to the contextual bandit setting, considering the challenges and constraints of real-world scenarios where user behavioral data is used to derive rewards.
  • The authors discusses the importance of optimizing multiple objectives in user-centric systems, such as music streaming platforms. Different user-centric objectives, like clicks, streams, and number of songs played, can be optimized, but they may have correlations and trade-offs with each other. For example, optimizing for relevance may harm diversity. The article also considers additional objectives related to diversity and promotion. To understand the interplay between these objectives, the authors analyze user streaming data and estimate various metrics. They find that different objectives have different degrees of correlation, with user-centric objectives being strongly positively correlated and diversity and promotion objectives showing weak negative correlation with user-centric objectives.
  • They emphasizes that optimizing for a single metric is insufficient in a multi-objective, multi-stakeholder platform setting. There is often a delicate trade-off between objectives, and a recommender system designed for a single metric is not suitable. The authors propose exploring multiple metric optimization and describe a recommendation approach based on contextual bandits, which are commonly used to optimize a single user satisfaction metric.
  • The article describes the use of contextual bandits for recommender systems. The recommendation problem is formalized as a combinatorial contextual bandit problem, where the system repeatedly interacts with users. The system observes a context, chooses an action (recommendation), and receives a reward based on user satisfaction. In the case of music streaming platforms, the context includes user features, playlist features, user-playlist affinity, and other contextual information.
  • Actions in this context are sets of recommendations, where users are presented with playlists. The rewards in a traditional user-centric system are based on user satisfaction with the recommendation. However, in a multi-stakeholder system, multiple objectives are considered, and vectorial rewards are observed, with each objective having its corresponding reward.
  • To optimize multiple objectives in content recommendation, the article proposes a multi-objective contextual bandit algorithm. The algorithm is described in detail, including mathematical notations and the problem setup. The arm-selection strategy is explained, which involves selecting a playlist to show to the user based on observed contextual features. The article highlights the benefits of multi-objective optimization for balancing competing objectives in recommender systems.
  • The authors describes the challenges faced by recommender systems in multi-stakeholder platforms, where different stakeholders have different objectives. To address this, the researchers proposed MO-LinCB, a multi-objective linear contextual bandit model. They used the Gini aggregation function to balance and optimize multiple objectives. The approach involved learning a recommendation policy using a scalable gradient ascent method.
  • The findings of the study indicate that optimizing for multiple objectives can lead to improvements in all the objectives involved. By considering multiple satisfaction metrics, the model was able to provide better recommendations, as different metrics capture different aspects of user behavior. The experiments also showed that it is possible to achieve gains in both complementary and competing objectives through multi-objective optimization. Specifically, the proposed model successfully achieved gains in a promotional objective without negatively impacting user satisfaction metrics.
  • While the presented model and findings have implications for recommender system design in multi-stakeholder platforms, there are several future research directions suggested. These include exploring alternative objective weighting functions to provide more control over the importance of different objectives, considering global objectives aggregated over time, quantifying objectives of other stakeholders, and expanding the range of objectives considered to include more competing objectives. These future research directions aim to further enhance the effectiveness and versatility of multi-objective recommender systems in multi-stakeholder settings.


  • Yahoo implemented a contextual bandit approach for news article recommendations, focusing on the use of shared features across different arms (hybrid linear models) instead of disjoint linear models. In their experiments, they discovered that this approach enabled transfer learning, allowing click-through rate information from one article to inform the recommendations of other articles.
  • For user representation, they considered 1,193 categorical features encompassing demographics, geography, and behavioral aspects related to news consumption history. News articles were represented by 83 categorical features, including URL category and manually tagged editor categories.
  • To reduce dimensionality, they projected user features and article features into respective categories and clustered users and articles with similar preferences. This resulted in six-dimensional representations for both users and articles. To create user-article cross features, they computed the outer product of the user and article features, resulting in a 36-dimensional vector, which formed the basis of the hybrid linear models.
  • Evaluating the performance of their policy was challenging due to the availability of offline data collected under a different policy. To address this, they made assumptions about the independence and identical distribution of individual events and the uniform random selection of arms by the policy that collected the logged data. They also had a learning bucket where some users were randomly assigned and served articles randomly.
  • Their policy evaluator compared the learned policy with the logged policy. If the learned policy selected the same arm as the logged policy, the event was retained, added to the history, and used to update the payoff. If the learned policy chose a different arm, the event was ignored, and the algorithm proceeded to the next event without any changes in the state.


  • Netflix utilized contextual bandits to personalize movie images on the home page, aiming to provide a better user experience and reduce regret. They opted for contextual bandits because traditional batch machine learning approaches require significant time to collect data, train models, and conduct AB testing, resulting in a delay in delivering an improved experience to members.
  • Initially, Netflix employed a non-contextual multi-armed bandit to determine the best artwork for all users. However, they transitioned to contextual bandits to personalize images for each individual. In this setup, the bandit selects from a set of images for each show (action) and observes the duration of time the user watches the show after being influenced by the image (reward). Additionally, the bandit takes into account user attributes (e.g., played titles, preferred genres, country, language preferences), temporal factors like the day of the week and time of day, and other relevant context information.
  • Netflix mentioned several bandit models, including a greedy policy implemented through supervised regression models, epsilon-greedy, LinUCB, and Thompson Sampling. However, they did not specify which specific model was used in their production system, suggesting that it may be an ensemble or a dynamic approach that is frequently updated.


  • “Contextual bandit models are a popular approach for personalizing the user experience by recommending relevant products. However, these models become challenging to train and evaluate when the number of actions, or products, in the recommendation pool become large.” (source)
  • Within the Multi-Armed Bandit (MAB) framework, the system acquires knowledge about the potential actions through real-time exploration. Following the execution of an action, the system observes a probabilistic outcome, representing the reward obtained. This process continues, and over time, the MAB model gains insights into the reward distribution associated with each action. Eventually, based on the learned information, the system selects the action with the highest average performance across all users as the production policy to be deployed.
  • Contextual bandits contain context which includes user-specific features that allow the CB model to personalize the recommended action for each user. By leveraging this context, the CB model has the potential to achieve a higher average reward across all customers compared to an MAB model, which lacks context information
  • At Instacart, the application of the Contextual Bandit (CB) framework was explored to enhance item retrieval and ranking, aiming to provide users with a more personalized shopping experience. Item retrieval and ranking involve finding and ordering products relevant to a user’s search query. Traditionally, a machine learning model is used to compute the relevance of each item to the query, based on the probability of it being added to the cart.
  • However, the problem of ranking recommended items presents challenges for the CB framework. Retrieving several hundred candidate items from a large pool and sorting them based on their ranking scores does not have a clear set of discrete actions. Considering each individual item as a discrete action is also problematic since the reward for a specific item (click or purchase) depends on the presence of other items shown to the customer, violating the independent reward attribution assumption of CB.
  • To address these challenges, one approach is to represent each potential action (sorted list of items) as a low-dimensional vector, as suggested in “Off-Policy Evaluation for Large Action Spaces via Embeddings.” For instance, in an item ranking problem, the vector can include features such as the average ranking score and average price of the top 5, 10, 15, 20 items. While additional numerical features can be added, the dimensionality of the action vector should be carefully considered to ensure the accuracy of the CB model in recommending the most suitable vector for each context.
  • In 2022, Instacart initially took a simpler approach to apply CBs to the ranking problem. They observed that for specific queries like “milk,” a linear model that predicts relevance scores for items (ranking them in decreasing order) without relying on user features performs well by recommending popular milk products. On the other hand, for broader queries like “healthy snack,” a nonlinear item scoring model that considers various user features as inputs performs well by suggesting personalized items for the customer. These observations guided their early work in utilizing CBs for ranking problems.
  • The Instacart team conducted experiments to investigate the effectiveness of Contextual Bandit (CB) models in improving search ranking and personalizing multi-objective ranking. Initially, they aimed to increase the relevance of search results by training a CB model to select the best search ranker for each user-query context, measured by cart_adds_per_search (items added to cart per search query). The CB model, implemented using the X-learner framework, achieved a significant 0.66% increase in cart_adds_per_search compared to the neural network item scoring model.
  • Building upon this success, the team explored the challenge of personalizing multi-objective ranking, which involves balancing personalized relevance, popularity, novelty, and other factors in recommending items. In their online experiment, they defined various ranking formulas that represented different objectives and treated them as potential actions for the CB model. Each action corresponded to a specific combination of weights assigned to the relevance, popularity, price, and availability scores of items.
  • To evaluate the performance of the CB models, a randomized data collection experiment was conducted, exploring different coefficient values around those used by the existing production model. The X-learner model and the XGBoost model demonstrated promising trade-offs between counterfactual estimates for metrics such as Cart Adds Per Search (CAPS) and Gross Merchandise Value (GMV) per search. Subsequently, an online A/B test was conducted with these models, resulting in a statistically significant increase of nearly 0.6% in CAPS for Android users (who are more price sensitive), while GMV per user showed a positive but not statistically significant increase. Considering the lower latency of the XGBoost model, it was chosen for production deployment.
  • The team plans to continue experimenting in this domain, exploring different action spaces for the CB model, and further refining the personalized ranking algorithms for improved user experience and business outcomes.
  • The team is exploring the use of Contextual Bandit (CB) models not only for personalizing search ranking but also for recommending a preference vector over retrieval sources. The goal is to maximize a specific objective that can be computed from each sorted list of results. Different strategies can be employed to translate the preference vector into user treatment. This includes sampling candidates from each source based on their preferences, upranking candidates from the most preferred source, and utilizing CB preferences as an additional ranking feature.
  • To train a CB model, a dataset needs to be collected where actions have been taken in various contexts, and the outcomes have been observed. Customers are randomly assigned to different variants, with each variant using a specific action for ranking items. The collected data is then used to train the CB model, learning the predicted reward for each action given a context. Counterfactual evaluation is performed to assess the trained model’s performance.
  • There are different algorithms that can be used to train a CB model. A simple approach involves training an XGBoost or LightGBM model with the action as a categorical feature. To obtain recommended actions from the model, all allowed actions are sequentially substituted into the model, and the action with the highest predicted reward is selected.
  • A more sophisticated approach is to use a neural network to represent the CB model, with input nodes corresponding to context features and output nodes representing predicted rewards for each possible action. During serving, the CB model can recommend the action with the highest predicted reward or make probabilistic recommendations based on predicted rewards. During training, the error between the predicted reward and the observed reward for the taken action is backpropagated through the corresponding output node.
  • These training approaches enable the CB model to learn and make informed recommendations based on observed context and historical outcomes, allowing for personalized and effective decision-making in various domains.
  • The images below (source) show the general architecture.
  • A more advanced framework for training Contextual Bandit (CB) models called X-learners was introduced in a 2019 paper. This method involves grouping the data based on the treatment (action) used for each group. For each group, a first-level machine learning (ML) model is trained to predict the observed reward. Then, second-level ML models are trained to predict the incremental lift, known as Conditional Average Treatment Effect (CATE), that would be observed if that group were to receive the “control” action (the currently deployed policy to be improved).
  • During serving, the X-learner CB model recommends the action with the highest predicted lift compared to the control action or makes probabilistic recommendations based on predicted lifts. The diagram provided illustrates the training steps for an X-learner model with two actions: treatment0 (control) and treatment1. The input features are denoted as X, the recorded action as T, the observed rewards as Y (with Y0 for cases where T=0 was applied and Y1 for cases where T=1 was applied), and the predicted value of Y is represented as Y* by the first-level model.
  • The X-learner framework allows for a more sophisticated and accurate estimation of treatment effects by considering the incremental lift that can be achieved by different actions. By incorporating this framework into CB models, more informed and effective action recommendations can be made to optimize desired outcomes in various applications.
  • The passage describes the implementation and evaluation of Contextual Bandit (CB) models for various applications at Instacart. The X-learner framework, introduced in a 2019 paper, is used to train CB models. The framework involves grouping the data based on the treatment (action) used, training ML models to predict observed rewards, and then training additional models to predict the incremental lift compared to a control action. The trained CB models can recommend actions with the highest predicted lift or make probabilistic recommendations.
  • To evaluate the trained CB models, two commonly used approaches are discussed. The first is Inverse Propensity Sampling (IPS), which weighs the rewards based on the ratio of probabilities between the new policy and the logging policy. The second is Doubly Robust (DR) estimation, which combines the predicted rewards and the IPS weighting for more accurate evaluation.
  • Different tools and libraries are utilized for training and evaluating CB models. XGBoost is used for training CB models with discrete action spaces due to its fast training time and good predictive power. RLlib, an open-source library for training contextual bandit and reinforcement learning models, is used for neural network-based CB models with discrete action spaces. RLlib supports a wide range of industry applications and can handle both discrete and continuous action spaces.
  • Engineers from Anyscale collaborated with Instacart to create a Python script that leverages RLlib for training and evaluating CB/RL models. The script can be run on a laptop or in the cloud and provides IPS and DR estimates during training, which can be visualized using tensorboard. The best-performing model based on IPS and DR metrics can be selected for further use.
  • In conclusion, Instacart is actively exploring the use of CB models for retrieval and ranking problems in various areas. They have developed tools and techniques to train and evaluate these models and plan to share success stories in future blog posts.

Offline Evaluation Replay

  • Offline evaluation replay is a method used to evaluate the performance of a recommendation or ranking system using historical data.
  • In offline evaluation replay, the system’s recommendations or rankings are generated using a given algorithm, and then compared against the ground truth or actual outcomes from past interactions or historical data and is commonly used for contextual bandit algorithms.
  • Offline evaluation replay has some limitations as well such as its inability to capture the dynamics of real-time online interactions, user feedback, or system updates. It assumes that the historical dataset is representative of the current user preferences and behaviors, which may not always be the case.
  • Nevertheless, offline evaluation replay can provide valuable insights and initial assessment of the performance of recommendation or ranking algorithms before deploying them in live online systems.

Industry examples of bandits for RecSys

  • Spotify: their recommending explanations for music recommendations, “recsplanations”, is an example of epsilon-greedy.
  • Yahoo: their news recommendations utilize UCB.
  • Alibaba: also leverages UCB for item recommendations.
  • DoorDash: uses Thompson Sampling for cuisine recommendations.
  • Amazon: uses a bandit algorithm for realtime conversions on landing pages as well as click-through rates on search engine result pages.
  • Twitter: uses bandit strategies to supply personalized recommendations.

Batched Bandits

  • Bandit algorithms typically operate on individual recommendations rather than on batches. Bandit algorithms are designed to balance the exploration of new options with the exploitation of known options to maximize rewards in decision-making processes. They are widely used in recommendation systems, adaptive routing, online advertising, and other areas where decisions must be made under uncertainty.
  • The term “bandit” comes from the analogy of a gambler playing multiple slot machines (one-armed bandits), where each machine has an unknown probability of payout. The gambler wants to find the best machine (or machines) to play in order to maximize their winnings over time. In this context, each “play” or “trial” typically involves a single decision about which “arm” of the bandit to pull, based on past performance and the need to explore less-known options.
  • In recommendation systems, a bandit algorithm might be used to recommend individual items to users one at a time. Each recommendation is treated as an individual trial, with the user’s response (e.g., click or no click) serving as the reward signal. Over time, the algorithm learns which items are more likely to be appealing to users while still occasionally testing less-recommended items to discover potential hidden gems.
  • While bandit algorithms are primarily oriented towards individual decisions, there are variations and extensions that allow for batch recommendations or decisions. These variations adapt the bandit framework to situations where multiple recommendations are made simultaneously or where decisions are grouped in batches for practical reasons. However, the core principle of balancing exploration and exploitation remains central to these approaches.

  • The core concept of bandit algorithms involves making sequential decisions, one at a time, to balance the exploration of untested options with the exploitation of known rewarding options. This principle is well-suited for scenarios where decisions are made individually, such as recommending a single article to a user or choosing an ad to display in real-time. However, there are scenarios where making decisions in batches is more practical or necessary. For these situations, variations and extensions of bandit algorithms have been developed to handle batch recommendations or decisions. Let’s explore how these variations adapt the traditional bandit approach for batch scenarios:

Contextual Bandits with Batching

  • In many real-world applications, decisions need to be made in groups or batches due to operational constraints. For example, an email marketing campaign may involve sending a batch of personalized offers to different segments of users at the same time. Contextual bandit algorithms can be extended to handle batch recommendations by considering the context or features of each decision or user in the batch. These extensions may use models that predict the expected reward for each action in the context of each user, selecting a batch of actions that maximizes the expected reward across the group.

Parallel Bandit Algorithms

  • Some applications require decisions to be made in parallel across multiple instances or agents. For instance, a distributed web service might use bandit algorithms to optimize response strategies across different servers simultaneously. Parallel bandit algorithms allow multiple agents to explore and exploit simultaneously, sharing information to improve overall decision making. This approach can be seen as a form of batch processing where each “batch” consists of parallel decisions made by different agents.

Batched Exploration and Exploitation

  • Batched bandit algorithms explicitly incorporate the batching process into the exploration and exploitation strategy. Instead of making a single decision at a time, these algorithms decide on a set of actions to take simultaneously, based on the current understanding of the environment. The batch size can be fixed or dynamically adjusted based on the algorithm’s performance or operational constraints. After each batch of decisions is made and the outcomes observed, the algorithm updates its understanding and prepares the next batch of decisions. This approach is useful in situations where decisions must be made or evaluated in groups rather than individually.

Online Learning with Batch Updates

  • In some scenarios, the environment or user preferences may change rapidly, making it impractical to update the decision-making model after every single action. Batch updates in bandit algorithms allow for the accumulation of data from multiple decisions before updating the model. This approach can reduce computational overhead and adapt more smoothly to changes in the environment. While the decision-making process itself might still focus on individual actions, the learning and updating process is batched for efficiency.

Challenges and Considerations

  • Implementing batch recommendations or decisions with bandit algorithms introduces several challenges, including how to optimally allocate actions within a batch to balance exploration and exploitation, how to handle dependencies or interactions between decisions within a batch, and how to update models effectively based on batch outcomes. Researchers and practitioners in fields like machine learning, operations research, and applied statistics continue to develop new techniques and algorithms to address these challenges, extending the applicability of bandit algorithms to a wide range of batch decision-making scenarios.

Further Reading