Overview

  • Multi-armed bandit is inspired from the idea of gambling in a casino. The multi-armed bandit is a classic problem in probability theory and decision-making that involves a gambler who faces a row of slot machines, or “one-armed bandits,” each with a different unknown probability of winning. The gambler must decide which machines to play in order to maximize their total reward over a series of plays.
  • The multi-armed bandit problem is often used as a metaphor for decision-making under uncertainty in various fields, such as statistics, machine learning, reinforcement learning, and economics.
  • Let’s look at the different facets of this problem below.

Bandit Algorithms

  • Let’s look at how the bandit algorithms work at a high level first.
  • Looking at the image below (source), we see that there is a learner that interacts with the environment with an action and the environment responds back with a reward.

  • The bandit game then, proceeds as follows for each round:
    • The learner chooses one action from a set of actions and shares it with the environment.
    • The environment generates a real-valued reward and sends it back to the learner.
    • The goal for the learner here is to maximize the cumulative reward or minimize the cumulative regret, where regret is the difference in total reward gained in \(n\) rounds and the total reward that could have been gained had the optimal action been chosen.

Exploration vs. Exploitation Tradeoff

  • Exploration vs. exploitation is at the heart of every bandit algorithm.

Exploitation

  • Exploitation is about recommending the optimal candidate given all the evidence and data that is currently available. Let’s look at a few strategies for exploitation:
    • Greedy exploit policy: the greedy exploit policy would involve always selecting the arm with the highest estimated reward and not exploring other arms. It assumes that the estimated rewards are accurate and the agent has a good understanding of the underlying reward distribution. It aims to make the best use of the current knowledge to select the arm with the highest expected reward at each time step, without allocating resources to explore other arms.
      • The image below (source) shows an overview of how Netflix uses greedy exploit policy in recommending a title.

      • When the member arrives at the homepage, we can see they have four titles in their candidate pool and the member features as well as the title features are extracted and provided to four pre-trained models.
      • At online time, the four models are scored based on the features that were extracted and a probability of playing the title is computed and the title with the highest probability is selected.

Exploration

  • Exploration is about recommending other candidates to gather more feedback.
  • While exploration sounds like a sacrifice, it may be a good long term strategy as it allows us to gather information to make the best overall decision. Let’s look at the different strategies of exploration:
    • Naive Exploration or random exploration: Add noise to the greedy policy such as E-greedy algorithm. It chooses to explore without any specific strategy or information. In this approach, the system selects arms or recommendations randomly, without taking into account any knowledge or estimation of the probabilities of success or quality of the options.
    • Optimism in the Face of Uncertainty: prefer actions where you have little information thus, gaining more information about unknown actions. Upper Confidence Bound (UCB) is a great example of this. Arms with higher uncertainty or unknown performance are given higher optimism or initial estimates of their potential rewards. These strategies are designed to encourage exploration of uncertain arms in order to gather more information and potentially identify higher-rewarding arms.
    • Probability Matching: select the action according to the best probability such as what Thompson Sampling does. It’s a strategy where an agent distributes its actions or choices across different arms in proportion to their probabilities of success, rather than consistently choosing the arm with the highest estimated reward.

Contextual Bandit Algorithms

Epsilon Greedy

  • Epsilon-greedy algorithm is a strategy that combines both exploitation and exploration in multi-armed bandit problems. It is not solely an exploitation or exploration algorithm, but rather a hybrid approach that balances both aspects.
  • The “exploitation” aspect of the epsilon-greedy algorithm refers to the selection of the arm with the currently estimated highest reward based on the exploitation rate (1 - \(\epsilon\)). This is done to maximize the short-term rewards by choosing the arm that appears to be the best according to the current knowledge.
  • The “exploration” aspect of the epsilon-greedy algorithm refers to the selection of a random arm with a probability of \(\epsilon\). This is done to explore other arms and gather more information about their rewards, even if they are not currently estimated to be the best. This exploration allows the agent to potentially discover better arms that may have initially lower estimated rewards but could have higher rewards in the long run.
  • By balancing exploitation and exploration, the epsilon-greedy algorithm aims to find a trade-off between taking advantage of the current best arm and exploring other arms to improve the overall performance. The value of epsilon (\(\epsilon\)) controls the level of exploration, where higher epsilon values result in more exploration, and lower epsilon values result in more exploitation.
  • The pseudo-code below (source) shows how the exploitation and exploration process work given the value of \(\epsilon\).
inputs:
    machines :: [ GamblingMachine ]
num_plays :: Map(GamblingMachine -> Int)
    total_reward :: Map(GamblingMachine -> Float)

select e // In the original post, e = 0.1
while True:
    x <- UNIFORM([0,1]) // A uniformly distributed random variable
if x < e: //exploration phase
    let m = random_choice(machines)
    reward <- play_machine( m )
    num_plays[m] += 1
    total_reward[m] += reward
else: //exploitation phase
    average_rewards = Map( (m, total_reward[m] / num_plays[m]) for m in machines )
    best_machine = argmax( average_rewards ) //Find the machine with the highest reward
    reward <- play_machine( best_machine )
    num_plays[best_machine] += 1
    total_reward[best_machine] += reward

Thompson Sampling (Bayesian)

  • This is a probabilistic algorithm that samples arms from their posterior distributions and selects the arm with the highest sampled reward. It uses Bayesian methods to update the posterior distributions based on observed data.
  • The key idea behind Thompson Sampling is to maintain a probability distribution over the expected rewards of each arm or action, and sample from these distributions to determine which arm to choose at each decision point.
  • The algorithm uses Bayesian methods to update the probability distributions based on observed rewards, which allows it to adaptively learn the optimal arm over time.
  • In the code below (source), we assume there are 5 arms with unknown probabilities of success (true_probs), and we want to run the Thompson Sampling algorithm for 1000 rounds (n_rounds). We initialize counters for each arm (n_pulls and n_successes) as zeros, and in each round, we sample from the beta distribution using the updated counters, choose the arm with the highest sampled probability, simulate pulling the chosen arm, and observe the reward.
import numpy as np

# Number of arms
n_arms = 5

# True unknown probability of success for each arm (unknown to the algorithm)
true_probs = np.random.rand(n_arms)

# Number of rounds
n_rounds = 1000

# Initialize counters for each arm
n_pulls = np.zeros(n_arms)
n_successes = np.zeros(n_arms)

# Main loop for Thompson Sampling
for t in range(n_rounds):
    # Sample from the probability distributions for each arm
    sampled_probs = np.random.beta(n_successes + 1, n_pulls - n_successes + 1)
    
    # Choose the arm with the highest sampled probability
    chosen_arm = np.argmax(sampled_probs)
    
    # Simulate pulling the chosen arm and observe the reward
    reward = np.random.binomial(1, true_probs[chosen_arm])
    
    # Update the counters for the chosen arm
    n_pulls[chosen_arm] += 1
    n_successes[chosen_arm] += reward
    
    # Print the chosen arm and reward for this round
    print("Round:", t+1)
    print("Chosen Arm:", chosen_arm)
    print("Reward:", reward)
    print("------")

# Print the estimated probabilities of success for each arm
estimated_probs = n_successes / n_pulls
print("Estimated Probabilities of Success:", estimated_probs)

Upper Confidence Bound

  • As we saw earlier, this strategy is based on Optimism in the Face of Uncertainty principle.
  • This algorithm uses an upper confidence bound to balance exploration and exploitation. It selects arms with higher upper confidence bounds, which represent the uncertainty of the estimated rewards.
  • The UCB algorithm leverages the optimistic estimate of the expected rewards by using the upper confidence bound, which encourages exploration by choosing arms with uncertain or poorly estimated rewards. As the algorithm accumulates more data through the number of pulls, the confidence interval term shrinks, and the algorithm tends to exploit the arms with higher estimated rewards more often.
  • In the code below (source), we can see how UCB is implemented with 5 arms.
import numpy as np

# Number of arms
n_arms = 5

# True unknown probability of success for each arm (unknown to the algorithm)
true_probs = np.random.rand(n_arms)

# Number of rounds
n_rounds = 1000

# Initialize counters for each arm
n_pulls = np.zeros(n_arms)
n_successes = np.zeros(n_arms)

# Main loop for UCB
for t in range(n_rounds):
    # Calculate the upper confidence bound for each arm
    ucb_values = n_successes / n_pulls + np.sqrt(2 * np.log(t + 1) / (n_pulls + 1e-6))
    
    # Choose the arm with the highest upper confidence bound
    chosen_arm = np.argmax(ucb_values)
    
    # Simulate pulling the chosen arm and observe the reward
    reward = np.random.binomial(1, true_probs[chosen_arm])
    
    # Update the counters for the chosen arm
    n_pulls[chosen_arm] += 1
    n_successes[chosen_arm] += reward
    
    # Print the chosen arm and reward for this round
    print("Round:", t+1)
    print("Chosen Arm:", chosen_arm)
    print("Reward:", reward)
    print("------")

# Print the estimated probabilities of success for each arm
estimated_probs = n_successes / n_pulls
print("Estimated Probabilities of Success:", estimated_probs)

Offline Evaluation Replay

  • Offline evaluation replay is a method used to evaluate the performance of a recommendation or ranking system using historical data.
  • In offline evaluation replay, the system’s recommendations or rankings are generated using a given algorithm, and then compared against the ground truth or actual outcomes from past interactions or historical data and is commonly used for contextual bandit algorithms.
  • Offline evaluation replay has some limitations as well such as its inability to capture the dynamics of real-time online interactions, user feedback, or system updates. It assumes that the historical dataset is representative of the current user preferences and behaviors, which may not always be the case.
  • Nevertheless, offline evaluation replay can provide valuable insights and initial assessment of the performance of recommendation or ranking algorithms before deploying them in live online systems.

Industry examples of bandits for recsys

  • Spotify: their recommending explanations for music recommendations, “recsplanations”, is an example of epsilon-greedy.
  • Yahoo: their news recommendations utilize UCB.
  • Alibaba: also leverages UCB for item recommendations.
  • DoorDash: uses Thompson Sampling for cuisine recommendations.
  • Amazon: uses a bandit algorithm for realtime conversions on landing pages as well as click-through rates on search engine result pages.
  • Twitter: uses bandit strategies to supply personalized recommendations.

References