Chapter 23. MAB (multi-armed bandits)
Recommended Post: 【Algorithm】 Algorithm Table of Contents
1. Overview
2. UCB
1. Overview
⑴ Overview
① Problem Definition: Selecting the optimal arm with the highest payoff
② Particularly necessary in the following situations
○ When it is difficult to calculate the objective function (black-box algorithm)
○ When there is no explicit analytic form
○ When singular values exist
○ When the function is not differentiable
○ When the data is highly noisy
⑵ Trade-off of two strategies
① Exploitation : Maximizing reward using the empirical mean derived from current data
② Exploration : Improving the empirical mean to match the true mean
③ If exploration is neglected when initial data deviates from the true distribution, incorrect judgments cannot be corrected
⑶ Type 1. Bayesian Optimization
○ Prior distribution (surrogate model) is given for the reward distribution of each arm
○ Gaussian Process is often used as the prior model
○ Posterior distribution is obtained every time a reward is observed
② Acquisition Function: Sampling is performed to maximize the next reward
○ All acquisition functions calculate the mean μ(x) and variance σ2(x) based on the posterior distribution, designing exploration strategies through this
○ Type 1. EI (expected improvement): The second formula for EI(x) holds only when using a Gaussian Process
○ Note, Φ and 𝜙 are the CDF and PDF of the standard normal distribution, and Z standardizes x
○ Higher λ leads to more exploration
○ Type 2. MPI (maximum probability of improvement)
○Type 3. UCB
③ Gaussian process: Assuming normal distribution during Bayesian optimization
x <- c(-3, -1, 0, 2, 3)
f <- data.frame(x=x, y=cos(x))
x <- f$x
k_xx <- varCov(x,x)
k_xxs <- varCov(x, x_star)
k_xsx <- varCov(x_star, x)
k_xsxs <- varCov(x_star, x_star)
f_star_bar <- k_xsx %*% solve(k_xx)%*%f$y # Mean
cov_f_star <- k_xsxs - k_xsx %*% solve(k_xx) %*% k_xxs # Var
plot_ly(type='scatter', mode="lines") %>%
add_trace(x=x_star, y=f_star_bar, name="Function Estimate") %>%
add_trace(x=x_star, y=f_star_bar-2*sqrt(diag(cov_f_star)),
name="Lower Confidence Band (Function Estimate)",
line = list(width = 1, dash = 'dash')) %>%
add_trace(x=x_star, y=f_star_bar+2*sqrt(diag(cov_f_star)),
name="Upper Confidence Band (Function Estimate)",
line = list(width = 1, dash = 'dash')) %>%
add_markers(x=f$x, y=f$y, name="Obs. Data", marker=list(size=20)) %>%
add_trace(x=x_star, y=cos(x_star),line=list(width=5), name="True f()=cos(x)") %>%
layout(title="Gaussian Process Simulation (5 observations)", legend = list(orientation='h'))
④ Hyperparameter Optimization
○ Explore hyperparameters that maximize marginal likelihood, the probability that data X and y occur, using Maximum Likelihood Estimation (MLE)
○ Marginal likelihood and Gaussian Process hyperparameters are defined as follows
○ Advantages: Improves underfitting and overfitting
⑤ References
○ Osband, Russo, and Van Roy 2013
○ Russo and Van Roy 2014, 2015, 2016
○ Bubeck and Liu 2013
○ Data Science and Predictive Analytics (UMich HS650) (PI: Ivo Dinov)
⑷ Type 2. Stochastic Multi-armed Bandit
① Feature 1. Online Decision: Select the it-th arm among N arms at each time t = 1, ···, T
② Feature 2. Stochastic Feedback: The rewards provided by each arm follow a fixed but unknown distribution
③ Feature 3. Bandit Feedback: At each time step, only the reward for the chosen arm is visible
④ Objective: Maximize cumulative expected rewards over T steps
⑤ Maximizing rewards is equivalent to minimizing regret = μ* - μt
⑥ Bandit: Implies taking away when failed, like a robber
⑦ Example: Thompson Sampling
○ Along with UCB, one of the most widely used bandit algorithms
2. UCB
⑴ Along with Thompson Sampling, one of the most widely used bandit algorithms
⑵ Formulation
① Empirical mean for arm i during time t
○ Numerator: Total reward obtained from arm i
○ I{is = i}: Outputs 1 only when is = i, otherwise 0
○ ni,t: Number of times i was chosen until time t
② UCB (upper confidence bound)
○ Optimism : Meaning to consider it higher than the actual expected value with high probability
○ High probability: Over 99.9%
③ Choose the arm with the maximum UCB at each step
④ Update empirical mean and UCB at every time step
○ Uncertainty represented by UCB decreases with each observation
⑶ Proof of Convergence
① Definition of Regret: For arbitrary time T,
② Convergence: Mathematically proven rigorously
③ Conclusion
○ Per time regret converges to 0: Meaning every choice becomes the best choice
○ Using only empirical mean does not guarantee optimal arm selection
○ May result in never attempting the optimal arm
⑷ Comparison of UCB and Thompson Sampling
① **Commonality 1. **Total regret is O(log T): Per time regret converges to 0 as O(log T / T)
② Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower
③ Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)
3. Thompson Sampling
⑴ Overview
① Oldest bandit algorithm: Introduced by Thompson in 1933
② Natural and efficient heuristic algorithm
⑵ Process
① Maintain belief for parameters of each arm (e.g., mean reward)
○ Use prior distributions such as Beta distribution or Gaussian distribution
② Extract estimated reward from each prior distribution
○ Simple sampling, not extracting expected value
○ Higher variance for fewer observations allows easier extraction
③ Choose the arm with the maximum estimated reward and observe its posterior reward
○ Posterior distribution must always be maintained by the researcher
④ Update prior belief with the posterior using Bayesian approach
⑤ Works well even with two modes
⑶ Example: Thompson Sampling using Beta Distribution
① Start with Beta(1, 1) as the prior belief for all arms
② For each time t,
○ Prior: Beta(α, β)
○ Independently sample θi,t for each arm
○ Choose it = argmaxi θi,t
○ Beta(α+1, β): When 1 is observed
○ Beta(α, β+1): When 0 is observed
⑷ Example: Thompson Sampling using Gaussian Distribution
① Start with N(0, ν2) as the prior belief for all arms
② For each time t,
○ Prior: N(0, ν2)
○ Independently sample θi,t for each arm
○ Choose it = argmaxi θi,t
○ Update the empirical mean µˆ (posterior) for the chosen arm: Where n is the number of independent observations
⑸ Convergence
① After a certain number of steps, well-separation between two arms eventually creates the best arm
② Beta Distribution
③ Gaussian Distribution: Based on Azuma-Hoeffding inequality
⑹ Comparison of UCB and Thompson Sampling
① Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)
② Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower
③ Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)
Input: 2021.12.10 00:52
Edited: 2024.11.21 15:30