Chapter 23. MAB (multi-armed bandits)
Recommended Post: 【Algorithm】 Algorithm Table of Contents
1. Overview
2. UCB
1. Overview
⑴ Overview
① Problem Definition: Selecting the optimal arm with the highest payoff
② Particularly necessary in the following situations
○ When it is difficult to calculate the objective function (black-box algorithm)
○ When there is no explicit analytic form
○ When singular values exist
○ When the function is not differentiable
○ When the data is highly noisy
⑵ Trade-off of two strategies
① Exploitation : Maximizing reward using the empirical mean derived from current data
② Exploration : Improving the empirical mean to match the true mean
③ If exploration is neglected when initial data deviates from the true distribution, incorrect judgments cannot be corrected
⑶ Type 1. Bayesian Optimization
○ Prior distribution (surrogate model) is given for the reward distribution of each arm
○ Gaussian Process is often used as the prior model
○ Posterior distribution is obtained every time a reward is observed
② Acquisition Function: Sampling is performed to maximize the next reward
○ All acquisition functions calculate the mean μ(x) and variance σ2(x) based on the posterior distribution, designing exploration strategies through this
○ Type 1. EI (expected improvement): The second formula for EI(x) holds only when using a Gaussian Process
○ Note, Φ and 𝜙 are the CDF and PDF of the standard normal distribution, and Z standardizes x
○ Higher λ leads to more exploration
○ Type 2. MPI (maximum probability of improvement)
○Type 3. UCB
③ Gaussian process: Assuming normal distribution during Bayesian optimization
x <- c(-3, -1, 0, 2, 3)
f <- data.frame(x=x, y=cos(x))
x <- f$x
k_xx <- varCov(x,x)
k_xxs <- varCov(x, x_star)
k_xsx <- varCov(x_star, x)
k_xsxs <- varCov(x_star, x_star)
f_star_bar <- k_xsx %*% solve(k_xx)%*%f$y # Mean
cov_f_star <- k_xsxs - k_xsx %*% solve(k_xx) %*% k_xxs # Var
plot_ly(type='scatter', mode="lines") %>%
add_trace(x=x_star, y=f_star_bar, name="Function Estimate") %>%
add_trace(x=x_star, y=f_star_bar-2*sqrt(diag(cov_f_star)),
name="Lower Confidence Band (Function Estimate)",
line = list(width = 1, dash = 'dash')) %>%
add_trace(x=x_star, y=f_star_bar+2*sqrt(diag(cov_f_star)),
name="Upper Confidence Band (Function Estimate)",
line = list(width = 1, dash = 'dash')) %>%
add_markers(x=f$x, y=f$y, name="Obs. Data", marker=list(size=20)) %>%
add_trace(x=x_star, y=cos(x_star),line=list(width=5), name="True f()=cos(x)") %>%
layout(title="Gaussian Process Simulation (5 observations)", legend = list(orientation='h'))
④ Hyperparameter Optimization
○ Explore hyperparameters that maximize marginal likelihood, the probability that data X and y occur, using Maximum Likelihood Estimation (MLE)
○ Marginal likelihood and Gaussian Process hyperparameters are defined as follows
○ Advantages: Improves underfitting and overfitting
⑤ References
○ Osband, Russo, and Van Roy 2013
○ Russo and Van Roy 2014, 2015, 2016
○ Bubeck and Liu 2013
○ Data Science and Predictive Analytics (UMich HS650) (PI: Ivo Dinov)
⑷ Type 2. Stochastic Multi-armed Bandit
① Feature 1. Online Decision: Select the it-th arm among N arms at each time t = 1, ···, T
② Feature 2. Stochastic Feedback: The rewards provided by each arm follow a fixed but unknown distribution
③ Feature 3. Bandit Feedback: At each time step, only the reward for the chosen arm is visible
④ Objective: Maximize cumulative expected rewards over T steps
⑤ Maximizing rewards is equivalent to minimizing regret = μ* - μt
⑥ Bandit: Implies taking away when failed, like a robber
⑦ Example: Thompson Sampling
○ Along with UCB, one of the most widely used bandit algorithms
2. UCB
⑴ Along with Thompson Sampling, one of the most widely used bandit algorithms
⑵ Formulation
① Empirical mean for arm i during time t
○ Numerator: Total reward obtained from arm i
○ I{is = i}: Outputs 1 only when is = i, otherwise 0
○ ni,t: Number of times i was chosen until time t
② UCB (upper confidence bound)
○ Optimism : Meaning to consider it higher than the actual expected value with high probability
○ High probability: Over 99.9%
③ Choose the arm with the maximum UCB at each step
④ Update empirical mean and UCB at every time step
○ Uncertainty represented by UCB decreases with each observation
⑶ Proof of Convergence
① Definition of Regret: For arbitrary time T,
② Convergence: Mathematically proven rigorously
③ Conclusion
○ Per time regret converges to 0: Meaning every choice becomes the best choice
○ Using only empirical mean does not guarantee optimal arm selection
○ May result in never attempting the optimal arm
⑷ Comparison of UCB and Thompson Sampling
① **Commonality 1. **Total regret is O(log T): Per time regret converges to 0 as O(log T / T)
② Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower
③ Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)
3. Thompson Sampling
⑴ Overview
① Oldest bandit algorithm: Introduced by Thompson in 1933
② Natural and efficient heuristic algorithm
⑵ Process
① Maintain belief for parameters of each arm (e.g., mean reward)
○ Use prior distributions such as Beta distribution or Gaussian distribution
② Extract estimated reward from each prior distribution
○ Simple sampling, not extracting expected value
○ Higher variance for fewer observations allows easier extraction
③ Choose the arm with the maximum estimated reward and observe its posterior reward
○ Posterior distribution must always be maintained by the researcher
④ Update prior belief with the posterior using Bayesian approach
⑤ Works well even with two modes
⑶ Example: Thompson Sampling using Beta Distribution
① Start with Beta(1, 1) as the prior belief for all arms
② For each time t,
○ Prior: Beta(α, β)
○ Independently sample θi,t for each arm
○ Choose it = argmaxi θi,t
○ Beta(α+1, β): When 1 is observed
○ Beta(α, β+1): When 0 is observed
⑷ Example: Thompson Sampling using Gaussian Distribution
① Start with N(0, ν2) as the prior belief for all arms
② For each time t,
○ Prior: N(0, ν2)
○ Independently sample θi,t for each arm
○ Choose it = argmaxi θi,t
○ Update the empirical mean µˆ (posterior) for the chosen arm: Where n is the number of independent observations
⑸ Convergence
① After a certain number of steps, well-separation between two arms eventually creates the best arm
② Beta Distribution
③ Gaussian Distribution: Based on Azuma-Hoeffding inequality
⑹ Comparison of UCB and Thompson Sampling
① Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)
② Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower
③ Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)
4. Early Stopping Problem
⑴ Secretary Problem
The secretary problem is a fascinating puzzle that gained popularity among mathematicians in the 1940s and 50s. Here's the situation: you need to hire one secretary, and there are N candidates lined up. You interview them one by one and must decide immediately whether to accept or reject each person. Once rejected, a candidate cannot be reconsidered, and once someone is hired, you can't see any of the remaining applicants. So, as the hiring manager, what’s the best strategy to select the most qualified candidate? Surprisingly, there’s a clear answer. Assuming the number of applicants is large enough, the optimal approach is to automatically reject roughly the first 37% of candidates. For example, if there are 100 applicants, you reject the first 36. Then, among the remaining candidates, you hire the first person who’s better than anyone you’ve seen so far. This strategy gives you about a 36.78% chance of picking the very best candidate — the probability converges to 1/e. A beautiful bit of math magic.
⑵ Early stopping (optimal stopping) problem with Bayesian optimization not applied
Input: 2021.12.10 00:52
Edited: 2024.11.21 15:30