Korean, Edit

Chapter 23. MAB (multi-armed bandits)

Recommended Post: 【Algorithm】 Algorithm Table of Contents


1. Overview

2. UCB

3. Thompson Sampling



1. Overview

⑴ Overview

① Problem Definition: Selecting the optimal arm with the highest payoff

② Particularly necessary in the following situations

○ When it is difficult to calculate the objective function (black-box algorithm)

○ When there is no explicit analytic form

○ When singular values exist

○ When the function is not differentiable

○ When the data is highly noisy

⑵ Trade-off of two strategies

Exploitation : Maximizing reward using the empirical mean derived from current data

Exploration : Improving the empirical mean to match the true mean

③ If exploration is neglected when initial data deviates from the true distribution, incorrect judgments cannot be corrected

Type 1. Bayesian Optimization

Bayes’ theorem


스크린샷 2024-11-21 오후 9 54 54


○ Prior distribution (surrogate model) is given for the reward distribution of each arm

○ Gaussian Process is often used as the prior model

○ Posterior distribution is obtained every time a reward is observed

② Acquisition Function: Sampling is performed to maximize the next reward


스크린샷 2024-11-21 오후 9 55 12


○ All acquisition functions calculate the mean μ(x) and variance σ2(x) based on the posterior distribution, designing exploration strategies through this

Type 1. EI (expected improvement): The second formula for EI(x) holds only when using a Gaussian Process

○ Note, Φ and 𝜙 are the CDF and PDF of the standard normal distribution, and Z standardizes x

○ Higher λ leads to more exploration


스크린샷 2024-11-21 오후 9 55 34


Type 2. MPI (maximum probability of improvement)


스크린샷 2024-11-26 오후 1 41 21


Type 3. UCB


스크린샷 2024-11-26 오후 1 41 44


③ Gaussian process: Assuming normal distribution during Bayesian optimization


스크린샷 2024-11-21 오후 9 55 53


x <- c(-3, -1, 0, 2, 3)
f <- data.frame(x=x, y=cos(x))
x <- f$x
k_xx <- varCov(x,x)
k_xxs <- varCov(x, x_star)
k_xsx <- varCov(x_star, x)
k_xsxs <- varCov(x_star, x_star)

f_star_bar <- k_xsx %*% solve(k_xx)%*%f$y              # Mean
cov_f_star <- k_xsxs - k_xsx %*% solve(k_xx) %*% k_xxs # Var

plot_ly(type='scatter', mode="lines") %>%
  add_trace(x=x_star, y=f_star_bar, name="Function Estimate") %>%
  add_trace(x=x_star, y=f_star_bar-2*sqrt(diag(cov_f_star)), 
              name="Lower Confidence Band (Function Estimate)",
            line = list(width = 1, dash = 'dash')) %>%
  add_trace(x=x_star, y=f_star_bar+2*sqrt(diag(cov_f_star)), 
              name="Upper Confidence Band (Function Estimate)",
            line = list(width = 1, dash = 'dash')) %>%
  add_markers(x=f$x, y=f$y, name="Obs. Data", marker=list(size=20)) %>%
  add_trace(x=x_star, y=cos(x_star),line=list(width=5), name="True f()=cos(x)") %>%
  layout(title="Gaussian Process Simulation (5 observations)", legend = list(orientation='h'))


④ Hyperparameter Optimization

○ Explore hyperparameters that maximize marginal likelihood, the probability that data X and y occur, using Maximum Likelihood Estimation (MLE)

○ Marginal likelihood and Gaussian Process hyperparameters are defined as follows


스크린샷 2024-11-21 오후 9 57 39


○ Advantages: Improves underfitting and overfitting

⑤ References

○ Osband, Russo, and Van Roy 2013

○ Russo and Van Roy 2014, 2015, 2016

○ Bubeck and Liu 2013

Data Science and Predictive Analytics (UMich HS650) (PI: Ivo Dinov)

Type 2. Stochastic Multi-armed Bandit

Feature 1. Online Decision: Select the it-th arm among N arms at each time t = 1, ···, T

Feature 2. Stochastic Feedback: The rewards provided by each arm follow a fixed but unknown distribution

Feature 3. Bandit Feedback: At each time step, only the reward for the chosen arm is visible

④ Objective: Maximize cumulative expected rewards over T steps


스크린샷 2024-11-21 오후 9 58 06


⑤ Maximizing rewards is equivalent to minimizing regret = μ* - μt


스크린샷 2024-11-21 오후 9 58 40


⑥ Bandit: Implies taking away when failed, like a robber

⑦ Example: Thompson Sampling

○ Along with UCB, one of the most widely used bandit algorithms



2. UCB

⑴ Along with Thompson Sampling, one of the most widely used bandit algorithms

⑵ Formulation

① Empirical mean for arm i during time t


스크린샷 2024-11-21 오후 9 58 56


○ Numerator: Total reward obtained from arm i

○ I{is = i}: Outputs 1 only when is = i, otherwise 0

○ ni,t: Number of times i was chosen until time t

② UCB (upper confidence bound)


스크린샷 2024-11-21 오후 9 59 26


Optimism : Meaning to consider it higher than the actual expected value with high probability

○ High probability: Over 99.9%

③ Choose the arm with the maximum UCB at each step


스크린샷 2024-11-21 오후 9 59 44


④ Update empirical mean and UCB at every time step

○ Uncertainty represented by UCB decreases with each observation

⑶ Proof of Convergence

① Definition of Regret: For arbitrary time T,


스크린샷 2024-11-21 오후 10 00 00


② Convergence: Mathematically proven rigorously


스크린샷 2024-11-21 오후 10 00 17


③ Conclusion

○ Per time regret converges to 0: Meaning every choice becomes the best choice

○ Using only empirical mean does not guarantee optimal arm selection

○ May result in never attempting the optimal arm

⑷ Comparison of UCB and Thompson Sampling

① **Commonality 1. **Total regret is O(log T): Per time regret converges to 0 as O(log T / T)

Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower

Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)

3. Thompson Sampling

⑴ Overview

① Oldest bandit algorithm: Introduced by Thompson in 1933

② Natural and efficient heuristic algorithm

⑵ Process

① Maintain belief for parameters of each arm (e.g., mean reward)

○ Use prior distributions such as Beta distribution or Gaussian distribution

② Extract estimated reward from each prior distribution

○ Simple sampling, not extracting expected value

○ Higher variance for fewer observations allows easier extraction

③ Choose the arm with the maximum estimated reward and observe its posterior reward

○ Posterior distribution must always be maintained by the researcher

④ Update prior belief with the posterior using Bayesian approach

⑤ Works well even with two modes

Example: Thompson Sampling using Beta Distribution

① Start with Beta(1, 1) as the prior belief for all arms

② For each time t,

○ Prior: Beta(α, β)

○ Independently sample θi,t for each arm

○ Choose it = argmaxi θi,t

○ Beta(α+1, β): When 1 is observed

○ Beta(α, β+1): When 0 is observed

Example: Thompson Sampling using Gaussian Distribution

① Start with N(0, ν2) as the prior belief for all arms

② For each time t,

○ Prior: N(0, ν2)

○ Independently sample θi,t for each arm

○ Choose it = argmaxi θi,t

○ Update the empirical mean µˆ (posterior) for the chosen arm: Where n is the number of independent observations


스크린샷 2024-11-21 오후 10 00 35


⑸ Convergence

① After a certain number of steps, well-separation between two arms eventually creates the best arm


스크린샷 2024-11-21 오후 10 00 47


② Beta Distribution


스크린샷 2024-11-21 오후 10 01 07


③ Gaussian Distribution: Based on Azuma-Hoeffding inequality


스크린샷 2024-11-21 오후 10 01 21


⑹ Comparison of UCB and Thompson Sampling

Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)

Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower

Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)



Input: 2021.12.10 00:52

Edited: 2024.11.21 15:30

results matching ""

    No results matching ""