Korean, Edit

Chapter 11-1. Multi-armed bandits

Recommended Post: 【Algorithm】 Algorithm Table of Contents


1. Overview

2. ϵ-greedy Exploration

3. UCB

4. Thompson Sampling



1. Overview

⑴ Overview

① Problem Definition: Selecting the optimal arm with the highest payoff


스크린샷 2026-03-13 오후 3 43 24

Figure 1. MAB


② Particularly necessary in the following situations

○ When it is difficult to calculate the objective function (black-box algorithm)

○ When there is no explicit analytic form

○ When singular values exist

○ When the function is not differentiable

○ When the data is highly noisy

⑵ Trade-off of two strategies

① Actions are classified as exploitation and exploration.

Exploitation: Maximizing reward using the empirical mean derived from current data.

Exploration: Improving the empirical mean to match the true mean.

④ If exploration is neglected when initial data deviates from the true distribution, incorrect judgments cannot be corrected

Type 1. Bayesian Optimization

Bayes’ theorem


스크린샷 2024-11-21 오후 9 54 54


○ Prior distribution (surrogate model) is given for the reward distribution of each arm

○ Gaussian Process is often used as the prior model

○ Posterior distribution is obtained every time a reward is observed

② Acquisition Function: Sampling is performed to maximize the next reward


스크린샷 2024-11-21 오후 9 55 12


○ All acquisition functions calculate the mean μ(x) and variance σ2(x) based on the posterior distribution, designing exploration strategies through this

Type 1. EI (expected improvement): The second formula for EI(x) holds only when using a Gaussian Process

○ Note, Φ and 𝜙 are the CDF and PDF of the standard normal distribution, and Z standardizes x

○ Higher λ leads to more exploration


스크린샷 2024-11-21 오후 9 55 34


Type 2. MPI (maximum probability of improvement)


스크린샷 2024-11-26 오후 1 41 21


Type 3. UCB


스크린샷 2024-11-26 오후 1 41 44


③ Gaussian process: Assuming normal distribution during Bayesian optimization


스크린샷 2024-11-21 오후 9 55 53


x <- c(-3, -1, 0, 2, 3)
f <- data.frame(x=x, y=cos(x))
x <- f$x
k_xx <- varCov(x,x)
k_xxs <- varCov(x, x_star)
k_xsx <- varCov(x_star, x)
k_xsxs <- varCov(x_star, x_star)

f_star_bar <- k_xsx %*% solve(k_xx)%*%f$y              # Mean
cov_f_star <- k_xsxs - k_xsx %*% solve(k_xx) %*% k_xxs # Var

plot_ly(type='scatter', mode="lines") %>%
  add_trace(x=x_star, y=f_star_bar, name="Function Estimate") %>%
  add_trace(x=x_star, y=f_star_bar-2*sqrt(diag(cov_f_star)), 
              name="Lower Confidence Band (Function Estimate)",
            line = list(width = 1, dash = 'dash')) %>%
  add_trace(x=x_star, y=f_star_bar+2*sqrt(diag(cov_f_star)), 
              name="Upper Confidence Band (Function Estimate)",
            line = list(width = 1, dash = 'dash')) %>%
  add_markers(x=f$x, y=f$y, name="Obs. Data", marker=list(size=20)) %>%
  add_trace(x=x_star, y=cos(x_star),line=list(width=5), name="True f()=cos(x)") %>%
  layout(title="Gaussian Process Simulation (5 observations)", legend = list(orientation='h'))


④ Hyperparameter Optimization

○ Explore hyperparameters that maximize marginal likelihood, the probability that data X and y occur, using Maximum Likelihood Estimation (MLE)

○ Marginal likelihood and Gaussian Process hyperparameters are defined as follows


스크린샷 2024-11-21 오후 9 57 39


○ Advantages: Improves underfitting and overfitting

Type 2. Stochastic Multi-armed Bandit

Feature 1. Online Decision: Select the it-th arm among N arms at each time t = 1, ···, T

Feature 2. Stochastic Feedback: The rewards provided by each arm follow a fixed but unknown distribution

Feature 3. Bandit Feedback: At each time step, only the reward for the chosen arm is visible

④ Objective: Maximize cumulative expected rewards over T steps


스크린샷 2026-03-13 오후 2 48 29


⑤ Maximizing rewards is equivalent to minimizing regret = μ* - μt. The regret over the entire K steps is defined as follows.


스크린샷 2026-03-13 오후 2 58 26


⑥ Bandit: Implies taking away when failed, like a robber

Example 1. Belief state update equation when $n$-th arm is chosen over N arms


스크린샷 2025-11-09 오전 11 30 04


Example 2. Thompson Sampling



2. ϵ-greedy Exploration

⑴ Procedure

Step 1. At the beginning, play all arms.

Step 2. With probability (1 - ϵ), choose the arm with the highest sample mean, and with probability ϵ, choose an arm at random.

Step 3. Update the sample mean for each arm and return to Step 2.

⑵ Characteristic

① Regret $R_K$ = Ο($K$) over the entire $K$ steps because the error is accumulated by ϵ for each step.



3. UCB

⑴ Overview

① Along with Thompson Sampling, one of the most widely used bandit algorithms

② UCB balances exploitation and exploration.

③ Provides optimism about the uncertainty.

⑵ Procedure

① Empirical mean for arm i during time t


스크린샷 2026-03-13 오후 3 06 59


○ Numerator: Total reward obtained from arm $i$

○ 𝕀{$i_s = i$}: Outputs 1 only when $i_s = i$, otherwise 0

○ $n_{i,t}$: Number of times i was chosen until time $t$

○ True mean is represented as $\mu_{i,t}$.

② Bonus term


스크린샷 2026-03-13 오후 3 33 14


○ It enables exploration.

○ It is linked to convergence as follows:


스크린샷 2026-03-13 오후 3 32 08


③ Upper Confidence Bound (UCB)


스크린샷 2025-11-09 오후 1 03 18


Optimism: Meaning to consider it higher than the actual expected value with high probability (over 99.9%)

Step 1. Play each arm once at the beginning.

Step 2. Choose the arm with the maximum $\text{UCB}_{i,t}$ at each step $i_t$.

Step 3. Update empirical mean and UCB at every time step: uncertainty represented by UCB decreases with each observation

⑶ Proof of Convergence

① Theorem

For all $N > 1$, if policy UCB is run on N arms having arbitrary reward distributions $P_1$, …, $P_N$ with support in [0,1], then its expected regret after any number of $T$ plays


스크린샷 2025-11-09 오후 1 30 41


is at most


스크린샷 2025-11-09 오후 1 31 17


where $\mu_1$, …, $\mu_N$ are the means of the distributions $P_1$, …, $P_N$.

② Proof

Let $c_{t,s}$ := √(2log($t$)/$s$). Also define by the following $\bar{X}n^i = (1/n) \Sigma_t=1^n X_t$ where ${X_t^i}{t∈ℕ}$ is the random process of rewards for arm $i$ if it is played successively. For any arm $i$, we upper bound UCBi($T$) on any sequence of plays. Let $I_t$ denote the arm played at time $t$, then we have with ℓ being an arbitrary positive integer that


스크린샷 2025-11-09 오후 1 35 15


Now we observe that X̄s* + ct,s ≤ X̄sii + ct,si implies that at least one of the following must hold: X̄s* ≤ μ* - ct,s, X̄sii ≤ μi + ct,si, μ* < μi + 2ct,si. Otherwise, there is a contradiction: X̄s* + ct,s > μ* ≥ μi + 2ct,si > X̄sii + ct,si. We can bound the probability of the first two events by using a version of the Azuma-Hoeffding inequality since both sX̄s* - sμ* and sisii - siμi are margingales null at 0 with increments in [-μ, 1 - μ] and [-μi, 1 - μi], respectively. This then yields the following


스크린샷 2025-11-09 오후 1 40 35


Also note that we have


스크린샷 2025-11-09 오후 1 41 10


for si ≥ 8 log(T) / Δi2 (since T ≥ t). Thus, if we take ℓ = ⌈8 log(T) / Δi2⌉, then we cannot have μ* < μi + 2ct,si and only the first two inequalities can hold. Thus, we get


스크린샷 2025-11-09 오후 1 43 11


Therefore, from 𝔼[ni,T] ≤ 𝔼[UCBi(T)] ≤ 8log(T) / Δi2 + C, we get Regret(T) ≤ 8 log(T) / Δi + C∑Δi.

③ Conclusion

○ Per time regret converges to 0: Meaning every choice becomes the best choice

○ Using only empirical mean does not guarantee optimal arm selection

○ May result in never attempting the optimal arm

Comparison: UCB vs ϵ-greedy Exploration

Difference 1. The total regret of UCB is $O(\log T)$, whereas that of $\epsilon$-greedy exploration is $O(T)$.

Difference 2. The average reward of UCB is higher than that of $\epsilon$-greedy exploration.


스크린샷 2026-03-13 오후 3 41 30

Figure 2. UCB vs ϵ-greedy exploration


Comparison: UCB vs Thompson Sampling

Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)

Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower

Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)



4. Thompson Sampling

⑴ Overview

① Oldest bandit algorithm: Introduced by Thompson in 1933

② Natural and efficient heuristic algorithm

⑵ Process

① Maintain belief for parameters of each arm (e.g., mean reward)

○ Use prior distributions such as Beta distribution or Gaussian distribution

② Extract estimated reward from each prior distribution

○ Simple sampling, not extracting expected value

○ Higher variance for fewer observations allows easier extraction

③ Choose the arm with the maximum estimated reward and observe its posterior reward

○ Posterior distribution must always be maintained by the researcher

④ Update prior belief with the posterior using Bayesian approach

⑤ Works well even with two modes

Example: Thompson Sampling using Beta Distribution

① Start with Beta(1, 1) as the prior belief for all arms

② For each time t,

○ Prior: Beta(α, β)

○ Independently sample θi,t for each arm

○ Choose it = argmaxi θi,t

○ Beta(α+1, β): When 1 is observed

○ Beta(α, β+1): When 0 is observed

Example: Thompson Sampling using Gaussian Distribution

① Start with N(0, ν2) as the prior belief for all arms

② For each time t,

○ Prior: N(0, ν2)

○ Independently sample θi,t for each arm

○ Choose it = argmaxi θi,t

○ Update the empirical mean µˆ (posterior) for the chosen arm: Where n is the number of independent observations


스크린샷 2024-11-21 오후 10 00 35


⑸ Convergence

① After a certain number of steps, well-separation between two arms eventually creates the best arm


스크린샷 2024-11-21 오후 10 00 47


② Beta Distribution


스크린샷 2024-11-21 오후 10 01 07


③ Gaussian Distribution: Based on Azuma-Hoeffding inequality


스크린샷 2024-11-21 오후 10 01 21


⑹ Comparison of UCB and Thompson Sampling

Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)

Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower

Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)



Input: 2021.12.10 00:52

Modified: 2026.03.13 16:09

Edited: 2024.11.21 15:30

results matching ""

    No results matching ""