Chapter 11-1. Multi-armed bandits
Recommended Post: 【Algorithm】 Algorithm Table of Contents
1. Overview
3. UCB
1. Overview
⑴ Overview
① Problem Definition: Selecting the optimal arm with the highest payoff
Figure 1. MAB
② Particularly necessary in the following situations
○ When it is difficult to calculate the objective function (black-box algorithm)
○ When there is no explicit analytic form
○ When singular values exist
○ When the function is not differentiable
○ When the data is highly noisy
⑵ Trade-off of two strategies
① Actions are classified as exploitation and exploration.
② Exploitation: Maximizing reward using the empirical mean derived from current data.
③ Exploration: Improving the empirical mean to match the true mean.
④ If exploration is neglected when initial data deviates from the true distribution, incorrect judgments cannot be corrected
⑶ Type 1. Bayesian Optimization
○ Prior distribution (surrogate model) is given for the reward distribution of each arm
○ Gaussian Process is often used as the prior model
○ Posterior distribution is obtained every time a reward is observed
② Acquisition Function: Sampling is performed to maximize the next reward
○ All acquisition functions calculate the mean μ(x) and variance σ2(x) based on the posterior distribution, designing exploration strategies through this
○ Type 1. EI (expected improvement): The second formula for EI(x) holds only when using a Gaussian Process
○ Note, Φ and 𝜙 are the CDF and PDF of the standard normal distribution, and Z standardizes x
○ Higher λ leads to more exploration
○ Type 2. MPI (maximum probability of improvement)
○Type 3. UCB
③ Gaussian process: Assuming normal distribution during Bayesian optimization
x <- c(-3, -1, 0, 2, 3)
f <- data.frame(x=x, y=cos(x))
x <- f$x
k_xx <- varCov(x,x)
k_xxs <- varCov(x, x_star)
k_xsx <- varCov(x_star, x)
k_xsxs <- varCov(x_star, x_star)
f_star_bar <- k_xsx %*% solve(k_xx)%*%f$y # Mean
cov_f_star <- k_xsxs - k_xsx %*% solve(k_xx) %*% k_xxs # Var
plot_ly(type='scatter', mode="lines") %>%
add_trace(x=x_star, y=f_star_bar, name="Function Estimate") %>%
add_trace(x=x_star, y=f_star_bar-2*sqrt(diag(cov_f_star)),
name="Lower Confidence Band (Function Estimate)",
line = list(width = 1, dash = 'dash')) %>%
add_trace(x=x_star, y=f_star_bar+2*sqrt(diag(cov_f_star)),
name="Upper Confidence Band (Function Estimate)",
line = list(width = 1, dash = 'dash')) %>%
add_markers(x=f$x, y=f$y, name="Obs. Data", marker=list(size=20)) %>%
add_trace(x=x_star, y=cos(x_star),line=list(width=5), name="True f()=cos(x)") %>%
layout(title="Gaussian Process Simulation (5 observations)", legend = list(orientation='h'))
④ Hyperparameter Optimization
○ Explore hyperparameters that maximize marginal likelihood, the probability that data X and y occur, using Maximum Likelihood Estimation (MLE)
○ Marginal likelihood and Gaussian Process hyperparameters are defined as follows
○ Advantages: Improves underfitting and overfitting
⑷ Type 2. Stochastic Multi-armed Bandit
① Feature 1. Online Decision: Select the it-th arm among N arms at each time t = 1, ···, T
② Feature 2. Stochastic Feedback: The rewards provided by each arm follow a fixed but unknown distribution
③ Feature 3. Bandit Feedback: At each time step, only the reward for the chosen arm is visible
④ Objective: Maximize cumulative expected rewards over T steps
⑤ Maximizing rewards is equivalent to minimizing regret = μ* - μt. The regret over the entire K steps is defined as follows.
⑥ Bandit: Implies taking away when failed, like a robber
⑦ Example 1. Belief state update equation when $n$-th arm is chosen over N arms
⑧ Example 2. Thompson Sampling
2. ϵ-greedy Exploration
⑴ Procedure
① Step 1. At the beginning, play all arms.
② Step 2. With probability (1 - ϵ), choose the arm with the highest sample mean, and with probability ϵ, choose an arm at random.
③ Step 3. Update the sample mean for each arm and return to Step 2.
⑵ Characteristic
① Regret $R_K$ = Ο($K$) over the entire $K$ steps because the error is accumulated by ϵ for each step.
3. UCB
⑴ Overview
① Along with Thompson Sampling, one of the most widely used bandit algorithms
② UCB balances exploitation and exploration.
③ Provides optimism about the uncertainty.
⑵ Procedure
① Empirical mean for arm i during time t
○ Numerator: Total reward obtained from arm $i$
○ 𝕀{$i_s = i$}: Outputs 1 only when $i_s = i$, otherwise 0
○ $n_{i,t}$: Number of times i was chosen until time $t$
○ True mean is represented as $\mu_{i,t}$.
② Bonus term
○ It enables exploration.
○ It is linked to convergence as follows:
③ Upper Confidence Bound (UCB)
○ Optimism: Meaning to consider it higher than the actual expected value with high probability (over 99.9%)
④ Step 1. Play each arm once at the beginning.
⑤ Step 2. Choose the arm with the maximum $\text{UCB}_{i,t}$ at each step $i_t$.
⑥ Step 3. Update empirical mean and UCB at every time step: uncertainty represented by UCB decreases with each observation
⑶ Proof of Convergence
① Theorem
For all $N > 1$, if policy UCB is run on N arms having arbitrary reward distributions $P_1$, …, $P_N$ with support in [0,1], then its expected regret after any number of $T$ plays
is at most
where $\mu_1$, …, $\mu_N$ are the means of the distributions $P_1$, …, $P_N$.
② Proof
Let $c_{t,s}$ := √(2log($t$)/$s$). Also define by the following $\bar{X}n^i = (1/n) \Sigma_t=1^n X_t$ where ${X_t^i}{t∈ℕ}$ is the random process of rewards for arm $i$ if it is played successively. For any arm $i$, we upper bound UCBi($T$) on any sequence of plays. Let $I_t$ denote the arm played at time $t$, then we have with ℓ being an arbitrary positive integer that
Now we observe that X̄s* + ct,s ≤ X̄sii + ct,si implies that at least one of the following must hold: X̄s* ≤ μ* - ct,s, X̄sii ≤ μi + ct,si, μ* < μi + 2ct,si. Otherwise, there is a contradiction: X̄s* + ct,s > μ* ≥ μi + 2ct,si > X̄sii + ct,si. We can bound the probability of the first two events by using a version of the Azuma-Hoeffding inequality since both sX̄s* - sμ* and siX̄sii - siμi are margingales null at 0 with increments in [-μ, 1 - μ] and [-μi, 1 - μi], respectively. This then yields the following
Also note that we have
for si ≥ 8 log(T) / Δi2 (since T ≥ t). Thus, if we take ℓ = ⌈8 log(T) / Δi2⌉, then we cannot have μ* < μi + 2ct,si and only the first two inequalities can hold. Thus, we get
Therefore, from 𝔼[ni,T] ≤ 𝔼[UCBi(T)] ≤ 8log(T) / Δi2 + C, we get Regret(T) ≤ 8 log(T) / Δi + C∑Δi.
③ Conclusion
○ Per time regret converges to 0: Meaning every choice becomes the best choice
○ Using only empirical mean does not guarantee optimal arm selection
○ May result in never attempting the optimal arm
⑷ Comparison: UCB vs ϵ-greedy Exploration
① Difference 1. The total regret of UCB is $O(\log T)$, whereas that of $\epsilon$-greedy exploration is $O(T)$.
② Difference 2. The average reward of UCB is higher than that of $\epsilon$-greedy exploration.
Figure 2. UCB vs ϵ-greedy exploration
⑸ Comparison: UCB vs Thompson Sampling
① Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)
② Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower
③ Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)
4. Thompson Sampling
⑴ Overview
① Oldest bandit algorithm: Introduced by Thompson in 1933
② Natural and efficient heuristic algorithm
⑵ Process
① Maintain belief for parameters of each arm (e.g., mean reward)
○ Use prior distributions such as Beta distribution or Gaussian distribution
② Extract estimated reward from each prior distribution
○ Simple sampling, not extracting expected value
○ Higher variance for fewer observations allows easier extraction
③ Choose the arm with the maximum estimated reward and observe its posterior reward
○ Posterior distribution must always be maintained by the researcher
④ Update prior belief with the posterior using Bayesian approach
⑤ Works well even with two modes
⑶ Example: Thompson Sampling using Beta Distribution
① Start with Beta(1, 1) as the prior belief for all arms
② For each time t,
○ Prior: Beta(α, β)
○ Independently sample θi,t for each arm
○ Choose it = argmaxi θi,t
○ Beta(α+1, β): When 1 is observed
○ Beta(α, β+1): When 0 is observed
⑷ Example: Thompson Sampling using Gaussian Distribution
① Start with N(0, ν2) as the prior belief for all arms
② For each time t,
○ Prior: N(0, ν2)
○ Independently sample θi,t for each arm
○ Choose it = argmaxi θi,t
○ Update the empirical mean µˆ (posterior) for the chosen arm: Where n is the number of independent observations
⑸ Convergence
① After a certain number of steps, well-separation between two arms eventually creates the best arm
② Beta Distribution
③ Gaussian Distribution: Based on Azuma-Hoeffding inequality
⑹ Comparison of UCB and Thompson Sampling
① Commonality 1. Total regret is O(log T): Per time regret converges to 0 as O(log T / T)
② Difference 1. Thompson Sampling often outperforms UCB: UCB is a slightly more conservative algorithm, converging slower
③ Difference 2. UCB is deterministic while Thompson Sampling is probabilistic (randomized)
Input: 2021.12.10 00:52
Modified: 2026.03.13 16:09
Edited: 2024.11.21 15:30