Korean, Edit

Chapter 9-1. Robust MDP

Recommended reading: 【Control Theory】 Stochastic Control Theory


1. General Overview

2. Finite-horizon robust MDP

3. Infinite-horizon robust MDP

4. Examples


a. Garud Iyengar (2005)

b. Arnab Nilim and Laurent El Ghaoui (2003), (2004)



1. General Overview

⑴ Introduction

① Goal: Choosing a robust policy under model ambiguity (i.e., transition probability P is ambiguous)

○ Uncertainty: The probability distribution is known, but randomness exists according to that probability.

○ Ambiguity: The probability distribution itself is unknown. In other words, it’s unclear which distribution is correct.

② Problem Definition

○ Standard MDP


스크린샷 2025-10-29 오전 9 03 40


○ Robust MDP: For each policy π, the problem is to find the worst-case transition probability P.


스크린샷 2025-10-29 오전 9 03 59


Problem 1. The DP optimal policy is very sensitive to small changes in transition probabilities.

○ Rain probability 0.3 → optimal to carry an umbrella.

○ Rain probability 0.2 → optimal not to carry an umbrella.

○ Only 0.1 difference, but the optimal policy changes.

Problem 2. Each (s, a) adds an inner minimization infp∈𝒫(s,a) 𝔼p[·], making it heavier computationally than standard MDP.

Problem 3. Impossibility of linear programming: Depending on the choice of 𝒫, it can become non-convex.

○ Fixed transition matrix P: intersection of half-spaces → convex.

○ infP with ≥ constraint: epigraph of a concave roof → generally non-convex.

Summary: In this paper, the “non-convex issue” refers to the fact that, when the overall ambiguous set P is non-rectangular, nature’s worst-case choices become temporally coupled, and the inner infimum optimization loses its nice convex structure. The rectangularity assumption resolves this by decomposing P into convex sets for each stage and each state–action pair, so that the inner infimum in the robust Bellman equation is well-defined and remains a convex optimization problem.


스크린샷 2025-10-29 오후 1 55 28


③ Solution

Rectangularity Assumption: Means the “choice” of transition probabilities at each time step is independent of others.


스크린샷 2025-10-29 오전 9 05 18


Purpose of introduction: To make DP decomposition (Bellman recursion) possible.

○ A stronger assumption than simple statistical independence. Usually holds for stationary policies.

○ ℳ(ℬ): Set of probability measures over discrete set ℬ.

○ 𝒯dt: Set of conditional distributions consistent with a history-dependent decision rule dt. To describe the conditional distribution, one must specify the joint distribution ph(a,s) of action a and next state st+1.

○ ×: Cartesian product, so each element of 𝒯π is a tuple of per-stage distributions like (p0, p1, …, pN-1).

○ π doesn’t appear explicitly in 𝒯π because it’s defined via dt’s, but when π changes, dt’s change, hence 𝒯π changes.

Result: Computation becomes tractable (mitigates curse of dimensionality), and if 𝒫(s, a) is convex, the inner problem is also convex.

Limitation: If the transition dynamics evolve rapidly, allowing nature to choose a different worst-case transition distribution at each visit (as rectangularity assumes) can be quite realistic. But if the transition model changes only slowly, then it becomes unrealistic to assume that nature can adversarially switch to a completely new distribution every time the same (s,a) pair is encounted, especially in an infinite-horizon setting where such revisits are frequent.

Adversarial Setting

○ Robust MDP is a game between a decision maker defining π and an adversary finding the infimum for each π.

Question 1. Since game theory and optimization are conceptually different, is robust MDP closer to game theory or optimization theory?


optimization game
minx∈X f(x) minxi ∈ Xi fi(xi, x-i), i ∈ [N]
(strong) convexivity (strong) monotonicity
(cvx) global minimizer (cvx) NE: fi(xi, x-i) ≤ fi(xi, x-i*) for i ∈ [N]
(ncvx) B-/Clarke stationarity (ncvx) quasi- / Clarke NE

Table 1. Difference between optimization theory and game theory


Question 2. Does not single-agent system affect it?

Answer: The retangularity assumption simplifies the problem with a single-agent optimization theory.

Perfect Observation MDP: Observe state st and choose at.

Time-invariant: The reward rt(st, at, st+1) is known and, if infinite-horizon case, time-invariant.

⑵ Background Theory

DP(Dynamic Programming)

② ϵ-optimal policy

Banach Fixed Point Theorem

Game Theory

Information Theory



2. Finite-horizon Robust MDP

Overview: For computation, state/action space is finite.

⑵ Value function after policy π is fixed.


스크린샷 2025-10-29 오전 9 08 14


① Since st+1 is ambiguous, we allow the reward at time t to depend on st+1 as well: rt(st, at, st+1) instead of rt(st, at).

⑶ Bellman equation (DP)


스크린샷 2025-10-29 오전 9 08 31


① Simple proof


스크린샷 2025-12-14 오전 2 47 43


② Rigorous proof: use an ϵ-optimal policy and the equivalence method (show both LHS ≥ RHS and LHS ≤ RHS).



3. Infinite-horizon Robust MDP

⑴ Overview: Under discount factor λ < 1 and rectangular uncertainty, the robust Bellman operator is also contractive → value/policy iteration converges.

⑵ Bellman equation (DP)


스크린샷 2025-10-29 오전 9 09 36


⑶ Banach Fixed Point Theorem


스크린샷 2025-10-29 오전 9 09 55


① Corollary: inf and sup do not alter the conclusion of the Banach Fixed Point Theorem (key result). Note that (b) is an NP-complete problem.


스크린샷 2025-10-29 오전 9 10 28


② Value iteration algorithm


스크린샷 2025-10-29 오전 9 10 50


○ Contractivity → error bound from current residual.


스크린샷 2025-10-29 오전 9 11 10


○ Error of one update step.


스크린샷 2025-10-29 오전 9 11 35


○ Therefore, to ensure ㅣㅣṼ - V*ㅣㅣ ≤ ϵ / 2.


스크린샷 2025-10-29 오전 9 12 09


○ The paper’s algorithm adds an extra 1/2 multiplier and stops when that condition is met.


스크린샷 2025-10-29 오전 9 12 28


○ This stricter limit ensures that even after updating V ← Ṽ, the error remains small enough (extra margin by triangle inequality), so when selecting a greedy policy in the next step, performance loss bound (usually (2λ / (1 - λ)) ㅣㅣV - V*ㅣㅣ) is also satisfied — a conservative (safety margin) setting.

③ Policy iteration algorithm


스크린샷 2025-11-30 오후 8 37 20


⑷ Asymmetry between Decision Maker and Adversary


스크린샷 2025-10-29 오전 9 12 55


① If only stationary policies are considered, under rectangularity, discounted infinite-horizon, and convex/compact P(s, a), even if the dynamic adversary can change choices each visit, the optimal response to a stationary policy is also “stationary” (same p for each (s,a)), yielding the same value.

② The dynamic adversary appears stronger, but under stationary policies, the actual worst-case response is static, giving equal value.

③ However, when the adversary is static, the decision maker’s stationary policy may not be optimal. A better option might be a non-stationary/dynamic (universal) policy that adjusts weights over time/history (Cover, 1991).



4. Examples

⑴ Overview

① The overall V-problem is hard to solve via linear programming due to non-convexity induced by infp.

② The proposed approach handles the outer structure via DP (value/policy iteration) and solves the inner minimization fast and convexly.

③ The following examples show that robust MDP is heavier than pure MDP but does not increase time complexity significantly

④ Summary


Uncertainty set Plain DP (per state–action) Robust DP (per state–action)
KL confidence set 𝒪(ㅣSㅣ) 𝒪(ㅣSㅣ) (increase in times of a constant)
L22-type) set 𝒪(ㅣSㅣ) 𝒪(ㅣSㅣlogㅣSㅣ)
L1 set 𝒪(ㅣSㅣ) 𝒪(ㅣSㅣlogㅣSㅣ)


Application 1. KL Confidence

① Overview: The inner problem reduces to a one-dimensional convex search (exponential tilting form solution).

Data → KL divergence: From observed frequency in (s, a), build MLE p̂sa, and via algebraic statistics (chi-square approximation), define confidence region (confidence level ω) as 𝒫(s, a) = {p: D(p ㅣㅣ p̂sa) ≤ tω}.

Closed-form solution of robust expectation + 1D convex problem: Solving Lagrangian of minp∈𝒫 𝔼p[v] gives f(γ) = γt + γlog(𝔼q[exp(-v / γ)]), a 1D convex function, with optimal distribution p(s) ∝ q(s) exp((μ - v(s)) / γ), i.e., exponential tilting. So once γ is found, p is directly obtained.

Computation complexity guarantee: f’(γ) is monotone/convex → bisection finds an ϵ-approximation efficiently (each f’(γ) evaluation is 𝒪(ㅣSㅣ)).


스크린샷 2025-10-29 오전 9 14 47


Conclusion: Modeling transition uncertainty in robust MDP with KL confidence sets yields a closed-form and fast 1D convex optimization (exponential tilting), making computation highly practical.

Application 2. L2 Approximation (χ2-type) Sets

① Overview: Once sorted, thresholding enables fast computation → makes computationally heavy robust MDPs practically solvable.

② The optimization infp∈𝒫 𝔼p[v] can be solved in 𝒪(ㅣ𝒮ㅣ log(ㅣ𝒮ㅣ)) time.

Application 3. L1 Approximation Sets

① D(p ㅣㅣ q) ≥ (1 / 2 ln 2) ㅣㅣp - qㅣㅣ12 → 𝒫 = {p: ㅣㅣp - qㅣㅣ1 ≤ δ}, δ = √(2t ln2) used.

② L1 / L type sets are convenient for modeling but weak as statistical confidence regions → in practice, L2-based sets are more recommended.



Input: 2025.10.27 20:40

results matching ""

    No results matching ""