Korean, Edit

Chapter 9-1. Robust MDP

Recommended reading: 【Control Theory】 Stochastic Control Theory


1. General Overview

2. Finite-horizon robust MDP

3. Infinite-horizon robust MDP

4. Applications

5. Appendix


a.Garud Iyengar (2005)

b.Arnab Nilim and Laurent El Ghaoui (2003)

1. General Overview

⑴ Introduction

① Goal: Choosing a robust policy under model ambiguity (i.e., transition probability P is uncertain)

○ Uncertainty: The probability distribution is known, but randomness exists according to that probability.

○ Ambiguity: The probability distribution itself is unknown. In other words, it’s unclear which distribution is correct.

② Problem Definition

○ Standard MDP


스크린샷 2025-10-29 오전 9 03 40


○ Robust MDP: For each policy π, the problem is to find the worst-case transition probability P.


스크린샷 2025-10-29 오전 9 03 59


Problem 1. The DP optimal policy is very sensitive to small changes in transition probabilities.

○ Rain probability 0.3 → optimal to carry an umbrella.

○ Rain probability 0.2 → optimal not to carry an umbrella.

○ Only 0.1 difference, but the optimal policy changes.

Problem 2. Each (s, a) adds an inner minimization infp∈𝒫(s,a) 𝔼p[·], making it heavier computationally than standard MDP.

Problem 3. Impossibility of linear programming: Depending on the choice of 𝒫, it can become non-convex.

○ Fixed transition matrix P: intersection of half-spaces → convex.

○ infP with ≥ constraint: epigraph of a concave roof → generally non-convex.


스크린샷 2025-10-29 오후 1 55 28


③ Solution

Adversarial Setting

○ Robust MDP is a game between a decision maker defining π and an adversary finding the infimum for each π.

Question: Since game theory and optimization are conceptually different, is robust MDP closer to game theory or optimization theory?

Rectangularity Assumption: Means the “choice” of transition probabilities at each time step is independent of others.


스크린샷 2025-10-29 오전 9 05 18


Purpose of introduction: To make DP decomposition (Bellman recursion) possible.

○ A stronger assumption than simple statistical independence. Usually holds for stationary policies.

○ ℳ(ℬ): Set of probability measures over discrete set ℬ.

○ 𝒯dt: Set of conditional distributions consistent with a history-dependent decision rule dt. To describe the conditional distribution, one must specify the joint distribution ph(a,s) of action a and next state st+1.

○ ×: Cartesian product, so each element of 𝒯π is a tuple of per-stage distributions like (p0, p1, …, pN-1).

○ π doesn’t appear explicitly in 𝒯π because it’s defined via dt’s, but when π changes, dt’s change, hence 𝒯π changes.

Result: Computation becomes tractable (mitigates curse of dimensionality), and if 𝒫(s, a) is convex, the inner problem is also convex.

Perfect Observation MDP: Observe state st and choose at.

Time-invariant: The reward r(st, at, st+1) is known and time-invariant.

⑵ Background Theory

DP(Dynamic Programming)

② ϵ-optimal policy

Banach Fixed Point Theorem

Game Theory

Information Theory



2. Finite-horizon Robust MDP

Overview: For computation, state/action space is finite.

⑵ Value function after policy π is fixed.


스크린샷 2025-10-29 오전 9 08 14


⑶ Bellman equation (DP)


스크린샷 2025-10-29 오전 9 08 31


① The following induction assumption is used in the proof.


스크린샷 2025-10-29 오전 9 08 57


② For a rigorous proof, use an ϵ-optimal policy and the equivalence method (show both LHS ≥ RHS and LHS ≤ RHS).



3. Infinite-horizon Robust MDP

⑴ Overview: Under discount factor λ < 1 and rectangular uncertainty, the robust Bellman operator is also contractive → value/policy iteration converges.

⑵ Bellman equation (DP)


스크린샷 2025-10-29 오전 9 09 36


⑶ Banach Fixed Point Theorem


스크린샷 2025-10-29 오전 9 09 55


① Corollary: inf and sup do not alter the conclusion of the Banach Fixed Point Theorem (key result). Note that (b) is an NP-complete problem.


스크린샷 2025-10-29 오전 9 10 28


② Algorithm


스크린샷 2025-10-29 오전 9 10 50


○ Contractivity → error bound from current residual.


스크린샷 2025-10-29 오전 9 11 10


○ Error of one update step.


스크린샷 2025-10-29 오전 9 11 35


○ Therefore, to ensure ㅣㅣṼ - V*ㅣㅣ ≤ ϵ / 2.


스크린샷 2025-10-29 오전 9 12 09


○ The paper’s algorithm adds an extra 1/2 multiplier and stops when that condition is met.


스크린샷 2025-10-29 오전 9 12 28


○ This stricter limit ensures that even after updating V ← Ṽ, the error remains small enough (extra margin by triangle inequality), so when selecting a greedy policy in the next step, performance loss bound (usually (2λ / (1 - λ)) ㅣㅣV - V*ㅣㅣ) is also satisfied — a conservative (safety margin) setting.

⑷ Asymmetry between Decision Maker and Adversary


스크린샷 2025-10-29 오전 9 12 55


① If (1) only stationary policies are considered, under rectangularity, discounted infinite-horizon, and convex/compact P(s, a), even if the dynamic adversary can change choices each visit, the optimal response to a stationary policy is also “stationary” (same p for each (s,a)), yielding the same value.

② The dynamic adversary appears stronger, but under stationary policies, the actual worst-case response is static, giving equal value.

③ However, when the adversary is static, the decision maker’s stationary policy may not be optimal. A better option might be a non-stationary/dynamic (universal) policy that adjusts weights over time/history (Cover, 1991).



4. Applications

⑴ Overview

① The overall V-problem is hard to solve via linear programming due to non-convexity induced by infp.

② The proposed approach handles the outer structure via DP (value/policy iteration) and solves the inner minimization fast and convexly.

Application 1. KL Confidence

① Overview: The inner problem reduces to a one-dimensional convex search (exponential tilting form solution).

Data → KL divergence: From observed frequency in (s, a), build MLE p̂sa, and via algebraic statistics (chi-square approximation), define confidence region (confidence level ω) as 𝒫(s, a) = {p: D(p ㅣㅣ p̂sa) ≤ tω}.

Closed-form solution of robust expectation + 1D convex problem: Solving Lagrangian of minp∈𝒫 𝔼p[v] gives f(γ) = γt + γlog(𝔼q[exp(-v / γ)]), a 1D convex function, with optimal distribution p(s) ∝ q(s) exp((μ - v(s)) / γ), i.e., exponential tilting. So once γ is found, p is directly obtained.

Computation complexity guarantee: f’(γ) is monotone/convex → bisection finds an ϵ-approximation efficiently (each f’(γ) evaluation is 𝒪(ㅣSㅣ)).


스크린샷 2025-10-29 오전 9 14 47


Conclusion: Modeling transition uncertainty in robust MDP with KL confidence sets yields a closed-form and fast 1D convex optimization (exponential tilting), making computation highly practical.

Application 2. L2 Approximation (χ2-type) Sets

① Overview: Once sorted, thresholding enables fast computation → makes computationally heavy robust MDPs practically solvable.

To be updated later.

Application 3. L1 Approximation Sets

① D(p ㅣㅣ q) ≥ (1 / 2 ln 2) ㅣㅣp - qㅣㅣ12 → 𝒫 = {p: ㅣㅣp - qㅣㅣ1 ≤ δ}, δ = √(2t ln2) used.

② L1 / L∞ type sets are convenient for modeling but weak as statistical confidence regions → in practice, L2-based sets are more recommended.

To be updated later.



5. Appendix

⑴ Found several typos as follows:


스크린샷 2025-10-29 오전 9 16 49

스크린샷 2025-10-29 오전 9 17 06



Input: 2025.10.27 20:40

results matching ""

    No results matching ""