Chapter 9-1. Robust MDP
Recommended reading: 【Control Theory】 Stochastic Control Theory
3. Infinite-horizon robust MDP
4. Applications
5. Appendix
b.Arnab Nilim and Laurent El Ghaoui (2003)
–
1. General Overview
⑴ Introduction
① Goal: Choosing a robust policy under model ambiguity (i.e., transition probability P is uncertain)
○ Uncertainty: The probability distribution is known, but randomness exists according to that probability.
○ Ambiguity: The probability distribution itself is unknown. In other words, it’s unclear which distribution is correct.
② Problem Definition
○ Standard MDP
○ Robust MDP: For each policy π, the problem is to find the worst-case transition probability P.
○ Problem 1. The DP optimal policy is very sensitive to small changes in transition probabilities.
○ Rain probability 0.3 → optimal to carry an umbrella.
○ Rain probability 0.2 → optimal not to carry an umbrella.
○ Only 0.1 difference, but the optimal policy changes.
○ Problem 2. Each (s, a) adds an inner minimization infp∈𝒫(s,a) 𝔼p[·], making it heavier computationally than standard MDP.
○ Problem 3. Impossibility of linear programming: Depending on the choice of 𝒫, it can become non-convex.
○ Fixed transition matrix P: intersection of half-spaces → convex.
○ infP with ≥ constraint: epigraph of a concave roof → generally non-convex.
③ Solution
○ Adversarial Setting
○ Robust MDP is a game between a decision maker defining π and an adversary finding the infimum for each π.
○ Question: Since game theory and optimization are conceptually different, is robust MDP closer to game theory or optimization theory?
○ Rectangularity Assumption: Means the “choice” of transition probabilities at each time step is independent of others.
○ Purpose of introduction: To make DP decomposition (Bellman recursion) possible.
○ A stronger assumption than simple statistical independence. Usually holds for stationary policies.
○ ℳ(ℬ): Set of probability measures over discrete set ℬ.
○ 𝒯dt: Set of conditional distributions consistent with a history-dependent decision rule dt. To describe the conditional distribution, one must specify the joint distribution ph(a,s) of action a and next state st+1.
○ ×: Cartesian product, so each element of 𝒯π is a tuple of per-stage distributions like (p0, p1, …, pN-1).
○ π doesn’t appear explicitly in 𝒯π because it’s defined via dt’s, but when π changes, dt’s change, hence 𝒯π changes.
○ Result: Computation becomes tractable (mitigates curse of dimensionality), and if 𝒫(s, a) is convex, the inner problem is also convex.
○ Perfect Observation MDP: Observe state st and choose at.
○ Time-invariant: The reward r(st, at, st+1) is known and time-invariant.
⑵ Background Theory
① DP(Dynamic Programming)
② ϵ-optimal policy
2. Finite-horizon Robust MDP
⑴ Overview: For computation, state/action space is finite.
⑵ Value function after policy π is fixed.
⑶ Bellman equation (DP)
① The following induction assumption is used in the proof.
② For a rigorous proof, use an ϵ-optimal policy and the equivalence method (show both LHS ≥ RHS and LHS ≤ RHS).
3. Infinite-horizon Robust MDP
⑴ Overview: Under discount factor λ < 1 and rectangular uncertainty, the robust Bellman operator is also contractive → value/policy iteration converges.
⑵ Bellman equation (DP)
⑶ Banach Fixed Point Theorem
① Corollary: inf and sup do not alter the conclusion of the Banach Fixed Point Theorem (key result). Note that (b) is an NP-complete problem.
② Algorithm
○ Contractivity → error bound from current residual.
○ Error of one update step.
○ Therefore, to ensure ㅣㅣṼ - V*ㅣㅣ ≤ ϵ / 2.
○ The paper’s algorithm adds an extra 1/2 multiplier and stops when that condition is met.
○ This stricter limit ensures that even after updating V ← Ṽ, the error remains small enough (extra margin by triangle inequality), so when selecting a greedy policy in the next step, performance loss bound (usually (2λ / (1 - λ)) ㅣㅣV - V*ㅣㅣ) is also satisfied — a conservative (safety margin) setting.
⑷ Asymmetry between Decision Maker and Adversary
① If (1) only stationary policies are considered, under rectangularity, discounted infinite-horizon, and convex/compact P(s, a), even if the dynamic adversary can change choices each visit, the optimal response to a stationary policy is also “stationary” (same p for each (s,a)), yielding the same value.
② The dynamic adversary appears stronger, but under stationary policies, the actual worst-case response is static, giving equal value.
③ However, when the adversary is static, the decision maker’s stationary policy may not be optimal. A better option might be a non-stationary/dynamic (universal) policy that adjusts weights over time/history (Cover, 1991).
4. Applications
⑴ Overview
① The overall V-problem is hard to solve via linear programming due to non-convexity induced by infp.
② The proposed approach handles the outer structure via DP (value/policy iteration) and solves the inner minimization fast and convexly.
⑵ Application 1. KL Confidence
① Overview: The inner problem reduces to a one-dimensional convex search (exponential tilting form solution).
② Data → KL divergence: From observed frequency in (s, a), build MLE p̂sa, and via algebraic statistics (chi-square approximation), define confidence region (confidence level ω) as 𝒫(s, a) = {p: D(p ㅣㅣ p̂sa) ≤ tω}.
③ Closed-form solution of robust expectation + 1D convex problem: Solving Lagrangian of minp∈𝒫 𝔼p[v] gives f(γ) = γt + γlog(𝔼q[exp(-v / γ)]), a 1D convex function, with optimal distribution p(s) ∝ q(s) exp((μ - v(s)) / γ), i.e., exponential tilting. So once γ is found, p is directly obtained.
④ Computation complexity guarantee: f’(γ) is monotone/convex → bisection finds an ϵ-approximation efficiently (each f’(γ) evaluation is 𝒪(ㅣSㅣ)).
⑤ Conclusion: Modeling transition uncertainty in robust MDP with KL confidence sets yields a closed-form and fast 1D convex optimization (exponential tilting), making computation highly practical.
⑶ Application 2. L2 Approximation (χ2-type) Sets
① Overview: Once sorted, thresholding enables fast computation → makes computationally heavy robust MDPs practically solvable.
To be updated later.
⑷ Application 3. L1 Approximation Sets
① D(p ㅣㅣ q) ≥ (1 / 2 ln 2) ㅣㅣp - qㅣㅣ12 → 𝒫 = {p: ㅣㅣp - qㅣㅣ1 ≤ δ}, δ = √(2t ln2) used.
② L1 / L∞ type sets are convenient for modeling but weak as statistical confidence regions → in practice, L2-based sets are more recommended.
To be updated later.
5. Appendix
⑴ Found several typos as follows:
Input: 2025.10.27 20:40