Chapter 9-1. Robust MDP

Recommended reading: 【Control Theory】 Stochastic Control Theory

1. General Overview

2. Finite-horizon robust MDP

3. Infinite-horizon robust MDP

4. Applications

5. Appendix

a.Garud Iyengar (2005)

b.Arnab Nilim and Laurent El Ghaoui (2003)

–

1. General Overview

⑴ Introduction

① Goal: Choosing a robust policy under model ambiguity (i.e., transition probability P is uncertain)

○ Uncertainty: The probability distribution is known, but randomness exists according to that probability.

○ Ambiguity: The probability distribution itself is unknown. In other words, it’s unclear which distribution is correct.

② Problem Definition

○ Standard MDP

○ Robust MDP: For each policy π, the problem is to find the worst-case transition probability P.

○ Problem 1. The DP optimal policy is very sensitive to small changes in transition probabilities.

○ Rain probability 0.3 → optimal to carry an umbrella.

○ Rain probability 0.2 → optimal not to carry an umbrella.

○ Only 0.1 difference, but the optimal policy changes.

○ Problem 2. Each (s, a) adds an inner minimization inf_{p∈𝒫(s,a)} 𝔼^p[·], making it heavier computationally than standard MDP.

○ Problem 3. Impossibility of linear programming: Depending on the choice of 𝒫, it can become non-convex.

○ Fixed transition matrix P: intersection of half-spaces → convex.

○ inf_P with ≥ constraint: epigraph of a concave roof → generally non-convex.

③ Solution

○ Adversarial Setting

○ Robust MDP is a game between a decision maker defining π and an adversary finding the infimum for each π.

○ Question: Since game theory and optimization are conceptually different, is robust MDP closer to game theory or optimization theory?

○ Rectangularity Assumption: Means the “choice” of transition probabilities at each time step is independent of others.

○ Purpose of introduction: To make DP decomposition (Bellman recursion) possible.

○ A stronger assumption than simple statistical independence. Usually holds for stationary policies.

○ ℳ(ℬ): Set of probability measures over discrete set ℬ.

○ 𝒯^d_t: Set of conditional distributions consistent with a history-dependent decision rule d_t. To describe the conditional distribution, one must specify the joint distribution p_h(a,s) of action a and next state s_t+1.

○ ×: Cartesian product, so each element of 𝒯^π is a tuple of per-stage distributions like (p₀, p₁, …, p_N-1).

○ π doesn’t appear explicitly in 𝒯^π because it’s defined via d_t’s, but when π changes, d_t’s change, hence 𝒯^π changes.

○ Result: Computation becomes tractable (mitigates curse of dimensionality), and if 𝒫(s, a) is convex, the inner problem is also convex.

○ Perfect Observation MDP: Observe state s_t and choose a_t.

○ Time-invariant: The reward r(s_t, a_t, s_t+1) is known and time-invariant.

⑵ Background Theory

① DP(Dynamic Programming)

② ϵ-optimal policy

③ Banach Fixed Point Theorem

④ Game Theory

⑤ Information Theory

2. Finite-horizon Robust MDP

⑴ Overview: For computation, state/action space is finite.

⑵ Value function after policy π is fixed.

⑶ Bellman equation (DP)

① The following induction assumption is used in the proof.

② For a rigorous proof, use an ϵ-optimal policy and the equivalence method (show both LHS ≥ RHS and LHS ≤ RHS).

3. Infinite-horizon Robust MDP

⑴ Overview: Under discount factor λ < 1 and rectangular uncertainty, the robust Bellman operator is also contractive → value/policy iteration converges.

⑵ Bellman equation (DP)

⑶ Banach Fixed Point Theorem

① Corollary: inf and sup do not alter the conclusion of the Banach Fixed Point Theorem (key result). Note that (b) is an NP-complete problem.

② Algorithm

○ Contractivity → error bound from current residual.

○ Error of one update step.

○ Therefore, to ensure ㅣㅣṼ - V*ㅣㅣ ≤ ϵ / 2.

○ The paper’s algorithm adds an extra 1/2 multiplier and stops when that condition is met.

○ This stricter limit ensures that even after updating V ← Ṽ, the error remains small enough (extra margin by triangle inequality), so when selecting a greedy policy in the next step, performance loss bound (usually (2λ / (1 - λ)) ㅣㅣV - V*ㅣㅣ) is also satisfied — a conservative (safety margin) setting.

⑷ Asymmetry between Decision Maker and Adversary

① If (1) only stationary policies are considered, under rectangularity, discounted infinite-horizon, and convex/compact P(s, a), even if the dynamic adversary can change choices each visit, the optimal response to a stationary policy is also “stationary” (same p for each (s,a)), yielding the same value.

② The dynamic adversary appears stronger, but under stationary policies, the actual worst-case response is static, giving equal value.

③ However, when the adversary is static, the decision maker’s stationary policy may not be optimal. A better option might be a non-stationary/dynamic (universal) policy that adjusts weights over time/history (Cover, 1991).

4. Applications

⑴ Overview

① The overall V-problem is hard to solve via linear programming due to non-convexity induced by inf_p.

② The proposed approach handles the outer structure via DP (value/policy iteration) and solves the inner minimization fast and convexly.

⑵ Application 1. KL Confidence

① Overview: The inner problem reduces to a one-dimensional convex search (exponential tilting form solution).

② Data → KL divergence: From observed frequency in (s, a), build MLE p̂_sa, and via algebraic statistics (chi-square approximation), define confidence region (confidence level ω) as 𝒫(s, a) = {p: D(p ㅣㅣ p̂_sa) ≤ tω}.

③ Closed-form solution of robust expectation + 1D convex problem: Solving Lagrangian of min_p∈𝒫 𝔼^p[v] gives f(γ) = γt + γlog(𝔼^q[exp(-v / γ)]), a 1D convex function, with optimal distribution p(s) ∝ q(s) exp((μ - v(s)) / γ), i.e., exponential tilting. So once γ is found, p is directly obtained.

④ Computation complexity guarantee: f’(γ) is monotone/convex → bisection finds an ϵ-approximation efficiently (each f’(γ) evaluation is 𝒪(ㅣSㅣ)).

⑤ Conclusion: Modeling transition uncertainty in robust MDP with KL confidence sets yields a closed-form and fast 1D convex optimization (exponential tilting), making computation highly practical.

⑶ Application 2. L₂ Approximation (χ²-type) Sets

① Overview: Once sorted, thresholding enables fast computation → makes computationally heavy robust MDPs practically solvable.

To be updated later.

⑷ Application 3. L₁ Approximation Sets

① D(p ㅣㅣ q) ≥ (1 / 2 ln 2) ㅣㅣp - qㅣㅣ₁² → 𝒫 = {p: ㅣㅣp - qㅣㅣ1 ≤ δ}, δ = √(2t ln2) used.

② L1 / L∞ type sets are convenient for modeling but weak as statistical confidence regions → in practice, L2-based sets are more recommended.

To be updated later.

5. Appendix

⑴ Found several typos as follows:

Input: 2025.10.27 20:40

1131

Chapter 9-1. Robust MDP

1. General Overview

2. Finite-horizon Robust MDP

3. Infinite-horizon Robust MDP

4. Applications

5. Appendix

results matching ""

No results matching ""