① DDS (deterministic system): x_t+1 := f_t(x_t, u_t, w_t) = f_t(x_t, u_t). y_t := h_t(x_t, v_t) = h_t(x_t). Case where at any time t both the state variable xt and the output variable y_t are known

② SDS (stochastic dynamical model): x_t+1 := f_t(x_t, u_t, w_t), y_t+1 := h_t(x_t, v_t), w_t, v_t ≢ 0.

⑶ Classification by control input

① open loop control: u_t := g_t(y_0:t, u_0:t-1) = g_t(u_0:t-1).

② feedback control: cases where past outputs y_0:t influence the control action u

③ centralized stochastic control : (1) stochastic dynamical system + (2) one controller + (3) controller with perfect recall

④ multi-controller problem: team problem, competitive game, etc.

⑷ Classification by policy

① decision process: a general framework that deals with decision-making problems where state, action, and reward follow through a process

② Markov process: (regardless of whether it is a decision process) the future depends only on the current state

○ Markov chain: among Markov processes, refers to those with a finite or countably infinite state space

○ controlled Markov chain: Markov chain + decision process

③ MDP (Markov decision process): among decision processes, cases where the future depends only on the current state

○ dynamic programming: recurrence relation (break the time dependence). If MDP refers to the system framework, dynamic programming refers to the methodology.

○ POMDP (partially observed Markov decision process): an MDP system where only partial information rather than full state information can be used

○ constrained MDP, constrained POMDP also exist

○ Related algorithms

④ Gaussian process: the state process {X_t} is such that any finite subset follows a joint Gaussian distribution

⑤ Gaussian-Markov process

○ Condition 1. {X_t} is a Gaussian process

○ Condition 2. Markov property: P(X_n+1 ∈ A ㅣ X₀, ···, X_n) = P(X_n+1 ∈ A ㅣ X_n)

3. Laws of Stochastic Control Theory

⑴ Lemma 1. In open-loop control, x_t is a function of x₀, u_0:t-1, w_0:t-1, and y_t is a function of x₀, u_0:t-1, w_0:t-1, v_t

⑵ Lemma 2. open-loop system vs. feedback system

① open loop control: u_t := g_t(y_0:t, u_0:t-1) = g_t(u_0:t-1).

② feedback control: cases where past outputs y_0:t influence the control action u

③ Under DDS, open-loop and feedback systems are equivalent.

○ open-loop → feedback (proof): Given an open-loop control input sequence u, define a feedback control that ignores the state (i.e., a map returning the predetermined u_t at each time t). Then, from the initial state x₀, it produces the same trajectory and cost. Thus, regardless of DDS/SDS, for any open-loop there exists a feedback policy that is equivalent at that initial state. That is, open-loop ⊂ feedback holds always.

○ feedback → open-loop (proof): In a DDS, all control inputs are uniquely determined (determinism). Therefore, if you pre-specify the same input sequence u as an open-loop policy, the resulting state evolution is identical and so is the cost.

② Under SDS, open-loop and feedback systems are not equivalent

○ Counterexample 1.

○ In the above counterexample, the feedback system outperforms the open-loop system (i.e., it generates the lower cost).

⑶ Lemma 3. policy independence: If W_t is independent of X_0:t-1, U_0:t-1, then ℙ(x_t+1^g ∈ A | x_0:t, u_0:t) = ℙ(x_t+1^g ∈ A | x_t, u_t) = ℙ(f_t(x_t, u_t, w_t) ∈ A | x_t, u_t) (Markov property), so dependence on policy g disappears

① In DDS, if you know the current state, you can know the next state immediately, but in SDS, past states matter, so conditional probabilities on the history are important.

② That is, when w_t is independent, system evolution follows natural laws + pure noise, so the policy is irrelevant; but if w_t depends on the policy, the policy changes the noise distribution, so the future state distribution depends on the policy.

③ Philosophy: Philosophically, “policy independence” implies that diversified judgments based on individual value assessments are impossible, and that choices become constrained by factual determinations.

⑷ Lemma 4. Gaussian process (GP)

① Definition: the state process {X_t} is such that any finite subset follows a joint Gaussian distribution

② 4-1. Even if each X_i is Gaussian, it does not imply {X_i}_i∈ℕ is a GP.

○ Example: X₂ = X₁ I{|X₁| ≤ k} + (-X₁) I{|X₁| > k}, Y = (X₁ + X₂) / 2 is not a GP

③ 4-2. For X_t+1 = AX_t + BU_t + GW_t, X₀ ~ 𝒩(0, ∑₀), W_t ~ 𝒩(0, Q), {X_t} is a GP

④ 4-3. Under a feedback policy, {X_t} is generally not a GP

○ Example: If U_t := g_t(Y_t) = g_t(X_t) = X_t², then X₁ = AX₀ + BX₀² + GW₀, which is not Gaussian

○ On the other hand, in a linear Gaussian SDS, for a general open-loop policy the state process {X_t} is always Gaussian.

⑤ (Note) MMSE (minimum mean-square estimator)

⑥ (Note) orthogonality principle

⑦ (Note) LMMSE (linear minimum mean-square estimator)

⑧ If X and Y are jointly Gaussian, then LMMSE = MMSE holds.

⑸ Lemma 5. multi-step prediction

① In general, ℙ(x_t+2^g ∈ A | x_t, u_t, u_t+1) ≠ ℙ(x_t+2^g ∈ A | x_0:t, u_0:t+1)

○ Official explanation: Since u_t+1 = g_t(y_0:t+1, u_0:t) that is not independent of u_0:t-1 implies information about w_t via observation, conditioning on u_t+1 breaks the past-independence of w_t: here “past” means x_0:t-1, u_0:t-1

○ Intuitive explanation: Let’s imagine the world 1 with u_0:t-1 ≡ 0 and the world 2 with u_0:t-1 ≡ 1. Without u_t+1, the distribution of x_t+1 can be predicted from x_t, u_t, but due to the presence of future variable u_t+1, to confer the plausibility of u_t+1 being out, x_t+1 ㅣ x_t,u_t can be different. Predicting x_t+1 ㅣ x_t,u_t without knowing which world is given (left) is less accurate than predicting x_t+1 ㅣ x_t,u_t with knowing it. Thus, both sides differ. Of note, once x_t+1 is predicted, together with u_t+1, we can predict the distribution of x_t+2 ㅣ x_t+1,u_t+1.

○ Counterexample 1. In open-loop control, u_t+1 = g_t(u_0:t) holds, so it cannot imply information about wt, hence equality holds.

○ Counterexample 2. When w_t is a constant

○ Counterexample 3. When u_t is defined to have the Markov property and memoryless feedback, e.g., u_t = μ_t(x_t): the following is the case y_t = x_t = u_t

② multi-step prediction with open-loop control

③ Chapman-Kolmogorov decomposition

⑹ Lemma 6. linear Gaussian state-space model

① (Note) Gaussian-Markov process

○ Condition 1. {X_t} is a Gaussian process

○ Condition 2. Markov property: P(X_n+1 ∈ A X₀, ···, X_n) = P(X_n+1 ∈ A X_n)

② System definition

○ Markov property: applies even with feedback policy

○ multi-step Markov property

○ mean propagation

○ cross-covariance Cov(X_t+m, X_t)

○ covariance propagation

③ DALE (discrete-time algebraic Lyapunov equation)

○ If the absolute values of all eigenvalues (including complex ones) of a square matrix A are less than 1, the matrix is defined as stable: because A^∞ = 0

○ If A is stable, then ∑_∞ = lim_t→∞ ∑_t = lim_t→∞ 𝔼[(X_t - 𝔼[X_t])(X_t - 𝔼[X_t])ᵀ] exists uniquely

○ Proof of uniqueness of ∑_∞

○ Remark 1. Stability of A is a sufficient, but not necessary condition

○ ∑_∞ may still exist uniquely even if A is not stable.

○ A trivial example is given by ∑₀ = 0, Q = 0, in which case ∑_k ≡ 0 independent of A. (No noise in the first place.)

○ Remark 2. ∑_∞ may not be strictly positive definite.

○ A trivial example is A = O, rank(GQG^T) < n. (Noise does not touch all directions in the state.)

○ Remark 3. If the input disturbance w_k affects all the components of the state vector, then the stability of A would be necessary for the convergence of ∑_k, and the limiting covariance ∑_∞ would be positive definite → the concept of recapability is born

④ reachability

○ Definition: Related to controllability and observability.

○ Theorem 1. The following are all equivalent: assume w ∈ ℝ^s

○ In condition 3, the noise sequence w should be interpreted as the control input applied to the system; due to them, the system can be steered from 0 to a given state x over n time steps.

○ Theorem 2. Lyapunov stability test

○ Note that in condition 2 it is PD (positive definite), not PSD (positive semidefinite)

⑺ Lemma 7. Graph Theory

① strongly connected (= irreducible, communicable): a condition where from any node i in the graph one can reach any other node j

Figure 1. Example of irreducible

Figure 2. Example of reducible (state 3 is a sink)

② period: the period of a specific node i is the greatest common divisor of the lengths of all paths from i back to i

○ Example: with two nodes A, B connected by two edges A=B, the period of each node is 2

○ The transition matrix with period m should have the following form when allowing for state rearrangements like Q^TPQ given a proper permutation matrix Q.

Figure 3. Example of a transition matrix with period m

S₁ → S₂ → ··· → S_m → S₁ → ··· has such a cycle

③ aperiodic: all nodes have period 1

○ aperiodic ⊂ irreduicible

○ Example: if each node has a walk to itself, it is aperiodic

④ stationary state: If Pr(x_n | x_n-1) is independent of n, the Markov process is stationary (time-invariant)

⑤ regular

○ regular ⊂ irreduicible

○ For some natural number k, every entry of the power Mk of the transition matrix M is positive (i.e., nonzero)

⑥ transition matrix

⑦ Markov policy: u_t = g_t(x_t)

⑧ One can prove the second law of thermodynamics (law of increasing entropy) using a Markov process

○ Because one can simulate the law of diffusion: provided a uniform stationary distribution is assumed

○ Related concept: random walk

⑨ Perron-Frobenius theorem

○ Theorem 1. If a Markov chain with transition matrix P is strongly connected, there exists exactly one stationary distribution q

○ The stationary distribution satisfies Pq = q

○ Theorem 2. If a finite Markov chain with transition matrix P is strongly connected and aperiodic, it is called an Ergodic Markov chain and satisfies:

○ P_ij: probability of transition from node j to node i. ∑_i P_ij = 1. Note that P_ij means the probability of transition from node i to node j in other Lemma.

○ 2-1. The (i, j) entry P_ij(k) of P^k converges to q_i as k → ∞: note it converges to the same value for fixed i regardless of j

○ 2-2. Regardless of the initial state x₀, the k-th state x_k converges to q as k → ∞

⑻ Lemma 8. Value function following deterministic Markov property

① Expected cost and transition probability

② recursive and backward iteration: Dynamic programming

○ J_T^g ∈ ℝ^1×1

○ π₀ ∈ ℝ^1×n: initial distribution of the Markov chain

○ V₀^g ∈ ℝ^n×1: vector of state-wise value functions collecting expected cumulative cost at each state under policy g

○ When T = ∞, J^g becomes infinite, being unable to find the optimal policy g; thus, Bellman equation, Cesàro limit concepts are introduced.

③ Bellman equation: related to the discounted cost problem

○ (Note) time-homogeneous: {x_t^g}_t≥0 and {x_t^g}_{t≥τ,∀τ∈ℤ⁺} follow the same distribution. Also means strictly stationary.

○ Condition 1. time-homogeneous transition: P_t(j | i, u) = P(j | i, u) ∀t

○ Condition 2. time-homogeneous cost: C_t(x, y) = C(x, y) ∀t

○ Condition 3. stationary policy: g_t = g ∀t

○ If all the above hold, one can obtain the following fixed-point equation

○ J^g: The present value of the cost; generally used in an economic context.

○ Since P^g is stable, all eigenvalues have absolute value less than 1, so det(I - βP^g) = β det( (1/β)I - P^g ) ≠ 0

○ V^g ∈ ℝ^n×1: vector of state-wise value functions collecting expected discounted cumulative cost at each state under policy g

○ P^g ∈ ℝ^n×n: transition matrix; the (i, j) entry is the probability of transition from i to j

④ Cesàro limit: related to the long-term average cost problem

⑤ Poisson equation: related to average cost

○ J^g: Value function. J^g is unique

○ L^g: relative value function. L^g is not unique (∵ L^g + α1 ∀α ∈ ℝ is also a solution to the Poisson equation)

○ Existence of solutions

⑼ Lemma 9. When not irreducible

① If P^g is not irreducible, the state space S splits into the transient states T and one or more recurrent communicating classes C₁, ···

② transient state: visited only finitely many times. Eventually the process leaves the transient states and enters a recurrent state.

○ Theorem: The stationary distribution π^g of a finite-state Markov chain assigns probability 0 to all transient states.

○ (i) Proof: Pigeonhole principle. The finiteness is critical (and needs to be used).

○ Suppose that a certain transient state i satisfies Q_i=0. Since a recurrent communicating class is a closed set, no communication occurs with outside nodes once the process enters the recurrent class. Thus, the assumption implies that the transient state i moves to only transient states for all K transitions. Thus, by Pigeonhole Principle, at least one transient state is visited twice among K+1 pigeonholes (counting the start), which yields a directed cycle entirely contained in the transient set. This cycle is closed (no edge leaves it to the recurrent class within these K steps), so it forms a closed communicating class disjoint from the given recurrent class. In a finite Markov chain, every closed communicating class is recurrent; thus, we have produced a second recurrent class, contradicting the assumption of uniqueness. Therefore, the assumption Q_i is false, and hence Q_i>0 for every transient state i.

○ (ii) Proof

③ recurrent state: since a recurrent communicating class is a closed set, no communication occurs with outside nodes

○ i → j: means there exists a path with positive probability from i to j

○ i ↔︎ j: means i → j and j → i; i and j communicate

○ positive recurrent: the mean return time to that state is finite.

○ A chain starting in a positive recurrent state has a unique stationary distribution.

○ null recurrent: the mean return time to that state is infinite. No stationary distribution exists.

○ Example: X_n+1 = X_n + ξ_n, X₀ = 0, ℙ(ξ_n = +1) = ℙ(ξ_n = −1) = 0.5 → The probability of going back to the origin is 1 but the expected time is ∞.

○ absorbing state: a state that, once entered, you remain in forever

④ Example 1. Stationary state set F

⑤ Example 2. Finite state space

Let S = {0, 1, ···, I}. Since V^g(0) = 0 and C(0, g(0)) = 0, we can focus only on Ś = S \ {0} = {1, ···, I}, the non-absorbing states. Let Ṽ^g be the value vector for states in Ś, and let R^g be the submatrix of P^g for transitions among states inside Ś. (i.e., the matrix that describes how the chain moves only among the non-absorbing states before hitting 0.) Then the system of equations for these states is Ṽ^g = c̃ + R^gṼ^g. To show uniqueness of Ṽ^g, suppose there are two solutions Ṽ₁^g, Ṽ₂^g. Let their difference be U^g = Ṽ₁^g - Ṽ₂^g; subtracting the two equations yields U^g = R^gU^g = ⋯ = (R^g)ⁿU^g = ⋯ = 0 (∵ lim_n→∞ (R^g)ⁿ = 0, method of infinite descent) ⇔ Ṽ₁^g = Ṽ₂^g. Thus, in a finite state space where state 0 is absorbing and all other states can reach 0, the first-passage-time cost equation has a unique nonnegative solution.

⑥ Example 3. Countably infinite state space

Figure 4. Countably infinite state space diagram

Uniqueness is not trivial. Assuming the solution is bounded often allows one to show uniqueness. Consider the equation for the difference of two solutions U^g = R^gU^g. Connecting this to the diagram yields U^g(ℓ+1) - U^g(ℓ) = (λ - 1)(U^g(ℓ) - U^g(ℓ-1)). The consecutive differences Δ(ℓ) = U^g(ℓ+1) - U^g(ℓ) form a geometric sequence with ratio (λ - 1). If |λ - 1| < 1, these differences converge to 0, suggesting a bounded solution. If |λ - 1| ≥ 1, the differences may diverge, implying that uniqueness may fail.

⑽ Lemma 10. Martingale

① Doob’s theorem

○ σ(X₁, X₂, ···, X_n): the smallest σ-algebra that makes X₁, X₂, ···, X_n measurable

○ Doob’s theorem: σ(X₁, X₂, ···, X_n) is equivalent to the collection of all functions of the form g(X₁, X₂, ···, X_n)

○ The larger the σ-algebra, the more functions are measurable with respect to it; i.e., the more information it contains.

② filtration

○ a collection of σ-algebras ordered increasingly by inclusion

○ Ordered by ⊆; if ℱ₁ ⊆ ℱ₂, then ℱ₂ is afterwards relative to ℱ₁

○ For convenience, let time index t = 0, 1, 2, ⋯; then the filtration is {ℱ_t}_t∈ℤ⁺ and satisfies ℱ_s ⊆ ℱ_t for all s ≤ t

○ Intuition: represents situations where information increases as observations accumulate over time

③ martingale

○ Property of conditional expectation

○ For any random variable Y, 𝔼[Y | X₁, ···, X_n] = 𝔼[Y | σ(X₁, ···, X_n)] holds

○ Reason: because σ(X₁, ···, X_n) is equivalent to the set of all functions generated by X₁, ···, X_n

○ Additionally, when σ(Y) ⊂ σ(Z), 𝔼[𝔼[X ㅣ Z] ㅣ Y] = 𝔼[𝔼[X ㅣ Y] ㅣ Z] = 𝔼[X ㅣ Y] is established.

○ Martingale: a stochastic process {X_t}_t∈ℤ⁺ adapted to a filtration {ℱ_t}_t∈ℤ⁺ that satisfies all of the following

○ Condition 1. X_t is ℱ_t-measurable for all t ∈ ℤ⁺

○ If s ≤ t ≤ s′ and ℱ>_s ⊆ ℱ_t ⊆ ℱ_s′, then X_t ∈ ℱ_t is not ℱ_s-measurable (insufficient information) but is ℱ_s′-measurable.

○ Condition 2. 𝔼[|X_t|] is finite for all t ∈ ℤ⁺

○ Condition 3. 𝔼[X_t | ℱ_s] = X_s almost surely for all s ≤ t and all t ∈ ℤ⁺

○ Interpretation: Given only the information up to time s (ℱ_s), the optimal prediction of X_t equals X_s (i.e., the prediction is constrained to X_s; 𝔼[X_t ㅣ ℱ_s] is an orthogonal projection of X_t into the ℱ_s-measurable random variable space (‘best prediction’)).

○ Remark: The martingale property is needed only when predicting the future from the past. In particular, for s > t we have 𝔼[X_t ㅣ ℱ_s] = X_t regardless of whether (X_t) is a martingale (assuming integrability).

○ For s < t, 𝔼[X_s ㅣ ℱ_t] = Xs also holds, because X_s is ℱ_s-measurable, but has insufficient information due to ℱ_s ⊆ ℱ_t.

○ Note: an i.i.d. process is generally not a martingale (except for the constant process)

○ Application: 𝔼[U^g(X_t^g) | X_t-1^g] = U^g(X_t-1^g)

④ Martingale and stochastic control theory

⑾ Lemma 11. (Fully observed) ― Optimal Policy

① Problem definition: cost-to-go function under perfect observation. Since control input {U_t, …, U₁} is measurable by {X_t, …, X₀}, the following holds:

② When following Markov property, Bellman equation is established. Here, J_t^g, V_t^{g^M}(X_t) is a cost-to-go from t to future.

③ Markovization theorem (Markov policy sufficiency, reduction to Markov policy)

○ Theorem: In a finite-horizon MDP, for any general (possibly history-dependent and randomized) policy g, there exists a behavioral Markov policy g^M such that, under the same initial distribution μ, the joint distributions of (X_t, U_t) for all t = 0, …, T−1 and of X_T are identical. Consequently, the performance J^g = 𝔼^g[∑_{t=0 to T−1} C_t(X_t, U_t) + C_T(X_T)] equals J^{g^M}. Hence, without loss of optimality, one may restrict attention to randomized Markov policies.

○ Proof

④ Comparison principle

○ Theorem: By working backward from the objective and ensuring the Bellman inequality holds at each step, the initial value V0 serves as a lower bound for the performance of all possible policies. The set of actions that achieve equality at each stage collectively constitute the optimal policy. Thus, the optimality can be verified or constructed by combining locally optimal (stage-wise) choices.

○ Proof: Using mathematical induction on backward

○ Case 1. t = T: Since J_T^g = 𝔼^g[C_T(X_T^g) ㅣ X_T^g, …, X₁^g, X₀] = C_T(X_T^g) ≥ V_T(X_T^g) (∵ (V1)) holds, the induction assumptions are still established.

○ Case 2. If the induction assumptions are established on ℓ = t+1, …, T, the fact that the assumptions still hold for ℓ = t can be verified as follows:

○ Corollary

⑤ Hamiltonian-Jacobi-Bellman (HJB) equation

○ Theorem: HJB is applicable to fiinite / countably infinite, state / action space. Continuous time version of Bellman equation.

○ Proof of theorem 1 is shown at the comparison principle already, so the following explanations are only relevant to theorem 2.

○ p: A certain Markov policy g^M = {g_t} is optimal.

○ q: Given ∀x, t, g_t(x) ∈ arg inf_u∈𝒰 {c_t(x, u) + 𝔼_{W_t}[V_t+1(f_t(x, u, W_t))]} (i.e., stepwise Bellman minimization is achieved)

○ Proof of sufficiency on theorem 2 (q ⇒ p): When x_t is given for each stage, any policies satisfying infimum is optimal (∵ corollary). Then, u_t should be a measurable function on the current state x_t, the optimal policy should be Markov policy.

○ Proof of necessity on theorem 2 (p ⇒ q): If a policy is optimal Markov policy, it should achieve infimum for each stage (w.p.1); otherwise, we can construct a better policy g’ with a positive probability set, implying J(g’) < J(g).

○ Corollary

○ Application 1. If an optimal path of 1 → 6 is 1 → 2 → 3 → 6, the optimal path of 2 → 6 should 2 → 3 → 6 by HJB equation.

○ Application 2. Policy evaluation

○ Application 3. V_t(x) obtained from HJB equation is called value function. It effectively decreases the size of search space.

○ Application 4. Q-value (state-action value function) : Q_t(x, u) = C_t(x, u) + 𝔼_{W_t}[V_t+1(f_t(x, u, W_t))], V_t(x) = inf_u∈𝒰 Q_t(x, u)

○ Application 5. For given finite state / action space, inf = min, and the optimal policy is deterministic Markov policy.

○ Application 6. randomized Markov policy에서의 value function: For u ~ μ_t(u ㅣ i),

⑿ Lemma 12. (Partially observed) ― Information state

① {z_t}_{t={0,···,T}} satisfying the following given partial observation context

○ Background: The of history H_t := {y₀, …, y_t, u₀, …, u_t-1} increases in time and the domain of it increases exponentially. (Curse of dimensionality)

○ Condition 1. Compression: z_t = ℓ_t(H_t) ∀t

○ Condition 2. Policy / Strategy Independent : z_t+1 = 𝒯_t(z_t, y_t+1, u_t) (■ current state, ■ new information) That is, Z_t can be updated recursively using current state and new information without direct reference to the entire past history.

○ Condition 3. Independence relative to g_0:t-1: ∀t = 0, …, T-1

② Condidate 1. z_t = H_t := {y₀, …, y_t, u₀, …, u_t-1}

○ π₀(i) is as follows:

③ Candidate 2. Belief state: z_t = π_t s.t. π_t(i) := ℙ(X_t = i H_t) = ℙ(X_t = i y_0:t, u_0:t-1) ∀i ∈ S

○ Condition 1 is satisfied: z_t = ℓ_t(H_t) in time (compression)

○ Condition 2 is satisfied: we can obtain 𝒯_t satisfiying π_t+1 = 𝒯_t(π_t, y_t+1, u_t) directly using Bayes’ rule. This is the update equation for the belief state, often called a nonlinear filter.

○ Condition 3 is satisfied: Using backward mathematical induction

○ Case 1. t = T-1: All terms do not depend on g_0:T-1.

○ Case 2. The backward induction can be established as follows:

○ HJB-like theorem is established

○ Proof

○ Application 1. (One-way) Separation theorem

○ Application 2. Cost function in belief space

○ Application 3. The following is strategy-dependent unlike belief state because it dependes on g_t-1

○ Application 4. The following is strategy-dependent unlike belief state: because if not including u_t as a condition π_t+1 = 𝔼_{u_t ~ p(· ㅣ Y_0:t) [𝒯_t(π_t, y_t+1, u_t)] is established so the information state transition is affected by the policy.}

○ Application 5. Generally, P^g(X_t+1 ∈ A ㅣ H_t, u_t) ≠ P^g(X_t+1 ∈ A ㅣ y_t, u_t), H_t = (Y_0:t, U_0:t-1)

○ Proof 1. Algebraic proof

○ Proof 2. Intuitive understanding: Basically, we can think of two worlds, each having u_0:t-1 ≡ 0 or u_0:t-1 ≡ 1. Note that u_t doesn’t give any information of which world is given. So, in this case, we can ignore u_t. Of note, simulating X_t from Y_t with respect to V_t is conditionally dependent on which world is given. However, if we know which world is given with u_0:t-1, the simulation of X_t from Y_t can be more accurate (= different). Thus, we know simulating X_t+1 from Y_t is conditionally dependent on u_0:t-1 as well. Therefore, the left-hand side and the right-hand side from the given equation are not generally same.

⒀ Lemma 13. Optimal stopping

① If V_t(i) = max{r(i), a + b∑_j∈S ℙ(j ㅣ i) V_t+1(j)} = max{r(i), a + b𝔼[V_t+1(j) ㅣ i]}, V_t(i) ≥ V_t+1(i) is established.

② If V_t(x) = max{-c + p(x)(1 + V_t+1(x-1)) + (1 - p(x))V_t+1(x), V_t+1(x)}, V_N+1(x) = 0, the following is established:

○ Monotonicity on time: V_t(x) ≥ V_t+1(x) (Proof)

○ Monotonicity on x: V_t(x) ≥ V_t(x-1) (Proof)

○ Supremum of marginal value: 1 ≥ V_t(x) - V_t(x-1) (Proof)

○ Concavity on x is not established: V_t(x) - V_t(x-1) ≤ V_t(x-1) - V_t(x-2) (Counterexamples exist)

○ Relationship between marginal value and time: V_t(x) - V_t(x-1) ≥ V_t+1(x) - V_t+1(x-1) (Proof)

○ Existence of threshold: G_t(x) = p(x)(1 - Δ_t+1(x)) is nondecreasing on x (Proof)

③ If L(π) is a convex function, L(π) = sup_i∈I {α_i π + β_i}, α_i, β_i ∈ ℝ is established.

○ The above comes from the inequality f(x) ≥ f(x₀) + f’(x₀)(x - x₀); it is not a claim that the formula holds for arbitrary choices of α_i and β_i.

○ Example: If L(π) = π², then π² = sup_x₀∈ℝ {2x₀π - x₀²}.

○ Practical point: Using transformation that preserve convexity in the linear (affine) representation, one can show that applying the same transformation to the original convex function also preserves convexity. Example follows.

④ Given two convex functions f₁ and f₂, max{f₁, f₂} is also convex.

Figure 5. Maximum of two convex functions is convex.

⑤ Sum of two convex functions is convex.

⒁ Lemma 14. Contraction theorem of Bellman operator

① Banach fixed point theorem

○ Theorem

Let F be a Banach space. Here, Banach space refers to a complete normed space, and a set being complete means every Cauchy sequence in the set converges to a certain element in the set. Let T: F → F be a transformation satisfying the following:

Tx - Ty ≤ β x - y , ∃β ∈ (0,1), ∀x, y ∈ F

Then, the following is established:

○ There is a unique fixed point w ∈ F satisfying Tw = w.

○ For any x ∈ X, lim_n→∞ Tⁿx = w.

This essentially implies the following:

○ The transformation satisfying the above is called contraction.

○ T is continuous, and specifically is Lipschitz continuous.

○ Proof

Let x ∈ F, α = ㅣㅣx - Txㅣㅣ. Then, we have ㅣㅣTⁿx - Tⁿ⁺¹xㅣㅣ ≤ βⁿα. If we set the Cauchy sequence of {x, Tx, T²x, ···}, we can obtain the following:

∀ϵ > 0, ∃N_ϵ s.t. ∀n, m ≥ N_ϵ, ㅣㅣTⁿx - T^mxㅣㅣ < ϵ

Without loss of generality, we can set n > m. Then, we have

Thus, we have N suject to αβ^N / (1 - β) < ϵ. As F is a Banach space, we have w subject to lim_n→∞ Tⁿx = w. Since T(lim_n→∞ Tⁿx) = Tx = lim_n→∞ Tⁿ⁺¹x = w, so w is a fixed point. IF we set w₁, w₂ as T’s fixed points, we have ㅣㅣw₁ - w₂ㅣㅣ = ㅣㅣTw₁ - Tw₂ㅣㅣ = ⋯ = ㅣㅣTⁿw₁ - Tnw₂ㅣㅣ = ⋯ = 0, so w₁ and w₂ are same.

○ Intuition

② Bellman equation and fixed point theorem

○ Theorem

Let F be a set of functions such tat F = {z: S → ℝ}. Here, S = {1, 2, ···, I} and z := (z(1), ···, z(I))^T. Let’s define the following norm: ㅣㅣzㅣㅣ = max_i ㅣz(i)ㅣ (i.e., ㅣㅣ·ㅣㅣ_∞) Let’s define the operator T : F → F for each component. Then, for all i ∈ S, we have Tz(i) = min_u∈𝒰 [C(i, u) + β∑_j∈S ℙ(j i, u) z(j)]. Then, T is a contraction mapping.

○ Proof

Let ℝ^I be a Banach space with a norm of ㅣㅣ·ㅣㅣ_∞. If z, y ∈ F, then we can prove the theorem as follows:

③ Corollary

○ ∀i ∈ S, W_∞(i) = min_u∈𝒰 C(i, u) + β∑_j∈S ℙ(j ㅣ i, u) W_∞(i) has a unique solution of w_∞.

○ W_∞(i) = inf_g∈𝒢 𝔼^g[∑_{t=0 to ∞} β^tc(x_t, u_t) ㅣ X₀ = i], ∀i ∈ S

○ Definition of J^g(i) and bound

○ Definition of W_∞

○ J^g(i) ≥ W_∞ (lower bound)

○ W_∞(i) ≥ inf_g J^g(i) (upper bound)

○ Conclusion: W_∞(i) = inf_g J^g(i)

○ The optimal stationary Markov policy g*(i) ∈ argmin_u∈𝒢 c(i, u) + β∑_j∈S ℙ(j ㅣ i, u)V_∞(j)

○ ⇔ W_∞ = c^g* + βPg*w∞

○ ⇔ W_∞ = (I - βP^g*)^-1 c^g*

○ Monotonicy: If y ≥ z, Ty ≥ Tz is established:

⒂ Lemma 15. Blackwell’s principle of irrelevant information

① If Y is independent of the state and does not appear directly in the reward function, then Y is irrelevant to the decision problem: any rule that ignores y performs at least as well as one that depends on it. In other words, having a lot of information doesn’t always mean it’s good.

② Example: Suppose a doctor must decide whether to treat a patient. Each morning, the doctor can observe the patient’s health condition X, and also knows the day of the week Y. The available actions are “treat” or “do not treat.” If the day of the week is unrelated to both the expected reward (i.e., survival probability) and the health condition, then making the decision based only on (X) yields the same expected survival rate.

③ However, depending on technical assumptions such as the structure of the action space or measurability, the formulation of an ϵ-optimal policy may vary. When countability or measurability traps arise (e.g., the projection of a Borel set being non-Borel), global approximate dominance may fail.

④ Conclusion: It is mathematically proven that in dynamic programming problems, memoryless strategies alone are sufficient to achieve optimal expected rewards.

⑤ Application: We can prove easily that Markov policies are optimal in finite-horizon Markov decision problems. (ref)

Input: 2025.08.26 23:34

895

Chapter 9. Stochastic Control Theory

1. Sigma-algebra

2. Terminology of Stochastic Control Theory

3. Laws of Stochastic Control Theory

results matching ""

No results matching ""