Stochastic Control Theory
Recommended post: 【Control Theory】 Table of Contents for Control Theory
2. Terminology of Stochastic Control Theory
3. Laws of Stochastic Control Theory
1. Sigma-algebra
⑵ Sigma-algebra(σ-algebra)
2. Terminology of Stochastic Control Theory
⑴ Variable definitions
① state (system state) xt: denotes a specific value or a random variable; same below
② observation yt: in the perfect observation case, yt = xt
③ noise (system noise), disturbance, primitive random variable wt, vt
④ system state noise wt
⑤ observation noise vt
⑥ primitive random seed creating stochastic uncertainty x0
⑦ control ut
⑧ control strategy / law / policy gt
⑨ system state sequence xt+1 := ft(xt, ut, wt)
⑩ observation sequence yt := ht(xt, vt)
⑪ control input, action ut := gt(y0:t, u0:t-1). Using all past information is called perfect recall.
⑵ Classification by system sequence
① DDS (deterministic system): xt+1 := ft(xt, ut, wt) = ft(xt, ut). yt := ht(xt, vt) = ht(xt). Case where at any time t both the state variable xt and the output variable yt are known
② SDS (stochastic dynamical model): xt+1 := ft(xt, ut, wt), yt+1 := ht(xt, vt), wt, vt ≢ 0.
⑶ Classification by control input
① open loop control: ut := gt(y0:t, u0:t-1) = gt(u0:t-1).
② feedback control: cases where past outputs y0:t influence the control action u
③ centralized stochastic control : (1) stochastic dynamical system + (2) one controller + (3) controller with perfect recall
④ multi-controller problem: team problem, competitive game, etc.
⑷ Classification by policy
① decision process: a general framework that deals with decision-making problems where state, action, and reward follow through a process
② Markov process: (regardless of whether it is a decision process) the future depends only on the current state
○ Markov chain: among Markov processes, refers to those with a finite or countably infinite state space
○ controlled Markov chain: Markov chain + decision process
③ MDP (Markov decision process): among decision processes, cases where the future depends only on the current state
○ dynamic programming: recurrence relation (break the time dependence). If MDP refers to the system framework, dynamic programming refers to the methodology.
○ POMDP (partially observed Markov decision process): an MDP system where only partial information rather than full state information can be used
○ constrained MDP, constrained POMDP also exist
④ Gaussian process: the state process {Xt} is such that any finite subset follows a joint Gaussian distribution
⑤ Gaussian-Markov process
○ Condition 1. {Xt} is a Gaussian process
○ Condition 2. Markov property: P(Xn+1 ∈ A ㅣ X0, ···, Xn) = P(Xn+1 ∈ A ㅣ Xn)
3. Laws of Stochastic Control Theory
⑴ Lemma 1. In open-loop control, xt is a function of x0, u0:t-1, w0:t-1, and yt is a function of x0, u0:t-1, w0:t-1, vt
⑵ Lemma 2. open-loop system vs. feedback system
① Under DDS, open-loop and feedback systems are equivalent: because uniqueness of ut in DDS naturally holds,
② Under SDS, open-loop and feedback systems are not equivalent
○ Counterexample 1.
⑶ Lemma 3. policy independence: If Wt is independent of X0:t-1, U0:t-1, then ℙ(xt+1g ∈ A | x0:t, u0:t) = ℙ(xt+1g ∈ A | xt, ut) = ℙ(ft(xt, ut, wt) ∈ A | xt, ut) (Markov property), so dependence on policy g disappears
① In DDS, if you know the current state, you can know the next state immediately, but in SDS, past states matter, so conditional probabilities on the history are important.
② That is, when wt is independent, system evolution follows natural laws + pure noise, so the policy is irrelevant; but if wt depends on the policy, the policy changes the noise distribution, so the future state distribution depends on the policy.
③ Sometimes written as Wt ⫫ x0, U0:t-1, W0:t-2 instead of Wt ⫫ x0, U0:t-1, but the former is a stronger claim.
⑷ Lemma 4. Gaussian process (GP)
① Definition: the state process {Xt} is such that any finite subset follows a joint Gaussian distribution
② 4-1. Even if each Xi is Gaussian, it does not imply {Xi}i∈ℕ is a GP.
○ Example: X2 = X1 I{|X1| ≤ k} + (-X1) I{|X1| > k}, Y = (X1 + X2) / 2 is not a GP
③ 4-2. For Xt+1 = AXt + BUt + GWt, X0 ~ 𝒩(0, ∑0), Wt ~ 𝒩(0, Q), {Xt} is a GP
④ 4-3. Under a feedback policy, {Xt} is generally not a GP
○ Example: If Ut := gt(Yt) = gt(Xt) = Xt2, then X1 = AX0 + BX02 + GW0, which is not Gaussian
⑤ (Note) MMSE (minimum mean-square estimator)
⑥ (Note) orthogonality principle
⑦ (Note) LMMSE (linear minimum mean-square estimator): If X and Y are jointly Gaussian, then LMMSE = MMSE holds
⑸ Lemma 5. multi-step prediction
① In general, ℙ(xt+2g ∈ A | xt, ut, ut+1) ≠ ℙ(xt+2g ∈ A | x0:t, u0:t+1)
② Since ut+1 = gt(y0:t+1, u0:t) that is not independent of u0:t-1 implies information about wt via observation, conditioning on ut+1 breaks the past-independence of wt: here “past” means x0:t-1, u0:t-1
○ Counterexample 1. In open-loop control, ut+1 = gt(u0:t) holds, so it cannot imply information about wt, hence equality holds.
○ Counterexample 2. When wt is a constant
○ Counterexample 3. When ut is defined to have the Markov property and memoryless feedback, e.g., ut = μt(xt): the following is the case yt = xt = ut
③ multi-step prediction with open-loop control
④ Chapman-Kolmogorov decomposition
⑹ Lemma 6. linear Gaussian state-space model
① (Note) Gaussian-Markov process
○ Condition 1. {Xt} is a Gaussian process
○ Condition 2. Markov property: P(Xn+1 ∈ A X0, ···, Xn) = P(Xn+1 ∈ A Xn)
② System definition
○ Markov property: applies even with feedback policy
○ multi-step Markov property
○ mean propagation
○ cross-covariance Cov(Xt+m, Xt)
○ covariance propagation
③ DALE (discrete-time algebraic Lyapunov equation)
○ If the absolute values of all eigenvalues (including complex ones) of a square matrix A are less than 1, the matrix is defined as stable: because A∞ = 0
○ If A is stable, then ∑∞ = limt→∞ ∑t = limt→∞ 𝔼[(Xt - 𝔼[Xt])(Xt - 𝔼[Xt])ᵀ] exists uniquely
○ Proof of uniqueness of ∑∞
○ stability of A is a sufficient, but not necessary condition: ∑∞ may still exist uniquely even if A is not stable
④ reachability
○ Definition
○ Theorem 1. The following are all equivalent: assume w ∈ ℝs
○ In condition 3, the noise sequence w should be interpreted as the control input applied to the system; due to them, the system can be steered from 0 to a given state x over n time steps.
○ Theorem 2. Lyapunov stability test
○ Note that in condition 2 it is PD (positive definite), not PSD (positive semidefinite)
⑺ Lemma 7. Graph Theory
① strongly connected (= irreducible, communicable): a condition where from any node i in the graph one can reach any other node j
Figure 1. Example of irreducible
Figure 2. Example of reducible (state 3 is a sink)
② period: the period of a specific node i is the greatest common divisor of the lengths of all paths from i back to i
○ Example: with two nodes A, B connected by two edges A=B, the period of each node is 2
Figure 3. Example of a matrix with period m
S1 → S2 → ··· → Sm → S1 → ··· has such a cycle
③ aperiodic: all nodes have period 1
○ aperiodic ⊂ irreduicible
○ Example: if each node has a walk to itself, it is aperiodic
④ stationary state: If Pr(xn | xn-1) is independent of n, the Markov process is stationary (time-invariant)
⑤ regular
○ regular ⊂ irreduicible
○ For some natural number k, every entry of the power Mk of the transition matrix M is positive (i.e., nonzero)
⑥ transition matrix
⑦ Markov policy: ut = gt(xt)
⑧ One can prove the second law of thermodynamics (law of increasing entropy) using a Markov process
○ Because one can simulate the law of diffusion: provided a uniform stationary distribution is assumed
○ Related concept: random walk
⑨ Perron-Frobenius theorem
○ Theorem 1. If a Markov chain with transition matrix P is strongly connected, there exists exactly one stationary distribution q
○ The stationary distribution satisfies Pq = q
○ Theorem 2. If a Markov chain with transition matrix P is strongly connected and aperiodic, it is called an Ergodic Markov chain and satisfies:
○ Pij: probability of transition from node j to node i. ∑i Pij = 1
○ 2-1. The (i, j) entry Pij(k) of Pk converges to qi as k → ∞: note it converges to the same value for fixed i regardless of j
○ 2-2. Regardless of the initial state x0, the k-th state xk converges to q as k → ∞
⑻ Lemma 8. value function
① recursive and backward iteration: for example, with present value (discounting future value to present),
○ JTg ∈ ℝ1×1
○ π0 ∈ ℝ1×n: initial distribution of the Markov chain
○ V0g ∈ ℝn×1: vector of state-wise value functions collecting expected cumulative cost at each state under policy g
② Bellman equation: related to the discounted cost problem
○ (Note) time-homogeneous: {xtg}t≥0 and {xtg}t≥τ,∀τ∈ℤ+ follow the same distribution
○ Condition 1. time-homogeneous transition: Pt(j | i, u) = P(j | i, u) ∀t
○ Condition 2. time-homogeneous cost: Ct(x, y) = C(x, y) ∀t
○ Condition 3. stationary policy: gt = g ∀t
○ If all the above hold, one can obtain the following fixed-point equation
○ Since Pg is stable, all eigenvalues have absolute value less than 1, so det(I - βPg) = β det( (1/β)I - Pg ) ≠ 0
○ Vg ∈ ℝn×1: vector of state-wise value functions collecting expected discounted cumulative cost at each state under policy g
○ Pg ∈ ℝn×n: transition matrix; the (i, j) entry is the probability of transition from i to j
③ Cesàro limit: related to the long-term average cost problem
④ Poisson equation: related to average cost
○ Jg is unique
○ Lg: relative value function. Lg is not unique (∵ Lg + α1 ∀α ∈ ℝ is also a solution to the Poisson equation)
○ Existence of solutions
⑼ Lemma 9. When not irreducible
① If Pg is not irreducible, the state space S splits into the transient states T and one or more recurrent communicating classes C1, ···
② transient state: visited only finitely many times. Eventually the process leaves the transient states and enters a recurrent state.
○ The stationary distribution πg assigns probability 0 to all transient states.
③ recurrent state: since a recurrent communicating class is a closed set, no communication occurs with outside nodes
○ i → j: means there exists a path with positive probability from i to j
○ i ↔︎ j: means i → j and j → i; i and j communicate
○ positive recurrent: the mean return time to that state is finite. A chain starting in a positive recurrent state has a unique stationary distribution.
○ null recurrent: the mean return time to that state is infinite. No stationary distribution exists.
○ absorbing state: a state that, once entered, you remain in forever
④ Example 1. Finite state space
Let S = {0, 1, ···, I}. Since Vg(0) = 0 and C(0, g(0)) = 0, we can focus only on Ś = S \ {0} = {1, ···, I}, the non-absorbing states. Let Ṽg be the value vector for states in Ś, and let Rg be the submatrix of Pg for transitions among states inside Ś. Then the system of equations for these states is Ṽg = c̃ + RgṼg. To show uniqueness of Ṽg, suppose there are two solutions Ṽ1g, Ṽ2g. Let their difference be Ug = Ṽ1g - Ṽ2g; subtracting the two equations yields Ug = RgUg = ⋯ = (Rg)nUg = ⋯ = 0 (∵ limn→∞ (Rg)n = 0, method of infinite descent) ⇔ Ṽ1g = Ṽ2g. Thus, in a finite state space where state 0 is absorbing and all other states can reach 0, the first-passage-time cost equation has a unique nonnegative solution.
⑤ Example 2. Countably infinite state space
Uniqueness is not trivial. Assuming the solution is bounded often allows one to show uniqueness. Consider the equation for the difference of two solutions Ug = RgUg. Connecting this to the diagram yields Ug(ℓ+1) - Ug(ℓ) = (λ - 1)(Ug(ℓ) - Ug(ℓ-1)). The consecutive differences Δ(ℓ) = Ug(ℓ+1) - Ug(ℓ) form a geometric sequence with ratio (λ - 1). If |λ - 1| < 1, these differences converge to 0, suggesting a bounded solution. If |λ - 1| ≥ 1, the differences may diverge, implying that uniqueness may fail.
⑽ Lemma 10. Martingale
① Doob’s theorem
○ σ(X1, X2, ···, Xn): the smallest σ-algebra that makes X1, X2, ···, Xn measurable
○ Doob’s theorem: σ(X1, X2, ···, Xn) is equivalent to the collection of all functions of the form g(X1, X2, ···, Xn)
○ The larger the σ-algebra, the more functions are measurable with respect to it; i.e., the more information it contains.
② filtration
○ a collection of σ-algebras ordered increasingly by inclusion
○ Ordered by ⊆; if ℱ1 ⊆ ℱ2, then ℱ2 is afterwards relative to ℱ1
○ For convenience, let time index t = 0, 1, 2, ⋯; then the filtration is {ℱt}t∈ℤ+ and satisfies ℱs ⊆ ℱt for all s ≤ t
○ Intuition: represents situations where information increases as observations accumulate over time
③ martingale
○ Property of conditional expectation
○ For any random variable Y, 𝔼[Y | X1, ···, Xn] = 𝔼[Y | σ(X1, ···, Xn)] holds
○ Reason: because σ(X1, ···, Xn) is equivalent to the set of all functions generated by X1, ···, Xn
○ Martingale: a stochastic process {Xt}t∈ℤ+ adapted to a filtration {ℱt}t∈ℤ+ that satisfies all of the following
○ Condition 1. Xt is ℱt-measurable for all t ∈ ℤ+
○ If s ≤ t ≤ s′ and ℱ>s ⊆ ℱt ⊆ ℱs′, then Xt ∈ ℱt is not ℱs-measurable (insufficient information) but is ℱs′-measurable.
○ Condition 2. 𝔼[|Xt|] is finite for all t ∈ ℤ+
○ Condition 3. 𝔼[Xt | ℱs] = Xs almost surely for all s ≤ t and all t ∈ ℤ+
○ Interpretation: Given only the information up to time s (ℱs), the optimal prediction of Xt equals Xs (i.e., the prediction is constrained to Xs).
○ Remark: The martingale property is needed only when predicting the future from the past. In particular, for s > t we have 𝔼[Xt ㅣ ℱs] = Xt regardless of whether (Xt) is a martingale (assuming integrability).
○ For s < t, 𝔼[Xt ㅣ ℱs] = Xt also holds, because Xs is ℱs-measurable and ℱs ⊆ ℱt implies Xs is ℱt-measurable.
○ Note: an i.i.d. process is generally not a martingale (except for the constant process)
○ Application: 𝔼[Ug(Xtg) | Xt-1g] = Ug(Xt-1g)
Input: 2025.08.26 23:34