Korean, Edit

Stochastic Control Theory

Recommended post: 【Control Theory】 Table of Contents for Control Theory


1. Sigma-algebra

2. Terminology of Stochastic Control Theory

3. Laws of Stochastic Control Theory


a. Reinforcement Learning

b. The Power of Belief



1. Sigma-algebra

Probability Space

Sigma-algebra(σ-algebra)



2. Terminology of Stochastic Control Theory

⑴ Variable definitions

state (system state) xt: denotes a specific value or a random variable; same below

observation yt: in the perfect observation case, yt = xt

noise (system noise), disturbance, error, primitive random variable wt, vt

system state noise wt

observation noise vt

⑥ primitive random seed creating stochastic uncertainty x0

control ut

control strategy / law / policy gt


스크린샷 2025-10-15 오전 10 06 54


system state sequence xt+1 := ft(xt, ut, wt)

observation sequence yt := ht(xt, vt)

control input, action ut := gt(y0:t, u0:t-1). Using all past information is called perfect recall.

⑵ Classification by system sequence

DDS (deterministic system): xt+1 := ft(xt, ut, wt) = ft(xt, ut). yt := ht(xt, vt) = ht(xt). Case where at any time t both the state variable xt and the output variable yt are known

SDS (stochastic dynamical model): xt+1 := ft(xt, ut, wt), yt+1 := ht(xt, vt), wt, vt ≢ 0.

⑶ Classification by control input

open loop control: ut := gt(y0:t, u0:t-1) = gt(u0:t-1).

feedback control: cases where past outputs y0:t influence the control action u

centralized stochastic control : (1) stochastic dynamical system + (2) one controller + (3) controller with perfect recall

multi-controller problem: team problem, competitive game, etc.

⑷ Classification by policy

decision process: a general framework that deals with decision-making problems where state, action, and reward follow through a process

Markov process: (regardless of whether it is a decision process) the future depends only on the current state


image


Markov chain: among Markov processes, refers to those with a finite or countably infinite state space

controlled Markov chain: Markov chain + decision process


스크린샷 2025-10-05 오전 2 11 59


MDP (Markov decision process): among decision processes, cases where the future depends only on the current state

dynamic programming: recurrence relation (break the time dependence). If MDP refers to the system framework, dynamic programming refers to the methodology.

POMDP (partially observed Markov decision process): an MDP system where only partial information rather than full state information can be used

constrained MDP, constrained POMDP also exist

Related algorithms

Gaussian process: the state process {Xt} is such that any finite subset follows a joint Gaussian distribution

Gaussian-Markov process

Condition 1. {Xt} is a Gaussian process

Condition 2. Markov property: P(Xn+1 ∈ A X0, ···, Xn) = P(Xn+1 ∈ A Xn)



3. Laws of Stochastic Control Theory

Lemma 1. In open-loop control, xt is a function of x0, u0:t-1, w0:t-1, and yt is a function of x0, u0:t-1, w0:t-1, vt


스크린샷 2025-10-05 오전 2 13 47


Lemma 2. open-loop system vs. feedback system

open loop control: ut := gt(y0:t, u0:t-1) = gt(u0:t-1).

feedback control: cases where past outputs y0:t influence the control action u

③ Under DDS, open-loop and feedback systems are equivalent.

○ open-loop → feedback (proof): Given an open-loop control input sequence u, define a feedback control that ignores the state (i.e., a map returning the predetermined ut at each time t). Then, from the initial state x0, it produces the same trajectory and cost. Thus, regardless of DDS/SDS, for any open-loop there exists a feedback policy that is equivalent at that initial state. That is, open-loop ⊂ feedback holds always.

○ feedback → open-loop (proof): In a DDS, all control inputs are uniquely determined (determinism). Therefore, if you pre-specify the same input sequence u as an open-loop policy, the resulting state evolution is identical and so is the cost.

② Under SDS, open-loop and feedback systems are not equivalent

Counterexample 1.


스크린샷 2025-10-05 오전 2 14 49


○ In the above counterexample, the feedback system outperforms the open-loop system (i.e., it generates the lower cost).


스크린샷 2025-10-11 오후 5 44 43


Lemma 3. policy independence: If Wt is independent of X0:t-1, U0:t-1, then ℙ(xt+1g ∈ A | x0:t, u0:t) = ℙ(xt+1g ∈ A | xt, ut) = ℙ(ft(xt, ut, wt) ∈ A | xt, ut) (Markov property), so dependence on policy g disappears


스크린샷 2025-10-05 오전 2 16 25


① In DDS, if you know the current state, you can know the next state immediately, but in SDS, past states matter, so conditional probabilities on the history are important.

② That is, when wt is independent, system evolution follows natural laws + pure noise, so the policy is irrelevant; but if wt depends on the policy, the policy changes the noise distribution, so the future state distribution depends on the policy.

Lemma 4. Gaussian process (GP)

① Definition: the state process {Xt} is such that any finite subset follows a joint Gaussian distribution

4-1. Even if each Xi is Gaussian, it does not imply {Xi}i∈ℕ is a GP.

○ Example: X2 = X1 I{|X1| ≤ k} + (-X1) I{|X1| > k}, Y = (X1 + X2) / 2 is not a GP

4-2. For Xt+1 = AXt + BUt + GWt, X0 ~ 𝒩(0, ∑0), Wt ~ 𝒩(0, Q), {Xt} is a GP

4-3. Under a feedback policy, {Xt} is generally not a GP

○ Example: If Ut := gt(Yt) = gt(Xt) = Xt2, then X1 = AX0 + BX02 + GW0, which is not Gaussian

○ On the other hand, in a linear Gaussian SDS, for a general open-loop policy the state process {Xt} is always Gaussian.

⑤ (Note) MMSE (minimum mean-square estimator)


스크린샷 2025-10-05 오전 9 49 29


⑥ (Note) orthogonality principle


스크린샷 2025-10-05 오전 9 50 20


⑦ (Note) LMMSE (linear minimum mean-square estimator)


스크린샷 2025-10-05 오전 9 50 46


⑧ If X and Y are jointly Gaussian, then LMMSE = MMSE holds.


스크린샷 2025-10-13 오후 7 55 13


Lemma 5. multi-step prediction

① In general, ℙ(xt+2g ∈ A | xt, ut, ut+1) ≠ ℙ(xt+2g ∈ A | x0:t, u0:t+1)


스크린샷 2025-10-05 오전 9 52 01


② Since ut+1 = gt(y0:t+1, u0:t) that is not independent of u0:t-1 implies information about wt via observation, conditioning on ut+1 breaks the past-independence of wt: here “past” means x0:t-1, u0:t-1

Counterexample 1. In open-loop control, ut+1 = gt(u0:t) holds, so it cannot imply information about wt, hence equality holds.

Counterexample 2. When wt is a constant

Counterexample 3. When ut is defined to have the Markov property and memoryless feedback, e.g., ut = μt(xt): the following is the case yt = xt = ut


스크린샷 2025-10-05 오전 9 54 51


③ multi-step prediction with open-loop control


스크린샷 2025-10-05 오전 9 55 21


Chapman-Kolmogorov decomposition


스크린샷 2025-10-05 오전 9 55 47


Lemma 6. linear Gaussian state-space model

① (Note) Gaussian-Markov process

Condition 1. {Xt} is a Gaussian process

Condition 2. Markov property: P(Xn+1 ∈ A X0, ···, Xn) = P(Xn+1 ∈ A Xn)

② System definition


스크린샷 2025-10-05 오전 9 57 08


○ Markov property: applies even with feedback policy


스크린샷 2025-10-05 오전 9 57 35


○ multi-step Markov property


스크린샷 2025-10-05 오전 9 58 08


○ mean propagation


스크린샷 2025-10-05 오전 9 58 34


○ cross-covariance Cov(Xt+m, Xt)


스크린샷 2025-10-05 오전 9 59 18


○ covariance propagation


스크린샷 2025-10-05 오전 9 59 48


DALE (discrete-time algebraic Lyapunov equation)

○ If the absolute values of all eigenvalues (including complex ones) of a square matrix A are less than 1, the matrix is defined as stable: because A = 0

○ If A is stable, then ∑ = limt→∞t = limt→∞ 𝔼[(Xt - 𝔼[Xt])(Xt - 𝔼[Xt])ᵀ] exists uniquely


스크린샷 2025-10-05 오전 10 01 17


○ Proof of uniqueness of ∑


스크린샷 2025-10-11 오후 11 40 32


Remark 1. Stability of A is a sufficient, but not necessary condition

○ ∑ may still exist uniquely even if A is not stable.

○ A trivial example is given by ∑0 = 0, Q = 0, in which case ∑k ≡ 0 independent of A. (No noise in the first place.)

Remark 2. may not be strictly positive definite.

○ A trivial example is A = O, rank(GQGT) < n. (Noise does not touch all directions in the state.)

Remark 3. If the input disturbance wk affects all the components of the state vector, then the stability of A would be necessary for the convergence of ∑k, and the limiting covariance ∑ would be positive definite → the concept of recapability is born

reachability

○ Definition: Related to controllability and observability.


스크린샷 2025-10-05 오전 10 02 24


Theorem 1. The following are all equivalent: assume w ∈ ℝs


스크린샷 2025-10-05 오전 10 02 54


○ In condition 3, the noise sequence w should be interpreted as the control input applied to the system; due to them, the system can be steered from 0 to a given state x over n time steps.

Theorem 2. Lyapunov stability test


스크린샷 2025-10-05 오전 10 03 34


○ Note that in condition 2 it is PD (positive definite), not PSD (positive semidefinite)

Lemma 7. Graph Theory

① strongly connected (= irreducible, communicable): a condition where from any node i in the graph one can reach any other node j


스크린샷 2025-10-05 오전 10 04 05

Figure 1. Example of irreducible


스크린샷 2025-10-05 오전 10 04 55

Figure 2. Example of reducible (state 3 is a sink)


② period: the period of a specific node i is the greatest common divisor of the lengths of all paths from i back to i

○ Example: with two nodes A, B connected by two edges A=B, the period of each node is 2

○ The transition matrix with period m should have the following form when allowing for state rearrangements like QTPQ given a proper permutation matrix Q.


스크린샷 2025-10-05 오전 10 05 44

Figure 3. Example of a transition matrix with period m

S1 → S2 → ··· → Sm → S1 → ··· has such a cycle


③ aperiodic: all nodes have period 1

○ aperiodic ⊂ irreduicible

○ Example: if each node has a walk to itself, it is aperiodic

④ stationary state: If Pr(xn | xn-1) is independent of n, the Markov process is stationary (time-invariant)

⑤ regular

○ regular ⊂ irreduicible

○ For some natural number k, every entry of the power Mk of the transition matrix M is positive (i.e., nonzero)

⑥ transition matrix


스크린샷 2025-10-05 오전 10 06 49


⑦ Markov policy: ut = gt(xt)

⑧ One can prove the second law of thermodynamics (law of increasing entropy) using a Markov process

○ Because one can simulate the law of diffusion: provided a uniform stationary distribution is assumed

○ Related concept: random walk

Perron-Frobenius theorem

Theorem 1. If a Markov chain with transition matrix P is strongly connected, there exists exactly one stationary distribution q

○ The stationary distribution satisfies Pq = q

Theorem 2. If a finite Markov chain with transition matrix P is strongly connected and aperiodic, it is called an Ergodic Markov chain and satisfies:

○ Pij: probability of transition from node j to node i. ∑i Pij = 1. Note that Pij means the probability of transition from node i to node j in other Lemma.

2-1. The (i, j) entry Pij(k) of Pk converges to qi as k → ∞: note it converges to the same value for fixed i regardless of j

2-2. Regardless of the initial state x0, the k-th state xk converges to q as k → ∞


스크린샷 2025-10-12 오후 1 24 03


Lemma 8. Value function following deterministic Markov property

① Expected cost and transition probability


스크린샷 2025-10-12 오후 1 52 21


recursive and backward iteration: Dynamic programming


스크린샷 2025-10-05 오전 11 12 32


○ JTg ∈ ℝ1×1

○ π0 ∈ ℝ1×n: initial distribution of the Markov chain

○ V0g ∈ ℝn×1: vector of state-wise value functions collecting expected cumulative cost at each state under policy g

○ When T = ∞, Jg becomes infinite, being unable to find the optimal policy g; thus, Bellman equation, Cesàro limit concepts are introduced.

Bellman equation: related to the discounted cost problem

○ (Note) time-homogeneous: {xtg}t≥0 and {xtg}t≥τ,∀τ∈ℤ+ follow the same distribution. Also means strictly stationary.


스크린샷 2025-10-07 오전 1 34 29


Condition 1. time-homogeneous transition: Pt(j | i, u) = P(j | i, u) ∀t

Condition 2. time-homogeneous cost: Ct(x, y) = C(x, y) ∀t

Condition 3. stationary policy: gt = g ∀t

○ If all the above hold, one can obtain the following fixed-point equation


스크린샷 2025-10-05 오전 10 10 36


○ Jg: The present value of the cost; generally used in an economic context.

○ Since Pg is stable, all eigenvalues have absolute value less than 1, so det(I - βPg) = β det( (1/β)I - Pg ) ≠ 0


스크린샷 2025-10-05 오전 10 12 01


○ Vg ∈ ℝn×1: vector of state-wise value functions collecting expected discounted cumulative cost at each state under policy g


스크린샷 2025-10-05 오전 10 12 57


○ Pg ∈ ℝn×n: transition matrix; the (i, j) entry is the probability of transition from i to j

Cesàro limit: related to the long-term average cost problem


스크린샷 2025-10-05 오전 10 13 39


Poisson equation: related to average cost


스크린샷 2025-10-05 오전 10 14 13


○ Jg: Value function. Jg is unique


스크린샷 2025-10-05 오전 10 14 45


○ Lg: relative value function. Lg is not unique ( Lg + α1 ∀α ∈ ℝ is also a solution to the Poisson equation)

○ Existence of solutions


스크린샷 2025-10-05 오전 10 15 31


Lemma 9. When not irreducible

① If Pg is not irreducible, the state space S splits into the transient states T and one or more recurrent communicating classes C1, ···

② transient state: visited only finitely many times. Eventually the process leaves the transient states and enters a recurrent state.

Theorem: The stationary distribution πg of a finite-state Markov chain assigns probability 0 to all transient states.


스크린샷 2025-10-07 오후 11 27 58


○ (i) Proof: Pigeonhole principle. The finiteness is critical (and needs to be used).

Suppose that a certain transient state i satisfies Qi=0. Since a recurrent communicating class is a closed set, no communication occurs with outside nodes once the process enters the recurrent class. Thus, the assumption implies that the transient state i moves to only transient states for all K transitions. Thus, by Pigeonhole Principle, at least one transient state is visited twice among K+1 pigeonholes (counting the start), which yields a directed cycle entirely contained in the transient set. This cycle is closed (no edge leaves it to the recurrent class within these K steps), so it forms a closed communicating class disjoint from the given recurrent class. In a finite Markov chain, every closed communicating class is recurrent; thus, we have produced a second recurrent class, contradicting the assumption of uniqueness. Therefore, the assumption Qi is false, and hence Qi>0 for every transient state i.

○ (ii) Proof


스크린샷 2025-10-07 오후 11 28 44


③ recurrent state: since a recurrent communicating class is a closed set, no communication occurs with outside nodes

○ i → j: means there exists a path with positive probability from i to j

○ i ↔︎ j: means i → j and j → i; i and j communicate

○ positive recurrent: the mean return time to that state is finite.

○ A chain starting in a positive recurrent state has a unique stationary distribution.

○ null recurrent: the mean return time to that state is infinite. No stationary distribution exists.

○ Example: Xn+1 = Xn + ξn, X0 = 0, ℙ(ξn = +1) = ℙ(ξn = −1) = 0.5 → The probability of going back to the origin is 1 but the expected time is ∞.

○ absorbing state: a state that, once entered, you remain in forever

Example 1. Stationary state set F


스크린샷 2025-10-12 오후 3 34 50


Example 2. Finite state space

Let S = {0, 1, ···, I}. Since Vg(0) = 0 and C(0, g(0)) = 0, we can focus only on Ś = S \ {0} = {1, ···, I}, the non-absorbing states. Let Ṽg be the value vector for states in Ś, and let Rg be the submatrix of Pg for transitions among states inside Ś. (i.e., the matrix that describes how the chain moves only among the non-absorbing states before hitting 0.) Then the system of equations for these states is Ṽg = c̃ + Rgg. To show uniqueness of Ṽg, suppose there are two solutions Ṽ1g, Ṽ2g. Let their difference be Ug = Ṽ1g - Ṽ2g; subtracting the two equations yields Ug = RgUg = ⋯ = (Rg)nUg = ⋯ = 0 (∵ limn→∞ (Rg)n = 0, method of infinite descent) ⇔ Ṽ1g = Ṽ2g. Thus, in a finite state space where state 0 is absorbing and all other states can reach 0, the first-passage-time cost equation has a unique nonnegative solution.

Example 3. Countably infinite state space


스크린샷 2025-10-05 오전 10 40 17

Figure 4. Countably infinite state space diagram


스크린샷 2025-10-14 오후 9 04 27


Uniqueness is not trivial. Assuming the solution is bounded often allows one to show uniqueness. Consider the equation for the difference of two solutions Ug = RgUg. Connecting this to the diagram yields Ug(ℓ+1) - Ug(ℓ) = (λ - 1)(Ug(ℓ) - Ug(ℓ-1)). The consecutive differences Δ(ℓ) = Ug(ℓ+1) - Ug(ℓ) form a geometric sequence with ratio (λ - 1). If |λ - 1| < 1, these differences converge to 0, suggesting a bounded solution. If |λ - 1| ≥ 1, the differences may diverge, implying that uniqueness may fail.

Lemma 10. Martingale

Doob’s theorem

○ σ(X1, X2, ···, Xn): the smallest σ-algebra that makes X1, X2, ···, Xn measurable

○ Doob’s theorem: σ(X1, X2, ···, Xn) is equivalent to the collection of all functions of the form g(X1, X2, ···, Xn)

○ The larger the σ-algebra, the more functions are measurable with respect to it; i.e., the more information it contains.

② filtration

○ a collection of σ-algebras ordered increasingly by inclusion

○ Ordered by ⊆; if ℱ1 ⊆ ℱ2, then ℱ2 is afterwards relative to ℱ1

○ For convenience, let time index t = 0, 1, 2, ⋯; then the filtration is {ℱt}t∈ℤ+ and satisfies ℱs ⊆ ℱt for all s ≤ t

○ Intuition: represents situations where information increases as observations accumulate over time

martingale

○ Property of conditional expectation

○ For any random variable Y, 𝔼[Y | X1, ···, Xn] = 𝔼[Y | σ(X1, ···, Xn)] holds

○ Reason: because σ(X1, ···, Xn) is equivalent to the set of all functions generated by X1, ···, Xn

○ Additionally, when σ(Y) ⊂ σ(Z), 𝔼[𝔼[X ㅣ Z] ㅣ Y] = 𝔼[𝔼[X ㅣ Y] ㅣ Z] = 𝔼[X ㅣ Y] is established.

○ Martingale: a stochastic process {Xt}t∈ℤ+ adapted to a filtration {ℱt}t∈ℤ+ that satisfies all of the following

Condition 1. Xt is ℱt-measurable for all t ∈ ℤ+

○ If s ≤ t ≤ s′ and ℱ>s​ ⊆ ℱt ⊆ ℱs′, then Xt ∈ ℱt is not ℱs-measurable (insufficient information) but is ℱs′-measurable.

Condition 2. 𝔼[|Xt|] is finite for all t ∈ ℤ+

Condition 3. 𝔼[Xt |s] = Xs almost surely for all s ≤ t and all t ∈ ℤ+

Interpretation: Given only the information up to time s (ℱs), the optimal prediction of Xt equals Xs (i.e., the prediction is constrained to Xs; 𝔼[Xt ㅣ ℱs] is an orthogonal projection of Xt into the ℱs-measurable random variable space (‘best prediction’)).

Remark: The martingale property is needed only when predicting the future from the past. In particular, for s > t we have 𝔼[Xt s] = Xt regardless of whether (Xt) is a martingale (assuming integrability).

○ For s < t, 𝔼[Xs ㅣ ℱt] = Xs also holds, because Xs is ℱs-measurable, but has insufficient information due to ℱs ⊆ ℱt.

○ Note: an i.i.d. process is generally not a martingale (except for the constant process)

○ Application: 𝔼[Ug(Xtg) | Xt-1g] = Ug(Xt-1g)

④ Martingale and stochastic control theory


스크린샷 2025-10-15 오전 11 58 24


Lemma 11. Optimal Policy

① Problem definition: cost-to-go function under perfect observation. Since control input {Ut, …, U1} is measurable by {Xt, …, X0}, the following holds:


스크린샷 2025-10-07 오후 1 59 08


② When following Markov property, Bellman equation is established. Here, Jtg, VtgM(Xt) is a cost-to-go from t to future.


스크린샷 2025-10-07 오후 2 01 58


Markovization theorem (Markov policy sufficiency, reduction to Markov policy)

Theorem: In a finite-horizon MDP, for any general (possibly history-dependent and randomized) policy g, there exists a behavioral Markov policy gM such that, under the same initial distribution μ, the joint distributions of (Xt, Ut) for all t = 0, …, T−1 and of XT ​ are identical. Consequently, the performance Jg = 𝔼g[∑t=0 to T−1 Ct(Xt, Ut) + CT(XT)] equals JgM. Hence, without loss of optimality, one may restrict attention to randomized Markov policies.

Proof


스크린샷 2025-10-10 오전 1 54 36


Comparison principle

Theorem: By working backward from the objective and ensuring the Bellman inequality holds at each step, the initial value V0 serves as a lower bound for the performance of all possible policies. The set of actions that achieve equality at each stage collectively constitute the optimal policy. Thus, the optimality can be verified or constructed by combining locally optimal (stage-wise) choices.


스크린샷 2025-10-07 오후 2 03 17


○ Proof: Using mathematical induction on backward

Case 1. t = T: Since JTg = 𝔼g[CT(XTg) ㅣ XTg, …, X1g, X0] = CT(XTg) ≥ VT(XTg) (∵ (V1)) holds, the induction assumptions are still established.

Case 2. If the induction assumptions are established on ℓ = t+1, …, T, the fact that the assumptions still hold for ℓ = t can be verified as follows:


스크린샷 2025-10-07 오후 2 06 19


○ Corollary


스크린샷 2025-10-07 오후 2 06 44


Hamiltonian-Jacobi-Bellman (HJB) equation

○ Theorem: HJB is applicable to fiinite / countably infinite, state / action space. Continuous time version of Bellman equation.


스크린샷 2025-10-07 오후 2 07 21


○ Proof of theorem 1 is shown at the comparison principle already, so the following explanations are only relevant to theorem 2.

○ p: A certain Markov policy gM = {gt} is optimal.

○ q: Given ∀x, t, gt(x) ∈ arg infu∈𝒰 {ct(x, u) + 𝔼Wt[Vt+1(ft(x, u, Wt))]} (i.e., stepwise Bellman minimization is achieved)

○ Proof of sufficiency on theorem 2 (q ⇒ p): When xt is given for each stage, any policies satisfying infimum is optimal (∵ corollary). Then, ut should be a measurable function on the current state xt, the optimal policy should be Markov policy.


스크린샷 2025-10-12 오후 5 29 43


○ Proof of necessity on theorem 2 (p ⇒ q): If a policy is optimal Markov policy, it should achieve infimum for each stage (w.p.1); otherwise, we can construct a better policy g’ with a positive probability set, implying J(g’) < J(g).


스크린샷 2025-10-14 오후 6 28 01

스크린샷 2025-10-14 오후 6 28 52


Application 1. If an optimal path of 1 → 6 is 1 → 2 → 3 → 6, the optimal path of 2 → 6 should 2 → 3 → 6 by HJB equation.

Application 2. Policy evaluation

Application 3. Vt(x) obtained from HJB equation is called value function. It effectively decreases the size of search space.

Application 4. Q-value (state-action value function) : Qt(x, u) = Ct(x, u) + 𝔼Wt[Vt+1(ft(x, u, Wt))], Vt(x) = infu∈𝒰 Qt(x, u)

Application 5. For given finite state / action space, inf = min, and the optimal policy is deterministic Markov policy.


스크린샷 2025-10-14 오후 6 32 30


Application 6. randomized Markov policy에서의 value function: For u ~ μt(u ㅣ i),


스크린샷 2025-10-14 오후 6 32 46


Lemma 12. Information state

① {zt}t={0,···,T} satisfying the following given partial observation context

○ Background: The of history Ht := {y0, …, yt, u0, …, ut-1} increases in time and the domain of it increases exponentially.

Condition 1. Compression: zt = ℓt(Ht) ∀t

Condition 2. Policy / Strategy Independent : zt+1 = 𝒯t(zt, yt+1, ut) ( current state, new information) That is, Zt can be updated recursively using current state and new information without direct reference to the entire past history.

Condition 3. Independence relative to g0:t-1: ∀t = 0, …, T-1


스크린샷 2025-10-23 오후 8 10 21


Condidate 1. zt = Ht := {y0, …, yt, u0, …, ut-1}

○ π0(i) is as follows:


스크린샷 2025-10-23 오후 8 11 24


Candidate 2. Belief state: zt = πt s.t. πt(i) := ℙ(Xt = i Ht) = ℙ(Xt = i y0:t, u0:t-1) ∀i ∈ S

Condition 1 is satisfied: zt = ℓt(Ht) in time (compression)

Condition 2 is satisfied: we can obtain 𝒯t satisfiying πt+1 = 𝒯tt, yt+1, ut) directly using Bayes’ rule. This is the update equation for the belief state, often called a nonlinear filter.


스크린샷 2025-10-23 오후 8 15 21


Condition 3 is satisfied: Using backward mathematical induction

Case 1. t = T-1: All terms do not depend on g0:T-1.


스크린샷 2025-10-24 오전 9 58 09


Case 2. The backward induction can be established as follows:


스크린샷 2025-10-24 오전 10 22 35


○ HJB-like theorem is established


스크린샷 2025-10-24 오전 10 31 26


Application 1. (One-way) Separation theorem


스크린샷 2025-10-24 오전 11 10 57


Application 2. Cost function in belief space


스크린샷 2025-10-24 오전 11 11 17


Application 3. The following is strategy-dependent unlike belief state because it dependes on gt-1


스크린샷 2025-10-24 오전 11 11 34


Application 4. The following is strategy-dependent unlike belief state: because if not including ut as a condition πt+1 = 𝔼ut ~ p(· ㅣ Y0:t) [𝒯tt, yt+1, ut)] is established so the information state transition is affected by the policy.


스크린샷 2025-10-24 오전 11 11 51



Input: 2025.08.26 23:34

results matching ""

    No results matching ""