Chapter 9. Stochastic Control Theory
Recommended post: 【Control Theory】 Table of Contents for Control Theory
2. Terminology of Stochastic Control Theory
3. Laws of Stochastic Control Theory
a. Robust MDP
1. Sigma-algebra
⑵ Sigma-algebra(σ-algebra)
2. Terminology of Stochastic Control Theory
⑴ Variable definitions
① State (system state) xt: denotes a specific value or a random variable; same below
② Observation yt: in the perfect observation case, yt = xt
③ Noise (system noise), Disturbance, Error, Primitive random variable wt, vt
④ System state noise wt
⑤ Observation noise vt
⑥ Primitive random seed creating stochastic uncertainty x0
⑦ Control ut
⑧ Control strategy / Law / Policy gt
⑨ System state sequence xt+1 := ft(xt, ut, wt)
⑩ Observation sequence yt := ht(xt, vt)
⑪ Control input, Action ut := gt(y0:t, u0:t-1). Using all past information is called perfect recall.
⑵ Classification by system sequence
① DDS (deterministic system): xt+1 := ft(xt, ut, wt) = ft(xt, ut). yt := ht(xt, vt) = ht(xt). Case where at any time t, both the state variable xt and the output variable yt are known.
② SDS (stochastic dynamical model): xt+1 := ft(xt, ut, wt), yt+1 := ht(xt, vt), wt, vt ≢ 0.
⑶ Classification by control input
① Open loop control: ut := gt(y0:t, u0:t-1) = gt(u0:t-1).
② Feedback control: Cases where past outputs y0:t influence the control action u.
③ Centralized stochastic control: (1) Stochastic dynamical system + (2) One controller + (3) Controller with perfect recall
④ Multi-controller problem: team problem, competitive game, etc.
⑷ Classification by policy
① Decision process: a general framework that deals with decision-making problems where state, action, and reward follow through a process.
② Markov process: (regardless of whether it is a decision process) the future depends only on the current state
○ Markov chain: among Markov processes, it refers to those with a finite or countably infinite state space.
○ Controlled Markov chain: Markov chain + Decision process
③ MDP (Markov decision process): among decision processes, it refers to cases where the future depends only on the current state.
○ Dynamic programming: recurrence relation (break the time dependence). If MDP refers to the system framework, dynamic programming refers to the methodology.
○ POMDP (partially observed Markov decision process): an MDP system where only partial information rather than full state information can be used.
○ Constrained MDP, Constrained POMDP also exist.
④ Gaussian process: the state process {Xt} is such that any finite subset follows a joint Gaussian distribution.
⑤ Gaussian-Markov process
○ Condition 1. {Xt} is a Gaussian process.
○ Condition 2. Markov property: P(Xn+1 ∈ A ㅣ X0, ···, Xn) = P(Xn+1 ∈ A ㅣ Xn)
3. Laws of Stochastic Control Theory
⑴ Lemma 1. In open-loop control, xt is a function of x0, u0:t-1, w0:t-1, and yt is a function of x0, u0:t-1, w0:t-1, vt.
⑵ Lemma 2. Open-loop system vs. Feedback system
① Open loop control: ut := gt(y0:t, u0:t-1) = gt(u0:t-1).
② Feedback control: Cases where past outputs y0:t influence the control action u.
③ Under DDS, open-loop and feedback systems are equivalent.
○ Proof for open-loop → feedback: Given an open-loop control input sequence u, define a feedback control that ignores the state (i.e., a map returning the predetermined ut at each time t). Then, from the initial state x0, it produces the same trajectory and cost. Thus, regardless of DDS/SDS, for any open-loop there exists a feedback policy that is equivalent at that initial state. That is, open-loop ⊂ feedback holds always.
○ Proof for feedback → open-loop: In a DDS, all control inputs are uniquely determined (determinism). Therefore, if you pre-specify the same input sequence u as an open-loop policy, the resulting state evolution is identical and so is the cost.
④ Under SDS, open-loop and feedback systems are not equivalent.
○ Counterexample 1.
○ In the above counterexample, the feedback system outperforms the open-loop system (i.e., it generates the lower cost).
⑶ Lemma 3. Policy independence: If Wt is independent of X0:t-1, U0:t-1, then ℙ(xt+1g ∈ A ㅣ x0:t, u0:t) = ℙ(xt+1g ∈ A ㅣ xt, ut) = ℙ(ft(xt, ut, wt) ∈ A ㅣ xt, ut) (Markov property), so dependence on policy g disappears.
① In DDS, if you know the current state, you can know the next state immediately, but in SDS, past states matter, so conditional probabilities on the history are important.
② That is, when wt is independent, system evolution follows natural laws + pure noise, so the policy is irrelevant; but if wt depends on the policy, the policy changes the noise distribution, so the future state distribution depends on the policy.
③ Philosophy: Philosophically, “policy independence” implies that diversified judgments based on individual value assessments are impossible, and that choices become constrained by factual determinations.
⑷ Lemma 4. Gaussian process (GP)
① Definition: the state process {Xt} is such that any finite subset of it follows a joint Gaussian distribution.
② 4-1. Even if each Xi is Gaussian, it does not imply {Xi}i∈ℕ is a GP.
○ Example: X2 = X1 I{ㅣX1ㅣ ≤ k} + (-X1) I{ㅣX1ㅣ > k}, Y = (X1 + X2) / 2 is not a GP.
③ 4-2. For Xt+1 = AXt + BUt + GWt, X0 ~ 𝒩(0, ∑0), Wt ~ 𝒩(0, Q), {Xt} is a GP.
④ 4-3. Under a feedback policy, {Xt} is generally not a GP.
○ Example: If Ut := gt(Yt) = gt(Xt) = Xt2, then X1 = AX0 + BX02 + GW0, which is not Gaussian.
○ On the other hand, in a linear Gaussian SDS, for a general open-loop policy, the state process {Xt} is always Gaussian.
⑤ (Note) MMSE (minimum mean-square estimator)
⑥ (Note) Orthogonality principle
⑦ (Note) LMMSE (linear minimum mean-square estimator)
⑧ If X and Y are jointly Gaussian, then LMMSE = MMSE holds.
⑸ Lemma 5. Multi-step prediction
① In general, ℙ(xt+2g ∈ A ㅣ xt, ut, ut+1) ≠ ℙ(xt+2g ∈ A ㅣ x0:t, u0:t+1)
○ Proof: Let’s think of xt → yt → ut → xt+1 → yt+1 → ut+1 → xt+2. Since ut+1 = gt(y0:t+1, u0:t) that is not independent of u0:t-1 implies information about wt via xt+1 = f(xt, ut, wt), conditioning on ut+1 breaks the past-independence of wt: here “past” means x0:t-1, u0:t-1.
○ Counterexample 1. In open-loop control, ut+1 = gt(u0:t) holds, so it cannot imply information about wt, hence equality holds.
○ Counterexample 2. When wt is a constant
○ Counterexample 3. When ut is defined to have the Markov property and memoryless feedback, e.g., ut = μt(xt): the following is the case yt = xt = ut
② Multi-step prediction with open-loop control
③ Chapman-Kolmogorov decomposition
⑹ Lemma 6. Linear Gaussian state-space model
① (Note) Gaussian-Markov process
○ Condition 1. {Xt} is a Gaussian process.
○ Condition 2. Markov property: P(Xn+1 ∈ A ㅣ X0, ···, Xn) = P(Xn+1 ∈ A ㅣ Xn)
② System definition
○ Markov property: applies even with feedback policy.
○ Multi-step Markov property
○ Mean propagation
○ Cross-covariance Cov(Xt+m, Xt)
○ Covariance propagation
③ DALE (discrete-time algebraic Lyapunov equation)
○ If the absolute values of all eigenvalues (including complex ones) of a square matrix A are less than 1, the matrix is defined as stable: because A∞ = 0.
○ If A is stable, then ∑∞ = limt→∞ ∑t = limt→∞ 𝔼[(Xt - 𝔼[Xt])(Xt - 𝔼[Xt])ᵀ] exists uniquely.
○ Proof of uniqueness of ∑∞
○ Remark 1. Stability of A is a sufficient, but not necessary condition.
○ ∑∞ may still exist uniquely even if A is not stable.
○ A trivial example is given by ∑0 = 0, Q = 0, in which case ∑k ≡ 0 independent of A. (No noise in the first place.)
○ Remark 2. ∑∞ may not be strictly positive definite.
○ A trivial example is A = O, rank(GQGT) < n. (Noise does not touch all directions in the state.)
○ Remark 3. If the input disturbance wk affects all the components of the state vector, then the stability of A would be necessary for the convergence of ∑k, and the limiting covariance ∑∞ would be positive definite → the concept of recapability is related.
④ Reachability
○ Definition: Related to controllability and observability.
○ Theorem 1. The following are all equivalent: assuming w ∈ ℝs
○ In condition 3, the noise sequence w should be interpreted as the control input applied to the system; due to them, the system can be steered from 0 to a given state x over n time steps.
○ Theorem 2. Lyapunov stability test
○ Note that in condition 2, it is PD (positive definite), not PSD (positive semidefinite).
⑺ Lemma 7. Graph Theory
① Strongly connected (= irreducible, communicable): a condition where from any node i in the graph, one can reach any other node j.
Figure 1. Example of “irreducible”
Figure 2. Example of “reducible” (state 3 is a sink)
② Period: the period of a specific node i is the greatest common divisor of the lengths of all paths from i back to i
○ Example: with two nodes A, B connected by two edges A=B, the period of each node is 2.
○ The transition matrix with period m should have the following form when allowing for state rearrangements like QTPQ given a proper permutation matrix Q.
Figure 3. Example of a transition matrix with period m
S1 → S2 → ··· → Sm → S1 → ··· has such a cycle.
③ Aperiodic: all nodes have period 1.
○ Aperiodic ⊂ Irreduicible
○ Example: if each node has a walk to itself, it is aperiodic.
④ Stationary state: If Pr(xn ㅣ xn-1) is independent of n, the Markov process is stationary (time-invariant).
⑤ Regular
○ Regular ⊂ Irreduicible
○ For some natural number k, every entry of the power Mk of the transition matrix M is positive (i.e., nonzero).
⑥ Transition matrix
⑦ Markov policy: ut = gt(xt)
⑧ One can prove the second law of thermodynamics (law of increasing entropy) using a Markov process.
○ Because one can simulate the law of diffusion: provided a uniform stationary distribution is assumed.
○ Related concept: random walk
⑨ Perron-Frobenius theorem
○ Theorem 1. If a Markov chain with transition matrix P is strongly connected, there exists exactly one stationary distribution q.
○ The stationary distribution satisfies Pq = q.
○ Theorem 2. If a finite Markov chain with transition matrix P is strongly connected and aperiodic, it is called an Ergodic Markov chain and satisfies:
○ Pij: probability of transition from node j to node i. ∑i Pij = 1. Note that Pij means the probability of transition from node i to node j in other Lemma below.
○ 2-1. The (i,j) entry of Pk, Pij(k), converges to qi as k → ∞: note that it converges to the same value for fixed i regardless of j.
○ 2-2. Regardless of the initial state x0, the k-th state xk converges to q as k → ∞.
⑻ Lemma 8. Value function following deterministic Markov property
① Expected cost and transition probability
② Recursive and backward iteration: Dynamic programming
○ JTg ∈ ℝ1×1
○ π0 ∈ ℝ1×n: initial distribution of the Markov chain
○ V0g ∈ ℝn×1: vector of state-wise value functions collecting expected cumulative cost at each state under policy g
○ When T = ∞, Jg becomes infinite, being unable to find the optimal policy g; thus, Bellman equation, Cesàro limit concepts are introduced.
③ Bellman equation: The below describes the discounted cost problem primarily.
○ (Note) Time-homogeneous: {xtg}t≥0 and {xtg}t≥τ,∀τ∈ℤ+ follow the same distribution. Also means strictly stationary.
○ Condition 1. Time-homogeneous transition: Pt(j ㅣ i, u) = P(j ㅣ i, u) ∀t
○ Condition 2. Time-homogeneous cost: Ct(x, y) = C(x, y) ∀t
○ Condition 3. Stationary policy: gt = g ∀t
○ If all the above hold, one can obtain the following fixed-point equation.
○ Jg: The present value of the cost; generally used in an economic context.
○ Since Pg is stable, all eigenvalues have absolute value less than 1, so det(I - βPg) = β det((1/β)I - Pg) ≠ 0
○ Vg ∈ ℝn×1: vector of state-wise value functions collecting expected discounted cumulative cost at each state under policy g
○ Pg ∈ ℝn×n: transition matrix; the (i,j) entry is the probability of transition from i to j.
④ Cesàro limit: related to the long-term average cost problem.
⑤ Poisson equation: related to average cost.
○ Jg is unique.
○ Lg: relative value function. Lg is not unique (∵ Lg + α1 ∀α ∈ ℝ is also a solution to the Poisson equation)
○ Existence of solutions
⑼ Lemma 9. When not irreducible
① If Pg is not irreducible, the state space S splits into the transient states T and one or more recurrent communicating classes C1, ···
② Transient state: visited only finitely many times. Eventually the process leaves the transient states and enters a recurrent state.
○ Theorem: The stationary distribution πg of a finite-state Markov chain assigns probability 0 to all transient states.
○ (i) Proof: Pigeonhole principle. The finiteness is critical (and needs to be used).
○ Suppose that a certain transient state i satisfies Qi=0. Since a recurrent communicating class is a closed set, no communication occurs with outside nodes once the process enters the recurrent class. Thus, the assumption implies that the transient state i moves to only transient states for all K transitions. Thus, by Pigeonhole Principle, at least one transient state is visited twice among K+1 pigeonholes (counting the start), which yields a directed cycle entirely contained in the transient set. This cycle is closed (no edge leaves it to the recurrent class within these K steps), so it forms a closed communicating class disjoint from the given recurrent class. In a finite Markov chain, every closed communicating class is recurrent; thus, we have produced a second recurrent class, contradicting the assumption of uniqueness. Therefore, the assumption Qi is false, and hence Qi>0 for every transient state i.
○ (ii) Proof
③ Recurrent state: since a recurrent communicating class is a closed set, no communication occurs with outside nodes.
○ i → j: means there exists a path with positive probability from i to j.
○ i ↔︎ j: means i → j and j → i; i and j communicate.
○ Positive recurrent: the mean return time to that state is finite.
○ A chain starting in a positive recurrent state has a unique stationary distribution.
○ Null recurrent: the mean return time to that state is infinite. No stationary distribution exists.
○ Example: Xn+1 = Xn + ξn, X0 = 0, ℙ(ξn = +1) = ℙ(ξn = −1) = 0.5 → The probability of going back to the origin is 1 but the expected time is ∞.
○ Absorbing state: a state that, once entered, you remain in forever
④ Example 1. Stationary state set F
⑤ Example 2. Finite state space
Let S = {0, 1, ···, I}. Since Vg(0) = 0 and C(0, g(0)) = 0, we can focus only on Ś = S \ {0} = {1, ···, I}, the non-absorbing states. Let Ṽg be the value vector for states in Ś, and let Rg be the submatrix of Pg for transitions among states inside Ś. (i.e., the matrix that describes how the chain moves only among the non-absorbing states before hitting 0.) Then the system of equations for these states is Ṽg = c̃ + RgṼg. To show uniqueness of Ṽg, suppose there are two solutions Ṽ1g, Ṽ2g. Let their difference be Ug = Ṽ1g - Ṽ2g; subtracting the two equations yields Ug = RgUg = ⋯ = (Rg)nUg = ⋯ = 0 (∵ limn→∞ (Rg)n = 0, method of infinite descent) ⇔ Ṽ1g = Ṽ2g. Thus, in a finite state space where state 0 is absorbing and all other states can reach 0, the first-passage-time cost equation has a unique nonnegative solution.
⑥ Example 3. Countably infinite state space
Figure 4. Countably infinite state space diagram
Uniqueness is not trivial. Assuming the solution is bounded often allows one to show uniqueness. Consider the equation for the difference of two solutions Ug = RgUg. Connecting this to the diagram yields Ug(ℓ+1) - Ug(ℓ) = (λ - 1)(Ug(ℓ) - Ug(ℓ-1)). The consecutive differences Δ(ℓ) = Ug(ℓ+1) - Ug(ℓ) form a geometric sequence with ratio (λ - 1). If |λ - 1| < 1, these differences converge to 0, suggesting a bounded solution. If |λ - 1| ≥ 1, the differences may diverge, implying that uniqueness may fail.
⑽ Lemma 10. Martingale
① Doob’s theorem
○ σ(X1, X2, ···, Xn): the smallest σ-algebra that makes X1, X2, ···, Xn measurable.
○ Doob’s theorem: σ(X1, X2, ···, Xn) is equivalent to the collection of all functions of the form g(X1, X2, ···, Xn).
○ The larger the σ-algebra, the more functions are measurable with respect to it; i.e., the more information it contains.
② Filtration
○ A collection of σ-algebras ordered increasingly by inclusion.
○ Ordered by ⊆; if ℱ1 ⊆ ℱ2, then ℱ2 is afterwards relative to ℱ1.
○ For convenience, let time index t = 0, 1, 2, ⋯; then the filtration is {ℱt}t∈ℤ+ and satisfies ℱs ⊆ ℱt for all s ≤ t.
○ Intuition: represents situations where information increases as observations accumulate over time.
③ Martingale
○ Property of conditional expectation
○ For any random variable Y, 𝔼[Y ㅣ X1, ···, Xn] = 𝔼[Y ㅣ σ(X1, ···, Xn)] holds.
○ Reason: because σ(X1, ···, Xn) is equivalent to the set of all functions generated by X1, ···, Xn.
○ Additionally, when σ(Y) ⊂ σ(Z), 𝔼[𝔼[X ㅣ Z] ㅣ Y] = 𝔼[𝔼[X ㅣ Y] ㅣ Z] = 𝔼[X ㅣ Y] is established.
○ Martingale: a stochastic process {Xt}t∈ℤ+ adapted to a filtration {ℱt}t∈ℤ+ that satisfies all of the following
○ Condition 1. Xt is ℱt-measurable for all t ∈ ℤ+.
○ If s ≤ t ≤ s′ and ℱs ⊆ ℱt ⊆ ℱs′, then Xt ∈ ℱt is not ℱs-measurable (due to insufficient information) but is ℱs′-measurable.
○ Condition 2. 𝔼[ㅣXtㅣ] is finite for all t ∈ ℤ+.
○ Condition 3. 𝔼[Xt ㅣ ℱs] = Xs, almost surely for all s ≤ t and all t ∈ ℤ+
○ Interpretation: Given only the information up to time s (ℱs), the optimal prediction of Xt equals Xs (i.e., the prediction is constrained to Xs; 𝔼[Xt ㅣ ℱs] is an orthogonal projection of Xt into the ℱs-measurable random variable space (‘best prediction’)).
○ Remark: The martingale property is needed only when predicting the future from the past. In particular, for s > t we have 𝔼[Xt ㅣ ℱs] = Xt regardless of whether Xt is a martingale (assuming integrability).
○ For s < t, 𝔼[Xs ㅣ ℱt] = Xs also holds, because Xs is ℱs-measurable, but has insufficient information due to ℱs ⊆ ℱt.
○ Note: an i.i.d. process is generally not a martingale (except for the constant process).
○ Application: 𝔼[Ug(Xtg) ㅣ Xt-1g] = Ug(Xt-1g)
④ Martingale and stochastic control theory
⑾ Lemma 11. (Fully observed) ― Optimal Policy
① Problem definition: cost-to-go function under perfect observation. Since control input {Ut, …, U1} is measurable by {Xt, …, X0}, the following holds:
② When following Markov property, Bellman equation is established. Here, Jtg, VtgM(Xt) is a cost-to-go from t to future.
③ Markovization theorem (Markov policy sufficiency, reduction to Markov policy)
○ Theorem: In a finite-horizon MDP, for any general (possibly history-dependent and randomized) policy g, there exists a behavioral Markov policy gM such that, under the same initial distribution μ, the joint distributions of (Xt, Ut) for all t = 0, …, T−1 and of XT are identical. Consequently, the performance (cost) Jg = 𝔼g[∑t=0 to T−1 Ct(Xt, Ut) + CT(XT)] equals JgM. Hence, without loss of optimality, one may restrict attention to randomized Markov policies.
○ Proof
④ Comparison principle
○ Theorem: By working backward from the objective and ensuring the Bellman inequality holds at each step, the initial value V0 serves as a lower bound for the performance of all possible policies. The set of actions that achieve equality at each stage collectively constitute the optimal policy. Thus, the optimality can be verified or constructed by combining locally optimal (stage-wise) choices.
○ Proof: Using mathematical induction on backward
○ Case 1. t = T: Since JTg = 𝔼g[CT(XTg) ㅣ XTg, …, X1g, X0] = CT(XTg) ≥ VT(XTg) (∵ (V1)) holds, the induction assumptions are still established.
○ Case 2. If the induction assumptions are established on ℓ = t+1, …, T, the fact that the assumptions still hold for ℓ = t can be verified as follows:
○ Corollary
⑤ Hamiltonian-Jacobi-Bellman (HJB) equation
○ Theorem: HJB is applicable to fiinite / countably infinite, state / action space. Continuous time version of Bellman equation.
○ Proof of theorem 1 is shown at the comparison principle already, so the following explanations are only relevant to theorem 2.
○ p: A certain Markov policy gM = {gt} is optimal.
○ q: Given ∀x, t, gt(x) ∈ arg infu∈𝒰 {ct(x, u) + 𝔼Wt[Vt+1(ft(x, u, Wt))]} (i.e., stepwise Bellman minimization is achieved)
○ Proof of sufficiency on theorem 2 (q ⇒ p): When xt is given for each stage, any policies satisfying the infimum is optimal (∵ corollary). Then, ut should be a measurable function on the current state xt, the optimal policy should be Markov policy.
○ Proof of necessity on theorem 2 (p ⇒ q): If a policy is optimal Markov policy, it should achieve infimum for each stage (w.p.1); otherwise, we can construct a better policy g’ with a positive probability set, implying J(g’) < J(g).
○ Corollary
○ Application 1. If an optimal path of 1 → 6 is 1 → 2 → 3 → 6, the optimal path of 2 → 6 should 2 → 3 → 6 by HJB equation.
○ Application 2. Policy evaluation: discussed below.
○ Application 3. Vt(x) obtained from HJB equation is called value function. It effectively decreases the size of search space.
○ Application 4. Q-value (state-action value function) : Qt(x, u) = Ct(x, u) + 𝔼Wt[Vt+1(ft(x, u, Wt))], Vt(x) = infu∈𝒰 Qt(x, u)
○ Application 5. For given finite state / action space, inf = min, and the optimal policy is deterministic Markov policy.
○ Application 6. Value function at randomized Markov policy: For u ~ μt(u ㅣ i),
○ If Y is independent of the state and does not appear directly in the reward, then Y is irrelevant for decision-making: a decision rule that ignores y is always at least as good. In other words, having more information is not always better.
○ Example: Suppose a doctor has to decide on a treatment for a patient. Every morning, the doctor can observe the patient’s health status X, and in addition also knows what day of the week it is, Y. The available actions are “treat” and “do not treat.” If the day of the week is unrelated to both the expected reward (survival probability) and the health status, then making decisions based only on X yields the same expected survival probability as using both X and Y.
○ However, depending on technical assumptions such as the action space and measurability, the statement may need to be formulated in terms of ε-optimal policies, and when countability/non-countability and measurability traps are involved (e.g., the projection of a Borel set can be non-Borel), there are counterexamples in which global approximate dominance fails.
○ Conclusion: In dynamic programming problems, one can mathematically prove that memoryless strategies are sufficient to optimize the expected reward.
○ Application: In finite-horizon Markov decision problems, one can easily prove that Markov policies are optimal. (ref)
⑿ Lemma 12. (Partially observed) ― Information state
① {zt}t={0,···,T} satisfying the following given partial observation context
○ Background: The history Ht := {y0, …, yt, u0, …, ut-1} increases in time and the domain of it increases exponentially. (Curse of dimensionality)
○ Condition 1. Compression: zt = ℓt(Ht) ∀t
○ Condition 2. Policy / Strategy Independent : zt+1 = 𝒯t(zt, yt+1, ut) (■ current state, ■ new information) That is, zt can be updated recursively using current state and new information without direct reference to the entire past history.
○ Condition 3. Independence relative to g0:t-1: ∀t = 0, …, T-1
② Condidate 1. zt = Ht := {y0, …, yt, u0, …, ut-1}
○ π0(i) is as follows:
③ Candidate 2. Belief state: zt = πt s.t. πt(i) := ℙ(Xt = i ㅣ Ht) = ℙ(Xt = i ㅣ y0:t, u0:t-1) ∀i ∈ S
○ Condition 1 is satisfied: zt = ℓt(Ht) in time (compression)
○ Condition 2 is satisfied: we can obtain 𝒯t satisfiying πt+1 = 𝒯t(πt, yt+1, ut) directly using Bayes’ rule. This is the update equation for the belief state, often called a nonlinear filter.
○ Condition 3 is satisfied: using backward mathematical induction.
○ Case 1. t = T-1: All terms do not depend on g0:T-1.
○ Case 2. The backward induction can be established as follows:
○ Bellman equation of belief-MDP: The key message of this theorem is that, although the overall policy space 𝒢 of an MDP consists of all mappings from histories to actions and is therefore extremely large, the belief Πt is sufficient, so we lose no optimality by restricting attention to separated policies of the form “belief → action.”
○ Proof
○ Application 1. (One-way) Separation theorem
○ Application 2. Cost function in belief space
○ Application 3. The following is strategy-dependent unlike belief state because it dependes on gt-1.
○ Application 4. The following is strategy-dependent unlike belief state: because if not including ut as a condition πt+1 = 𝔼ut ~ p(· ㅣ Y0:t) [𝒯t(πt, yt+1, ut)] is established so the information state transition is affected by the policy.
○ Application 5. Generally, Pg(Xt+1 ∈ A ㅣ Ht, ut) ≠ Pg(Xt+1 ∈ A ㅣ yt, ut), Ht = (Y0:t, U0:t-1)
○ Proof: Let’s suppose t = 1. Let’s think of x0 → y0 → u0 → x1 → y1 → u1 → x2 → ⋯. Given H1 = (y0:1, u0), we can determine the distribution of x0 via y0 → x0. Thus, we can more accurately determine the distribution of x1 via y1 → x1 and (x0, u0) → x1. Afterward, we can determine the distribution of x2 via x1, u1, w1. However, in the right-hand side, we can only use y1 → x1 and the limited information for u1 to determine the distribution of x1, leading to less accurate distribution of x2. Thus, both sides differ.
○ Application 6. Pg(Xt+1 ∈ A ㅣ Ht, ut) = Pg(Xt+1 ∈ A ㅣ y0:t, ut)
○ Proof: Let’s think of x0 → y0 → u0 → x1 → y1 → u1 → x2 → ⋯. Given policy g, in the right-hand side, we can determine the distribution of u0 and x0 via y0 → u0 and y0 → x0, respectively. Thus, we can determine the distribution of x1 via (x0, u0) → x1. Now by utilizing ut = gt(y0:t, u0:t-1) and xt+1 = ft(xt, ut, wt), we can determine the distribution of all the variables, thus leading to the distribution of xt+1 thoroughly. It is identical that from the left-hand side.
○ Conclusion: The information state Ztg := (Y0:tg, U0:t-1g) is a function of Y0:tg.
○ Application 7. P(Xt+1 ∈ A ㅣ Ht, ut) ≠ P(Xt+1 ∈ A ㅣ y0:t, ut)
○ Proof: Let’s think of x0 → y0 → u0 → x1 → y1 → u1 → x2 → ⋯. In the right-hand side, we can determine the distribution of x0 via y0 → x0 but can only determine the distribution of u0 via y0 → u0 on the probability over policy set. However, u0 is exactly given in the left-hand side. Therefore, both sides differ. We can conclude that Pg(Xt+1 ∈ A ㅣ Ht, ut) is policy-independent, while Pg(Xt+1 ∈ A ㅣ y0:t, ut) is policy-dependent.
⒀ Lemma 13. Dynamic Program
① If Vt(i) = max{r(i), a + b∑j∈S ℙ(j ㅣ i) Vt+1(j)} = max{r(i), a + b𝔼[Vt+1(j) ㅣ i]}, Vt(i) ≥ Vt+1(i) is established.
② If Vt(x) = max{-c + p(x)(1 + Vt+1(x-1)) + (1 - p(x))Vt+1(x), Vt+1(x)}, VN+1(x) = 0, the following is established:
○ Monotonicity on time: Vt(x) ≥ Vt+1(x) (Proof)
○ Monotonicity on x: Vt(x) ≥ Vt(x-1) (Proof)
○ Supremum of marginal value: 1 ≥ Vt(x) - Vt(x-1) (Proof)
○ Concavity on x is not established: Vt(x) - Vt(x-1) ≤ Vt(x-1) - Vt(x-2) (Counterexamples exist)
○ Relationship between marginal value and time: Vt(x) - Vt(x-1) ≥ Vt+1(x) - Vt+1(x-1) (Proof)
○ Existence of threshold: Gt(x) = p(x)(1 - Δt+1(x)) is nondecreasing on x (Proof)
③ Convex
○ Lemma 1. Given two convex functions f1 and f2, max{f1, f2} is also convex.
Figure 5. Maximum of two convex functions is convex.
○ Lemma 2. Sum of two convex functions is convex.
○ Lemma 3. If V(x) is a non-decreasing convex function, V(max{x, a}), ∀a is also a convex convex: it can be easily understood geometrically.
○ Lemma 4. If L(π) is a convex function, L(π) = supi∈I {αi π + βi}, αi, βi ∈ ℝ is established.
○ The above comes from the inequality f(x) ≥ f(x0) + f’(x0)(x - x0); it is not a claim that the formula holds for arbitrary choices of αi and βi.
○ Example: If L(π) = π2, then π2 = supx0∈ℝ {2x0π - x02}.
○ Practical point: Using transformation that preserve convexity in the linear (affine) representation, one can show that applying the same transformation to the original convex function also preserves convexity. Example is as follows.
⒁ Lemma 14. (Stochastic policy) DCOE(discounted cost optimality equation, infinite-horizon discounted cost Bellman equation)
○ Theorem
○ Let F be a Banach space. Here, Banach space refers to a complete normed space, and a set being “complete” means every Cauchy sequence in the set converges to a certain element in the set. Let T: F → F be a transformation satisfying the following: ㅣㅣTx - Tyㅣㅣ ≤ βㅣㅣx - yㅣㅣ, ∃β ∈ (0,1), ∀x, y ∈ F. Then, the following is established:
○ There is a unique fixed point w ∈ F satisfying Tw = w.
○ For any x ∈ X, limn→∞ Tnx = w.
○ This essentially implies the following:
○ The transformation satisfying the above is called contraction. Strictly speaking, it requires that sup β(x, y) < 1.
○ T is continuous, and specifically is Lipschitz continuous.
○ Proof
○ Let x ∈ F, α = ㅣㅣx - Txㅣㅣ. Then, we have ㅣㅣTnx - Tn+1xㅣㅣ ≤ βnα. If we set the Cauchy sequence of {x, Tx, T2x, ···}, we can obtain the following: ∀ϵ > 0, ∃Nϵ s.t. ∀n, m ≥ Nϵ, ㅣㅣTnx - Tmxㅣㅣ < ϵ. Without loss of generality, we can set n > m. Then, we have
○ Thus, we have N suject to αβN / (1 - β) < ϵ. As F is a Banach space, we have w subject to limn→∞ Tnx = w. Since T(limn→∞ Tnx) = Tx = limn→∞ Tn+1x = w, so w is a fixed point. IF we set w1, w2 as T’s fixed points, we have ㅣㅣw1 - w2ㅣㅣ = ㅣㅣTw1 - Tw2ㅣㅣ = ⋯ = ㅣㅣTnw1 - Tnw2ㅣㅣ = ⋯ = 0, so w1 and w2 are same.
○ Intuition

② Bellman operator and contraction theorem
○ Theorem
○ Let F be a set of functions such tat F = {z: S → ℝ}. Here, S = {1, 2, ···, I} and z := (z(1), ···, z(I))T. Let’s define the following norm: ㅣㅣzㅣㅣ = maxi ㅣz(i)ㅣ (i.e., ㅣㅣ·ㅣㅣ∞) Let’s define the operator T: F → F for each component. Then, for all i ∈ S, we have Tz(i) = minu∈𝒰 [C(i, u) + β∑j∈S ℙ(j ㅣ i, u) z(j)]. Then, T is a contraction mapping.
○ Proof
○ Let ℝI be a Banach space with a norm of ㅣㅣ·ㅣㅣ∞. If z, y ∈ F, then we can prove the theorem as follows:
○ Even if we replace it with a peculiar definition that maximizes the cost, the contraction theorem still holds with the same proof structure.
③ Corollary
○ ∀i ∈ S, W∞(i) = minu∈𝒰 C(i, u) + β∑j∈S ℙ(j ㅣ i, u) W∞(i) has a unique solution of w∞.
○ Relationship between DCOE and geometric distribution
○ W∞(i) = infg∈𝒢 Jg(i) = infg∈𝒢 𝔼g[∑t=0 to ∞ βtc(xt, ut) ㅣ X0 = i], ∀i ∈ S
○ Definition of Jg(i) and bound
○ Definition of W∞: By the property of contraction mapping, the decreasing sequence {Wn} converges to W∞.
○ Jg(i) ≥ W∞ (lower bound)
○ W∞(i) ≥ infg Jg(i) (upper bound): it can be represented as J(X0, π*) ≤ ∑τ=0 to t-1 βτ Cτ(Xτ, πτ*) + βt𝔼[Vt(Xt, πt)].
○ Conclusion: W∞(i) = infg Jg(i)
○ The optimal stationary Markov policy g*(i) ∈ argminu∈𝒢 c(i, u) + β∑j∈S ℙ(j ㅣ i, u)V∞(j)
○ ⇔ W∞ = cg* + βPg*w∞
○ ⇔ W∞ = (I - βPg*)-1 cg*
○ Optimal operator T is defined as follows:
○ By definition, TZ ≤ TgZ
○ Tg and T are both contraction mapping on ℓ∞ norm.
○ Tg and T both satisfy the monotonicity: if z ≤ y, Tgz ≤ Tgy and Tz ≤ Ty
○ Proof 1
○ Proof 2
○ If h, g ∈ 𝒢SMP, and ThW∞g ≤ W∞g, W∞h ≤ W∞g is established: we can obtain ThnW∞g ≤ ⋯ ≤ ThW∞g ≤ W∞g; then, we can make n to infinite to obtain the conclusion.
④ Algorithm 1. Value iteration
○ A method that updates the value function by repeatedly applying the Bellman optimality operator T as Vk+1 = TVk.
○ Because of the contraction mapping property, if β < 1 then Vk → V* converges to the unique fixed point, and the greedy policy with respect to V* is optimal.
○ A typical stopping criterion is ㅣㅣVk+1 - Vkㅣㅣ∞ < ε , etc.
○ Characteristic: It has the advantage of simple computation, but the disadvantage of requiring many iterations.
⑤ Algorithm 2. Policy iteration
○ Step 1. Choose any stationary Markov policy gn ∈ 𝒢SMP.
○ Step 2. Policy evaluation: compute W∞gn = (I - βPgn)-1Cgn.
○ Step 3. Stopping criterion: if TW∞gn = W∞gn, stop and take gn as the optimal policy; otherwise go to Step 4.
○ Step 4. Policy improvement: define gn+1 as follows.
○ Theorem: The sequence ({g0, g1, g2, ···}) reaches an optimal policy after finitely many iterations.
○ Proof: When the stopping condition W∞gn = TW∞gn holds, gn is already an optimal policy. Otherwise, in Step 4 we have Tgn+1 W∞gn ≤ W∞gn, and there is at least one state where the inequality is strict. Since the number of possible policies is finite (ㅣUㅣㅣSㅣ) and at each step the cost is strictly improved (in at least one state), the same policy cannot be revisited. Therefore, the algorithm reaches the stopping condition in finitely many steps and yields an optimal policy.
○ Characteristic: It has the advantage of needing far fewer iterations, but at each iteration it must alternate between two operators (policy evaluation (more expensive due to inverse matrix calculation) and policy improvement).
⑥ Algorithm 3. Linear programming
○ Theorem
○ Proof
○ Implementation: Lagrange multiplier method, dual optimal variables.
⒂ Lemma 15. (Stochastic policy) ACOE (average cost optimality equation, infinite-horizon average cost Bellman equation)
① Overview
○ For a finite state space, finite action space, and bounded cost function, define Jg as follows:
○ In an irreducible Markov chain, consider the Poisson equation Jg1 + Wg = Cg + PgWg.
○ If Wg is a solution, then Wg + α1 is also a solution for any constant α, but Wg is uniquely determined up to this additivity constant.
○ Derivation of the ACOE: the relative value function WN(i) - WN(j) converges, as N → ∞, corresponding to W(i) - W(j).
○ Significance 1. J* is the optimal cost.
○ Significance 2. The optimal SMP g* must satisfy the ACOE.
○ Suppose the optimal policy g* is given as follows, and assume that the equality does not hold for some state i.
○ Then a contradiction arises because g* loses optimality, and hence the optimal g* must satisfy the ACOE.
○ Significance 3. We can use policy iteration.
② Algorithm 1. Policy iteration algorithm
○ Step 1. Choose an arbitrary policy g0 ∈ 𝒢SMP.
○ Step 2. Policy evaluation: Given gn, solve the following Poisson equation to obtain (Jgn, Wgn). Since we have I equations but (I+1) unknowns (Jgn, Wgn(1), ···, Wgn(I)), we fix one component, e.g. set Wgn(I) = 0), to uniquely determine the solution of Jgn1 + Wgn = Cgn + PgnWgn
○ Step 3. Stopping criterion: If gn satisfies the ACOE, then gn is an optimal SMP. Otherwise, go to Step 4. ACOE: Jgn + Wgn(i) = minu∈𝒰 {C(i, u) + ∑j∈S P(j ㅣ i, u)Wgn(j)}, ∀i
○ Step 4. Policy improvement: Define a new policy gn+1 by gn+1 ∈ arg minu∈𝒰 {C(i, u) + ∑j∈S P(j ㅣ i, u)Wgn(j)} and return to Step 2.
○ If gn does not satisfy the stopping criterion, then Jgn+1 < Jgn holds.
③ Algorithm 2. ACOE and relative value iteration
○ Assumption: For every g ∈ 𝒢SMP, the transition matrix Pg is irreducible and aperiodic.
○ Step 1. Choose an arbitrary h0 ∈ ℝI.
○ Step 2. For any k ≥ 1,
○ ∀i ∈ S, λk(i) = minu∈𝒰 {C(i, u) + ∑j∈S P(j ㅣ i, u)hk-1(j)}
○ μk = λk(I) for some fixed reference state I
○ ∀i ∈ S, hk(i) = λk(i) - μk
○ Step 3. Check convergence; if it does not converge, return to Step 2.
○ Then μk converges to J*, and hk converges to W (the relative value function).
④ Other algorithms
○ Linear programming: If there is no max condition, J* can go down to −∞ due to the ≤ inequality (which is a stronger condition than taking the minimum).
○ Successive approximation
○ Bertsekas’ algorithm
○ Puterman’s algorithm
⑤ MDS (Martingale difference sequence)
○ Overview: For concrete sample paths (random variables), we want to prove that lim infN→∞ ĴNg(ω) almost surely (a.s.) is as follows for every g, and that limN→∞ ĴNg*(ω) = J* a.s. (assuming g* is optimal)
○ Theorem 1. Let {Xk}k∈ℕ be adapted to a filtration {ℱk}k∈ℕ and satisfy 𝔼[Xk+1 ㅣ ℱk] = 0 almost surely. Then Xk and Yk = ∑j=1 to k Xj are ℱk-measurable.
○ Theorem 2. Martingale stability theorem (LLN)
○ Theorem 3. For any g ∈ 𝒢, limN→∞ inf (1/N) ∑t=0 to N-1 C(Xtg, Utg) ≥ J* a.s. is established, and if equality holds, then g* satisfies the ACOE.
○ Let (J*, W) be a solution of the ACOE. Define Zk+1 = C(Xk, Uk) - J* + W(Xk+1) - W(Xk) - h(Xk, Uk) and h(i, u) = C(i, u) + ∑j∈S P(j ㅣ i, u)W(j) - J* - W(i) ≥ 0. Let ℱk = σ(X0, X1g, ···, Xkg). Then, given Ukg = gk(X0, X1g, ···, Xkg), we obtain 𝔼[Zk+1 ㅣ ℱk] = 0 as follows:
○ Hence {Zk}k∈ℕ is an MDS, and each term of Zk+1 is bounded, so Zk+1 is bounded with respect to M̃. Therefore,
○ holds, and by the LLN we have limN→∞ (1/N)∑k=1 to N Zk = 0 a.s. But
○ holds and h(·, ·) ≥ 0, so we have limN→∞ inf (1/N) ∑t=0 to N-1 C(Xtg, Utg) ≥ J* a.s.
⑥ Theorem: The DCOE problem becomes equivalent to the ACOE problem as β → 1.
○ Reason why ㅣWβ(i) - Wβ(j)ㅣ < ∞ in the above proof.
○ Theorem
○ Upper bound
○ Lower bound
○ General proof
○ Significance 1. It is not necessarily true that the optimal policy gβ* converges to the optimal ACOE policy g*.
○ Significance 2. Blackwell optimality: If the same policy gβ* is optimal for all β with β̂ < β < 1, then gβ* is also optimal for the ACOE. Then, Jβ* = ACOE optimal J* also holds.
○ Proof 1.
○ Proof 2. In the classical Blackwell argument, you compare two stationary policies π and ν by looking at their value difference as a function of the discount factor, fπ,ν(γ) = Vγπ − Vγν. In a standard finite MDP this difference is a rational function of γ, and a nonzero rational function can have only finitely many zeros. That means there are only finitely many discount factors γ at which the two policies tie; beyond those points, their ranking cannot keep flipping infinitely often as γ → 1. Hence, for all γ sufficiently close to 1, the same policy remains optimal, and this policy is also optimal for the average-reward criterion; this is the Blackwell-optimal policy.
⒃ Lemma 16. Kalman filter
① Overview
○ Case 1. Pure prediction problem (Ut ≡ 0): Kalman filter. Given Y0, ···, Yt, the problem is to predict X0, ···, Xt.
○ Case 2. Pure control problem (Yt = Xt): LQR (linear quadratic regulator)
○ Case 3. Partial observation with quadratic cost: LQG (linear quadratic Gaussian).
② Review: linear Gaussian process
③ Case 1. ut ≡ 0 or Bt = 0
○ Step 1. Prediction
○ Step 2. Observation prediction
○ Step 3. Data update
○ Significance: From the form of Lt, we see that the filter is deterministic and nonlinear.
○ Application: Asymptotic behavior of the Kalman filter
○ Definition: In the time-invariant case At ≡ A, Gt ≡ G, Ct ≡ C, Ht ≡ H.
○ Background theory: The reason observability appears when discussing the asymptotic behavior of the Kalman filter is that, in order to construct a long-term stable observer, the system must be at least observable or, more generally, satisfy observability. If an unobservable mode exists, the state component in that direction cannot be corrected using measurements. In particular, if such a mode is unstable (i.e., its eigenvalue lies outside the unit circle), then its contribution cannot be inferred from the outputs, and the estimation error in that direction grows without bound over time. Consequently, the error covariance Pk also diverges along that direction, so the Riccati recursion does not converge to a finite limit P∞, and the observer error cannot be stabilized. Therefore, to guarantee good asymptotic behavior of the Kalman filter—such as convergence of the error covariance, a constant steady-state Kalman gain, and stable estimation—it is essential that all unstable modes be observable, that is, that the pair (A,C) be observability.
○ “Observable”, “detectabile” and “reachable” have the same meaning.
○ ARE(algebraic Riccati equation)
○ If (A,S) is reachable, then the following are equivalent: A is stable. ⇔ The equation ∑ = A∑A* + SS* has a positive definite solution ∑. ⇔ Corollary
④ Case 2. Arbitrary ut = gt(Ht) = gt(y0:t, u0:t-1), and arbitrary Ct(xt, ut)
○ Result 1. (y0:tg, u0:t-1g) and y0:t are the same σ-algebra: each is the function of each other
○ (y0:tg, u0:t-1g) → y0:t proof: u0g = g0(y0g) = g0(y0 + ȳ0g) and y0 = y0g - c0x̄0g = y0g - C0𝔼[x0]. x̄1g = A0x̄0g + B0u0g, so ȳ1g = C1x̄1g, leading to y1 = y1g - ȳ1g. So, we can construct y0:t in this way.
○ y0:t → (y0:tg, u0:t-1g) proof: Let’s think of x0 → y0 → u0 → x1 → y1 → u1 → x2 → ⋯. ȳ0g = c0x̄0g is known. y0g = y0 + ȳ0g is known, and u0g = g0(y0g) is also known. Then, x̄1g = A0x̄0g + B0u0g is known, and also we know ȳ1g = c1x̄1g and y1g = y1 + ȳ1g. Thereforw, u1g = g1(y0:1g, y0g) is known. Thus, the proposition can be proved in this way.
○ This is related to the former proposition: Pg(Xt+1 ∈ A ㅣ Ht, ut) = Pg(Xt+1 ∈ A ㅣ y0:t, ut).
○ Result 2. πt = ℙ(xtg ㅣ y0:tg, u0:t-1g) = ℙ(xtg ㅣ y0:t)
○ Result 3. ℙ(xt ㅣ y0:t) (= Kalman filter) is enought to understand the system.
○ Remark 1. Control only affects x̄tg (mean), but not for ∑tㅣt (variance).
○ Remark 2. It does not require learning(active exploration) to decrease the variance, unlike general POMDP.
○ Remark 3. Separation principle: Kalman filter and control can be divided.
⑤ Dynamic programming
⑥ Quadratic cost
○ Assumption: Ct(xt, ut) = xt*Ptxt + ut*Ttut, CT(xT) = xT*PTxT
○ Let X ~ 𝒩(X̄, ∑) and S be a symmetric matrix, then we have 𝔼[X*SX] = X̄*SX̄ + Tr(S∑).
○ Solution for LQG problem
○ Proof
○ Remark 1. Certainty equivalent control: Noise terms {wt} and {vt} don’t apprear in the optimal control policy.
○ Remark 2. Separation principle: Forward estimation through Kalman filter, and backward computation for control action by solving LQR problem.
○ Remark 3. If ∑tㅣt ≡ 0, st ≡ 0.
⑦ Quadratic cost and ACOE
⒄ Lemma 17. MAB(multi-armed bandit)
① Formula
○ N stochastic processes (Xkn)k=0,1,···; n=1,2,···
○ Each stochastic process has a value in S = {1, 2, ···, I} or countably infinite set.
○ For each time k, uk ∈ {1, 2, ···, N}
○ If uk = m, Xk+1m = Xkm ∀m ∈ {1, 2, ···, N} - {n}
○ ℙ(Xk+1n = j) = P(j ㅣ Xkn)
○ We observe (Xk1, Xk2, ···, Xkn).
○ We want an optimal policy such that supg∈𝒢 𝔼g[∑k=0 to ∞ βkR(xkuk)].
○ Dynamic programming: W(x1, x2, ···, xN), xi ∈ S(sN) = maxu∈{1, 2, ···, N} R(xu) + β∑j=1 to I P(j ㅣ xu)W(x1, x2, ···, xu-1, j, xu+1, ···, xN)
② Example: Suppose there are N slot machines, denoted by M1, …, MN. The success probability of each slot machine Mi is θi, and the failure probability is 1 − θi. When we play once, we receive a reward of 1 on success and 0 on failure. The success probabilities θ1, …, θN are mutually independent random variables taking values in [0, 1], and their prior distributions are denoted by P1(dθ1), …, PN(dθN) (we assume these distributions admit densities). At each time step, we can choose and play only one of the N slot machines. Find an optimal policy that maximizes Eg[∑t=0 to ∞ βt rt].
③ Example: We consider a small network with J queues (nodes). At each node j ∈ {1, …, J}, there are initially nj jobs (customers) waiting in line. The time required to serve a single customer at node j is random, and its distribution has cumulative distribution function (CDF) Fj. When a customer completes service at node j, one of two things happens. With probability qjℓ, the customer moves to another node ℓ and joins the queue there; with the remaining probability 1 − ∑ℓ=1 to J qjℓ, the customer leaves the system entirely. We assume that whenever a customer completes service at node j, we obtain a reward rj > 0 (for example, rj can be interpreted as the revenue the company earns from completing one job at node j). There is only one server (worker) in the system. Thus, at each moment we must decide “which node’s queue should we serve next?” We assume that no new customers arrive from outside, and our goal is to maximize the total discounted reward (total discounted revenue) over time. Viewing this situation as a multi-armed bandit (MAB) problem, each node j corresponds to an arm. At each time step we choose one action from the set {1, 2, …, J} and serve one customer from the corresponding node’s queue. Where the customer goes next is determined by the probabilities qjℓ for that node, and as a result the overall queueing state (how many customers remain at each node, etc.) evolves to the next state. The joint state of all nodes is the “system state” in the bandit problem. To make the system look more like a standard bandit model, we can add a fictitious (J+1)-th node that represents the destination of customers who leave the system, thereby making the network “closed.” No actual server is assigned to this fictitious node, so it is never chosen as an action; it is just a device to represent the flow of customers leaving the system. In this way, a structure in which a single server chooses among several queues where to spend time, receives a reward whenever a service is completed, and sees the queueing state change accordingly is a canonical example of an MAB. If there are no arrivals at all, each queue shrinks only when we choose to serve it, so the problem is a relatively simple “rested” bandit (cf. the Gittins policy is optimal). However, if customers keep arriving from outside, then the queues change state even when they are not chosen, and the problem becomes an example of a “restless bandit” (cf. the Gittins index policy is no longer optimal, and one must resort to concepts such as the Whittle index).
Input: 2025.08.26 23:34