Korean, Edit

Reinforcement Learning Example [01-10]

Recommended article: 【Control Theory】 Stochastic Control Theory



Q4.

Consider a finite-state ergodic discrete time Markov chain {Xt}t∈Z+ on state-space 𝒮 = {1, 2, …, I} with transition matrix P and initial distribution μ (for X0). For (bounded) cost vector C let (L, J) be a (bounded) solution to the Poisson equation for (C, P), i.e., (L, J) satisfy (vector) equation


L + 1J = C + PL


Then show the following:

(a) The above equation can be written as


L(Xt) + J = C(Xt) + 𝔼[L(Xt+1) ㅣ Xt]


where L(Xt) = L(i) (the ith component of column vector L) if Xt = i.

(b) Then, using the Markov property show that the random process {Mt}t∈Z+ given by


M0 := L(X0), Mt+1 := L(Xt+1) + ∑s=0 to t C(Xs) - (t+1) J, t = 0, 1, 2, ···


is a martingale with respect to filtration {ℱt}t∈Z+ where ℱt = σ(X0, X1, …, Xt). As 𝔼[ㅣMtㅣ] < +∞ follows from the boundedness of C, L and J, you only need to show properties (2)―adaptedness, i.e., Mt is ℱt measurable―, and (3)―𝔼[Mt+1ㅣℱt] = Mt w.p. 1. For property (3) you will need to use part (a) of this question.

A4.

(a)

Let’s check the dimensionality of the given Poisson equation


스크린샷 2025-12-26 오후 10 26 12


Here, P(i,j) (i-th row, j-th column) means the transition probability ℙ(jㅣi). Thus, the i-th component of the equation is


L(i) + J = C(i) + ∑j=1 to I P(i,j)·L(j) = C(i) + 𝔼[L(j)ㅣi)


Equivalently, given Xt = i, it can be represented as


L(Xt) + J = C(Xt) + 𝔼[L(Xt+1)ㅣXt)


I replaced L(j) with L(Xt+1) in light of the meaning of the transition.

(b)

M0 = L(X0sub>) (measurable by X0)

M1 = L(X1) + C(X0) - J (measurable by X0:1)

M2 = L(X2) + C(X0) + C(X1) - 2J (measurable by X0:2)

Generally speaking, X: Ω → ℝ is measurable when X-1(A) = {w ∈ Ω : X(w) ∈ A} ∈ ℱ ∀ A ∈ 𝒢 given a pair (Ω, ℱ) and (ℝ, 𝒢). Of note, the measurability can be defined without a measure ℙ, and 𝒢 is generally set to 𝒢 = ℬ(ℝ), where ℬ is Borel σ-algebra. According to the above observations, we can think of the inverse function of X0:t for Mt especially given a finite-stage and a finite state set and the existence of the range of the inverse function with the same reasons. Thus, Mt is ℱt-measurable.

Obviously, ℱ0 ⊆ ℱ1 ⊆ ℱ2 ⊆ ⋯ (∵ σ(X0:t-1) ⊆ σ(X0:t)), so we can find the adaptedness property of filtration. So the question is whether 𝔼[Mt+1ㅣℱt] = Mt is true or not. This means the best prediction of Mt+1 can be projected on ℱt-measurable space, which is beneficial for future prediction given the current condition. Using (a), we obtain


스크린샷 2025-12-26 오후 11 50 27


Here, Doob’s theorem states that σ(X1, …, Xn) is identical to any functions with g(X1, …, Xn) form. With a filtration nature of {ℱt}t∈ℤ+ and already proved properties (1), (2) and (3), {Mt}t∈ℤ+ is a martingale.




Q5.

Let (Xt)t≥0 be a stochastic process with state space S. Let P be a transition operator on functions f : S → ℝ, and let I denote the identity operator. For every bounded function f : S → ℝ, define

Mtf = f(Xt) - f(X0) - ∑τ = 0 to (t-1) (P - I)f(Xτ),

and let ℱt = σ(X0, …, Xt) be the natural filtration generated by X. Show that the following two statements are equivalent:

  1. (Xt)t≥0 is a Markov chain with transition operator P.

  2. For every bounded function f, the process (Mtf)t≥0 is a martingale with respect to the filtration (ℱt)t≥0.

A5.

Proof on 1 → 2

Mt+1f - Mtf = f(Xt+1) - f(Xt) - (P-I)f(Xt) = f(Xt+1) - Pf(Xt)

If (Xt) is a Markov chain with transition probability matrix P, 𝔼[f(Xt+1) ㅣ ℱt] = Pf(Xt)

∴ 𝔼[Mt+1f - Mtf ㅣ ℱt] = 𝔼[f(Xt+1) - Pf(Xt) ㅣ ℱt] = 0

∴ 𝔼[Mt+1f ㅣ ℱt] = Mtf, so Mtf is a Martingale.


Proof on 2 → 1

Mt+1f - Mtf = f(Xt+1) - f(Xt) - (P-I)f(Xt) = f(Xt+1) - Pf(Xt)

Since Mtf is a Martingale, 𝔼[Mt+1f - Mtf ㅣ ℱt] = 𝔼[f(Xt+1) - Pf(Xt) ㅣ ℱt] = 0

That is, 𝔼[f(Xt+1) ㅣ ℱt] = Pf(Xt) is established for all bounded f.

It indicates that if the total past history ℱt is given, the next distribution is determined by only Xt. → 1-step Markov property.

If it is applied repetitively, 𝔼[g(Xt+k) ㅣ ℱt] = Pkg(Xt) is established, implying the total Markov property.



Q8.

An individual is offered 3 to 1 odds in a coin tossing game where she wins whenever a tail occurs. However, she suspects that the coin is biased and has an a priori probability distribution with CDF F(p) and pdf f(p), for the probability p that a head occurs at each toss. A maximum of T coin tosses is allowed. The individual’s objective is to determine a policy of deciding whether to continue or stop participating in the game, given the outcomes of the game so far, so as to maximize her earnings.

(i) Identify an information state for the problem and write down the equation determining its evolution.

(ii) Write down the dynamic program for this problem.

A8.

Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).

Case 1. For general information state,


스크린샷 2025-11-23 오후 4 54 06


Case 1. For specific information state,


스크린샷 2025-11-23 오후 4 52 32



Q9.

Show that W(i), i=1,…,I is the solution of the linear program: Maximize ∑i=1 to I z(i) subject to z(i) ≤ c(i,u) + β∑j=1 to I Pij(u) z(j), ∀u,i

A9.

Let β ∈ (0, 1), a probabilistic transition matrix P(u), and the one-step cost c be given. Define the Bellman operator T by


(Tv)(i) = minu { c(i, u) + β Σj Pij(u) v(j) }


W is the unique fixed point of this equation: W = TW. For all i and u, we have


W∞(i) = minu' { c(i, u') + β Σj Pij(u') W(j) } ≤ c(i, u) + β Σj Pij(u) W(j)


so W(i) also satisfies the given inequality constraint. Since the constraint holds for all u, we have


z(i) ≤ c(i, u) + β Σj Pij(u) z(j) for all u ⇒ z(i) ≤ (Tz)(i)


Here we can see the monotonicity of the operator T:


x ≤ y ⇒ c(i, u) + β Σj Pij(u) x(j) ≤ c(i, u) + β Σj Pij(u) y(j) ⇒ (Tx)(i) ≤ (Ty)(i)


Therefore, z ≤ Tz ≤ T2z ≤ ··· ≤ limn→∞ Tnz = W. The uniqueness of W∞ and the fact that any z converges to W are already guaranteed by the Banach fixed-point theorem and the Bellman contraction mapping. Moreover, any feasible z is bounded as follows:


||z|| ≤ maxi,u c(i, u) + β ||z|| ⇔ ||z|| ≤ maxi,u c(i, u) / (1 − β)


Hence Σi z(i) ≤ Σi W(i). In summary, W is a feasible solution that satisfies the constraints of the given linear program, and since z ≤ W for all feasible z, the solution that maximizes Σi z(i) is z = W, and this W is unique.



Q10.

Show that the minimum cost is the solution of the linear program: maximize J* subject to J* + w(i) ≤ c(i,u) + ∑j=1 to I Pij(u)w(j), 1 ≤ i ≤ I, u ∈ U.

A10.

We can define Jopt* as follows: Jopt* + h(i) = minu∈U {c(i,u) + ∑j Pij(u)h(j)}, ∀i. If we set w(i) = h(i), we know Jopt*, which is LP feasible solution, achieves the minimum cost, implying


스크린샷 2025-11-23 오후 6 17 00


For any arbitrary i, u, and LP feasible solution L, we have


스크린샷 2025-11-23 오후 6 20 29


Let g be an optimal stationary policy whose long-run average cost is Jopt*. Then


스크린샷 2025-11-23 오후 6 23 00


Here, I used the sample-path optimality (a stronger statement), which can be derived from the Martingale difference sequence (MDS). Let J* := sup{J ㅣ ∃w s.t. (J, w) is LP feasible} be the optimal value of the LP. Since Jopt* ≥ J for every LP feasible J, we have Jopt* ≥ J*. Therefore, the minimal cost Jopt* from the Bellman optimality equation is identical to J* from: max J* s.t. J* + w(i) ≤ c(i, u) + ∑j Pij(u) w(j).



Input: 2025.11.21 01:30

results matching ""

    No results matching ""