Korean, Edit

Reinforcement Learning Example [01-10]

Recommended article: 【Control Theory】 Stochastic Control Theory



Q5.

Let (Xt)t≥0 be a stochastic process with state space S. Let P be a transition operator on functions f : S → ℝ, and let I denote the identity operator. For every bounded function f : S → ℝ, define

Mtf = f(Xt) - f(X0) - ∑τ = 0 to (t-1) (P - I)f(Xτ),

and let ℱt = σ(X0, …, Xt) be the natural filtration generated by X. Show that the following two statements are equivalent:

  1. (Xt)t≥0 is a Markov chain with transition operator P.

  2. For every bounded function f, the process (Mtf)t≥0 is a martingale with respect to the filtration (ℱt)t≥0.

A5.



Q8.

An individual is offered 3 to 1 odds in a coin tossing game where she wins whenever a tail occurs. However, she suspects that the coin is biased and has an a priori probability distribution with CDF F(p) and pdf f(p), for the probability p that a head occurs at each toss. A maximum of T coin tosses is allowed. The individual’s objective is to determine a policy of deciding whether to continue or stop participating in the game, given the outcomes of the game so far, so as to maximize her earnings.

(i) Identify an information state for the problem and write down the equation determining its evolution.

(ii) Write down the dynamic program for this problem.

A8.

Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).

Case 1. For general information state,


스크린샷 2025-11-23 오후 4 54 06


Case 1. For specific information state,


스크린샷 2025-11-23 오후 4 52 32



Q9.

Show that W(i), i=1,…,I is the solution of the linear program: Maximize ∑i=1 to I z(i) subject to z(i) ≤ c(i,u) + β∑j=1 to I Pij(u) z(j), ∀u,i

A9.

Let β ∈ (0, 1), a probabilistic transition matrix P(u), and the one-step cost c be given. Define the Bellman operator T by


(Tv)(i) = minu { c(i, u) + β Σj Pij(u) v(j) }


W is the unique fixed point of this equation: W = TW. For all i and u, we have


W∞(i) = minu' { c(i, u') + β Σj Pij(u') W(j) } ≤ c(i, u) + β Σj Pij(u) W(j)


so W(i) also satisfies the given inequality constraint. Since the constraint holds for all u, we have


z(i) ≤ c(i, u) + β Σj Pij(u) z(j) for all u ⇒ z(i) ≤ (Tz)(i)


Here we can see the monotonicity of the operator T:


x ≤ y ⇒ c(i, u) + β Σj Pij(u) x(j) ≤ c(i, u) + β Σj Pij(u) y(j) ⇒ (Tx)(i) ≤ (Ty)(i)


Therefore, z ≤ Tz ≤ T2z ≤ ··· ≤ limn→∞ Tnz = W. The uniqueness of W∞ and the fact that any z converges to W are already guaranteed by the Banach fixed-point theorem and the Bellman contraction mapping. Moreover, any feasible z is bounded as follows:


||z|| ≤ maxi,u c(i, u) + β ||z|| ⇔ ||z|| ≤ maxi,u c(i, u) / (1 − β)


Hence Σi z(i) ≤ Σi W(i). In summary, W is a feasible solution that satisfies the constraints of the given linear program, and since z ≤ W for all feasible z, the solution that maximizes Σi z(i) is z = W, and this W is unique.



Q10.

Show that the minimum cost is the solution of the linear program: maximize J* subject to J* + w(i) ≤ c(i,u) + ∑j=1 to I Pij(u)w(j), 1 ≤ i ≤ I, u ∈ U.

A10.

We can define Jopt* as follows: Jopt* + h(i) = minu∈U {c(i,u) + ∑j Pij(u)h(j)}, ∀i. If we set w(i) = h(i), we know Jopt*, which is LP feasible solution, achieves the minimum cost, implying


스크린샷 2025-11-23 오후 6 17 00


For any arbitrary i, u, and LP feasible solution L, we have


스크린샷 2025-11-23 오후 6 20 29


Let g be an optimal stationary policy whose long-run average cost is Jopt*. Then


스크린샷 2025-11-23 오후 6 23 00


Here, I used the sample-path optimality (a stronger statement), which can be derived from the Martingale difference sequence (MDS). Let J* := sup{J ㅣ ∃w s.t. (J, w) is LP feasible} be the optimal value of the LP. Since Jopt* ≥ J for every LP feasible J, we have Jopt* ≥ J*. Therefore, the minimal cost Jopt* from the Bellman optimality equation is identical to J* from: max J* s.t. J* + w(i) ≤ c(i, u) + ∑j Pij(u) w(j).



Input: 2025.11.21 01:30

results matching ""

    No results matching ""