Reinforcement Learning Example [01-10]

Recommended article: 【Control Theory】 Stochastic Control Theory

Q5.

Let (X_t)_t≥0 be a stochastic process with state space S. Let P be a transition operator on functions f : S → ℝ, and let I denote the identity operator. For every bounded function f : S → ℝ, define

M_t^f = f(X_t) - f(X₀) - ∑_{τ = 0 to (t-1)} (P - I)f(X_τ),

and let ℱ_t = σ(X₀, …, X_t) be the natural filtration generated by X. Show that the following two statements are equivalent:

(X_t)_t≥0 is a Markov chain with transition operator P.
For every bounded function f, the process (M_t^f)_t≥0 is a martingale with respect to the filtration (ℱ_t)_t≥0.

A5.

Q8.

An individual is offered 3 to 1 odds in a coin tossing game where she wins whenever a tail occurs. However, she suspects that the coin is biased and has an a priori probability distribution with CDF F(p) and pdf f(p), for the probability p that a head occurs at each toss. A maximum of T coin tosses is allowed. The individual’s objective is to determine a policy of deciding whether to continue or stop participating in the game, given the outcomes of the game so far, so as to maximize her earnings.

(i) Identify an information state for the problem and write down the equation determining its evolution.

(ii) Write down the dynamic program for this problem.

A8.

Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).

Case 1. For general information state,

Case 1. For specific information state,

Q9.

Show that W_∞(i), i=1,…,I is the solution of the linear program: Maximize ∑_{i=1 to I} z(i) subject to z(i) ≤ c(i,u) + β∑_{j=1 to I} P_ij(u) z(j), ∀u,i

A9.

Let β ∈ (0, 1), a probabilistic transition matrix P(u), and the one-step cost c be given. Define the Bellman operator T by

(Tv)(i) = min_u { c(i, u) + β Σ_j P_ij(u) v(j) }

W_∞ is the unique fixed point of this equation: W_∞ = TW_∞. For all i and u, we have

W∞(i) = min_u' { c(i, u') + β Σ_j P_ij(u') W_∞(j) } ≤ c(i, u) + β Σ_j P_ij(u) W_∞(j)

so W_∞(i) also satisfies the given inequality constraint. Since the constraint holds for all u, we have

z(i) ≤ c(i, u) + β Σ_j P_ij(u) z(j) for all u ⇒ z(i) ≤ (Tz)(i)

Here we can see the monotonicity of the operator T:

x ≤ y ⇒ c(i, u) + β Σ_j P_ij(u) x(j) ≤ c(i, u) + β Σ_j P_ij(u) y(j) ⇒ (Tx)(i) ≤ (Ty)(i)

Therefore, z ≤ Tz ≤ T²z ≤ ··· ≤ lim_n→∞ Tⁿz = W_∞. The uniqueness of W∞ and the fact that any z converges to W_∞ are already guaranteed by the Banach fixed-point theorem and the Bellman contraction mapping. Moreover, any feasible z is bounded as follows:

||z||_∞ ≤ max_i,u c(i, u) + β ||z||_∞ ⇔ ||z||_∞ ≤ max_i,u c(i, u) / (1 − β)

Hence Σ_i z(i) ≤ Σ_i W_∞(i). In summary, W_∞ is a feasible solution that satisfies the constraints of the given linear program, and since z ≤ W_∞ for all feasible z, the solution that maximizes Σ_i z(i) is z = W_∞, and this W_∞ is unique.

Q10.

Show that the minimum cost is the solution of the linear program: maximize J* subject to J* + w(i) ≤ c(i,u) + ∑_{j=1 to I} P_ij(u)w(j), 1 ≤ i ≤ I, u ∈ U.

A10.

We can define J_opt* as follows: J_opt* + h(i) = min_u∈U {c(i,u) + ∑_j P_ij(u)h(j)}, ∀i. If we set w(i) = h(i), we know J_opt*, which is LP feasible solution, achieves the minimum cost, implying

For any arbitrary i, u, and LP feasible solution L, we have

Let g be an optimal stationary policy whose long-run average cost is J_opt*. Then

Here, I used the sample-path optimality (a stronger statement), which can be derived from the Martingale difference sequence (MDS). Let J* := sup{J ㅣ ∃w s.t. (J, w) is LP feasible} be the optimal value of the LP. Since J_opt* ≥ J for every LP feasible J, we have J_opt* ≥ J*. Therefore, the minimal cost J_opt* from the Bellman optimality equation is identical to J* from: max J* s.t. J* + w(i) ≤ c(i, u) + ∑_j P_ij(u) w(j).

Input: 2025.11.21 01:30

1130

Reinforcement Learning Example [01-10]

Q5.

A5.

Q8.

A8.

Q9.

A9.

Q10.

A10.

results matching ""

No results matching ""