Dynamic Programming Example Problems [01-10]

Higher category: 【Algorithm】 Algorithm and Machine Learning Index

Overview

○ Example 1. Unix diff for comparing two files

○ Example 2. Viterbi for hidden Markov model

○ Example 3. Smith-Waterman for genetic sequence alignment

○ Example 4. -Ford for shortest path routing in networks

○ Example 5. Cocke-Kasami-Younger for parsing context free grammars

Q1.

Consider a rooted binary tree network. Each internal node x has two children, denoted by x(lhs) and x(rhs). The cost of moving from x to its child x(a) is l_a, where a ∈ {lhs,rhs}. For each node x, let L_x be the minimum cost of any path from x to a leaf node. For each leaf node y, define L_y = 0. Show that for any internal node x the following holds:

L_x = min_{a∈{lhs,rhs}} {l_a + L_x(a)}.

S1.

See this link. Applying this, the shortest-path algorithms (e.g., Dijkstra algorithm, Floyd algorithm) were developed. (ref)

Q2.

An investor has a fund. It is worth x pounds at time zero, and the principal cannot be withdrawn. The fund pays interest at rate r × 100% per year for T years. At time t, the investor consumes a proportion a_t of the interest and reinvests the rest in the fund. What choice of {a_t} should the investor make in order to maximize total consumption?

S2.

See this link.

x_t+1 = x_t + rx_t(1 - a_t)

W_T-1(x_T-1) = rx_T-1

W_T-2(x_T-2) = max_{0≤a_T-2≤1} {rx_T-2a_T-2 + W_T-1(x_T-1)} = max_{0≤a_T-2≤1} {rx_T-2a_T-2 + r(x_T-2 + rx_T-2(1-a_T-2))} = 2rx_T-2

W_T-s(x_T-s) = rx_T-s max_{0≤a_T-s≤1} {(1+r)ρ_T-s+1 + (1 - rρ_T-s+1)a_T-s}

The optimal policy is to save all income at first and then, once a certain threshold is reached, switch to consuming all of it.

Q3.

You want to sell a car. At each time (t = 0, …, T-1), you choose a price p_t, and then a customer comes and looks at the car. The probability that the customer buys the car at price p is D(p). If the car has not been sold by time T, it is sold for a fixed price W_T, where W_T < 1. Maximize the expected reward from selling the car, and find the recursion (dynamic programming / Bellman equation) for the optimal reward when D(p) = max(1 - p, 0).

S3.

See this link. The expected reward R_t = max_p {pD(p) + (1 - D(p))R_t+1} = max_p {p(1-p) + pR_t+1}. Rt is nonincreasing on t = 0, and p_t = (R_t+1 + 1) / 2.

Q4.

You own an expensive fish. Each day you are offered a price X for the fish, where X has probability density function f(x). You may either accept or reject the offer. With probability 1−p the fish dies on that day. Find the policy that maximizes the expected profit from selling the fish.

S4.

See this link. This system shows a time-invariant characteristics. Thus, the Bellman equation is as follows:

V_t(x) = max{x, p𝔼[V_t+1(X_t+1 ㅣ x)]} = max{x, p𝔼[V_t+1(X_t+1)]} = max{x, a},

where a is a constant. Therefore, the optimal policy is as follows: if x ≥ a, buy it; otherwise, reject it. Here, we can obtain the following fixed-point equation

a = ∫_{-∞ to a} af(y)dy + ∫_{a to ∞} yf(y) dy ⇔ 0 = ∫_{a to ∞} (y - a)f(y) dy.

The LHS is fixed to zero, but the RHS is nonincreasing on a, so the fixed-point is unique.

Q5.

A gambler has i pounds and wants to reach N pounds. At each play, the gambler may stake any integer amount j with 1 ≤ j ≤ i. The gambler wins with probability p and loses with probability q = 1−p; if they win, they gain j pounds, and if they lose, they lose j pounds. The game ends when the gambler’s fortune reaches either 0 or N. Assuming that p > 1/2, argue that it is always optimal for the gambler to gamble 1 pound at a time.

S5.

See this link.

V_t(x) = max_j {p𝔼[V_t+1(x+j)] + q𝔼[V_t+1(x-j)]} → V(x) = max_j {p𝔼[V(x+j)] + q𝔼[V(x-j)]} (∵ time-invariant)

Note that R(i) = (1 - (q/p)ⁱ) / (1 - (q/p)^N) is the solution of R(i) = pR(i-1) + qR(i+1), R(0) = 0, R(N) = 1. The given problem is more complex, but in the end, we obtain V(x) = R(x). Intuitively, if infinite time is allowed, the optimal policy is to invest the minimum unit to decrease the risk.

The function form can be obtained by assuming the linear sum of exponential functions. Let’s suppose R₁, R₂ be the solutions of R(i) = pR(i-1) + qR(i+1). Then, the difference of two functions, R₁ - R₂, also satisfy the equation. This equation is similar to the internally dividing point in geometry, so we know R₁ − R₂ is always zero in the integer grid points considering the number of variables and equations. If we increase the number of integer grid points, we know R₁ - R₂ is identical to 0 for all the points in the given interval.

Q6.

Let X_k denote the price of a given stock on day k, and suppose X_k+1 = X_k + W_k, k = 0, 1, 2, …, where W₀, W₁, W₂, … are independently identically distributed (i.i.d.) random variables with a given distribution P, having finite mean, and are also independent of X₀, the initial price. Suppose you have an option to buy one share of stock at a fixed price K. You have N days to exercise this option. You do not have to exercise the option, but if you do on a given day k when the price is x, your profit will be x - K. You can exercise the option at most once. Then, determine a strategy that maximizes your expected profit.

A6.

See this link. At any point of time, if you exercise the option at price x, the return is x - K. Hence, the stochastic dynamic program is:

V_N(x) = max(x - K, 0) V_t(x) = max(x - K, 𝔼_{W_t}[V_t+1(x + V_t)]), t = 0 ~ N-1

We prove the following properties of the value function: (i) V_t(x) ≥ V_t+1(x), ∀x, (ii) V_t(x) - x is decreasing in x. Note that for t = N, V_N(x) ≥ 0. Therefore, V_N-1(x) = max(x - K, 𝔼_{W_N-1}[V_N(x + W_N-1)]) ≥ max(x-K, 0) = V_N(x). Let’s assume the assertion of part (i) is true for t+1, that is V_t+1(x) ≥ V_t+2(x), ∀x. To complete the proof of (i) we must show that V_t(x) ≥ V_t+1(x), ∀x. Because of the inductive assumption, we get

V_t(x) = max(x - K, 𝔼_{W_t}[V_t+1(x + W_t)] ≥ max(x - K, 𝔼_{W_t}[V_t+2(x + W_t)]) = V_t+1(x)

Regarding the proof of (ii), the claim is true for t = N since V_N(x) - x = max(x - K, 0) - x = max(-K, -x) which is decreasing in x. Suppose the assertion of (ii) is true for t+1, that is, V_t+1(x) - x is decreasing in x. To complete the proof of (ii) we must show that V_t(x) - x is decreasing in x.

V_t(x) - x = max(-K, 𝔼_{W_t}[V_t+1(x + W_t)] - x) 𝔼_{W_t}[V_t+1(x + W_t)] - x = ∫[V_t+1(x+w) - (x+w)]P_W(dw) + 𝔼(W_t)

By the induction hypothesis, V_t+1(x + w) - (x + w) is decreasing in x for each fixed w. Since the right-hand-side of the above equation is a nonnegative linear combination of such functions, it follows that ∫[V_t+1(x + w) - (x + w)]P(dw) is decreasing in x. Consequently, the right-hand-side of the above equation is decreasing in x, thus V_t(x) - x is decreasing in x. The proof of (ii) is complete.

Consider the equation for the indifferent point: -K = 𝔼_W[V_t+1(x + W)] - x. Property (ii) of the value function implies that the indifference equation has at most one solution. Define p_t := unique solution of the indifference equation if it exists; +∞ otherwise. Property (i) of the value function implies V_t(x) ≥ V_t+1(x), and thus 𝔼[V_t] ≥ 𝔼[V_t+1], which implies p_t ≤ p_t-1 (the sequence is decreasing). Finally, it is obvious that p_N = K. Finally, from (1) and (2), it follows that the optimal policy is to exercise the option at time t if and only if the current price x ≥ p_t.

Q7.

You own a call option with strike price p. This means you may buy one share at price p; if the share price at time t is X_t, your profit from exercising then is X_t − p. The option must be exercised no later than time T. The stock price X_t follows X_t+1 = X_t + ε_t, where the εt are independent random shocks with the same distribution and with finite mean. Show that there exists a decreasing sequence {a_t}_0≤t≤T such that it is optimal to exercise the option exactly when X_s ≥ a_s occurs at time s.

A7.

See this link. W_t(x) can be shown to be a decreasing function of t. It can also be shown that W_t(x) - x is a decreasing function of x. Intuitively, the larger x is, the higher the expected profit, so one tends to want to exercise the option. Hence we can show that on day m, it is optimal to exercise the option when x ≥ a_s. Furthermore, from the dynamic program and the monotone behavior of W_t(x) in m, we can easily show that a₀ ≥ a₁ ≥ ··· ≥ a_N. Intuitively, the smaller t is (i.e., the earlier it is, with more time remaining), the higher the basic expected profit, so instead of exercising the option hastily, one prefers to wait strategically.

Q8.

The detection period for a certain type of interstellar particle is N days (N fixed). A scientist may look for that type of particle on any given day or may stay home. When the particle count is x, the probability of detecting it is p(x), and the probability of detecting more than one is zero. The probability p(·) is a known non-decreasing function of x. A particle detected is worth one unit, and a day of searching costs c units (0 < c < 1). If the scientist stays home he/she does not incur any cost and receives no reward. It is assumed that:

(A1) The number x of particles on any given day is known to the scientist.

(A2) The number of particles decreases by one when the scientist is successful; otherwise it is unchanged from one day to the next.

The objective is to determine a detection policy to maximize the scientist’s expected reward during the detection period.

A8.

See this link. Let V_t(x) be the value function when the time is t and the number of particles is x. Then we obtain

Q_t(x, search) = -c + p(x)(1 + V_t+1(x-1)) + (1 - p(x)) V_t+1(x) Q_t(x, stay) = V_t+1(x) V_N+1(x) = 0 V_t(x) = max(Q_t(x, search), Q_t(x, stay)) = max( -c + p(x)(1 + V_t+1(x-1)) + (1 - p(x)) V_t+1(x), V_t+1(x) ), t = 1, ···, N

V_t(x) is decreasing in t and increasing in x. Thus, at any given time t, the incentive to search becomes stronger as x grows, and one can show that on day t it is optimal to search only when x ≥ x_t*. If we define “stay” iff Q_k(x, stay) < Q_k(x, search), then we obtain x₁* = ··· = x_n*). Since this can be shown at any given time point, and the same argument holds by symmetry. That is, one can show that x_N-1* = x_N* and V_N-1(x) = V_N(x) for all x ≤ x_N*, and, by symmetry, it is intuitively clear that this inductive hypothesis is preserved. If instead we define “stay” iff Q_k(x, stay) ≤ Q_k(x, search), then we obtain x₁* ≥ ··· ≥ x_n* since “stay” is then weakly preferred.

Q9.

An optimal stopping problem is a Markov decision process with two actions: a = 0 meaning “stop,” and a = 1 meaning “continue.” There are two types of costs: c(x, a = 0) = k(x) for the stopping cost and c(x, a = 1) = c(x) for the continuation cost. This defines a stopping problem. Assuming that the time horizon is finite, the Bellman equation is

C_t(x) = min{k(x), c(x) + 𝔼[C_t-1(X)]}

with boundary conditions C(0) = 0, C₀(x) = k(x) .

A9.

See this link. Assume that the cost functions are increasing in x. Then the value function C_t(x) is increasing in x and decreasing in the remaining time t. Hence the optimal policy is of threshold type: with t periods remaining, there exists a number x_t* such that it is optimal to stop if and only if x ≥ x_t. Moreover, the thresholds are monotone in t, in the sense that x_t ≤ x_t+1*, so that having more time left makes continuation more attractive and one stops only for larger values of x.

Q10.

(Secretary Problem) Suppose you are presented with n offers in sequential order. After looking at an offer you must decide either to accept it and terminate the process or to reject it. Once rejected, the offer is lost. Suppose at any time you know the relative rank of the present offer compared to the previous ones. Determine the strategy that maximizes the probability of selecting the best offer when all n! orderings of offers are assumed to be equally likely.

A10.

Let V_k be the optimal success probability before observing the k^th offer, X_k be a relative rank out of the given k offers (the higher the better), and Q(x, a), a ∈ 𝒜 = {accept, reject} be state-action value function. Then, we have

Note that at the n^th offer, we already know the exact rank of the offer, so the terminal condition is as follows: V_n(x) = 𝟙{x = n}. From this Bellman equation, we can have the following deterministic optimal policy:

Note that the success probability converges to 1/e.

A10-2.

Let the probability that the m-th offer is the best be p(m) = m/n, and let the probability that, after rejecting the m-th offer, the best offer still lies ahead be L(m) = 1 - m/n. Define V(m) = max{p(m), L(m)}. As m increases, p(m) increases, while L(m) decreases. Therefore, the optimal strategy is to keep rejecting offers at first, and then, after a certain point, accept the first offer that is the best among all offers observed up to that time.

Q11.

You look for a parking space on a street; each space is free with probability p = 1−q. You can’t tell whether a space is free until you reach it. Once at a space you must decide to stop or continue. From position s (that is, s spaces from your destination), the cost of stopping is s. The cost of passing your destination without parking is D.

A11.

See this link. Let V_s a value function for the position from the destination, then we have the following Bellman equation:

V_s = (1-q) max{s, V_s-1} + qV_s-1, V₀ = qD

Whether, as in the given problem, the cost decreases as you get closer to the destination (left) or, conversely, increases (right), the optimal policy is given as follows: stop if and only if s ≤ s*

Since we always stop when we are close to the destination, we end up solving V_s* = (1−q)s + qV_s−1* and in this case V_s* can be written in a neat closed form (cf. homogeneous and particular solutions): stop if and only if s ≤ V_s* = s - (q/p) + (qD + q/p) q^s

Q12.

Consider a Markov chain (M.C.) that evolves in discrete time t = 0, 1, 2, … with state space X = S = {1, 2, …}. The transition probabilities are given by the matrix {P(j ㅣ i)}, for i, j = 1, 2, …. At each time t = 0, 1, 2,… , the decision-maker can perfectly observe the current state X_t of the Markov chain. Each observation incurs a cost c. At some time t, the decision-maker may choose to stop observing the evolution of the Markov chain. If they stop at time t, they receive a reward r(X_t). In particular, if X_t = i, then r(X_t) = r(i). The reward r(X_t) is terminal in the sense that, once the decision-maker stops at time t, the decision is final and no further costs or rewards are incurred. The decision-maker must make this final stopping decision no later than time T. The objective is to choose a strategy g := (g₀, g₁, …, g_T−1) that maximizes J^g = 𝔼^g[−c(τ(g)+1) + r(X_τ(g))], where τ(g) denotes the (random) time at which the decision-maker stops under strategy g. For every strategy g, we have 0 ≤ τ(g) ≤ T. Determine the optimal policy.

A12.

The dynamic program can be formulated as follows:

V_T(i) = r(i)

V_t(i) = max{r(i), -c + ∑_j P(i, j)V_t+1(j)}, t = 0 ~ T-1

From this we obtain V_t(i) ≥ V_t+1(i). Intuitively, if the reward at time t is large enough, one can stop and make the decision immediately; otherwise, one should wait and aim for the next transition. Since the value function increases the earlier the time t is, the incentive to wait for a transition becomes stronger in order to maximize the reward. Hence, the stopping set at time t, S_t := {i ∈ X : V_t(x) = r(i)} is a subset of the stopping set at time (t+1), S_t+1; that is, S_t ⊂ S_t+1. Assume that the stopping set at time (T-1), S_T-1, has the closure property, i.e., for all (i, j) such that i ∈ S_T-1 and j ∉ S_T-1, we have P(j ㅣ i) = 0. In this case the size of S_T-1 is drastically reduced, and inductively we obtain S_t ⊃ S_t+1 ⊃ ··· ⊃ S_T-1. Therefore, the optimal policy at time t is “stop iff i ∈ S_t”, and under the closure property it becomes “stop iff i ∈ S_t = S_T-1”.

Q13.

An individual wants to sell her house. An offer comes at the beginning of each day. It is assumed that successive offers are independent and an offer is x_j with probability p_j , j = 1, 2, . . . , n, where x_j , j = 1, 2, . . . , n, are non-negative scalars. Any offer that is not immediately accepted is not lost but may be accepted at any later date. A maintenance cost c is incurred for each day that the house remains unsold. There is a deadline to sell the house within N days. The objective is to maximize the price at which the house is sold minus the maintenance cost.

A13.

We can find an appropriate information state z_t = max{x₁, ···, x_t}. Then, we have the following dynamic program:

Q_k(z, sell) = z Q_k(z, remain) = -c + 𝔼^g[V_k+1(z_k+1) ㅣ z_k = z] = -c + ∑_j p_j V_k+1(max{z, x_j}) V_n(z) = z V_k(z) = max{Q_k(z, sell), Q_k(z, remain)}

We can find V_t(z) ≥ V_t+1(z), V_t(z) ≥ V_t(z-1), the convexivity of V_t(z) in the dynamic program. The optimal policy is sell iff z ≥ r_t. If “remain iff Q_k(z, remain) < Q_k(z, sell)”, we obtain r₁ = ··· = r_n, because we can show it in a step and it holds for other time steps by symmetry. If “remain iff Q_k(z, remain) ≤ Q_k(z, sell)”, we obtain r₁ ≥ ··· ≥ r_n, because “remain” is preferred.

Q14.

An object is located in one of n possible locations. The probability that the object is in location i is p_i (p_i is the prior probability, p_i > 0, ∑p_i = 1). A search in location i costs c_i, c_i > 0, and if the object is present in that location the probability that it will be discovered is α_i, i = 1, 2, . . . , n. Determine a policy that discovers the object at minimal cost.

Hint: Show that if (π₁(t), π₂(t), …, π_n(t)) is the information state at time t, t = 1,2,…, then an optimal policy is to search the location that has the maximal value of α_iπ_i(t) / c_i, i = 1, 2, . . . , n.

A14.

Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).

Q15.

An individual is offered 3 to 1 odds in a coin tossing game where she wins whenever a tail occurs. However, she suspects that the coin is biased and has an a priori probability distribution with CDF F(p) and pdf f(p), for the probability p that a head occurs at each toss. A maximum of T coin tosses is allowed. The individual’s objective is to determine a policy of deciding whether to continue or stop participating in the game, given the outcomes of the game so far, so as to maximize her earnings.

A15.

Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).

For general information state,

For specific information state,

Q16.

(Sequential binary hypothesis testing) Consider a binary state H ∈ {0, 1} with prior probability ℙ(H = 0) = p. At each time t = 0, 1, ···, T, the decision maker chooses one of the following actions: (10 Stop and declare a decision u ∈ {0, 1}, or (2) Continue (incurring a cost (C)) and obtain an observation Y_t. If the declaration is incorrect, a penalty K is incurred.

a. Write down the terminal value function V_T(π) at time (t = T).

b. Write the Bellman equation for V_t(π) for 0 ≤ t ≤ T-1.

c. Let f₀(y) and f₁(y) denote the probability mass functions of Y under H = 0 and H = 1, respectively. Derive the belief update rule π_t+1 = 𝒯(π_t, y) after observing y.

d. Express the conditional expected cost of choosing “continue,” 𝔼[V_t+1(π_t+1) ㅣ π_t = π], in integral form.

e. Show by mathematical induction that for all t, the function V_t(π) is concave in π.

f. Assuming V_t(π) is concave, show that the optimal policy is characterized by two threshold values ρ_t^1,* ≤ ρ_t^0,*. Write down the equations that determine these threshold values.

g. (Infinite-horizon fixed point) Let T → ∞. Write the fixed-point equation satisfied by the time-invariant value function V(π).

h. Suppose T = 1 (i.e., the current time is t = 0 and there is only one more opportunity to observe), K = 1, C = 0.1, and the observation takes discrete values y ∈ {a, b} with likelihoods f₀(a) = 0.7, f₁(a) = 0.3, f₀(b) = 0.3, and f₁(b) = 0.7. Given the initial belief π = ℙ(H = 0) = 0.45, determine whether it is optimal to continue or to stop.

A16.

a. V_T(π) = min((1 - π)K, πK)

b. V_t(π) = min((1 - π)K, πK, C + 𝔼[V_t+1(π_t+1) ㅣ π_t = π])

c. π_t+1 = f₀(y)π_t / (f₀(y)π_t + f₁(y)(1 - π_t)) = 𝒯(π_t, y)

d. 𝔼[V_t+1(π_t+1) ㅣ π] = ∫ V_t+1(𝒯(π, y)) p(y ㅣ π) dy = ∫ V_t+1(𝒯(π, y)) (πf₀(y) + (1-π)f₁(y)) dy

e. If V_t+1(π) is concave by induction, we can put it as inf_i∈ℐ {α_iπ + β_i}, resulting in 𝔼[V_t+1(π_t+1) ㅣ π] = ∫ inf_i {α_i 𝒯(π,y) + β_i} p(y ㅣ π) dy = ∫ inf_i { α_if₀(y)π + β_i (f₀(y)π + f₁(y)(1 - π)) } dy → concave

g. Then, threshold values also converge to time-invariant constants.

h. If calculated, it is optimal to continue (≈ 0.371) < stop (0.45).

Input: 2025.11.21 00:36

721