Dynamic Programming Example Problems [01-10]
Higher category: 【Algorithm】 Algorithm and Machine Learning Index
Overview
○ Example 1. Unix diff for comparing two files
○ Example 2. Viterbi for hidden Markov model
○ Example 3. Smith-Waterman for genetic sequence alignment
○ Example 4. -Ford for shortest path routing in networks
○ Example 5. Cocke-Kasami-Younger for parsing context free grammars
Q1.
Consider a rooted binary tree network. Each internal node x has two children, denoted by x(lhs) and x(rhs). The cost of moving from x to its child x(a) is la, where a ∈ {lhs,rhs}. For each node x, let Lx be the minimum cost of any path from x to a leaf node. For each leaf node y, define Ly = 0. Show that for any internal node x the following holds:
S1.
See this link. Applying this, the shortest-path algorithms (e.g., Dijkstra algorithm, Floyd algorithm) were developed. (ref)
Q2.
An investor has a fund. It is worth x pounds at time zero, and the principal cannot be withdrawn. The fund pays interest at rate r × 100% per year for T years. At time t, the investor consumes a proportion at of the interest and reinvests the rest in the fund. What choice of {at} should the investor make in order to maximize total consumption?
S2.
See this link.
xt+1 = xt + rxt(1 - at)
WT-1(xT-1) = rxT-1
WT-2(xT-2) = max0≤aT-2≤1 {rxT-2aT-2 + WT-1(xT-1)} = max0≤aT-2≤1 {rxT-2aT-2 + r(xT-2 + rxT-2(1-aT-2))} = 2rxT-2
WT-s(xT-s) = rxT-s max0≤aT-s≤1 {(1+r)ρT-s+1 + (1 - rρT-s+1)aT-s}
The optimal policy is to save all income at first and then, once a certain threshold is reached, switch to consuming all of it.
Q3.
You want to sell a car. At each time (t = 0, …, T-1), you choose a price pt, and then a customer comes and looks at the car. The probability that the customer buys the car at price p is D(p). If the car has not been sold by time T, it is sold for a fixed price WT, where WT < 1. Maximize the expected reward from selling the car, and find the recursion (dynamic programming / Bellman equation) for the optimal reward when D(p) = max(1 - p, 0).
S3.
See this link. The expected reward Rt = maxp {pD(p) + (1 - D(p))Rt+1} = maxp {p(1-p) + pRt+1}. Rt is nonincreasing on t = 0, and pt = (Rt+1 + 1) / 2.
Q4.
You own an expensive fish. Each day you are offered a price X for the fish, where X has probability density function f(x). You may either accept or reject the offer. With probability 1−p the fish dies on that day. Find the policy that maximizes the expected profit from selling the fish.
S4.
See this link. This system shows a time-invariant characteristics. Thus, the Bellman equation is as follows:
where a is a constant. Therefore, the optimal policy is as follows: if x ≥ a, buy it; otherwise, reject it. Here, we can obtain the following fixed-point equation
The LHS is fixed to zero, but the RHS is nonincreasing on a, so the fixed-point is unique.
Q5.
A gambler has i pounds and wants to reach N pounds. At each play, the gambler may stake any integer amount j with 1 ≤ j ≤ i. The gambler wins with probability p and loses with probability q = 1−p; if they win, they gain j pounds, and if they lose, they lose j pounds. The game ends when the gambler’s fortune reaches either 0 or N. Assuming that p > 1/2, argue that it is always optimal for the gambler to gamble 1 pound at a time.
S5.
See this link.
Note that R(i) = (1 - (q/p)i) / (1 - (q/p)N) is the solution of R(i) = pR(i-1) + qR(i+1), R(0) = 0, R(N) = 1. The given problem is more complex, but in the end, we obtain V(x) = R(x). Intuitively, if infinite time is allowed, the optimal policy is to invest the minimum unit to decrease the risk.
The function form can be obtained by assuming the linear sum of exponential functions. Let’s suppose R1, R2 be the solutions of R(i) = pR(i-1) + qR(i+1). Then, the difference of two functions, R1 - R2, also satisfy the equation. This equation is similar to the internally dividing point in geometry, so we know R1 − R2 is always zero in the integer grid points considering the number of variables and equations. If we increase the number of integer grid points, we know R1 - R2 is identical to 0 for all the points in the given interval.
Q6.
Let Xk denote the price of a given stock on day k, and suppose Xk+1 = Xk + Wk, k = 0, 1, 2, …, where W0, W1, W2, … are independently identically distributed (i.i.d.) random variables with a given distribution P, having finite mean, and are also independent of X0, the initial price. Suppose you have an option to buy one share of stock at a fixed price K. You have N days to exercise this option. You do not have to exercise the option, but if you do on a given day k when the price is x, your profit will be x - K. You can exercise the option at most once. Then, determine a strategy that maximizes your expected profit.
A6.
See this link. At any point of time, if you exercise the option at price x, the return is x - K. Hence, the stochastic dynamic program is:
We prove the following properties of the value function: (i) Vt(x) ≥ Vt+1(x), ∀x, (ii) Vt(x) - x is decreasing in x. Note that for t = N, VN(x) ≥ 0. Therefore, VN-1(x) = max(x - K, 𝔼WN-1[VN(x + WN-1)]) ≥ max(x-K, 0) = VN(x). Let’s assume the assertion of part (i) is true for t+1, that is Vt+1(x) ≥ Vt+2(x), ∀x. To complete the proof of (i) we must show that Vt(x) ≥ Vt+1(x), ∀x. Because of the inductive assumption, we get
Regarding the proof of (ii), the claim is true for t = N since VN(x) - x = max(x - K, 0) - x = max(-K, -x) which is decreasing in x. Suppose the assertion of (ii) is true for t+1, that is, Vt+1(x) - x is decreasing in x. To complete the proof of (ii) we must show that Vt(x) - x is decreasing in x.
By the induction hypothesis, Vt+1(x + w) - (x + w) is decreasing in x for each fixed w. Since the right-hand-side of the above equation is a nonnegative linear combination of such functions, it follows that ∫[Vt+1(x + w) - (x + w)]P(dw) is decreasing in x. Consequently, the right-hand-side of the above equation is decreasing in x, thus Vt(x) - x is decreasing in x. The proof of (ii) is complete.
Consider the equation for the indifferent point: -K = 𝔼W[Vt+1(x + W)] - x. Property (ii) of the value function implies that the indifference equation has at most one solution. Define pt := unique solution of the indifference equation if it exists; +∞ otherwise. Property (i) of the value function implies Vt(x) ≥ Vt+1(x), and thus 𝔼[Vt] ≥ 𝔼[Vt+1], which implies pt ≤ pt-1 (the sequence is decreasing). Finally, it is obvious that pN = K. Finally, from (1) and (2), it follows that the optimal policy is to exercise the option at time t if and only if the current price x ≥ pt.
Q7.
You own a call option with strike price p. This means you may buy one share at price p; if the share price at time t is Xt, your profit from exercising then is Xt − p. The option must be exercised no later than time T. The stock price Xt follows Xt+1 = Xt + εt, where the εt are independent random shocks with the same distribution and with finite mean. Show that there exists a decreasing sequence {at}0≤t≤T such that it is optimal to exercise the option exactly when Xs ≥ as occurs at time s.
A7.
See this link. Wt(x) can be shown to be a decreasing function of t. It can also be shown that Wt(x) - x is a decreasing function of x. Intuitively, the larger x is, the higher the expected profit, so one tends to want to exercise the option. Hence we can show that on day m, it is optimal to exercise the option when x ≥ as. Furthermore, from the dynamic program and the monotone behavior of Wt(x) in m, we can easily show that a0 ≥ a1 ≥ ··· ≥ aN. Intuitively, the smaller t is (i.e., the earlier it is, with more time remaining), the higher the basic expected profit, so instead of exercising the option hastily, one prefers to wait strategically.
Q8.
The detection period for a certain type of interstellar particle is N days (N fixed). A scientist may look for that type of particle on any given day or may stay home. When the particle count is x, the probability of detecting it is p(x), and the probability of detecting more than one is zero. The probability p(·) is a known non-decreasing function of x. A particle detected is worth one unit, and a day of searching costs c units (0 < c < 1). If the scientist stays home he/she does not incur any cost and receives no reward. It is assumed that:
(A1) The number x of particles on any given day is known to the scientist.
(A2) The number of particles decreases by one when the scientist is successful; otherwise it is unchanged from one day to the next.
The objective is to determine a detection policy to maximize the scientist’s expected reward during the detection period.
A8.
See this link. Let Vt(x) be the value function when the time is t and the number of particles is x. Then we obtain
Vt(x) is decreasing in t and increasing in x. Thus, at any given time t, the incentive to search becomes stronger as x grows, and one can show that on day t it is optimal to search only when x ≥ xt*. If we define “stay” iff Qk(x, stay) < Qk(x, search), then we obtain x1* = ··· = xn*). Since this can be shown at any given time point, and the same argument holds by symmetry. That is, one can show that xN-1* = xN* and VN-1(x) = VN(x) for all x ≤ xN*, and, by symmetry, it is intuitively clear that this inductive hypothesis is preserved. If instead we define “stay” iff Qk(x, stay) ≤ Qk(x, search), then we obtain x1* ≥ ··· ≥ xn* since “stay” is then weakly preferred.
Q9.
An optimal stopping problem is a Markov decision process with two actions: a = 0 meaning “stop,” and a = 1 meaning “continue.” There are two types of costs: c(x, a = 0) = k(x) for the stopping cost and c(x, a = 1) = c(x) for the continuation cost. This defines a stopping problem. Assuming that the time horizon is finite, the Bellman equation is
with boundary conditions C(0) = 0, C0(x) = k(x) .
A9.
See this link. Assume that the cost functions are increasing in x. Then the value function Ct(x) is increasing in x and decreasing in the remaining time t. Hence the optimal policy is of threshold type: with t periods remaining, there exists a number xt* such that it is optimal to stop if and only if x ≥ xt. Moreover, the thresholds are monotone in t, in the sense that xt ≤ xt+1*, so that having more time left makes continuation more attractive and one stops only for larger values of x.
Q10.
(Secretary Problem) Suppose you are presented with n offers in sequential order. After looking at an offer you must decide either to accept it and terminate the process or to reject it. Once rejected, the offer is lost. Suppose at any time you know the relative rank of the present offer compared to the previous ones. Determine the strategy that maximizes the probability of selecting the best offer when all n! orderings of offers are assumed to be equally likely.
A10.
Let Vk be the optimal success probability before observing the kth offer, Xk be a relative rank out of the given k offers (the higher the better), and Q(x, a), a ∈ 𝒜 = {accept, reject} be state-action value function. Then, we have
Note that at the nth offer, we already know the exact rank of the offer, so the terminal condition is as follows: Vn(x) = 𝟙{x = n}. From this Bellman equation, we can have the following deterministic optimal policy:
Note that the success probability converges to 1/e.
A10-2.
Let the probability that the m-th offer is the best be p(m) = m/n, and let the probability that, after rejecting the m-th offer, the best offer still lies ahead be L(m) = 1 - m/n. Define V(m) = max{p(m), L(m)}. As m increases, p(m) increases, while L(m) decreases. Therefore, the optimal strategy is to keep rejecting offers at first, and then, after a certain point, accept the first offer that is the best among all offers observed up to that time.
Q11.
You look for a parking space on a street; each space is free with probability p = 1−q. You can’t tell whether a space is free until you reach it. Once at a space you must decide to stop or continue. From position s (that is, s spaces from your destination), the cost of stopping is s. The cost of passing your destination without parking is D.
A11.
See this link. Let Vs a value function for the position from the destination, then we have the following Bellman equation:
Whether, as in the given problem, the cost decreases as you get closer to the destination (left) or, conversely, increases (right), the optimal policy is given as follows: stop if and only if s ≤ s*
Since we always stop when we are close to the destination, we end up solving Vs* = (1−q)s + qVs−1* and in this case Vs* can be written in a neat closed form (cf. homogeneous and particular solutions): stop if and only if s ≤ Vs* = s - (q/p) + (qD + q/p) qs
Q12.
Consider a Markov chain (M.C.) that evolves in discrete time t = 0, 1, 2, … with state space X = S = {1, 2, …}. The transition probabilities are given by the matrix {P(j ㅣ i)}, for i, j = 1, 2, …. At each time t = 0, 1, 2,… , the decision-maker can perfectly observe the current state Xt of the Markov chain. Each observation incurs a cost c. At some time t, the decision-maker may choose to stop observing the evolution of the Markov chain. If they stop at time t, they receive a reward r(Xt). In particular, if Xt = i, then r(Xt) = r(i). The reward r(Xt) is terminal in the sense that, once the decision-maker stops at time t, the decision is final and no further costs or rewards are incurred. The decision-maker must make this final stopping decision no later than time T. The objective is to choose a strategy g := (g0, g1, …, gT−1) that maximizes Jg = 𝔼g[−c(τ(g)+1) + r(Xτ(g))], where τ(g) denotes the (random) time at which the decision-maker stops under strategy g. For every strategy g, we have 0 ≤ τ(g) ≤ T. Determine the optimal policy.
A12.
The dynamic program can be formulated as follows:
VT(i) = r(i)
Vt(i) = max{r(i), -c + ∑j P(i, j)Vt+1(j)}, t = 0 ~ T-1
From this we obtain Vt(i) ≥ Vt+1(i). Intuitively, if the reward at time t is large enough, one can stop and make the decision immediately; otherwise, one should wait and aim for the next transition. Since the value function increases the earlier the time t is, the incentive to wait for a transition becomes stronger in order to maximize the reward. Hence, the stopping set at time t, St := {i ∈ X : Vt(x) = r(i)} is a subset of the stopping set at time (t+1), St+1; that is, St ⊂ St+1. Assume that the stopping set at time (T-1), ST-1, has the closure property, i.e., for all (i, j) such that i ∈ ST-1 and j ∉ ST-1, we have P(j ㅣ i) = 0. In this case the size of ST-1 is drastically reduced, and inductively we obtain St ⊃ St+1 ⊃ ··· ⊃ ST-1. Therefore, the optimal policy at time t is “stop iff i ∈ St”, and under the closure property it becomes “stop iff i ∈ St = ST-1”.
Q13.
An individual wants to sell her house. An offer comes at the beginning of each day. It is assumed that successive offers are independent and an offer is xj with probability pj , j = 1, 2, . . . , n, where xj , j = 1, 2, . . . , n, are non-negative scalars. Any offer that is not immediately accepted is not lost but may be accepted at any later date. A maintenance cost c is incurred for each day that the house remains unsold. There is a deadline to sell the house within N days. The objective is to maximize the price at which the house is sold minus the maintenance cost.
A13.
We can find an appropriate information state zt = max{x1, ···, xt}. Then, we have the following dynamic program:
We can find Vt(z) ≥ Vt+1(z), Vt(z) ≥ Vt(z-1), the convexivity of Vt(z) in the dynamic program. The optimal policy is sell iff z ≥ rt. If “remain iff Qk(z, remain) < Qk(z, sell)”, we obtain r1 = ··· = rn, because we can show it in a step and it holds for other time steps by symmetry. If “remain iff Qk(z, remain) ≤ Qk(z, sell)”, we obtain r1 ≥ ··· ≥ rn, because “remain” is preferred.
Q14.
An object is located in one of n possible locations. The probability that the object is in location i is pi (pi is the prior probability, pi > 0, ∑pi = 1). A search in location i costs ci, ci > 0, and if the object is present in that location the probability that it will be discovered is αi, i = 1, 2, . . . , n. Determine a policy that discovers the object at minimal cost.
Hint: Show that if (π1(t), π2(t), …, πn(t)) is the information state at time t, t = 1,2,…, then an optimal policy is to search the location that has the maximal value of αiπi(t) / ci, i = 1, 2, . . . , n.
A14.
Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).
Q15.
An individual is offered 3 to 1 odds in a coin tossing game where she wins whenever a tail occurs. However, she suspects that the coin is biased and has an a priori probability distribution with CDF F(p) and pdf f(p), for the probability p that a head occurs at each toss. A maximum of T coin tosses is allowed. The individual’s objective is to determine a policy of deciding whether to continue or stop participating in the game, given the outcomes of the game so far, so as to maximize her earnings.
A15.
Information state can be defined as the posterior probability ℙ(outcome ㅣ prior).
For general information state,
For specific information state,
Q16.
(Sequential binary hypothesis testing) Consider a binary state H ∈ {0, 1} with prior probability ℙ(H = 0) = p. At each time t = 0, 1, ···, T, the decision maker chooses one of the following actions: (10 Stop and declare a decision u ∈ {0, 1}, or (2) Continue (incurring a cost (C)) and obtain an observation Yt. If the declaration is incorrect, a penalty K is incurred.
a. Write down the terminal value function VT(π) at time (t = T).
b. Write the Bellman equation for Vt(π) for 0 ≤ t ≤ T-1.
c. Let f0(y) and f1(y) denote the probability mass functions of Y under H = 0 and H = 1, respectively. Derive the belief update rule πt+1 = 𝒯(πt, y) after observing y.
d. Express the conditional expected cost of choosing “continue,” 𝔼[Vt+1(πt+1) ㅣ πt = π], in integral form.
e. Show by mathematical induction that for all t, the function Vt(π) is concave in π.
f. Assuming Vt(π) is concave, show that the optimal policy is characterized by two threshold values ρt1,* ≤ ρt0,*. Write down the equations that determine these threshold values.
g. (Infinite-horizon fixed point) Let T → ∞. Write the fixed-point equation satisfied by the time-invariant value function V(π).
h. Suppose T = 1 (i.e., the current time is t = 0 and there is only one more opportunity to observe), K = 1, C = 0.1, and the observation takes discrete values y ∈ {a, b} with likelihoods f0(a) = 0.7, f1(a) = 0.3, f0(b) = 0.3, and f1(b) = 0.7. Given the initial belief π = ℙ(H = 0) = 0.45, determine whether it is optimal to continue or to stop.
A16.
a. VT(π) = min((1 - π)K, πK)
b. Vt(π) = min((1 - π)K, πK, C + 𝔼[Vt+1(πt+1) ㅣ πt = π])
c. πt+1 = f0(y)πt / (f0(y)πt + f1(y)(1 - πt)) = 𝒯(πt, y)
d. 𝔼[Vt+1(πt+1) ㅣ π] = ∫ Vt+1(𝒯(π, y)) p(y ㅣ π) dy = ∫ Vt+1(𝒯(π, y)) (πf0(y) + (1-π)f1(y)) dy
e. If Vt+1(π) is concave by induction, we can put it as infi∈ℐ {αiπ + βi}, resulting in 𝔼[Vt+1(πt+1) ㅣ π] = ∫ infi {αi 𝒯(π,y) + βi} p(y ㅣ π) dy = ∫ infi { αif0(y)π + βi (f0(y)π + f1(y)(1 - π)) } dy → concave
f.
g. Then, threshold values also converge to time-invariant constants.
h. If calculated, it is optimal to continue (≈ 0.371) < stop (0.45).
Input: 2025.11.21 00:36