Chapter 11. Reinforcement Learning (reinforcement learning; RL)
Recommended Reading: 【Algorithm】 Algorithm Index
1. Overview
2. Markov Chain
1. Overview
⑴ Definition
① Supervised Learning
○ Data: (x, y) (where x is a feature, y is a label)
○ Goal: Compute the mapping function x → y
② Unsupervised Learning
○ Data: x (where x is a feature and there is no label)
○ Goal: Learn the underlying structure of x
③ Reinforcement Learning
○ Data: (s, a, r, s’) (where s is state, a is action, r is reward, s’ is the next state)
○ Goal: Maximize the total reward over multiple time steps
⑵ Characteristics: Different from supervised learning and unsupervised learning
① Implicitly receives correct answers: Provided in the form of rewards
② Needs to consider interaction with the environment: Delayed feedback can be an issue
③ Previous decisions influence future interactions
④ Actively gathers information: Reinforcement learning includes the process of obtaining data
⑶ Element 1. Reward
⑷ Element 2. Policy
① Definition: An agent’s behavior, mapping state input to action output
② Type 1. Deterministic Policy
③ Type 2. Stochastic Policy
○ Reason 1. To explore since the optimal action is unknown during learning
○ Reason 2. The optimal situation might be stochastic (e.g., rock-paper-scissors, opponent exploiting the strategy)
⑸ Element 3. Value Function
① Definition: Expected value of future rewards represented as lifetime value (LTV)
② Formula: Given state s, policy π, and discount factor γ to adjust future value to present value,
③ Provides a basis for evaluating whether a state is good or bad and choosing actions accordingly
⑹ Element 4. Model
① Definition: The behavior of the environment
② Given a state and action, the model determines the next state and reward
③ Note that there are model-free and model-based methods
2. Markov Chain
⑴ Definition: A system where the future state depends only on the current state and not on past states
① Strongly Connected (= Irreducible): A state where any node i in the graph can reach any other node j
② Period: The greatest common divisor of all paths returning to node i
○ Example: If two nodes A and B are connected as A=B with two edges, the period of each node is 2
③ Aperiodic: When all nodes have a period of 1
○ Aperiodic ⊂ Irreducible
○ Example: If each node has a walk returning to itself, it is aperiodic
④ Stationary State: If Pr(xn xn-1) is independent of n, the Markov process is stationary (time-invariant)
⑤ Regular
○ Regular ⊂ Irreducible
○ If there exists a natural number k such that all elements of the k-th power of the transition matrix M^k are positive (i.e., nonzero)
⑵ Two-State Markov Chain: Closely related to Graph Theory
Figure 1. Two-Scale Markov Chain
① M: Transformation causing state transition in one step
② Mn: Transformation causing state transition in n steps
③ Steady-State Vector: A vector q satisfying Mq = q, i.e., an eigenvector with eigenvalue 1
⑶ HMM (Hidden Markov Model)
① χ = {Xi} is a Markov process and Yi = ϕ(Xi) (where ϕ is a deterministic function), then y = {Yi} is a Hidden Markov Model.
② Baum-Welch Algorithm
○ Purpose: Learning HMM parameters
○ Input: Observed data
○ Output: State transition probabilities and emission probabilities of HMM
○ Principle: A type of EM (Expectation Maximization) algorithm
○ Formula
○ Akl: Number of transition from state k to l
○ Ek(b): Number of emissions of observation b from state k
○ Bk: Initial probability for state k
③ Viterbi Algorithm (ref)
○ Purpose: Find the most likely hidden state sequence given an HMM
○ Input: HMM parameters and observed data
○ N: Number of possible hidden states
○ T: Length of observed data
○ A: State transition probability, akl = probability of transitioning from state k to state l
○ E: Emission probability, ek(x) = probability of observing x in state k
○ B: Initial state probability
○ Output: Most probable state sequence
○ Principle: Uses dynamic programming to compute the optimal path
○ Step 1. Initialization
○ bk: Initial probability of state k, P(s0 = k)
○ ek(σ): Probability of observing the first observation σ in state k, P(x0 s0 = k)
○ Step 2. Recursion
○ Compute maximum probability from the previous state at each time step i = 1, …, T
○ Compute backpointer (ptr) storing the most probable previous state
○ ptri(l) serves to store the previous state k that has the highest probability of transitioning to the current state l.
○ Step 3. Termination
○ Select the highest probability at the final time step
○ Determine the last state of the optimal sequence
○ vk(i - 1): Optimal probability at previous time step i - 1 in state k
○ akl: Probability of transitioning from state k to l
○ Step 4. Traceback
○ Trace back through ptr array from i = T, …, 1 to recover the optimal path
○ Example
Figure 2. Example of Viterbi Algorithm
○ Python Code
class HMM(object):
def __init__(self, alphabet, hidden_states, A=None, E=None, B=None):
self._alphabet = set(alphabet)
self._hidden_states = set(hidden_states)
self._transitions = A
self._emissions = E
self._initial = B
def _emit(self, cur_state, symbol):
return self._emissions[cur_state][symbol]
def _transition(self, cur_state, next_state):
return self._transitions[cur_state][next_state]
def _init(self, cur_state):
return self._initial[cur_state]
def _states(self):
for k in self._hidden_states:
yield k
def draw(self, filename='hmm'):
nodes = list(self._hidden_states) + ['β']
def get_children(node):
return self._initial.keys() if node == 'β' else self._transitions[node].keys()
def get_edge_label(pred, succ):
return (self._initial if pred == 'β' else self._transitions[pred])[succ]
def get_node_shape(node):
return 'circle' if node == 'β' else 'box'
def get_node_label(node):
if node == 'β':
return 'β'
else:
return r'\n'.join([node, ''] + [
f"{e}: {p}" for e, p in self._emissions[node].items()
])
graphviz(nodes, get_children, filename=filename,
get_edge_label=get_edge_label,
get_node_label=get_node_label,
get_node_shape=get_node_shape,
rankdir='LR')
def viterbi(self, sequence):
trellis = {}
traceback = []
for state in self._states():
trellis[state] = np.log10(self._init(state)) + np.log10(self._emit(state, sequence[0]))
for t in range(1, len(sequence)):
trellis_next = {}
traceback_next = {}
for next_state in self._states():
k={}
for cur_state in self._states():
k[cur_state] = trellis[cur_state] + np.log10(self._transition(cur_state, next_state))
argmaxk = max(k, key=k.get)
trellis_next[next_state] = np.log10(self._emit(next_state, sequence[t])) + k[argmaxk]
traceback_next[next_state] = argmaxk
trellis = trellis_next
traceback.append(traceback_next)
max_final_state = max(trellis, key=trellis.get)
max_final_prob = trellis[max_final_state]
result = [max_final_state]
for t in reversed(range(len(sequence)-1)):
result.append(traceback[t][max_final_state])
max_final_state = traceback[t][max_final_state]
return result[::-1]
④ Type 1. PSSM: Simpler HMM structure
⑤ Type 2. Profile HMM: It is advantageous over PSSMs regarding the following:
○ Diagram of profile HMM
Figure 3. Diagram of profile HMM
○ M, I, and D represent match, insertion, and deletion, respectively.
○ Mi can be transitioned to Mi+1, Ii, and Di+1.
○ Ii can be transitioned to Mi+1, Ii, and Di+1.
○ Di can be transitioned to Mi+1, Ii, and Di+1.
○ Advantage 1. The ability to model insertions and deletion
○ Advantage 2. Transitions are restricted only between valid state traversal.
○ Advantage 3. Boundaries between states are better defined.
⑷ Property 1. Perron-Frobenius Theorem
① Theorem 1. If a Markov chain with transition matrix P is strongly connected, then a unique stationary distribution q exists
○ Stationary distribution satisfies Pq = q
② Theorem 2. If a Markov chain with transition matrix P is strongly connected and aperiodic,
○ Pij: Probability of transitioning from node j to node i. ∑i Pij = 1
○ 2-1. The i-th row and j-th column element Pij(k) of Pk converges to qi as k → ∞
○ 2-2. Regardless of the initial state x0, the k-th state xk converges to q as k → ∞
⑸Property 2. The second law of thermodynamics (entropy increase) can be proven using Markov processes
① Because it can simulate the diffusion law: Assuming uniform stationary distribution
② Related Concept: Random Walk
3. Markov Decision Process (MDP)
⑴ Overview
① Definition: Involving transitions, where the state at time t+1 is determined solely by the state at time t
② Diagram
Figure 4. Agent-Environment Interaction in MDP
○ State: st ∈ S
○ Action: at ∈ A
○ Reward: rt ∈ R(st, at)
○ Policy: at ~ π(· st)
○ Transition: (st+1, rt+1) ~ P(· st, at)
③ Most real-world problems correspond to Markov decision processes
④ Existence of optimal solution V(s)
○ Premise 1. Markov property
○ Premise 2. Stationary assumption
○ Premise 3. No distributional shift
⑵ Type 1. Q-learning
① Definition: Learning the value function Vπ(s) for all states
○ Transition probability P(s’ s, a) is unknown
○ Reward function R(s, a) is unknown
② Method 1.
○ Initialize estimates for all state/action pairs: Q̂(s, a) = 0
○ Initial iteration scheme
○ Take a random action a
○ Observe the next state s’ and receive R(s, a) from the environment
○ Update Q̂(s, a)
○ Later iteration scheme
○ Take a = arg maxa Q̂(s, a)
○ Observe the next state s’ and receive R(s, a)
○ Update Q̂(s, a)
○ Random restart
③ Method 2. ε-greedy (epsilon-greedy)
○ Initialize estimates for all state/action pairs: Q̂(s, a) = 0
○ Iteration scheme
○ With probability 1 - ε, take a = arg maxa Q̂(s, a), and with probability ε, take a random action
○ Observe the next state s’ and receive R(s, a)
○ Update Q̂(s, a)
○ Perform a random restart in later iterations
④ DQN (Deep Q-Network)
○ Definition: Implementation of Q-learning using deep learning
○ Applied to many games: DQN on Atari 2600 (2013), AlphaGo (2016), etc.
⑶ Type 2. Policy Gradient
① Definition: Learning policy π(s) directly
② Characteristics
○ In Q-learning, states can be continuous, but actions must be discrete
○ In policy gradient, both states and actions can be continuous
③ Algorithm
○ Step 1. Receive frame
○ Step 2. Forward propagate to get P(action)
○ Step 3. Sample a from P(action)
○ Step 4. Play the rest of the game
○ Step 5. If the game is won, update in the ∇θ direction
○ Step 6. If the game is lost, update in the -∇θ direction
⑷ Other Types
① Deep O-Network (DON)
② A3C (Asynchronous Advantage Actor-Critic)
③ Genetic Algorithm
④ SARSA (State-Action-Reward-State-Action)
4. Markov Chain Monte Carlo
⑴ Definition: A method for generating samples from a Markov chain following a complex probability distribution
⑵ Type 1. Metropolis-Hastings
① Generate a new candidate sample from the current state → Accept or reject the candidate sample → If accepted, transition to the new state
⑶ Type 2. Gibbs Sampling
⑷ Type 3. Importance/Rejection Sampling
⑸ Type 4. Reversible Jump MCMC
① General MCMC methods like Type 1 and Type 2 sample from a probability distribution in a fixed-dimensional parameter space
② Reversible Jump MCMC operates in a variable-dimensional parameter space: The number of parameters dynamically changes during sampling
Input: 2021.12.13 15:20
Updated: 2024.10.08 22:43