Chapter 11. Reinforcement Learning (reinforcement learning; RL)

Recommended Reading: 【Algorithm】 Algorithm Index

1. Overview

2. Markov Chain

3. Markov Decision Process

4. Markov Chain Monte Carlo

1. Overview

⑴ Definition

① Supervised Learning

○ Data: (x, y) (where x is a feature, y is a label)

○ Goal: Compute the mapping function x → y

② Unsupervised Learning

○ Data: x (where x is a feature and there is no label)

○ Goal: Learn the underlying structure of x

③ Reinforcement Learning

○ Data: (s, a, r, s’) (where s is state, a is action, r is reward, s’ is the next state)

○ Goal: Maximize the total reward over multiple time steps

⑵ Characteristics: Different from supervised learning and unsupervised learning

① Implicitly receives correct answers: Provided in the form of rewards

② Needs to consider interaction with the environment: Delayed feedback can be an issue

③ Previous decisions influence future interactions

④ Actively gathers information: Reinforcement learning includes the process of obtaining data

⑶ Element 1. Reward

⑷ Element 2. Policy

① Definition: An agent’s behavior, mapping state input to action output

② Type 1. Deterministic Policy

③ Type 2. Stochastic Policy

○ Reason 1. To explore since the optimal action is unknown during learning

○ Reason 2. The optimal situation might be stochastic (e.g., rock-paper-scissors, opponent exploiting the strategy)

⑸ Element 3. Value Function

① Definition: Expected value of future rewards represented as lifetime value (LTV)

② Formula: Given state s, policy π, and discount factor γ to adjust future value to present value,

③ Provides a basis for evaluating whether a state is good or bad and choosing actions accordingly

⑹ Element 4. Model

① Definition: The behavior of the environment

② Given a state and action, the model determines the next state and reward

③ Note that there are model-free and model-based methods

2. Markov Chain

⑴ Definition: A system where the future state depends only on the current state and not on past states

① Strongly Connected (= Irreducible): A state where any node i in the graph can reach any other node j

② Period: The greatest common divisor of all paths returning to node i

○ Example: If two nodes A and B are connected as A=B with two edges, the period of each node is 2

③ Aperiodic: When all nodes have a period of 1

○ Aperiodic ⊂ Irreducible

○ Example: If each node has a walk returning to itself, it is aperiodic

④ Stationary State: If Pr(xn xn-1) is independent of n, the Markov process is stationary (time-invariant)

⑤ Regular

○ Regular ⊂ Irreducible

○ If there exists a natural number k such that all elements of the k-th power of the transition matrix M^k are positive (i.e., nonzero)

⑵ Two-State Markov Chain: Closely related to Graph Theory

Figure 1. Two-Scale Markov Chain

① M: Transformation causing state transition in one step

② Mⁿ: Transformation causing state transition in n steps

③ Steady-State Vector: A vector q satisfying Mq = q, i.e., an eigenvector with eigenvalue 1

⑶ HMM (Hidden Markov Model)

① χ = {X_i} is a Markov process and Y_i = ϕ(X_i) (where ϕ is a deterministic function), then y = {Y_i} is a Hidden Markov Model.

② Baum-Welch Algorithm

○ Purpose: Learning HMM parameters

○ Input: Observed data

○ Output: State transition probabilities and emission probabilities of HMM

○ Principle: A type of EM (Expectation Maximization) algorithm

○ Formula

○ A_kl: Number of transition from state k to l

○ E_k(b): Number of emissions of observation b from state k

○ B_k: Initial probability for state k

③ Viterbi Algorithm (ref)

○ Purpose: Find the most likely hidden state sequence given an HMM

○ Input: HMM parameters and observed data

○ N: Number of possible hidden states

○ T: Length of observed data

○ A: State transition probability, a_kl = probability of transitioning from state k to state l

○ E: Emission probability, e_k(x) = probability of observing x in state k

○ B: Initial state probability

○ Output: Most probable state sequence

○ Principle: Uses dynamic programming to compute the optimal path

○ Step 1. Initialization

○ b_k: Initial probability of state k, P(s0 = k)

○ e_k(σ): Probability of observing the first observation σ in state k, P(x₀ s₀ = k)

○ Step 2. Recursion

○ Compute maximum probability from the previous state at each time step i = 1, …, T

○ Compute backpointer (ptr) storing the most probable previous state

○ ptr_i(l) serves to store the previous state k that has the highest probability of transitioning to the current state l.

○ Step 3. Termination

○ Select the highest probability at the final time step

○ Determine the last state of the optimal sequence

○ v_k(i - 1): Optimal probability at previous time step i - 1 in state k

○ a_kl: Probability of transitioning from state k to l

○ Step 4. Traceback

○ Trace back through ptr array from i = T, …, 1 to recover the optimal path

○ Example

Figure 2. Example of Viterbi Algorithm

○ Python Code

class HMM(object):
    def __init__(self, alphabet, hidden_states, A=None, E=None, B=None):
        self._alphabet = set(alphabet)
        self._hidden_states = set(hidden_states)
        self._transitions = A
        self._emissions = E
        self._initial = B
        
    def _emit(self, cur_state, symbol):
        return self._emissions[cur_state][symbol]
    
    def _transition(self, cur_state, next_state):
        return self._transitions[cur_state][next_state]
    
    def _init(self, cur_state):
        return self._initial[cur_state]

    def _states(self):
        for k in self._hidden_states:
            yield k

    def draw(self, filename='hmm'):
        nodes = list(self._hidden_states) + ['β']

        def get_children(node):
            return self._initial.keys() if node == 'β' else self._transitions[node].keys()

        def get_edge_label(pred, succ):
            return (self._initial if pred == 'β' else self._transitions[pred])[succ]
        
        def get_node_shape(node):
            return 'circle' if node == 'β' else 'box'
            
        def get_node_label(node):
            if node == 'β':
                return 'β'
            else:
                return r'\n'.join([node, ''] + [
                    f"{e}: {p}" for e, p in self._emissions[node].items()
                ])

        graphviz(nodes, get_children, filename=filename,
                 get_edge_label=get_edge_label,
                 get_node_label=get_node_label,
                 get_node_shape=get_node_shape,
                 rankdir='LR')
        
    def viterbi(self, sequence):
        trellis = {} 
        traceback = [] 
        for state in self._states():
            trellis[state] = np.log10(self._init(state)) + np.log10(self._emit(state, sequence[0])) 
            
        for t in range(1, len(sequence)):
            trellis_next = {}
            traceback_next = {}

            for next_state in self._states():  
                k={}
                for cur_state in self._states():
                    k[cur_state] = trellis[cur_state] + np.log10(self._transition(cur_state, next_state)) 
                argmaxk = max(k, key=k.get)
                trellis_next[next_state] =  np.log10(self._emit(next_state, sequence[t])) + k[argmaxk] 
                traceback_next[next_state] = argmaxk
                
            trellis = trellis_next
            traceback.append(traceback_next)
            
        max_final_state = max(trellis, key=trellis.get)
        max_final_prob = trellis[max_final_state]
                
        result = [max_final_state]
        for t in reversed(range(len(sequence)-1)):
            result.append(traceback[t][max_final_state])
            max_final_state = traceback[t][max_final_state]
            
        return result[::-1]

④ Type 1. PSSM: Simpler HMM structure

⑤ Type 2. Profile HMM: It is advantageous over PSSMs regarding the following:

○ Diagram of profile HMM

Figure 3. Diagram of profile HMM

○ M, I, and D represent match, insertion, and deletion, respectively.

○ M_i can be transitioned to M_i+1, I_i, and D_i+1.

○ I_i can be transitioned to M_i+1, I_i, and D_i+1.

○ D_i can be transitioned to M_i+1, I_i, and D_i+1.

○ Advantage 1. The ability to model insertions and deletion

○ Advantage 2. Transitions are restricted only between valid state traversal.

○ Advantage 3. Boundaries between states are better defined.

⑷ Property 1. Perron-Frobenius Theorem

① Theorem 1. If a Markov chain with transition matrix P is strongly connected, then a unique stationary distribution q exists

○ Stationary distribution satisfies Pq = q

② Theorem 2. If a Markov chain with transition matrix P is strongly connected and aperiodic,

○ P_ij: Probability of transitioning from node j to node i. ∑_i P_ij = 1

○ 2-1. The i-th row and j-th column element P_ij(k) of P^k converges to q_i as k → ∞

○ 2-2. Regardless of the initial state x₀, the k-th state x_k converges to q as k → ∞

⑸Property 2. The second law of thermodynamics (entropy increase) can be proven using Markov processes

① Because it can simulate the diffusion law: Assuming uniform stationary distribution

② Related Concept: Random Walk

3. Markov Decision Process (MDP)

⑴ Overview

① Definition: Involving transitions, where the state at time t+1 is determined solely by the state at time t

② Diagram

Figure 4. Agent-Environment Interaction in MDP

○ State: s_t ∈ S

○ Action: a_t ∈ A

○ Reward: r_t ∈ R(s_t, a_t)

○ Policy: a_t ~ π(· s_t)

○ Transition: (s_t+1, r_t+1) ~ P(· s_t, a_t)

③ Most real-world problems correspond to Markov decision processes

④ Existence of optimal solution V(s)

○ Premise 1. Markov property

○ Premise 2. Stationary assumption

○ Premise 3. No distributional shift

⑵ Type 1. Q-learning

① Definition: Learning the value function V^π(s) for all states

○ Transition probability P(s’ s, a) is unknown

○ Reward function R(s, a) is unknown

② Method 1.

○ Initialize estimates for all state/action pairs: Q̂(s, a) = 0

○ Initial iteration scheme

○ Take a random action a

○ Observe the next state s’ and receive R(s, a) from the environment

○ Update Q̂(s, a)

○ Later iteration scheme

○ Take a = arg max_a Q̂(s, a)

○ Observe the next state s’ and receive R(s, a)

○ Update Q̂(s, a)

○ Random restart

③ Method 2. ε-greedy (epsilon-greedy)

○ Initialize estimates for all state/action pairs: Q̂(s, a) = 0

○ Iteration scheme

○ With probability 1 - ε, take a = arg maxa Q̂(s, a), and with probability ε, take a random action

○ Observe the next state s’ and receive R(s, a)

○ Update Q̂(s, a)

○ Perform a random restart in later iterations

④ DQN (Deep Q-Network)

○ Definition: Implementation of Q-learning using deep learning

○ Applied to many games: DQN on Atari 2600 (2013), AlphaGo (2016), etc.

⑶ Type 2. Policy Gradient

① Definition: Learning policy π(s) directly

② Characteristics

○ In Q-learning, states can be continuous, but actions must be discrete

○ In policy gradient, both states and actions can be continuous

③ Algorithm

○ Step 1. Receive frame

○ Step 2. Forward propagate to get P(action)

○ Step 3. Sample a from P(action)

○ Step 4. Play the rest of the game

○ Step 5. If the game is won, update in the ∇θ direction

○ Step 6. If the game is lost, update in the -∇θ direction

⑷ Other Types

① Deep O-Network (DON)

② A3C (Asynchronous Advantage Actor-Critic)

③ Genetic Algorithm

④ SARSA (State-Action-Reward-State-Action)

4. Markov Chain Monte Carlo

⑴ Definition: A method for generating samples from a Markov chain following a complex probability distribution

⑵ Type 1. Metropolis-Hastings

① Generate a new candidate sample from the current state → Accept or reject the candidate sample → If accepted, transition to the new state

⑶ Type 2. Gibbs Sampling

⑷ Type 3. Importance/Rejection Sampling

⑸ Type 4. Reversible Jump MCMC

① General MCMC methods like Type 1 and Type 2 sample from a probability distribution in a fixed-dimensional parameter space

② Reversible Jump MCMC operates in a variable-dimensional parameter space: The number of parameters dynamically changes during sampling

Input: 2021.12.13 15:20

Updated: 2024.10.08 22:43

2162

Chapter 11. Reinforcement Learning (reinforcement learning; RL)

1. Overview

2. Markov Chain

4. Markov Chain Monte Carlo

results matching ""

No results matching ""