Korean, Edit

Chapter 11. Reinforcement Learning (reinforcement learning; RL)

Recommended Reading: 【Algorithm】 Algorithm Index


1. Overview

2. Markov Chain

3. Markov Decision Process

4. Markov Chain Monte Carlo



1. Overview

⑴ Definition

① Supervised Learning

○ Data: (x, y) (where x is a feature, y is a label)

○ Goal: Compute the mapping function x → y

② Unsupervised Learning

○ Data: x (where x is a feature and there is no label)

○ Goal: Learn the underlying structure of x

③ Reinforcement Learning

○ Data: (s, a, r, s’) (where s is state, a is action, r is reward, s’ is the next state)

○ Goal: Maximize the total reward over multiple time steps

Characteristics: Different from supervised learning and unsupervised learning

① Implicitly receives correct answers: Provided in the form of rewards

② Needs to consider interaction with the environment: Delayed feedback can be an issue

③ Previous decisions influence future interactions

④ Actively gathers information: Reinforcement learning includes the process of obtaining data

Element 1. Reward

Element 2. Policy

① Definition: An agent’s behavior, mapping state input to action output

Type 1. Deterministic Policy

Type 2. Stochastic Policy

Reason 1. To explore since the optimal action is unknown during learning

Reason 2. The optimal situation might be stochastic (e.g., rock-paper-scissors, opponent exploiting the strategy)

Element 3. Value Function

① Definition: Expected value of future rewards represented as lifetime value (LTV)

② Formula: Given state s, policy π, and discount factor γ to adjust future value to present value,


스크린샷 2025-02-25 오후 10 29 52


③ Provides a basis for evaluating whether a state is good or bad and choosing actions accordingly

Element 4. Model

① Definition: The behavior of the environment

② Given a state and action, the model determines the next state and reward

③ Note that there are model-free and model-based methods



2. Markov Chain

⑴ Definition: A system where the future state depends only on the current state and not on past states


스크린샷 2025-02-25 오후 10 30 26


① Strongly Connected (= Irreducible): A state where any node i in the graph can reach any other node j

② Period: The greatest common divisor of all paths returning to node i

○ Example: If two nodes A and B are connected as A=B with two edges, the period of each node is 2

③ Aperiodic: When all nodes have a period of 1

○ Aperiodic ⊂ Irreducible

○ Example: If each node has a walk returning to itself, it is aperiodic

④ Stationary State: If Pr(xn xn-1) is independent of n, the Markov process is stationary (time-invariant)

⑤ Regular

○ Regular ⊂ Irreducible

○ If there exists a natural number k such that all elements of the k-th power of the transition matrix M^k are positive (i.e., nonzero)

⑵ Two-State Markov Chain: Closely related to Graph Theory


image

Figure 1. Two-Scale Markov Chain


① M: Transformation causing state transition in one step

② Mn: Transformation causing state transition in n steps

③ Steady-State Vector: A vector q satisfying Mq = q, i.e., an eigenvector with eigenvalue 1

⑶ HMM (Hidden Markov Model)

① χ = {Xi} is a Markov process and Yi = ϕ(Xi) (where ϕ is a deterministic function), then y = {Yi} is a Hidden Markov Model.

Baum-Welch Algorithm

○ Purpose: Learning HMM parameters

○ Input: Observed data

○ Output: State transition probabilities and emission probabilities of HMM

○ Principle: A type of EM (Expectation Maximization) algorithm

○ Formula

○ Akl: Number of transition from state k to l


스크린샷 2025-03-14 오전 8 40 21


○ Ek(b): Number of emissions of observation b from state k


스크린샷 2025-03-14 오전 8 40 38


○ Bk: Initial probability for state k


스크린샷 2025-03-14 오전 8 40 55


Viterbi Algorithm (ref)

○ Purpose: Find the most likely hidden state sequence given an HMM

○ Input: HMM parameters and observed data

○ N: Number of possible hidden states

○ T: Length of observed data

○ A: State transition probability, akl = probability of transitioning from state k to state l

○ E: Emission probability, ek(x) = probability of observing x in state k

○ B: Initial state probability

○ Output: Most probable state sequence

○ Principle: Uses dynamic programming to compute the optimal path


image


Step 1. Initialization


스크린샷 2025-02-25 오후 10 32 04


○ bk: Initial probability of state k, P(s0 = k)

○ ek(σ): Probability of observing the first observation σ in state k, P(x0 s0 = k)

Step 2. Recursion

○ Compute maximum probability from the previous state at each time step i = 1, …, T


스크린샷 2025-02-25 오후 10 32 46


○ Compute backpointer (ptr) storing the most probable previous state


스크린샷 2025-02-25 오후 10 33 01


○ ptri(l) serves to store the previous state k that has the highest probability of transitioning to the current state l.

Step 3. Termination

○ Select the highest probability at the final time step


스크린샷 2025-02-25 오후 10 34 15


○ Determine the last state of the optimal sequence


스크린샷 2025-02-25 오후 10 34 32


○ vk(i - 1): Optimal probability at previous time step i - 1 in state k

○ akl: Probability of transitioning from state k to l

Step 4. Traceback

○ Trace back through ptr array from i = T, …, 1 to recover the optimal path


스크린샷 2025-02-25 오후 10 34 55


○ Example


image

Figure 2. Example of Viterbi Algorithm


○ Python Code


class HMM(object):
    def __init__(self, alphabet, hidden_states, A=None, E=None, B=None):
        self._alphabet = set(alphabet)
        self._hidden_states = set(hidden_states)
        self._transitions = A
        self._emissions = E
        self._initial = B
        
    def _emit(self, cur_state, symbol):
        return self._emissions[cur_state][symbol]
    
    def _transition(self, cur_state, next_state):
        return self._transitions[cur_state][next_state]
    
    def _init(self, cur_state):
        return self._initial[cur_state]

    def _states(self):
        for k in self._hidden_states:
            yield k

    def draw(self, filename='hmm'):
        nodes = list(self._hidden_states) + ['β']

        def get_children(node):
            return self._initial.keys() if node == 'β' else self._transitions[node].keys()

        def get_edge_label(pred, succ):
            return (self._initial if pred == 'β' else self._transitions[pred])[succ]
        
        def get_node_shape(node):
            return 'circle' if node == 'β' else 'box'
            
        def get_node_label(node):
            if node == 'β':
                return 'β'
            else:
                return r'\n'.join([node, ''] + [
                    f"{e}: {p}" for e, p in self._emissions[node].items()
                ])

        graphviz(nodes, get_children, filename=filename,
                 get_edge_label=get_edge_label,
                 get_node_label=get_node_label,
                 get_node_shape=get_node_shape,
                 rankdir='LR')
        
    def viterbi(self, sequence):
        trellis = {} 
        traceback = [] 
        for state in self._states():
            trellis[state] = np.log10(self._init(state)) + np.log10(self._emit(state, sequence[0])) 
            
        for t in range(1, len(sequence)):
            trellis_next = {}
            traceback_next = {}

            for next_state in self._states():  
                k={}
                for cur_state in self._states():
                    k[cur_state] = trellis[cur_state] + np.log10(self._transition(cur_state, next_state)) 
                argmaxk = max(k, key=k.get)
                trellis_next[next_state] =  np.log10(self._emit(next_state, sequence[t])) + k[argmaxk] 
                traceback_next[next_state] = argmaxk
                
            trellis = trellis_next
            traceback.append(traceback_next)
            
        max_final_state = max(trellis, key=trellis.get)
        max_final_prob = trellis[max_final_state]
                
        result = [max_final_state]
        for t in reversed(range(len(sequence)-1)):
            result.append(traceback[t][max_final_state])
            max_final_state = traceback[t][max_final_state]
            
        return result[::-1]


Type 1. PSSM: Simpler HMM structure

Type 2. Profile HMM: It is advantageous over PSSMs regarding the following:

○ Diagram of profile HMM


image

Figure 3. Diagram of profile HMM


○ M, I, and D represent match, insertion, and deletion, respectively.

○ Mi can be transitioned to Mi+1, Ii, and Di+1.

○ Ii can be transitioned to Mi+1, Ii, and Di+1.

○ Di can be transitioned to Mi+1, Ii, and Di+1.

Advantage 1. The ability to model insertions and deletion

Advantage 2. Transitions are restricted only between valid state traversal.

Advantage 3. Boundaries between states are better defined.

Property 1. Perron-Frobenius Theorem

Theorem 1. If a Markov chain with transition matrix P is strongly connected, then a unique stationary distribution q exists

○ Stationary distribution satisfies Pq = q

Theorem 2. If a Markov chain with transition matrix P is strongly connected and aperiodic,

○ Pij: Probability of transitioning from node j to node i. ∑i Pij = 1

2-1. The i-th row and j-th column element Pij(k) of Pk converges to qi as k → ∞

2-2. Regardless of the initial state x0, the k-th state xk converges to q as k → ∞

Property 2. The second law of thermodynamics (entropy increase) can be proven using Markov processes

① Because it can simulate the diffusion law: Assuming uniform stationary distribution

② Related Concept: Random Walk

3. Markov Decision Process (MDP)

⑴ Overview

① Definition: Involving transitions, where the state at time t+1 is determined solely by the state at time t

② Diagram


image

Figure 4. Agent-Environment Interaction in MDP


○ State: st ∈ S

○ Action: at ∈ A

○ Reward: rt ∈ R(st, at)

○ Policy: at ~ π(· st)
○ Transition: (st+1, rt+1) ~ P(· st, at)

③ Most real-world problems correspond to Markov decision processes

④ Existence of optimal solution V(s)


스크린샷 2025-02-25 오후 10 39 24


Premise 1. Markov property

Premise 2. Stationary assumption

Premise 3. No distributional shift

Type 1. Q-learning

① Definition: Learning the value function Vπ(s) for all states

○ Transition probability P(s’ s, a) is unknown

○ Reward function R(s, a) is unknown

Method 1.

○ Initialize estimates for all state/action pairs: Q̂(s, a) = 0

○ Initial iteration scheme

○ Take a random action a

○ Observe the next state s’ and receive R(s, a) from the environment

○ Update Q̂(s, a)

○ Later iteration scheme

○ Take a = arg maxa Q̂(s, a)

○ Observe the next state s’ and receive R(s, a)

○ Update Q̂(s, a)


스크린샷 2025-02-25 오후 10 40 02


○ Random restart

Method 2. ε-greedy (epsilon-greedy)

○ Initialize estimates for all state/action pairs: Q̂(s, a) = 0

○ Iteration scheme

○ With probability 1 - ε, take a = arg maxa Q̂(s, a), and with probability ε, take a random action

○ Observe the next state s’ and receive R(s, a)

○ Update Q̂(s, a)


스크린샷 2025-02-25 오후 10 40 18


○ Perform a random restart in later iterations

④ DQN (Deep Q-Network)

○ Definition: Implementation of Q-learning using deep learning

○ Applied to many games: DQN on Atari 2600 (2013), AlphaGo (2016), etc.

Type 2. Policy Gradient

① Definition: Learning policy π(s) directly

② Characteristics

○ In Q-learning, states can be continuous, but actions must be discrete

○ In policy gradient, both states and actions can be continuous

③ Algorithm

Step 1. Receive frame

Step 2. Forward propagate to get P(action)

Step 3. Sample a from P(action)

Step 4. Play the rest of the game

Step 5. If the game is won, update in the ∇θ direction

Step 6. If the game is lost, update in the -∇θ direction

⑷ Other Types

① Deep O-Network (DON)

② A3C (Asynchronous Advantage Actor-Critic)

③ Genetic Algorithm

④ SARSA (State-Action-Reward-State-Action)



4. Markov Chain Monte Carlo

⑴ Definition: A method for generating samples from a Markov chain following a complex probability distribution

Type 1. Metropolis-Hastings

① Generate a new candidate sample from the current state → Accept or reject the candidate sample → If accepted, transition to the new state

Type 2. Gibbs Sampling

Type 3. Importance/Rejection Sampling

Type 4. Reversible Jump MCMC

① General MCMC methods like Type 1 and Type 2 sample from a probability distribution in a fixed-dimensional parameter space

② Reversible Jump MCMC operates in a variable-dimensional parameter space: The number of parameters dynamically changes during sampling



Input: 2021.12.13 15:20

Updated: 2024.10.08 22:43

results matching ""

    No results matching ""