Chapter 11. Reinforcement Learning
Recommended Reading: 【Algorithm】 Algorithm Index
1. Overview
2. Markov Chain
a. Reinforcement Learning Problem Sets
I wrote this based on my experience taking UMich ECE567.
1. Overview
⑴ Definition
① Supervised Learning
○ Data: (x, y) (where x is a feature, y is a label)
○ Goal: Compute the mapping function x → y
② Unsupervised Learning
○ Data: x (where x is a feature and there is no label)
○ Goal: Learn the underlying structure of x
③ Reinforcement Learning
○ Data: (s, a, r, s’) (where s is state, a is action, r is reward, s’ is the next state)
○ Goal: Maximize the total reward over multiple time steps
① Element 1. State
② Element 2. Reward
○ Definition: The change in the state.
○ Value function: The expected value of future rewards, expressed as lifetime value (VLT).
○ Formulation: For a state s, a policy π, and a discount factor γ that adjusts future value to present value.
○ Provides the basis for choosing an action by evaluating whether a state is good or bad.
③ Element 3. Action
④ Element 4. Policy
○ Definition: The agent’s behavior; a mapping that takes a state as input and outputs an action.
○ Decision process: A general framework for decision-making problems in which states, actions, and rewards unfold over a process.
○ Type 1: Deterministic policy
○ Type 2: Stochastic policy
○ Reason 1: In learning, the optimal behavior is unknown, so exploration is needed.
○ Reason 2: The optimal situation itself may be stochastic (e.g., rock–paper–scissors, or when an opponent exploits determinism).
⑤ Element 5. Model
○ Definition: The behavior/dynamics of the environment.
○ Given a state and an action, the model determines the next state and the reward.
○ Note the distinction between model-free and model-based methods.
⑶ Characteristics: Different from supervised learning and unsupervised learning
① Implicitly receives correct answers: Provided in the form of rewards
② Needs to consider interaction with the environment: Delayed feedback can be an issue
③ Previous decisions influence future interactions
④ Actively gathers information: Reinforcement learning includes the process of obtaining data
2. Markov Chain
⑴ Overview
① Definition: A system where the future state depends only on the current state and not on past states
② A Markov chain refers to a Markov process whose state space is finite or countably infinite.
③ Lemma 1. Chapman-Kolmogorov decomposition
④ Lemma 2. Linear state-space model
① Strongly Connected (= Irreducible): A state where any node i in the graph can reach any other node j
② Period: The greatest common divisor of all paths returning to node i
○ Example: If two nodes A and B are connected as A=B with two edges, the period of each node is 2
③ Aperiodic: When all nodes have a period of 1
○ Aperiodic ⊂ Irreducible
○ Example: If each node has a walk returning to itself, it is aperiodic
④ Stationary State: If $Pr(x_n \mid x_{n-1})$ is independent of $n$, the Markov process is stationary (time-invariant)
⑤ Regular
○ Regular ⊂ Irreducible
○ If there exists a natural number k such that all elements of the k-th power of the transition matrix M^k are positive (i.e., nonzero)
⑥ Lemma 1. Perron-Frobenius theorem
⑦ Lemma 2. Lyapunov equation
⑧ Lemma 3. Bellman equation
⑵ Type 1. Two-State Markov Chain
Figure 1. Two-Scale Markov Chain
① M: Transformation causing state transition in one step
② Mn: Transformation causing state transition in n steps
③ Steady-State Vector: A vector q satisfying Mq = q, i.e., an eigenvector with eigenvalue 1
⑶ Type 2. HMM (Hidden Markov Model)
① χ = {Xi} is a Markov process and Yi = ϕ(Xi) (where ϕ is a deterministic function), then y = {Yi} is a Hidden Markov Model.
② Baum-Welch Algorithm
○ Purpose: Learning HMM parameters
○ Input: Observed data
○ Output: State transition probabilities and emission probabilities of HMM
○ Principle: A type of EM (Expectation Maximization) algorithm
○ Formula
○ Akl: Number of transition from state k to l
○ Ek(b): Number of emissions of observation b from state k
○ Bk: Initial probability for state k
○ Purpose: Find the most likely hidden state sequence given an HMM
○ Input: HMM parameters and observed data
○ N: Number of possible hidden states
○ T: Length of observed data
○ A: State transition probability, akl = probability of transitioning from state k to state l
○ E: Emission probability, ek(x) = probability of observing x in state k
○ B: Initial state probability
○ Output: Most probable state sequence
○ Principle: Uses dynamic programming to compute the optimal path
○ Step 1. Initialization
○ bk: Initial probability of state $k$, $P(s_0 = k)$
○ $e_k(σ)$: Probability of observing the first observation σ in state $k$, $P(x_0 \mid s_0 = k)$
○ Step 2. Recursion
○ Compute maximum probability from the previous state at each time step i = 1, …, T
○ Compute backpointer (ptr) storing the most probable previous state
○ ptri(l) serves to store the previous state k that has the highest probability of transitioning to the current state l.
○ Step 3. Termination
○ Select the highest probability at the final time step
○ Determine the last state of the optimal sequence
○ vk(i - 1): Optimal probability at previous time step i - 1 in state k
○ akl: Probability of transitioning from state k to l
○ Step 4. Traceback
○ Trace back through ptr array from i = T, …, 1 to recover the optimal path
○ Example

Figure 2. Example of Viterbi Algorithm
○ Python Code
class HMM(object):
def __init__(self, alphabet, hidden_states, A=None, E=None, B=None):
self._alphabet = set(alphabet)
self._hidden_states = set(hidden_states)
self._transitions = A
self._emissions = E
self._initial = B
def _emit(self, cur_state, symbol):
return self._emissions[cur_state][symbol]
def _transition(self, cur_state, next_state):
return self._transitions[cur_state][next_state]
def _init(self, cur_state):
return self._initial[cur_state]
def _states(self):
for k in self._hidden_states:
yield k
def draw(self, filename='hmm'):
nodes = list(self._hidden_states) + ['β']
def get_children(node):
return self._initial.keys() if node == 'β' else self._transitions[node].keys()
def get_edge_label(pred, succ):
return (self._initial if pred == 'β' else self._transitions[pred])[succ]
def get_node_shape(node):
return 'circle' if node == 'β' else 'box'
def get_node_label(node):
if node == 'β':
return 'β'
else:
return r'\n'.join([node, ''] + [
f"{e}: {p}" for e, p in self._emissions[node].items()
])
graphviz(nodes, get_children, filename=filename,
get_edge_label=get_edge_label,
get_node_label=get_node_label,
get_node_shape=get_node_shape,
rankdir='LR')
def viterbi(self, sequence):
trellis = {}
traceback = []
for state in self._states():
trellis[state] = np.log10(self._init(state)) + np.log10(self._emit(state, sequence[0]))
for t in range(1, len(sequence)):
trellis_next = {}
traceback_next = {}
for next_state in self._states():
k={}
for cur_state in self._states():
k[cur_state] = trellis[cur_state] + np.log10(self._transition(cur_state, next_state))
argmaxk = max(k, key=k.get)
trellis_next[next_state] = np.log10(self._emit(next_state, sequence[t])) + k[argmaxk]
traceback_next[next_state] = argmaxk
trellis = trellis_next
traceback.append(traceback_next)
max_final_state = max(trellis, key=trellis.get)
max_final_prob = trellis[max_final_state]
result = [max_final_state]
for t in reversed(range(len(sequence)-1)):
result.append(traceback[t][max_final_state])
max_final_state = traceback[t][max_final_state]
return result[::-1]
④ Type 1. PSSM: Simpler HMM structure
⑤ Type 2. Profile HMM: It is advantageous over PSSMs regarding the following:
○ Diagram of profile HMM
Figure 3. Diagram of profile HMM
○ M, I, and D represent match, insertion, and deletion, respectively.
○ Mi can be transitioned to Mi+1, Ii, and Di+1.
○ Ii can be transitioned to Mi+1, Ii, and Di+1.
○ Di can be transitioned to Mi+1, Ii, and Di+1.
○ Advantage 1. The ability to model insertions and deletion
○ Advantage 2. Transitions are restricted only between valid state traversal.
○ Advantage 3. Boundaries between states are better defined.
⑸ Type 3. Markov chain Monte Carlo (MCMC)
① Definition: A method for generating samples from a Markov chain following a complex probability distribution
② Method 1. Metropolis-Hastings
○ Generate a new candidate sample from the current state → Accept or reject the candidate sample → If accepted, transition to the new state
③ Method 2. Gibbs Sampling
④ Method 3. Importance/Rejection Sampling
⑤ Method 4. Reversible Jump MCMC
○ General MCMC methods like Method 1 and Method 2 sample from a probability distribution in a fixed-dimensional parameter space
○ Reversible Jump MCMC operates in a variable-dimensional parameter space: The number of parameters dynamically changes during sampling
3. Markov Decision Process
⑴ Overview
① Definition: A decision process in which the future depends only on the current state.
② In practice, the vast majority of problem settings can be treated as a Markov decision process (MDP).
③ Schematic: For transitions, the state at t+1 is determined solely as a function of the state at t.
Figure 4. Agent-Environment Interaction in MDP
○ State: $s_t ∈ S$
○ Action: $a_t ∈ A$
○ Reward: $r_t ∈ R(s_t, a_t)$
○ Policy: $a_t ~ π(· \mid s_t)$
○ Transition: $(s_{t+1}, r_{t+1}) ~ P(· \mid s_t, a_t)$
④ Existence of optimal solution $V(s)$
○ Premise 1. Markov property
○ Premise 2. Stationary assumption
○ Premise 3. No distributional shift
⑵ Type 1. Q-learning (Watkins, 1989)
① Overview
○ Learn the Q-function (action-value function) directly from data.
○ Model-free (off-policy), i.e., the transition dynamics $P(x’ \mid x, u)$—is unknown.
○ The reward function $R(s, a)$ is known.
○ Example of value iteration.
② Formula: Because the next state $j$ is fixed, the transition probability term drops out.
○ Step 1. Initialize estimates for all state/action pairs: Q̂(x, u) = 0
○ Step 2. Take a random action $u$
○ Step 3. Observe the next state $x’$ and receive $R(x’, u)$ from the environment: Note that this is the realized reward, not the expected reward.
○ Step 4. Update $Q̂(x, u)$
③ Supplements
○ Learning the full model requires $\left|\mathcal{S}\right|^2\left|\mathcal{A}\right|$ memory, but Q-learning only needs $\left|\mathcal{S}\right|\times\left|\mathcal{A}\right|$ memory.
○ It is also referred to as TD (temporal-difference) learning.
○ Proof on convergence of Q-learning: Pseudo-contraction mapping
⑶ Type 2. SARSA (State-Action-Reward-State-Action; ε-greedy, epsilon greedy) (Rummery & Niranjan, 1994)
① Overview
○ Depends on policy (on-policy).
○ When there is no data available, it chooses a policy and collects new data through interaction with the environment.
② Formula
○ Step 1. Initialize estimates for all state/action pairs: Q̂(s, a) = 0
○ Step 2. At each step $k$, with probability 1 - εk, take u ∈ arg maxv Q̂(x, v); with probability εk, take a random action
○ Step 3. Observe the next state $x’$ and receive $R(x’, u)$: Note that this is the realized reward, not the expected reward.
○ Step 4. Update Q̂(x, a)
○ Step 5. As $k \to \infty$, drive $\varepsilon_k \to 0$; in this case, one can also use a Boltzmann (softmax) distribution with temperature annealing $(T \to 0)$.
③ Comparison with Q-learning
○ Target policy: the policy to be learned
○ Behavior policy: the policy used to generate samples
○ Q-learning: target policy = optimal policy; behavioral policy = any policy under which each action is taken infinitely often
○ SARSA: target policy = ε-greedy policy; behavioral policy = ε-greedy policy
○ Supplement 1. Using an ε-greedy behavior policy doesn’t automatically make it SARSA; what matters is the target policy used in the update.
○ Supplement 2. “Off-policy” means that no matter which policy collected the data, the update uses a greedy target (i.e., the max over next actions).
○ Supplement 3. If the behavior policy is fully greedy (ε = 0), then in SARSA $a’ = \arg\max_{a} Q(s’, a)$, so the targets match and the update equations become essentially the same.
⑷ Type 3. Q-learning with Linear Function Approximation
① Purpose: The memory space of Q-learning is ㅣ𝒮ㅣ × ㅣ𝒜ㅣ, while that of linear function approximation is M ≪ ㅣ𝒮ㅣ × ㅣ𝒜ㅣ.
② Formula: Note that δk2, which is the squared temporal difference, is always non-negative.
③ Significance: With a fixed policy and on-policy sampling, convergence is guaranteed (in an appropriate sense).
④ Example: Tetris game (code)
⑤ Problem: If those conditions are violated (e.g., off-policy learning or updating with incorrect weighting), the algorithm can diverge.
⑸ Type 4. DQN
① Q-learning with nonlinear function approximation
○ Formula: Constructs loss function based on Bellman equation.
○ Problem: Local optima, Unstability during training
② Experience Replay (Replay Buffer)
○ Definition: A replay buffer is a memory that stores the agent’s past experiences collected from the environment. A single time-step transition is usually stored as $(s, a, r, s’, \text{done})$.
○ Motivation: In reinforcement learning, data arrive as a sequence $(y_1,(x_1,u_1)) \rightarrow (y_2,(x_2,u_2)) \rightarrow (y_3,(x_3,u_3)) \rightarrow \cdots$, so samples are highly correlated and the data distribution keeps changing. Deep learning (SGD) is typically much more stable when samples are i.i.d. (independent and identically distributed). With highly correlated data, updates can “swing” in one direction, often leading to divergence or oscillation during training.
○ How it’s used in training: Instead of updating the network using only the most recent experience, randomly sample a batch of transitions from the buffer and train the network on that mini-batch. This makes the training data closer to i.i.d., which helps improve stability and can reduce the chance of getting stuck in poor local solutions (i.e., improves training stability and robustness).
③ DQN (Deep Q-Network)
○ Definition: DQN implements Q-learning using deep neural networks. It is often implemented as an off-policy method.
○ Online network (Q-network): The network that is currently being trained. With parameters $\theta$, it outputs $Q(s,a;\theta)$. It is used both for action selection (e.g., $\epsilon$-greedy) and is updated continuously via SGD.
○ Target network: A slowly changing copy of the online network. With parameters $\theta^-$, it outputs $Q(s,a;\theta^-)$ and is used to compute the target value $y = r + \alpha \max_{a’} Q(s’, a’; \theta^-)$. The target network is updated either by periodically copying the online parameters ($\theta^- \leftarrow \theta$, hard update) or by slowly tracking the online network, which stabilizes the targets.
○ Comparison with experience replay: Both aim to stabilize training, but they are different mechanisms.
○ Experience replay: Reduces sample correlation and distribution shift by shuffling data (making it closer to i.i.d.) through random sampling from a replay buffer.
○ DQN (target network mechanism): If the $Q$ used to compute the target $y = r + \alpha \max Q(s’, a’)$ changes too rapidly, training can diverge. The target network updates slowly, reducing the “moving target” problem and improving stability.
○ SGD detail: To treat $y_k$ as a label, DQN ignores the fact that $y_k$ is actually a function of the network parameters $w$.
○ Mini-batch SGD: Training is typically performed using mini-batches sampled from the replay buffer.
○ Implementation in Pytorch
Figure 5. Implementation in Pytorch
○ Left (alternative form):
forward(state, action) -> q_scalarwith shape(batch, 1). A drawback is that you must evaluate all possible actions separately—i.e., run multiple forward passes to compute $Q(s,1), Q(s,2), \ldots$ and then compare them.
○ Right (standard DQN form):
forward(state) -> q_valueswith shape(batch, num_actions). Since you only need a single forward pass and then takeargmax, action selection is simple and computationally efficient. Also, because the early hidden layers are shared, the learning signal for one action can change the shared representation and indirectly affect the Q-values of other actions (i.e., coupling).
random.sample(self.replay buffer, batch size)
next state values = target net(next states).max(1)[0]
target values= rewards + gamma × next state values.detach()
loss = nn.MSELoss()(state action values, target values)
optimizer.zerograd(); loss.backward(); optimizer.step()
○ Applications: Applied to many games (e.g., Cartpole Problem (code), Atari 2600 (2013), AlphaGo (2016), etc.).
④ DDQN(double Q-learning)
○ Thrun and Schwartz (1993): Q-learning often overestimates the action value (Q-value) under certain states.
○ Double Q-learning (van Hasselt, 2010)
○ Definition: Train two Q-functions (with parameters $w$ and $w’$), then take their average to mitigate the overestimation problem. In this setup, the online network is used to select the action, and the target network is used to evaluate its value.
○ Principle: Double estimator is underestimator.
○ Formula
○ Q-learning vs. Double Q-learning (Weng et al., 2020): In Q-learning, the learning rate is $\beta_k = \frac{c}{k}$, whereas in Double Q-learning it is $\beta_k = \frac{2c}{k}$. Under linear function approximation, we can obtain the following. This also implies that Double Q-learning can converge faster because it uses a larger learning rate.
○ Example: DQN overestimation and DDQN in JAX (code)
○ Clipped double Q-learning (Fujimoto, van Hoof, David Meger, 2018)
⑤ Dueling DDQN (Wang et al, 2016)
Figure 6. Dueling DDQN
○ Definition: For the advantage function $A(s,a):=Q(s,a)-V(s),$ we build a model that predicts \(V\) and \(A\) separately using \(Q(x,u;w,w_v,w_a)=V(x;w,w_v)+A(x,u;w,w_a),\) and then combine them afterward.
○ Motivation: Except for the optimal action, not every action is equally important. \(A(s,a)\) can contrast actions under the same state \(s\).
○ Unidentifiability problem: The decomposition is not unique because \(Q(s,a)=V(s)+A(s,a)=(V(s)+c)+(A(s,a)-c),\quad c\in\mathbb{R},\) so infinitely many \((V,A)\) pairs yield the same \(Q\).
○ Solution 1 (Subtract max): \(Q(x,u;w)=V(x;w)+\big(A(x,u;w)-\max_v A(x,v;w)\big).\) Under a greedy policy, \(Q(x,a^*)=V(x),\) so it enforces the advantage to be \(0\) for the selected optimal action.
○ Solution 2 (Subtract mean): $$Q(x,u;w)=V(x;w)+\left(A(x,u;w)-\frac{1}{ \mathcal{A} }\sum_v A(x,v;w)\right).$$
○ Pros: Improves optimization stability; resolves the unidentifiability issue.
○ Cons: With subtract-mean, \(V\) becomes a different baseline (the average \(Q\)) rather than \(V^*\), so the semantics of \(V\) and \(A\) no longer match the original \(V^*\) and \(A^*\) definitions.
○ Example: CartPole (code)
⑥ Multi-step target (Sutton & Barto, 1998)
○ \(Q(x_k,u_k)=\mathbb{E}[r(x_k,u_k)] + \alpha, r(x_{k+1},\pi^*(x_{k+1}))+\cdots+\alpha^{n-1} r(x_{k+n-1},\pi^*(x_{k+n-1}))+\alpha^n \max_v Q(x_{k+n},v).\)
○ SARSA with a multi-step target (sliding-window style): Use the loss function \(\big(r_k^{(n)}+\alpha^n \max_v Q(x_{k+n},v;w)-Q(x_k,u_k;w)\big)^2.\)
⑦ Prioritized DDQN (Prioritized replay, Schaul et al., 2016)
○ Definition: Assign weights (priorities) to experiences using the information-theoretic concept of surprisal.
○ Step 1: When a new experience is added, assign it the maximum priority \(z_{\max}\).
○ Step 2: Using \(m\) experiences, compute the sampling probability \(p_k=\frac{z_k}{\sum_i z_i}.\)
○ Step 3: If experience \(k\) is sampled, define the update (change) in its weight \(z_k\) as follows:
⑧ Distributed DQN (Bellemare et al., 2017)
⑨ Noisy DQN
⑩ Conclusion: Applying all of them will give you good performance
Figure 7. Rainbow
⑹ Type 5. Deep O-network (DON)
⑺ Type 6. Actor-critic
① Typical actor-critic
○ Definition: Generalized policy iteration. Learning policy π(s) directly
○ Motivation
○ In Q-learning, states can be continuous, but actions must be discrete
○ In policy gradient, both states and actions can be continuous
○ Typically DQN is used if there are dozens of action spaces, and policy gradient is used if there are more than that.
○ In DQN, the network takes \(x\) as input and outputs \(u\) (i.e., Q-values for each discrete action), whereas in actor–critic methods, the critic takes \(x\) and \(u\) as inputs and outputs \(Q(x,u)\) (because the action space is continuous/infinite).
○ Components
○ Critic: TD(λ), double-Q, clipped double-Q
○ Actor: ε-greedy based on the current Q-function, policy-gradient
○ An example process
○ Step 1. Receive frame
○ Step 2. Forward propagate to get P(action)
○ Step 3. Sample a from P(action)
○ Step 4. Play the rest of the game
○ Step 5. If the game is won, update in the ∇θ direction
○ Step 6. If the game is lost, update in the -∇θ direction
② A3C (Asynchronous Advantage Actor-Critic)
③ Policy gradient theorem
○ $\rho_w$: Discounted state distribution (= geometric distribution, discounted occupancy measure)
○ $\nabla_w \log \pi_w(u \mid x)$: Score function
○ Gibbs policy:
○ Gaussian policy:
④ REINFORCE (Williams, 1988, 1992)
○ Use Monte Carlo methods to estimate the Q-function.
○ On-policy.
○ Drawback: high variance.
○ Example: CartPole (code)
⑤ Variance Reduction Theorem
○ Definition: For a given $F(x)$, a variance reduction method that uses a function $\phi(x)$ satisfying $\mathbb{E}[\phi(x)] = 0$ and having high correlation with $F(x)$, in order to reduce the variance of $\mathrm{Var}(F(x) - \phi(x))$.
○ Application in the policy gradient theorem
○ $G_w(x_k)$ must be action-independent. It may depend on $x_{k-1}$, but if it depends on $x_{k+1}$, it implicitly contains information about $u_k$, which makes it inappropriate.
⑥ Natural Policy Gradient (NPG)
○ Policy gradient theorem is a kind of optimization algorithm: Euclidean. Maximum, minimum in ±∞ without quadratic terms.
○ NPG (Kakade et al., 2001)
○ Convergence of PG (Mei et al., 2020): under the tabular setting where the state and action sets are finite,
○ $\beta = (1-\alpha)^3 / 8$
○ $S$: the number of states. Since it is typically very large, it becomes an issue in convergence arguments for PG.
○ $\rho$: the discounted state distribution
○ $c$: a constant
○ $d$: the distribution of the initial condition
○ Convergence of NPG (Agarwal, 2020): under the tabular setting where the state and action sets are finite,
○ $\beta = (1-\alpha)^2 \log \left|\mathcal{A}\right|$
○ $t$: the iteration index (number of iterations)
○ NPG + entropy regularization
○ The policy $\pi$ is usually set as follows:
○ If we set $\beta = (1-\alpha)/\tau$, then the above policy becomes soft policy iteration ($\simeq$ a softmax policy).
○ As $\tau \to 0$, soft policy iteration reduces back to standard policy iteration.
⑦ TRPO(trust region policy optimization)
○ Performance difference lemma (Kakade & Langford, 2002)
○ Local approximation
○ TRPO (Schulman et al., 2015)
○ TRPO = NPG with a different learning rate.
○ $\tilde{\beta}$ is chosen with a backtracking line search to satisfy the constraint.
○ Monotonic improvement theorem: With carefully chosen learning rates, TRPO guarantees that under general function approximations $V_{w_{k+1}}(x_0) ≥ V_{w_k}(x_0)$.
⑧ Proximal Policy Optimization (PPO)
○ Rationale: NPG and TRPO require computing the inverse of the Fisher information matrix (i.e., they are second-order methods), which makes them computationally expensive.
○ PPO: First-order method
○ PPO-CLIP: Uses a min operation to conservatively discourage how much the model changes. The proportional coefficient in the second term is clipped so that if it deviates from 1 by more than $\epsilon$, it is set to $1 \pm \epsilon$.
⑻ Type 7. GA(Genetic Algorithm)
4. General Decision Process
⑴ MAB(multi-armed bandit)
⑵ UCB
5. Advanced Topics
⑴ Unsupervised learning
① Benchmarks: URLB
② Baselines: Diversity is all you need, Forward Backward Representation
③ Envs: Dm_control, Maze, Hopper, Cheetah, Quadruped, Walker
⑵ Online goal-conditioned reinforcement learning
① Benchmarks: JaxGCRL
② Baselines: Contrastive Reinforcement Learning, SAC, PPO, TD3
③ Envs: Brax, Locomotion, Manipulation
⑶ Offline goal-conditioned reinforcement learning
① Benchmarks: OGBench
② Baselines: CRL, HIQL, QRL, implicit Q/V learning
③ Envs: Locomotion, Manipulation, powderworld
⑷ Continual reinforcement learning
① Benchmarks: CORA
③ Envs: Atari, Procgen, Minihack, CHORES, Nethack (codebase)
④ Envs without benchmark: AgarCL, Jelly Bean World
⑸ Open-ended reinforcement learning
① Benchmarks: Craftax baseline
③ Envs: Craftax
⑹ Safe reinforcement learning
① Benchmarks: Omnisafe
② Baselines: CPO, FOCOPS, PPO-Lagrangian and TRPO-Lagrangian
③ Envs: Safety-Gymnasium, Safe-Control-Gym
⑺ Multi-agent reinforcement learning
① Benchmarks: BenchMARL
③ Envs: PettingZoo, VMAS
⑻ Queueing network control via reinforcement learning
① Benchmarks: QGym
② Envs: DiffDiscreteEventSystem
Input: 2021.12.13 15:20
Updated: 2024.10.08 22:43