An Introduction to the Fundamentals of Reinforcement Learning

Overview

Fundamentals of Reinforcement Learning

Background

Reinforcement Learning (RL) is a powerful and rapidly advancing branch of machine learning, inspired by behavioral psychology. It focuses on how agents should take actions in an environment to maximize some notion of cumulative reward. Unlike supervised learning, where the learning agent is given input-output pairs, reinforcement learning emphasizes learning from interaction.

Reinforcement Learning (RL) is pivotal across diverse applications. In robotics, it powers autonomous navigation and manipulation tasks, while in gaming, it excels in strategic decision-making as seen with AlphaGo. RL optimizes trading in finance, adapts healthcare treatments, and enhances capabilities in natural language processing and computer vision. Its expanding reach includes smart grids, recommendation systems, and virtual assistants, highlighting RL's transformative impact across various domains.

Mathematical Foundations

The RL problem is meant to be a straightforward framing of the problem of learning from interaction with environments $\mathcal{E}$ over several discrete time steps to achieve a goal. At each time step $t$, the agent receives a state ${s}{t}$ in the environment's state space $\mathcal{S}$ and selects an action, $a_t \in \mathcal {A}(s_t)$ according to a policy $\pi({a}{t}|{s}{t})$, where $\mathcal{A}(s_t)$ is the set of actions available in state $s_t$. The policy amounts to a conditional probability $\pi(a|s)$ of the agent taking action if the current state is $s$. It is a mapping from state and action to the probability of taking an action. After that, the agent will receive a scalar reward ${r}{t}$ and store the transition in the agent's memory as experiences. The process continues until the agent reaches a terminal state. The agent seeks to learn a policy ${ \pi }^{ \ast }$ that maximizes the expected discounted return ${ R }_{ t }=\sum { k=0 }^{ \infty }{ { \gamma }^{ k }{ r }{ t+k } }$, accumulated reward with the discount factor $\gamma \in (0,1]$ trades-off the importance of immediate and future rewards.

RL tasks that satisfy the Markov property can be described as Markov decision processes (MDPs), which are defined by a tuple $(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)$, where $\mathcal{R}$ is a reward function $\mathcal{R}(s,a)$ and $\mathcal{P}$ is a state transition probability $\mathcal{P}({s}{t+1}|{s}{t},{a}{t})$. The Markov property indicates the future states are conditionally independent of the past given the present. So, in an RL task, the decisions and values are assumed to be a function only of the current state. Markov property can be defined as $p({ s }{ t+1 }|{ s }{ 1 },{ a }{ 1 },...,{ s }{ t },{ a }{ t }) = p({ s }{ t+1 }|{ s }{ t },{ a }{ t })$ , which means the future states are conditionally independent of the past given the present. RL task which satisfies Markov property can be described as MDPs, defined by the 5-tuple $(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)$, where $\mathcal{R}$ is reward function $\mathcal{R}(s,a)$ and $\mathcal{P}$ is state transition probability $\mathcal{P}({s}{t+1}|{s}{t},{a}{t})$. In an episodic task, the state will reset after each episode of length, and the sequence of states, actions, and rewards in an episode constitute a trajectory or rollout of the policy.

Value Function

Value functions are a core component of RL systems, which constructs a function approximator that estimates the long-term reward from any state. It estimates how good (expected return) it is for an agent to be in a given state (or given action in a given state). By this way, the function approximator exploits the structure in the state space to efficiently learn the value of observed states and generalize to the value of similar, unseen states. A typical form of value function can be defined as:

$${ V }^{ \pi }(s)=\mathbb{ E }[R|s,\pi ]= \mathbb{E}[\sum { k=0 }^{ \infty }{ { \gamma }^{ k }{ r }{ t+k } }|s,\pi] $$

Normally we refer to ${ V }^{ \pi }(s)$ as the state-value function, which measures the expected discounted return when starting in a state $s$ and following a policy $\pi$. When actions follow by the optimal policy ${\pi}^{\ast}$, the state-value function can be optimal:

$${ V }^{ \ast }(s)=\max _{ \pi }{ { V }^{ \pi }(s) } \quad \forall s\in \mathcal{ S }$$

In addition to measuring the value of states, there is also an indicator for measuring the quality of actions' selection, which is denoted as state-action-value or quality function ${Q}^{\pi}(s,a)$. It defines the value of choosing an action $a$ from a given state $s$ and thereafter following a policy $\pi$.

$${ Q }^{ \pi }(s,a)=\mathbb{ E }[R|s,a,\pi ]= \mathbb{E}[\sum { k=0 }^{ \infty }{ { \gamma }^{ k }{ r }{ t+k } }|s,a,\pi] $$

State-action-value is similar to the state value $V^{\pi}$ except the initial action $a$ is provided, and the policy $\pi$ is only followed from the succeeding state onwards. The optimal state-action-value function is denoted as:

$${ Q }^{ \ast }(s,a)=\max _{ \pi }{ { Q }^{ \pi }(s,a) } \quad \forall s\in \mathcal{ S } , \forall a\in \mathcal{ A } $$

${ Q }^{ \ast }(s,a)$ gives the maximum state-action value for state $s$ and action $a$ achievable by any policy. This action value function satisfies a recursive property, which is a fundamental property of value functions in the RL setting, and it expresses a relationship between the value of a state and its successor states:

$${Q}^{\pi}(s,a)=\mathbb{E}{{s}^{\prime}}[r+\gamma\mathbb{E}{{a}^{\prime}\sim{\pi}({s}^{\prime})}[{Q}^{\ast}({s}^{\prime},{a}^{\prime})]|s,a,\pi] $$

Unlike producing absolute state-action values as with $Q^{\pi}$, an advantage function represents relative state-action values, which measures whether or not the action is better or worse than the policy's default behavior. Often, it is easier to learn that action yields higher reward than another, than it is to learn the actual return from taking one particular action. Advantage function expresses a relative advantage of actions through this simple relationship:

$${ A }^{ \pi }(s,a)={ Q }^{ \pi }(s,a)-{ V }^{ \pi }(s) $$

Many successful value-based RL algorithms rely on the idea of advantage updates.

Key Reinforcement Learning Algorithms

Deep Q-Network

Deep reinforcement learning (DRL) applies deep neural nets for representing the value functions within reinforcement learning methods. DRL algorithms have attained superhuman performance in several challenging task domains to attribute to the powerful function approximation and representation learning properties of the DL. The DQN algorithm achieves human-level performance on Atari series games from pixels input. It parameterizes the quality function $Q$ with a neural network $Q(s,a;\theta)$ that approximates the $Q$ values. Two main techniques of the DQN algorithm can learn value functions in a stable and robust way are using the target network and experience replay. At each iteration, the network's parameters are updated by minimizing the following loss function:

$${L}{i}({\theta}{i})=\mathbb{E}{s,a,r,{s}^{\prime}}[({y}{i}^{DQN}-Q(s,a;{\theta}_{i}))^{2}]$$

with

$${y}_{i}^{DQN}=r+\gamma \underset {{a}^{\prime}}{max}Q({s}^{\prime},{a}^{\prime};{\theta}^{-}) $$

in which ${\theta}^{-}$ is the parameter for the target network. The first stabilizing method is fixing the target network's parameters rather than calculating the TD error based on its own rapidly fluctuating estimates of the $Q$-values. The second one, experience replay, uses a buffer for storing a certain size of transitions $({s}{t},{a}{t},{s}{t+1},{r}{t+1},)$ makes it possible for training off-policy and enhancing the efficiency of sampling data.

There is a series of improvements in the value-based RL setting after the DQN algorithm ignited this field. To reduce the overestimated $Q$-values in DQN, van Hasselt et al. proposed the double DQN algorithm. Wang et al. presented a dueling Q-network architecture to estimate state-value function $V(s)$ and associated advantage function $A(s,a)$ respectively. Tamar et al. proposed a value iteration network that can effectively learn to plan, and it leads to better generalization in many RL tasks. Schaul et al. developed the PER approach built on top of double DQN, it makes the experience replay process more efficient and effective than all transitions are replayed uniformly.

Dueling Network Architecture

Unlike the standard single sequence $Q$-networks design, the dueling network structure consists of two sequences (streams) of networks (A-network and V-network) which separately learn action-advantage function and state-value function. This construction decouples the value and advantage functions and combines the two streams to produce the estimate of the state-action value function with a special aggregating module. The two streams share a common feature extraction layer (or lower layers). The deep $Q$-network focuses on estimating every state-action pairs' value. However, the idea of dueling network is to estimate action-independent state function and action-dependent advantage function separately, because in RL environments, not all states are related to a specific action, there are many states independent of action, and under these states the agent does not need to change actions to adapt to the new states. Therefore, it is meaningless and inefficient to estimate such state-action pairs' value. Dueling network firstly presented by Wang et al. and through this change, the training efficiency has been greatly improved than the single-stream $Q$ networks. The dueling network results in a new state of the art for tasks in the discrete action space according to Wang's work. Shortly, the $Q$-values generated by dueling network are more advantageous to the performance improvement than deep $Q$-network in an RL task.

Policy Gradient

The methods mentioned above indirectly learn the policy $\pi(s)$ based on the estimate of the value functions. These value-based approaches are effective in handling problem in a discrete actions field. However, when dealing with a problem with a continuous action space such as physical control tasks, the value-based approaches cannot be straightforwardly applied, and it is difficult to ensure the results' convergence since it relies on each actions' $Q$ value. An obvious approach to implement value-based algorithms such as DQN to continuous domains is to discretize the action space to several fixed actions. However, it has many drawbacks and limitations such as throwing information (maybe essential) about the structure of the action domain.

There is no such worry in policy-based approaches since the policy network output agent's actions without the estimation of the action-value function. They directly parameterize the control policy $\pi(a|s;\theta)$ and update the parameters $\theta$ to optimize the cumulative reward, therefore, policy-based methods are more applicable to continuous control problem such as tasks of robotic controls than the value-based methods.

Policy gradient (PG) is an appealing policy-based algorithm which optimizes the parametric policy ${\pi}{\theta}(a|s)=\mathbb{P}[a|s;\theta]$ following the gradient ${\nabla}{\theta}J({\pi}{\theta})$ of its expectation of cumulative reward with respect to the policy parameters. Policy-gradient methods are effective in high-dimensional or continuous action spaces, and can learn stochastic policies. In an RL task, the agent's goal is to find parameter $\theta$ maximizes the objective function $J(\pi)$. A typical performance objective to be considered is the average reward function: $J(\pi)=\mathbb{E}[R|{\pi}{\theta}]$. The policy-gradient theorem provides the gradient of $J$ with respect to the parameters $\theta$ of policy $\pi$:

$${\nabla}{\theta}J({\pi}{\theta})=\int {\mathcal{S}}^{ }{{\rho}^{\pi} }\int{\mathcal{A}}^{ }{{\nabla}{\theta}}{\pi}{\theta}(a|s){Q}^{ \pi}(s,a)dads$$

$$\quad\quad\quad\quad=\mathbb{E}{s\sim{\rho}^{\pi},a\sim {\pi}^{\theta}}[{\nabla}{\theta} log{\pi}^{\theta}(a|s){Q}^{\pi}(s,a)] $$

Where the ${\rho}^{\pi}(s)$ is the state distribution. The unknown part, ${Q}^{\pi}(s,a)$ is normally estimated by using the actual returns ${ R }_{ t }=\sum { k=0 }^{ \infty }{ { \gamma }^{ k }{ r }{ t+k } }$ as an approximation for each ${Q}^{\pi}(s_t,a_t)$. Based on this theorem, Silver et al. proposed a deterministic policy-gradient (DPG) algorithm for estimating gradient and it is more efficient than the usual stochastic policy-gradient method. O'Donoghue et al. referred to a new technique by combining PGQL and discussed the practical implementation of this technique in RL setting.

Actor-Critic Algorithm

Regular policy-gradient methods often exhibit slow convergence due to the large variances of the gradient estimates. The actor-critic methods attempt to reduce the variance by adopting a critic network to estimate the value of the current policy, which is then used to update the actor's policy parameters in a direction of performance improvement.

The action-selection policy is known as the actor ${\pi}{\theta}:\mathcal{S}\rightarrow \mathcal{A}$, which make decisions without the need for optimization procedures on a value function, mapping representation of the states to action-selection probabilities . The~value function is known as the critic ${Q}{\phi}^{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, which estimates the expected return to reduce variance and accelerate learning, mapping states to expected cumulative future reward.

The actor and critic are two separated networks share a common observation. At each step, the action selected by actor network is also an input factor to the critic network. In the process of policy improvement, the critic network estimates the state-action value of the current policy by DQN, then actor network updates its policy in a direction improves the $Q$-value. Compared with the previous pure policy-gradient methods, which do not have a value function, using a critic network to evaluate the current policy is more conducive to convergence and stability. The better the state-action value evaluation is, the lower the learning performance's variance is. It is important and helpful to have a better policy evaluation in the critic network.

Policy-gradient-based actor-critic algorithms are useful in many real-life applications because they can search for optimal policies using low-variance gradient estimates. Lillicrap et al. presented the DDPG algorithm, which combines the actor-critic approach with insights from DQN, to solve simulated physics tasks and it has been widely used in many robotic control tasks. It uses two neural networks; the actor network learns a deterministic policy and the critic network approximates the Q-function of the current policy.

Summary

Reinforcement Learning (RL) represents a powerful paradigm in machine learning, drawing inspiration from behavioral psychology to enable agents to make decisions that maximize cumulative rewards in complex environments. Formulated as Markov Decision Processes (MDPs), RL tasks involve states, actions, rewards, and transition probabilities. Algorithms such as Deep Q-Networks (DQN) leverage deep neural networks to approximate Q-values efficiently, facilitating decision-making in discrete action spaces.

Value-based RL methods, exemplified by DQN, estimate state-action values to optimize policies. Actor-critic approaches improve upon traditional policy-gradient methods by incorporating a critic network to estimate value functions, thereby reducing variance and enhancing learning stability. These advancements extend to continuous action spaces with algorithms like Deep Deterministic Policy Gradient (DDPG), which combines deterministic policies and Q-function approximations.

Policy-gradient methods directly optimize policies based on gradient estimates of expected rewards, proving effective in scenarios with continuous action spaces. The dueling network architecture further refines training efficiency by separating state-value and advantage functions, emphasizing action-dependent advantages.

Overall, RL continues to evolve through innovations in value estimation, policy optimization, and application across diverse domains such as robotics and game playing. Advances in neural network architectures and learning algorithms continue to drive progress in RL research and application. Recent trends focus on adapting RL to continuous action spaces, integrating with other domains like NLP and computer vision, and improving sample efficiency and training stability. Future directions include enhancing RL's applicability to real-world challenges through interdisciplinary collaborations and addressing ethical considerations in deployment.

comments powered by Disqus

Translations: