>This article briefly introduces reinforcement learning and its important concepts and terms, and focuses on the Q-Learning algorithm, SARSA, DQN and DDPG algorithms.

2025/05/3114:13:35 hotcomm 1782

This article briefly introduces reinforcement learning and its important concepts and terms, and focuses on the Q-Learning algorithm, SARSA, DQN and DDPG algorithms.

Reinforcement Learning (RL) refers to a machine learning method in which the agent receives a delayed reward in the next time step (an evaluation of the previous action). This method is mainly used in games such as Atari (Atari), Mario (Mario), and its performance is comparable to that of humans, or even surpasses that of humans. Recently, with the combination with neural networks, this algorithm has been continuously developed and has been able to solve more complex tasks, such as the pendulum problem. Although there are already a large number of reinforcement learning algorithms, there seems to be no articles that comprehensively compare them. Every time I have to decide which algorithms to apply to a specific task, it makes me very troublesome. This article aims to solve this problem by briefly discussing the setup of reinforcement learning and briefly introduce some well-known algorithms.

1. Introduction to reinforcement learning

Usually, the setting of reinforcement learning consists of two parts, one is the agent and the other is the environment.

reinforcement learning diagram

environment refers to the scene in which the agent performs an action (such as the game itself in Atari game), while the agent represents the reinforcement learning algorithm. The environment first sends a state to the agent, and then the agent takes action to respond to the state based on its knowledge. After that, the environment sends the next state and returns the reward to the agent. The agent uses the reward returned by the environment to update its knowledge and evaluate the previous action. This cycle continues until the environment sends a termination state to end the event.

Most reinforcement learning algorithms follow this pattern. Below I will briefly introduce some terms in reinforcement learning to facilitate the discussion in the next section.

Definition

1. Action (A): All possible actions that an agent can take.

2. Status (S): The current situation returned by the environment.

3. Reward (R): The instant return value of the environment to evaluate the agent's previous action.

4. Strategy (π): The strategy for the agent to decide the next action based on the current state.

5. Value (V): The long-term expectation return under discount (discount), which is distinguished from the short-term return represented by R. Vπ(s) is defined as the expected long-term return value of the current state **s** under the strategy π.

6. Q value or action value (Q): The Q value is similar to the value, the difference is that it also has one more parameter, that is, the current action a. Qπ(s, a) refers to the long-term return of the current state **s** taking action a under strategy π.

Model-free vs. Model-based

The model here refers to the dynamic simulation of the environment, that is, the model learns the transfer probability T(s1|(s0, a)) from the current state s0 and action a to the next state s1. If the transfer probability is successfully learned, the agent will know the possibility of entering a specific state given the current state and action. However, when the state space and action space grow (S×S×A, for table setting), the model-based algorithm becomes impractical.

On the other hand, the model-free algorithm relies on trial and error to update knowledge. Therefore, it does not require space to store all combinations of states and actions. All the algorithms discussed in the next section fall into this category.

in policy (on-policy) vs. off-policy)

in policy agents based on current action a learning value, while off-policy agents based on local optimal greedy action a* learning value. (We will further discuss this issue in the Q-Learning and SARSA algorithm sections)

2. Description of various algorithms

2.1 Q-learning algorithm

Q-Learning is an off-strategy, model-free reinforcement learning algorithm based on Bellman Equation:

Bellman Equation

Where, E represents expectations, ƛ is the discount factor.We can rewrite it into the form of Q value:

Q value The optimal Q value of the Belman equation

can be expressed as:

The optimal Q value

The goal is to maximize the Q value. Before diving into the methods of optimizing Q values, I would like to discuss two value update methods that are closely related to Q-learning.

policy iteration method

policy iteration method alternately uses strategy evaluation and policy improvement.

Policy Iteration method

Policy evaluation evaluates the value of greedy strategies obtained from the last policy improvement function V. On the other hand, policy improvements update policies through actions that maximize the V value of each state. The update equation is based on the Bellmann equation. It iterates until it converges.

strategy iteration pseudo-code

value iteration

value iteration only contains one part. It updates the value function V based on the optimal Bellman equation.

optimal Bellman equation

value iteration pseudo-code

After iteration convergence, the optimal strategy is directly derived by applying the maximum value function to all states.

Note that both methods need to know the transfer probability p, which indicates that it is a model-based algorithm. However, as I mentioned earlier, there is a scalability problem with model-based algorithms. So how does Q-learning solve this problem?

Q-Learning Update equation

α refers to the learning rate (i.e. the speed at which we are approaching the target). The idea behind Q-learning is highly dependent on value iteration. However, the update equation is replaced by the above formula. Therefore, we no longer need to worry about the probability of transfer.

Q-learning pseudocode

Note that the selection criteria for the next action a' is to be able to maximize the Q value of the next state, rather than follow the current strategy. Therefore, Q-Learning belongs to the off-strategy algorithm.

2.2 Status-action-Reward-State-Action (SARSA)

SARSA is very similar to Q-learning. The key difference between SARSA and Q-learning is that SARSA is an in-strategy algorithm. This means that SARSA learns the Q value based on the actions performed by the current policy rather than the greedy policy.

SARSA's update equation

action a_(t+1) is the action performed in the next state s_(t+1) under the current strategy.

SARSA Pseudocode

From the above pseudocode, you may notice that two action selections are performed, which always follow the current policy. In contrast, Q-learning has no constraints on the next action, as long as it can maximize the Q value of the next state. Therefore, SARSA is a policy algorithm.

2.3 Deep Q Network (DQN)

Q-learning is a very powerful algorithm, but its main disadvantage is its lack of universality. If you understand Q-learning as updating numbers in a two-dimensional array (action space × state space), then it is actually similar to dynamic programming. This shows that the Q-learning agent does not know what action to take for an unseen state. In other words, the Q-learning agent does not have the ability to value unseen states. To solve this problem, DQN introduces neural networks to get rid of two-dimensional arrays.

DQN Use neural networks to estimate the Q value function. The input of the network is the current action, while the output is the Q value corresponding to each action.

Play Atari games with DQN

In 2013, DeepMind applied DQN to Atari games, as shown in the figure above. The input is the original image of the current game scene, and through multiple layers including a convolutional layer and a fully connected layer, outputting the Q value of each action that the agent can perform.

The problem boils down to: How do we train the network?

answer is based on Q-learning update equations to train the network. Recall that the target Q value of Q-learning is:

Target Q value

ϕ is equivalent to state s, representing the parameters in the neural network. Therefore, the loss function of the network can be defined as the squared error between the target Q value and the output of the network Q value.

DQN pseudocode

Two other techniques are also important for training DQN:

1. Experience Replay: Because the training samples in typical reinforcement learning settings are highly correlated and the data efficiency is low, this will make the network more difficult to converge. One way to solve the problem of sample distribution is to use empirical replay. Essentially, sample transformations are stored, and then randomly selected from the "Conversion Pool" to update knowledge.

2. Separate Target Network: The target Q network is the same as the network structure used for valuation. According to the pseudocode above, at each C step, the target network is reset to another. As a result, the volatility becomes less severe, leading to more stable training.

2.4 Deep Deterministic Policy Gradient (DDPG)

Although DQN has achieved great success in high-dimensional problems, such as Atari games, the action space is still discrete. However, many interesting tasks, especially physical control tasks, the action space is continuous. And if you separate the action space too thinly to approach continuous space, your action space will be too large. For example, suppose that the degree of freedom of a free random system is 10. For each degree of freedom, you divide the space into 4 parts and you end up with 4¹⁰= 1,048,576 actions. For such a large action space, convergence is also extremely difficult.

DDPG relies on the "actor-critic" architecture. The actor uses to adjust the parameters of the strategy function, that is, to determine the best action in a specific state.

strategy function

, and commentators are used to evaluate the policy function estimated by the actor based on the temporal difference (TD) error.

Time difference error

Here, lowercase v indicates the strategy that the actor has determined. Looks familiar, right? Looks like an updated equation like Q-learning! TD learning is a method of learning how to predict value based on future values of a given state. Q-learning is a special type of TD learning that is used to learn Q values.

"Actionist-Criticist" architecture

DDPG also borrows the idea of experience replaying and separating the target network from DQN. Another problem with DDPG is that it rarely explores the action. One solution is to add noise in the parameter space or action space.

Action Noise (left), Parameter Noise (right)

OpenAI This blog believes that adding noise to parameter space is much better than adding it to action space. A commonly used noise is the Ornstein-Uhlenbeck stochastic process. pseudo-code