Editor's introduction: Reinforcement learning is a branch of machine learning. It can ultimately achieve specific purposes or maximize the overall action benefits through continuous interaction with the environment, trial and error.

2025/05/3114:15:38 hotcomm 1450

Edit Introduction: Reinforcement learning is a branch of machine learning. It can ultimately achieve specific purposes or maximize the overall action benefits through continuous interaction with the environment, trial and error. The author of this article summarizes and analyzes reinforcement learning, let’s learn it together.

Editor's introduction: Reinforcement learning is a branch of machine learning. It can ultimately achieve specific purposes or maximize the overall action benefits through continuous interaction with the environment, trial and error. - DayDayNews

The concept of reinforcement learning was known to the general public at the 2017 Wuzhen Go Summit. As reinforcement learning was applied in major games such as King , the glory of , the world ranked first, and became more and more well-known to the public. The Honor of Kings AI team has even published papers on the application of reinforcement learning in Honor of Kings in the top journal AAAI.

So what exactly is reinforcement learning and how is it applied? Let me share with you my views on the entire process of reinforcement learning and how reinforcement learning is currently applied in the industry. Welcome to communicate and exchange. Those who do not have any computer foundation can understand it.

01 Introduction Reinforcement learning

Reinforcement learning is a branch of machine learning. I will not repeat the introduction about machine learning . Interested readers can read the machine learning article I wrote before: Must-read series for strategy product managers - the first lecture machine learning

1. What is reinforcement learning

Reinforcement learning is a learning method of machine learning (see the figure above for explanation of the four main machine learning methods).

The figure above does not mention deep learning because from the learning method level, deep learning belongs to a subset of the above four methods. Reinforcement learning exists independently, so the above figure lists reinforcement learning separately, but does not list deep learning.

Reinforcement learning and the other three learning methods are the main difference: during reinforcement learning training, the environment needs to provide feedback, and the corresponding specific feedback value. It is not a classified task, not how to distinguish fraudulent customers from normal customers in financial anti-fraud scenarios. Reinforcement learning mainly guides the training subjects to make decisions at each step, what actions can be used to achieve specific goals or maximize returns.

For example, AlphaGo plays Go , AlphaGo is the training object for reinforcement learning. There is no right or wrong in every step taken by AlphaGo, but there is a difference between "good or bad". The current chess face is "good", which is a good chess step. The "bad" is a bad move. The training basis of

reinforcement learning is that every step of AlphaGo can give clear feedback. Is it "good" or "bad"? What is "good" and "bad" specific? It can be quantified. The ultimate training purpose of reinforcement learning in the AlphaGo scenario is to allow the chess pieces to occupy more areas on the chess surface and win the final victory.

makes a not very appropriate metaphor, a bit like a circus training monkeys.

The beast trainer beats the gong and trains the monkey to stand and salutes. The monkey is our training object. If the monkey completes the standing salute, he will receive a certain food reward. If it is not completed or is done wrong, there will be no food reward or even a whip. Over time, whenever the animal trainer beats the gong, the monkey will naturally know to stand and salute, because this action is the most profitable action in the current environment, and other actions will not have food, and they will even be beaten by a whip.

The inspiration for reinforcement learning comes from the behavioral theory in psychology :

All learning is a process of establishing a direct connection between stimulation and reaction through conditional action.
enhancement plays an important role in the establishment process between stimulation and response. In the connection between stimulus-response, what the individual learns is habit, and habit is the result of repeated practice and reinforcement.
Once a habit is formed, as long as the original or similar stimulating situation occurs, the acquired habitual reaction will automatically appear.

Based on the above theory, reinforcement learning is how the training object can gradually form expectations of stimulation under the stimulation of rewards or punishments given by the environment and produce habitual behaviors that can obtain the best benefits.

2. Main features of reinforcement learning

1) Trial-error learning

Reinforcement learning requires training objects to constantly interact with the environment, and summarize the best behavioral decisions at each step through trial and error. There is no guidance in the entire process, only cold feedback. All learning is based on environmental feedback, and the training object is to adjust its behavioral decisions.

2) Delay feedback

During the reinforcement learning training process, the training object's "trial and error" behavior obtains feedback from the environment. Sometimes it may take until the entire training is over before a feedback will be obtained, such as Game Over or Win. Of course, in this case, we usually disassemble during training, and try to decompose the feedback to every step.

3) Time is an important factor in reinforcement learning

A series of environmental state changes and environmental feedback in reinforcement learning are strongly linked to time. The entire training process of reinforcement learning is a change with time, and state feedback is constantly changing, so time is an important factor in reinforcement learning.

4) The current behavior affects the subsequent received data

. Why does this feature be proposed separately is also to distinguish it from supervised learning semi-supervised learning. In supervised learning semi-supervised learning, each training data is independent and has no correlation with each other. However, this is not the case in reinforcement learning. The current state and the actions taken will affect the next received state. There is a certain correlation between data.

02 Detailed explanation of reinforcement learning

Below we will introduce reinforcement learning in detail:

1. Basic components

1. This article uses a game called Pacman ( Pacman ) to introduce the basic components of reinforcement learning. The game goal is very simple, that is, the Agent wants to eat all the beans on the screen, and at the same time, it cannot be touched by ghosts. If the ghost is touched by ghosts, the game will end, and the ghosts are also moving constantly.
Agent Every time you take a step, eat a bean or get hit by a ghost, the score on the upper left of the screen will change. The current score in the legend is 435 points. This mini game is also the cousrwork (http://ai.berkeley.edu/project_overview.html) used by the University of California, Berkeley when taking the reinforcement learning course. Subsequent articles will also use this mini game for practical explanations on reinforcement learning.
1) Agent (Agency)
Reinforcement learning training is Agent, which is sometimes translated as "agent", which is collectively referred to here as "agent". In Pacman is this yellow fan-shaped moving body with a wide mouth open.
2) Environment (Environment)
The overall background of the entire game is the environment; in Pacman, Agent, Ghost, beans and various isolation sections form the entire environment.
3) State (state)
The current state of Environment and Agent, because Ghost is moving, the number of beans is constantly changing, and the position of the Agent is constantly changing, so the entire State is changing; here it is particularly emphasized that State contains the state of Agent and Environment.
4) Action (action)
Based on the current State, what actions can the Agent take, such as left or right, up or down; Action is strongly linked to State, for example, there are partitions in many positions in the figure above. It is obvious that the Agent cannot go left or right under this State, and can only go up and down.
5)Reward (reward)
Agent After taking a specific action under the current State, it will receive certain feedback from the environment, which is Reward. Reward is used as a general term. Although Reward translates into Chinese means "reward", in reinforcement learning, Reward only represents the "feedback" given by the environment, which may be a reward or a punishment. For example, in Pacman game, Agent encounters Ghost's environment and gives punishment.
or above are the five basic components of reinforcement learning.
2. Reinforcement learning training process
Next we need to introduce the training process of reinforcement learning.The entire training process is based on one premise, we believe that the entire process is in line with Markov decision-making process .
1) Markov Decision Process (Markov Decision Process)
Markov is a mathematician of Russian . In order to commemorate his research in Markov chain, he named "Markov Decision Process" after him, and the following is replaced by MDP.
MDP core idea is that the next step of State is only related to the current state State and the Action to be taken in the current state, and only goes back one step. For example, State3 in the above picture is only related to State2 and Action2, and has nothing to do with State1 and Action1. We know the current State and the Action to be taken, so we can launch what the next State is, without continuing to trace back to the previous State and what the Action is, and then combine it with the current (State, Action) to get the next State. In actual application of
, the basic scenarios are Markov decision-making processes. For example, AlphaGo plays Go, what is the current chess face, and where the current chess pieces are ready to fall, we can clearly know what the next chess face is.
Why do we need to define the entire training process that complies with MDP? Because only if we comply with MDP can we infer the next State based on the current State and the Action to be taken. It is convenient to clearly infer the State changes of each step during the training process. If we cannot even infer the State changes of each step during the training process, then there is no way to train.
Next we use reinforcement learning to guide the Agent how to act.
3. Reinforcement learning algorithm classification
What algorithm do we choose to guide the Agent action? There are many types of reinforcement learning algorithms. There are many classification methods for how to classify reinforcement learning algorithms. Here I choose three more common classification methods.
1) Value Based
Description: selects how to act in the current State based on all the actions that can be taken under each State, and the corresponding values of these Actions. To emphasize that the Value here does not enter the next Stae from the current State. The environment gives Reward, which is part of the Value.
But when we actually train, we should pay attention to both the current returns and the long-term returns. Therefore, the Value here is obtained through a calculation formula, not just the Reward that immediately feedbacks the state change environment. Because the calculation of Value is relatively complex, the Bellmann equation is usually used, and will not be described in detail here.
How to select Action: Simply put, select the action with the largest value under the current State. Choose Action that brings the maximum Value bonus. For example, in the StateA state in the figure below, there are 3 Actions that can be taken, but Action2 brings the largest Value, so when the Agent finally enters the StateA state, Action2 will be selected.
emphasizes the value value in , which is unknown at the beginning of reinforcement learning training, and we usually set it to 0. Then let the Agent constantly try various actions, interact with the environment, and obtain Rewards. Then, according to the formula for calculating the Value, constantly update the Value. Finally, after training N multiple rounds, the Value value will tend to a stable number, so that the specific Action will be adopted under the specific State, and the corresponding Value is what is.
Representative algorithms: Q-Learning, SARSA (State-Action-Reward-State-Action)
Applicable scenarios: Action space is discrete, for example, the action space in Pacman is basically "up, down, left and right", but some Agent's action space is a continuous process, such as the control of the robotic arm, and the entire movement is continuous. It is also OK to force the continuous Action to be disassembled into discrete, but the dimension obtained is too large, which is often exponential and not suitable for training.
is also in the Value-Based scenario, and finally learns the best action corresponding to each State. However, in some scenarios, even if you finally learn the best action corresponding to each State, it is random. For example, the best strategy is to have 1/3 of each probability of scissors/stone/cloth.
2) Policy Based
Policy Based strategy is a supplement to Value Based
Description: models the Action strategy based on the Action strategy that can be adopted for each State, learns the probability corresponding to the Action that can be adopted under the specific State, and then selects Action based on the probability. How to use Reward to calculate the probability corresponding to each Action involves a lot of derivative calculations. If you are interested in the specific process, please refer to this article: https://zhuanlan.zhihu.com/p/54825295
How to choose Action: Based on the resulting policy function, enter State to get Action.
Representative algorithm: Policy Gradients
Applicable scenario: Action space is continuous. The best action corresponding to each State is not necessarily fixed. Basically, the applicable scenario of Policy Based is a supplement to the applicable scenario of Value Based. For the Action space is continuous, we usually assume that the action space conforms to Gaussian distribution , and then perform the next calculation.
3) Actor-Critic
AC classification combines Value-Based and Policy-Based, and the algorithms inside combine 2.3.1 and 2.3.2.
The above are three common reinforcement learning algorithms. In the Pacman game, we can apply the Value-Based algorithm to train. Because the final optimal action under each State is relatively fixed, and the Reward function is also easy to set.
4) Other categories
The above three categories are common classification methods. Sometimes we will also classify from other angles. The following classification methods and the above classification have a certain overlap:
Based on whether to learn the environment Model classification: Model-based refers to the agent that has learned how the entire environment runs. When the agent knows that the reward obtained by performing any action in any state and the next state it reaches can be derived through the model, the total problem becomes a dynamic programming problem, and you can directly use the greedy algorithm . This reinforcement learning method that uses modeling the environment is the Model-based method.
and Model-free means that sometimes the optimal strategy is found without modeling the environment. While we cannot know the exact environmental returns, we can estimate it. Q(s,a) in Q-learning is an estimate of the sum of future returns obtained after executing action a in state s. After many rounds of training, the estimated value of Q(s,a) will become more and more accurate. At this time, greedy algorithms are also used to determine what actions the agent takes in a specific state.
How to determine whether the reinforcement learning algorithm is Model-based or Model-free? Whether we can accurately predict the next state and return before the agent performs its action a in state s. If possible, then it is Model-based. If not, it is Model-free.
4. EE (Explore Exploit)
introduces various reinforcement learning algorithms: Value-Based, Policy-Based, and Actor-Critic. But in fact, we will encounter a "EE" problem during the reinforcement learning training. Double E here is not "Electronic Engineering", but "Explore Exploit" and "Explore Utilization".
For example, in Value-Based, in the state of StateA in the figure below, the values corresponding to Action123 at the beginning are 0, because we didn't know before training, and the initial values are all 0. If Action1 is randomly selected for the first time, StateA is converted to StateB and Value=2 is obtained. The system records that the Value=2 corresponding to Action1 is selected under StateA.
If the next Agent returns to StateA again, if we choose an action that can return the maximum value, then we must choose Action1. Because at this time, the value corresponding to Action23 under StateA is still 0. Agent has never tried what kind of value it would bring to Action23.
So when strengthening learning training, the Agent will tend to explore Explore at the beginning. It is not the maximum value brought by an Action. The Action has a certain randomness when selecting Action, and the purpose is to cover more Actions and try every possibility.After many rounds of training,
and other training rounds, after basically trying various actions under various State, we will significantly reduce the proportion of exploration and try to make the Agent more inclined to use Exploit. Which Action returns the largest Value, choose which Action.
ExploreExploreExploit is a problem that is often encountered in machine learning. It is not only encountered in reinforcement learning, but also in recommendation systems. For example, if the user is interested in a certain product or content, should the system keep pushing it to the user, and should it also be appropriately matched with some random product or content.
5. Difficulties in the actual development of reinforcement learning
When we actually apply reinforcement learning to train, we often encounter various problems. Although reinforcement learning is powerful, sometimes many problems are difficult and difficult to start.
1) Reward settings
How to set the Reward function and how to quantify the feedback of the environment is a very difficult problem. For example, in AlphaGo, how to measure the "good" and "bad" of each move and finally quantify it is a very difficult problem. The Reward function in some scenarios is difficult to set.
2) The sampling training takes too long, and it is difficult to actually apply in the industrial industry
reinforcement learning needs to explore every action under each State and then learn. In practical application, this is a very huge number in some scenarios. For training time, computing power and overhead is very huge. Many times, using other algorithms will also achieve the same effect, and the training time and computing power overhead are saved a lot. The upper limit of reinforcement learning is very high, but if training is not in place, the lower limit is often particularly low.
3) It is easy to fall into local optimal
partial scenarios. The actions taken by the Agent may be the current local optimal, rather than the global optimal. People often take screenshots on the Internet that they encountered King of Glory AI when playing games. It is obvious that pushing towers or crystals is the most reasonable behavior at this time, but AI is fighting soldiers because AI adopts a local optimal behavior. No matter how reasonable the Reward function setting is, it may fall into the local optimality.
03 Practical application of reinforcement learning
Although reinforcement learning still has various difficult problems, the industry has also begun to try to apply reinforcement learning to actual scenarios. What other applications are there besides AlphaGo:
1. Autonomous driving
Currently, domestic Baidu uses certain reinforcement learning algorithms in the field of autonomous driving. However, because reinforcement learning requires trial and error with the environment, the cost in the real world is too high, so safety officers need to be added to intervene during real training to promptly correct the wrong behaviors taken by the Agent.
2. Game
game can be said to be the most widely used reinforcement learning. Some MOBA games on the market currently have reinforcement learning version AI, the most famous one is Honor of Kings AI. You can interact at will, try and make mistakes at will, without any real cost. At the same time, Reward is relatively easy to set, and there is an obvious reward mechanism.
Currently, some major Internet companies are also trying to add reinforcement learning to recommend in recommendation systems, such as Baidu Meituan . Use reinforcement learning to improve the diversity of recommended results and complement the traditional collaborative filtering CTR prediction models, etc.
In short, reinforcement learning is a very popular research direction in the field of machine learning and has a very broad application prospect. The next article will introduce how to use the Q-Learning algorithm to train Pacman to eat beans to explain Python. Everyone is welcome to continue to pay attention.
This article was originally published by @King James on Everyone is a Product Manager. Reproduction is prohibited without permission.
question picture is from Unsplash, based on CC0 protocol