An approach to solving optimal control problems of nonlinear systems by introducing detail-reward mechanism in deep reinforcement learning

: In recent years, dynamic programming and reinforcement learning theory have been widely used to solve the nonlinear control system (NCS). Among them, many achievements have been made in the construction of network model and system stability analysis, but there is little research on establishing control strategy based on the detailed requirements of control process. Spurred by this trend, this paper proposes a detail-reward mechanism (DRM) by constructing the reward function composed of the individual detail evaluation functions in order to replace the utility function in the Hamilton-Jacobi-Bellman (HJB) equation. And this method is introduced into a wider range of deep reinforcement learning algorithms to solve optimization problems in NCS. After the mathematical description of the relevant characteristics of NCS, the stability of iterative control law is proved by Lyapunov function. With the inverted pendulum system as the experiment object, the dynamic environment is designed and the reward function is established by using the DRM. Finally, three deep reinforcement learning algorithm models are designed in the dynamic environment, which are based on Deep Q-Networks, policy gradient and actor-critic. The effects of different reward functions on the experimental accuracy are compared. The experimental results show that in NCS, using the DRM to replace the utility function in the HJB equation is more in line with the detailed requirements of the designer for the whole control process. By observing the characteristics


Introduction
In the last two decades, the research and application of nonlinear systems have made considerable progress. Combined with the Lyapunov stability theory, there have been many research results in the control theory of nonlinear systems. Wu et al. [1] designed an adaptive quantitative control method for analyzing uncertain nonlinear strictly feedback systems by combining the backstepping technique and Lyapunov stability theory. Shatyrko et al. [2] studied the stabilization of Lur'e-type nonlinear indirect control systems with time-delay argument, and obtained the sufficient conditions for the absolute stability of the system by the Lyapunov method. Kai et al. [3] discussed a control method that can generate the desired limit-cycle-like behavior for a two-dimensional discrete-time nonlinear control system. Wei [4] discussed the boundedness of nonlinear nabla fractional-order systems, and by using the nabla Laplace transform, derived two stability criteria in the form of Lyapunov's theorem. Pole et al. [5] described progressively globally asymptotically stable nonlinear control systems through symbolic models, which enables one to utilize techniques from supervisory control and algorithms from game theory for controller synthesis purposes.
At the same time, the development of optimization theory of dynamic systems and the research and application of nonlinear control systems have opened up huge application space. Zhang et al. [6] introduced the control research of adaptive dynamic programming (ADP) in synchronous generator and generator excitation of multi-machine power system. Volckaert et al. [7] use iterative learning control (ILC) to solve the control problem of the cart-cable system with nonlinear characteristics. Gao et al. [8] used a global adaptive dynamic programming(GADP) method for cooperative adaptive cruise control (CACC) of connected vehicles with unknown nonlinear dynamics. Trélat. [9] conducted research on some common methods used to solve nonlinear optimal control problems in aerospace, such as Pontryagin's maximum principle and conjugate point theory.
At present, many scholars combine nonlinear system and optimal control theory to carry out research on optimal control of nonlinear system. The main methods to solve the optimal control problem are classical variational method, maximum value principle and dynamic programming [10][11][12]. One of the main difficulties of these classical optimal control theories is to determine the optimal control of a nonlinear system, so it is necessary to solve the Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) [13]. Due to the complexity of the nonlinear system and the various constraints of the real systems, nonlinear HJB equations and nonlinear two-point boundary value problems cannot be solved analytically. Therefore, it is of great significance to study some numerical solution methods or explicit optimal solutions approximate to the optimal solution to solve the optimal control problem.
With the development of various technologies in the field of artificial intelligence, many methods of combining artificial intelligence algorithms to solve nonlinear system problems emerge [14,15]. Another powerful methodology for solving optimal control problems is the Adaptive Dynamic Programming (ADP) algorithm, which was proposed by Werbos in 1991 [16]. Combining the reinforcement learning and the dynamic programming, ADP simulates the human learning through environmental feedback and is considered to be a method very close to the intelligence of the human single evaluation network to obtain approximate optimal control, which ensures a certain stability of the uncertain nonlinear continuous-time system. Liu et al. [32] constructed two kinds of neural networks to achieve approximate optimal control for continuous-time nonlinear systems with unknown structure and control constraints. Among them, the cyclic neural network (RNN) is used to identify the system model of the system, and two feedforward neural networks (Feedforward NN) are used as the execution network and the evaluation network, respectively, for the approximation of the optimal control strategy and the optimal performance index function. Zhao et al. [33] proposed a RL-based cyclic fixed finite-time algorithm to solve the HJB equation concerning the optimal control of continuous nonlinear finite-time systems with uncertain system structure. Zhao et al. [34] proposed a new reinforcement learning strategy consisting of two parts: an online learning optimal control of a nominal system and a feedforward neural network (NN) compensation that handles uncertain input constraints. This strategy solves the optimal stability problem of unknown nonlinear systems constrained by uncertain inputs. Wang et al. [35] studied the approximate optimal control of continuous-time non-affine nonlinear systems using reinforcement learning, and used effective precompensation techniques for proper system transitions. Combined with the overall strategy iteration to alleviate the needs of system dynamics. J. W. Kim et al. [36] proposed a model-based RL approach that uses deep neural networks (DNNs) to iteratively learn the solutions of HJB and its associated equations, which can significantly improve the performance of learning policies in uncertain initial states and in the presence of state noise.
The algorithm based on dynamic programming and iterative strategy starts with the performance index function in the state space of the control system, explores the control effect of the control sequence, and obtains the optimal control law [37][38][39][40]. The above literature has achieved great results in the design of network structure and the convergence proof of policy algorithm. In the study of optimal control of nonlinear systems, more attention is paid to the control details of dynamic programming algorithms. Its purpose is to expect that the changes in the state space during the whole control process can meet the detailed requirements of the control system, and finally achieve the control goal. This requires that the utility function in the performance indicator function must be able to express the specific needs of users. Most of the above literatures use quadratic forms when defining utility functions, and they do not have a good description of the relationship between utility functions and system states. In addition, there are great limitations on the setting of parameters in utility functions. As for how to design a utility function that is closely related to the actual control objective in nonlinear systems, there are few literature, and this is also the research direction of this paper. This paper proposes detailed regulation of DRM based on the control process. By designing the evaluation function of the control target, multiple evaluation functions are combined into a reward function to replace the utility function in the performance index function, and then the reward function is introduced into the deep reinforcement learning algorithms. We use three algorithms to verify the effectiveness of DRM, which are based on Deep Q-Networks, policy gradient and actor-critic frameworks respectively.
The remainder of this paper is organized as follows. Section 2 describes nonlinear system properties and introduces the limitations of utility functions in iterative algorithms. In Section 3, an alternative approach to the utility function is proposed combining Lyapunov stability and Q-learning. Taking the inverted pendulum system as the experiment object, Section 4 establishes the dynamic environment, constructs the reward function mechanism of the inverted pendulum system, analyzes the deep reinforcement learning algorithm model and designs experiments to verify the effectiveness of the algorithm. Finally, Section 5 concludes this work.

Problem statement
Consider a continuous-time nonlinear system given by: where ∈ is the system state vector, ∈ Ω ⊂ is the control input vector, and • is a continuously differentiable vector function. The boundary conditions are that the initial condition is fixed and the end condition is free. The performance index function is defined as: where the control vector is unconstrained and continuous. In the interval , , the scalar functions • and • are continuous and twice differentiable. Function , is continuous and has continuous first and second order partial derivatives with respect to and . In the admissible control domain Ω, t, t f ∈ Ω, so the optimal performance index function is defined as: * , , ∈Ω , , , . ( For optimal control law * t , Eq (4) can be written as: * , , * , * , , * , .
According to Euler's discretization [41], the discrete NCS function of the control system can be as follows: , .
The nonlinear system is expressed by the function , , where ∈ Ω ⊂ is the system state vector, and ∈ Ω ⊂ is the control input vector. Ω and Ω are defined as: where ‖•‖ denotes the Euclidean norm. Let ̱ , , . . . denote the control sequence from to ∞. Let , be the utility function. The performance index function is defined as: For a control sequence ̱ , the optimal performance index function can be defined as: So we have: * , * .
The optimal single control law * can be expressed as: * min , * , .
The above is the basic principle of using dynamic programming to solve the optimal control of the two systems. The optimal control vector function and the optimal state trajectory can be solved through the HJB equation, but the process is often very complicated or difficult to achieve. In addition, when the discrete system is high-order, it is easy to cause the "curse of dimensionality" problem. A common alternative method is ADP. The definition of the performance index function in ADP is usually in the form of a quadratic performance index, similar to the following expression as follows: where is a -dimensional nonnegative definite matrix, and is a -dimensional positive definite matrix. The values of and depend on the dimensions of and respectively. The utility function is generally expressed in the following form: , .
In most cases, and are set as identity matrix with suitable dimensions, but this is often not very consistent with the real control system and control objectives. For example, in a specific control system, the high-frequency control law may damage electronic components, and high-frequency control should be avoided. The mechanical characteristics of some systems require that the control law conforms to a functional flexible control as much as possible. Moreover, due to the specific operation environment, the state space under the output of the control law is required to satisfy the specific requirements. This requires that the utility function has a more flexible nonlinear form. As far as we know, there is very little literature about how to assign appropriate values to and in order that the utility function satisfies the real control system requirements. In fact, it is difficult or even impossible to accurately determine the two matrices to match the real control system.

Construction of detail-reward mechanism
In order to solve the above problems, we try to modify the utility function so that it can not only satisfy the mathematical requirements of the best performance indicator, but also satisfy our requirements for the control process. Firstly, define a new function , and use it as the mapping object of the utility function , . Then, for a known NCS , , , the discrete method is used to establish the function , , so as to establish the dynamic environment. Using the dynamic environment, the output vector under the input vectors and can be solved, and the , matching with , can be output. Thus, how to design the dynamic environment and how to design the output function , corresponding to , has become one of the key problems to solve in the nonlinear control systems. For mathematical convenience, we make the following conventions and constraints on , : 1) The discrete nonlinear control sequence can be defined as a finite Markov decision process (MDP) , , , where: , , … , , ∈ Ω , 1 is a finite set of states; , , … , , ∈ Ω is a set of control actions; is the reward distributions, and it can be written as : ↦ , with , , ∈ 0, being the reward received upon taking in state transitioning to state . The boundedness of the reward function of , , is to ensure that the performance index function composed of it is bounded. The intuitive explanation of the reward function is: the higher the compliance of the control effect, the closer the value is to , and the lower the compliance of the control effect, the closer the value is to 0.
2) For a specific state space ∈ , it is the subspace of state space , i.e., ⊂ , and it is composed of specific components of state space . In this state, there may be specific control requirements or the designer's objective evaluation of the state. We regard as an observation perspective of state space. Therefore, we define the evaluation function , ∈ 0,1 under the perspective of . In this perspective, specific control can be done and the control effect , can be evaluated. As a component of the reward function, the evaluation function under a specific observation perspective is often helpful to guide the follow-up control to better approach the control goal. As shown in Figure 1, in a gradually stable MDP control trajectory, the control vector in state is , the reward value is , and transitions to the next state . If there is a better control vector under the observation perspective a at this time step, the state is directly transitioned to , bypassing the middle 1 control processes, and the MDP trajectory becomes , , . Therefore, such a control law is suitable for increasing the reward value. According to the different observation perspectives, the evaluation function , related to the state-control pair , is formulated to evaluate the control validity. When 0, • has a value range of 0,1 and is continuously differentiable.
Here 1 means that the maximum value of the reward function is 1. When 0, • should be continuous and twice differentiable in order to ensure that • is also continuous and twice differentiable if • 0. The evaluation function sequence is , , . . . , , . Where is the number of evaluation functions, i.e., the effectiveness of control is divided into different observation perspectives for evaluation.
A dynamic environment is constructed by the above formula, as shown in Figure 2. The input parameters are state-control pair , , and the output parameters are the next state and , of the system. If • is regarded as the reward value of the dynamic environment, the transformation from the utility function of the NCS to the reward is completed, i.e., the utility function is substituted by the reward value. The description effect may be that the smaller the value of the utility function, the greater the reward value, but the value range of the reward is the interval from 0 to r . In order to satisfy the Eqs (8)-(11), we can express the relationship between , and , with the following equation: It is worth mentioning that the evaluation functions and their respective observation perspectives should be designed consistent with the control objectives. The design of observation perspectives can mainly focus on two points: the first is to monitor the effect of the control law in a specific state during the control process, the second is that this design is conducive to the system to achieve an optimal control effect. The observation perspective of the evaluation function should not be repeated, and there should be no conflict in control methods between the evaluation functions, which means that the high score of the control law under one observation perspective is easier to guide the control law to obtain high scores under other observation perspectives. However, the high score of the control law under a certain observation perspective may not immediately lead to high scores under other observation perspectives, but this will cause the control law in the current state to get high scores with great probability under more observation perspectives in the next time steps.
According to the definition of expected discounted return in reinforcement learning theory, let , and bring it into the following equation: where is the discount rate, 0 1, is the reward of step , ∈ ⊂ , is reward sets, and is a final time step. The advantages of using reward function , instead of utility function , are as follows: 1) The reward function reflects the feasible control objectives from various observation perspectives. It may be a nonlinear function instead of the linear expression of the utility function, which can better satisfy the detailed control objectives.
2) The reward function , is composed of several evaluation functions from their respective observation perspectives. Each evaluation function , is specific, simple and easy to calculate, which avoids the problem that it is difficult to specify the weight matrix and in the utility function of , . 3) For a specific real control platform, the reward function is easier to realize. According to the requirements of control objectives and control details, it is easier to adjust each evaluation function, and the adjustment of one evaluation function will not affect other evaluation functions. 4) After adding the reward function in the dynamic environment, we can try to use more skillful reinforcement learning algorithms to solve the nonlinear control problem.
In Eq (15), the reward function takes the form of successive multiplications instead of successive additions or others. The advantages of doing this are as following: 1) If the reward function takes the continuous addition form of the evaluation functions, when changing one of the evaluation functions, the weight of each evaluation function must also be considered, which will make the problem very complex, and its rationality needs to be carefully verified in the experiment. In the form of continuous multiplication, if one of the evaluation functions is changed, it only changes the calculation method and the possible gain of the function, and there is no need to change the weights of other evaluation functions. In fact, the output value of each function can be understood as its own weight.
2) Any evaluation function can play the role of "one vote veto" by using continuous multiplication. For example, in the experiment of controlling the robot to walk, no matter how high the score given by the evaluation function that responsible for evaluating the robot's posture, as long as it touches the boundary, it should be given a zero score. According to Eq (15), when the evaluation function responsible for boundary detection gives a zero score, the score of the reward function is zero. But the continuous addition form of the evaluation functions will not bring such an effect.
In order to comply with Eqs (3)- (5) and (11), we try to modify Eq (8) as: and we have: Therefore, as long as ̱ conforms to the system under (6), it satisfies the optimality equation: Hence * can be obtained as: So the optimal performance index function has another form: Recalling Eq (10), we get: * min , * , * * , * , and the optimal single control law is: * arg max * .
Comparing Eqs (26) and (14), the problem of solving the optimal performance index function is transformed into the problem of maximizing the return function. Further, combined with Eq (18), we can try to use a wider range of reinforcement learning algorithms to solve the optimization problem of nonlinear systems.
For the convenience of analysis, the stability proof is made on the basis of the following assumptions 1 and 2.
Assumption 1: The discrete NCS represented by Eq (6) is controllable, and the system state 0 is an equilibrium state of the discrete NCS under the condition of 0, i.e., 0,0 0. Assumption 2: The feedback control satisfies 0 for 0 , ∀ ∈ Ω ⊂ , is continuous and stabilizes the discrete NCS on Ω , is finite. Assumption 1 ensures that the object we use the deep reinforcement learning method to solve is a controllable NCS, and determines that the equilibrium point of the system is 0, and the output of the system function • is 0. Assumption 2 ensures that the control law is admissible and the output is 0 at the equilibrium point, so as to meet the requirements of a controllable NCS. In the process of control sequence, the performance index function is a bounded function. In this way, we have made restrictive assumptions about the system object and control law.
According to the definition of , , , 0, then , is a positive definite function for any and . Let be an arbitrary admissible control law. According to Q-learning [43], for 0, let , be the initial iterative function constructed by , i.e., Thus, initial iterative function satisfies the following optimality equation: Then, we get the iterative control law as follows: Let , be the iterative function constructed by for 1,2, . . ., then we can get the following generalized optimality equation: Therefore, the iterative control law is updated by: Because the reward function , is a positive definite function of and , and under Assumption 1 and 2, the iterative function , , 0,1, . . ., is positive definite for and . According to Eq (30), substitute for ，and let , for 0,1, . . ., we can get: For 0,1, . . ., the function is positive definite for , so is a Lyapunov function. Thus, is a stable control law. In the above, it is difficult to give a specific expression form for the control objective of the utility function under the actual control system. We can try to reconstruct the reward function , to replace the utility function with the help of DRM. Therefore, the key step of solving nonlinear optimization problem is transformed into the problem of how to construct a reinforcement learning dynamic environment and optimize the reward function , . According to Eqs (15) and (16), the determination of evaluation function is one of the important tasks of DRM. Ng et al. [44] proposed the reward shaping theory, if a potential energy-based function is added to the reward function, the optimal strategy remains unchanged, realizing the optimal strategy of modifying the reward without affecting the Markov Decision Process. The reward shaping method is used in multi-agent reinforcement learning, and it is effective in robots collaborative-competitive operations and multi-objective tasks [45][46][47][48][49]. This approach give us some inspiration, especially in the construction of the potential energy function and how to avoid reward conflicts.

Experimental schemes based on reinforcement learning
Using the mathematical models above, the problem of obtaining a control method for a nonlinear system can be transformed into a problem of training one or more agents' control strategies, establishing a value function and obtaining the maximum reward. In this section, by choosing the inverted pendulum system as the experiment object, we will analyze its characteristics and establish the dynamic environment, design the reward function through different control objectives, then use three deep reinforcement learning algorithm models to conduct experiments. the experiments result is discussed finally.  In [50], the deep reinforcement learning algorithm model is used to realize the balance of the cart-pole in the inverted pendulum system, but the relevant control theory in the NCS is not involved, and the design of the reward function is not introduced in detail. The system function of the inverted pendulum system model in this paper is expressed as follows [51,52]:

Description of the inverted pendulum system model
where , is the moving speed of the cart on the rail, is the swing angular speed of the swing pole. The nonlinear system equations (33) and (34) of inverted pendulum dynamic system represent the control input affine system. These two nonlinear equations are represented in the following standard state space form: .

Reward function design
For Eq (35), Let , , , , and set as the maximum limit of the swing pole deviation from the balance angle, Let be the maximum value that the cart deviates from the origin of the guide rail, the maximum speed of the cart, and the maximum rotation speed of the swing pole. The definition rules of the system state space and evaluation function are as following: , The range of the output value of the evaluation function is [0,1]. In order to describe the expected position of the cart on the rail and guide the cart to run at the equilibrium position of the rail with greater probability, we make an evaluation function as following: From the perspective of control, the cart "has to" drive the swing pole into a better state space at some positions of the rail. In this way, we can establish an evaluation function to better guide the system into the equilibrium state. If it is said that the desired balance angle is zero in the equilibrium state, then at different positions of the rail, we assign position-related dynamic balance angles to guide the movement of the cart. The position-related balance angle is given by: Then the evaluation function is established as: Notice that the output range of • is [0,1], and is the maximum allowable deflection angle of the swing pole.
In the balancing process of inverted pendulum system, not only the system is required to reach the equilibrium state in a limited time, but also the energy consumed should be considered. It is expected that the energy consumed will reach the minimum in the process of state transition under two adjacent time steps. Therefore, we make an evaluation function as follows: where, , is the power consumption function of the system under the action of control output in state; is the coefficient function in state. Without considering the external disturbance of the system such as air resistance, the force power consumption function can be expressed by the following formula: , , here, is defined as the energy loss after the system transitions to the next state when there is no action output control in state ; in the same case of no action output, is the energy loss caused by system friction.
is defined as the energy loss after the system transitions to the next state under the action of output control in state , and is the energy loss caused by system friction after the system transitions to the next state under the action of output control in state . In the system represented by Eqs (33) and (34), friction is not considered, so 0. Therefore, in state , the system energy is: , where is the kinetic energy of the cart, is the kinetic energy of the swing pole, and is the potential energy of the swing pole. According to the inverted pendulum model, we can get: cos , where is the swing pole angle in state . According to the above formulas, calculate and respectively, and then calculate , . Note the coefficient function , which reflects the user's specific allocation method of energy in combination with the current state. For example, it is required to minimize the single-step energy, or establish the minimum theoretical value of energy according to the current state, and then require the single-step energy to approach the value as much as possible, and ensure that the final evaluation function is between 0 and 1.
Note that in the control process of the system, , gives the maximum value of evaluation 1, , is the evaluation function of cart displacement in the state vector, , is the evaluation function of and in the state vector, and , is the evaluation function of , and in the state vector. The final evaluation effect is expressed as: , is a nonlinear description that acts as a utility function, which is different from the description method of Eq (8). In other words, , reflects the user's specific control needs in a more subtle way. It is commendable that users can add, modify the details of specific control requirements via the evaluation function, and the modification of a certain evaluation function will not affect the roles of other evaluation functions. In the dynamic environment, the relationship between the evaluation functions and the reward function is that the reward function is like composed of multiple "referees", and these "referees" are the evaluation functions. Each referee gives a score ranging from 0 to 1, and then the reward function summarizes the scores of each referee by means of continuous multiplication. The scoring standard for each referee is based on their different viewing angles, and the viewing angles of each referee cannot be the same. In fact, the viewing angles are their observation perspectives as shown in Figure 2. As mentioned above, no evaluation function can violate the control law of the system, which means that the referee is required to guide and encourage the effect of "performer", in other words, the performer (i.e., the control law) follows the performance guidance effect of a referee's high score, which is easier to achieve the final effect of control. This final effect is the best one close to recognized by each referee.

Deep reinforcement learning model analysis
In this paper, • is regarded as the reward value of the dynamic environment, and the deep reinforcement learning algorithm is designed to interact with the dynamic environment from three perspectives of state-value, policy and "Actor-Critic". During the interaction process, the agent's execution strategy is optimized through the reinforcement learning algorithm to achieve the maximum reward value. Among them, the deep reinforcement learning algorithm based on state-value mainly takes Deep Q-Learning (DQN) as an example, the policy-based algorithm mainly takes the policy-gradient reinforcement algorithm as an example, and the "Actor-Critic" framework mainly takes Deep Deterministic Policy Gradient as an example. Three deep reinforcement learning models are analyzed and described below.

Deep Q-learning
DQN is a value-based iteration reinforcement learning algorithm [53]. When the state and action space generated during the interaction between the agent and the environment is discrete and the dimension is not high, the Q-Learning method can be used to establish the state-action value Q table. Update the Q-values in the table according to the Bellman optimality equation. In the inverted pendulum system, the states of the cart and the swing pole are high-dimensional continuous. Using the Q-Learning method to build a table is very difficult, and it often causes the "curse of dimensionality". When the states and actions are high-dimensional and continuous, the construction table can be transformed into a function fitting problem, and the Q value is generated by fitting the function instead of the table, so that similar states can obtain similar output actions. In the function fitting of high-dimensional input, the advantage of deep neural network for complex feature extraction can be used to map the input to the output through neural network. Therefore, the essence of the DQN method is to combine the deep neural network and Q-Learning, map the state action value Q through the network structure, and update the agent action strategy according to the Bellman optimal equation. Algorithm 1 shows the DQN algorithm flow.
In algorithm 1, , , , is the state space of the inverted pendulum system at time . Among them, is the displacement of the cart, is the angle of the swing pole, is the speed of the cart, and is the angular speed of the swing pole. is the reward value obtained during the interaction with the dynamic environment at time .
DQN records the tuple , , , generated by the agent in each state and stores it in the experience replay. The deep neural network randomly selects the tuple in the experience replay as the training label to update the network parameters. Randomly selecting previous experience in the experience replay will make the network more efficient, solving the problem of correlation and non-static distribution to a certain extent [54].
DQN adopts two network structures with the same structure but different parameters, namely action-value network (Q-network) and target action-value network (target Q-network). When the action-value network is used to approximate the Q value, the update of the Q value is prone to oscillation, resulting in unstable learning behavior of the agent. The target action-value network reduces the correlation between the predicted Q value and the estimated Q value to a certain extent to improve the stability of the algorithm. In the network parameter update process, the parameters used by the Q-network are the latest parameters of each update, and the Q-network parameters are copied to the target Q-network after multiple iterations. The agent maps the estimated Q value through Q-network at , uses the epsilon-greedy [55] strategy to select the Q value under the corresponding action, takes the action corresponding to the Q value to interact with the environment, and records the generated tuple , , , and stores it in the experience replay. The target Q-network maps as input to the predicted Q value, selects the sum of the maximum predicted Q value and the immediate reward as the TD target [56], uses the mean square error between the TD target and the Q value as a loss function for updating network parameters. The loss function between target action-value function and action-value function in DQN is as follows: , ,

Reinforce: Monte Carlo Policy Gradient
In the DQN series of reinforcement learning algorithms, we mainly approximate the value function, learn the action value function based on the value, and then adopt a similar greedy strategy to select the agent output action according to the estimated action value function. But the DQN series of reinforcement learning algorithms have the following problems: 1) Algorithms based on value functions can only process discrete actions, but cannot process continuous actions output by the agent.
2) The algorithm based on the value function cannot solve the stochastic policy problem. The optimal strategy corresponding to the DQN series of methods is a deterministic strategy, and the action corresponding to the maximum value is taken as the optimal strategy. Aiming at the shortcomings of value-based reinforcement learning algorithms, a policy-based reinforcement learning algorithm, policy gradient, is introduced [57].
The following is the state value function gradient formula in Monte Carlo sampling, where Pr → , , is the probability of transitioning from state to state in steps under policy : in Eq (50) is converted as follows: is the gradient of the policy parameters, is the policy distribution under the policy , and the learning goal of the policy parameters is to maximize the value of , so the update of the policy parameters is similar to solving the gradient of ,which is: In the above equation, 、 and are the parameters in the action value network and the policy network, respectively. In the Monte Carlo Policy Gradient, the sampling is used to replace under the policy . Therefore, can be defined as: represents the reward obtained in the state, can be calculated by sampling, the expected value is equal to the gradient. Introducing the above gradient into the formula , we get: Algorithm 2 shows the process of the Monte Carlo Policy Gradient reinforce algorithm. In algorithm 2, , , , represents the state space in the inverted pendulum system, and represents the immediate reward obtained when the state is transferred, which is equivalent to , .

Deep deterministic policy gradient
According to the output action characteristics of the agent in reinforcement learning, strategies can be divided into deterministic strategies and random strategies. Deterministic policy means that the action of the agent output through the policy in a certain state is deterministic. The random strategy refers to the probability that the agent outputs multiple actions through the strategy in a certain state, and the agent chooses to execute the action according to the probability. In the DQN series of methods, the greedy strategy is used to select the action corresponding to the maximum Q value, which belongs to the greedy deterministic strategy. When the action set output by the agent is continuous-valued or discrete-valued with very high dimensionality, we use a stochastic strategy to study the probability of action output, which increases the amount of computation. The Deep Deterministic Policy Gradient (DDPG) algorithm is often used to solve the action selection problem in a high-dimensional continuous state space [58]. The algorithm is tuned on the basis of the Actor-Critic framework, making full use of DQN and policy gradients. The DDPG algorithm uses a training network and a target network in the design of the network structure. Through the target network, the correlation between the predicted value and the estimated value can be reduced to a certain extent, and the stability of the algorithm can be improved. Training the network with samples from the experience replay minimizes the correlation between samples.
The DDPG algorithm mainly includes the following parts: ① The network includes two parts, Actor and Critic, in which Actor and Critic are composed of training network and target network respectively. ② Introduce the experience replay to store the data of the interaction between the agent and the environment, and use soft update to update the parameters of the target network to ensure the stability of the training network [59]. ③ Add random noise to the Actor network output to ensure that the agent has a certain exploration ability when selecting actions [60]. Algorithm 3 shows the process of the DDPG algorithm.
In algorithm 3, is random noise, | is the Actor network, , | is the Critic network, is the target Actor network and , is the target critic network. The agent obtains the execution action through the network at , and interacts with the environment after adding noise to the action to generate an immediate reward , and the state is updated to . The agent stores the tuple , , , in the experience replay. When the experience replay data reaches a certain amount, it samples data from the experience replay to train the network . -Network needs to evaluate the actions made by . Therefore, -Network uses the deterministic policy gradient method to optimize the network parameters of , and the network parameter replication process adopts a soft update method. The agent keeps repeating the above process until the policy network converges.
The target action value function in the DDPG algorithm can be expressed as: The loss function of the Critic network is: Using the deterministic sampling strategy gradient to update the policy network parameter can be expression as:

Comparison and analysis of experimental results
This section describes and analyzes the experimental results. We use three reinforcement learning algorithms to train the inverted pendulum control model, and analyze the differences between the different algorithms based on their training data and test data (Section 4.4.1). Finally, we discuss the current research deficiencies and propose future research directions (Section 4.4.2).

Deep reinforcement learning algorithm validation and performance evaluation
This section compares the experimental results of three deep reinforcement learning algorithms (DQN, policy-gradient, DDPG) in the dynamic environment of the inverted pendulum system, including the impact of the DRM settings on the experimental accuracy, and the differences between different algorithms. The experiment is divided into two parts: i) The influence of the setting of the detail-reward function under three deep reinforcement learning algorithms on the experimental accuracy in the training and testing phase. ii) Differences between three deep reinforcement learning algorithms under the same detail-reward function. In the dynamic environment of the inverted pendulum system, the initial states of the inverted pendulum system in the training and testing phase are shown in Table 1. Tables 2 and 3 show the reinforcement learning parameters and network parameters of the DQN algorithm in the inverted pendulum system.

1) Experimental comparison of reward functions with different details under the same algorithm
In Table 3, we set the number of outputs of the action-value Q network equal to 25. It can be seen from reference [61] that the richer the choice of control vector , the higher the controllability, but if it is too large, the convergence effect of the network model becomes worse in the training process. Therefore, reasonable selection of the number of action values output by Q network can improve the controllability of inverted pendulum system without affecting the effect of training.    We first give an illustration of the agent training rules for the following experiments. Episode is a round of agent training, and in each episode, the maximum number of times the agent is allowed to learn is 300. The setting range of cart position is [−0.2, 0.2], and the setting range of the swing pole angle is [−0.174, 0.174]. During the training of the agent, as long as the cart and the swing pole are in the areas of the set values, the agent can continue to learn in this episode, and the step value is increased by 1 while the sum of the reward value is also added to the instant reward value, otherwise the learning round will be exited, and the sums of steps and rewards under each episode is recorded. Figure 4 shows the training and testing process of an inverted pendulum system using the DQN algorithm in a dynamic environment. Figure 4(a) shows the number of admissible control actions output by the inverted pendulum system agent under each episode in the training process. Figure 4 As can be seen from Figure 4, the setting of the DRM affects the reinforcement learning effect of the inverted pendulum. In Figure 4(a),(b), a single-reward function (i.e., ) is used, when the episode value is less than 400, the number of times the rule is met is less than 50, and the reward value under each episode is less than 50. When the number of episodes is greater than 400, the reward value gradually increases, but the reward distribution curve fluctuates greatly, the generalization of the trained model is weak, and the agent cannot learn useful experience well. When detail-reward function (i.e., ∏ ) is used, the number of times the rule is met is proportional to the number of the episodes. In Figure 4(b), when the episodes are greater than 200, the training process tends to converge, the reward value is close to 250, and the reward distribution curve is relatively flat.
In Figure 4(c),(d), when steps = 15 and a single-reward function is used, the position of the cart is 0.01 and the swing pole angle is 0.33, which means that the agent control fails. After 40 steps, by using detail-reward function, the position of the cart is close to zero, the swing pole angle is close to zero degrees, and the distribution of the position and angle curves is relatively gentle.
The above analysis of the distribution curves of steps, rewards, cart position, and swing pole angle in Figure 4 shows that when using the DQN algorithm to train an agent in a dynamic environment of inverted pendulum system, the experimental effect of detail-reward function is better than a single-reward function.   Tables 4 and 5 show the reinforcement learning parameters and network parameters of the policy gradient algorithm in the inverted pendulum system. Figure 5 shows the training and testing process of an inverted pendulum system using the policy gradient algorithm in a dynamic environment.  As can be seen from Figure 5(a),(b), the training process converges faster when a single-reward function is used. When the episode is equal to 3000, the step and the reward reach the maximum value of 300 respectively; when using the detail-reward function, the training process converges slowly and the reward rises more gently. When the number of episodes is 14,000, the step and reward reach their maximum values of 300 and 250 respectively. According to Eq (37), a single-reward function means that only the evaluation function • plays a role, and its values are only 0 and 1. The final value range of the detail-reward function is [0, 1]. Therefore, the maximum values of reward in these two cases are different.  It area A2 of Figure 5(c), when the detail-reward function is used, the position curve distribution of the cart is relatively flat, the minimum position is −0.05, and the maximum position is 0.04. While using a single-reward function, the cart position distribution curve oscillates greatly, the minimum position and maximum position are −0.077 and 0.05 respectively. It can be seen from area A1 of Figure 5(d) that when the detail-reward function is used, the variation range of the swing pole angle is [−0.05, 0.05]. When the single-reward function is used, the range of the swing pole angle is [−0.07, 0.06], and the data illustrate the swing angle range of the former is smaller than that of the latter.
By analyzing the distribution curves of step, reward, cart position and swing pole angle in Figure 5, we can get: When an agent is trained with the policy-gradient algorithm, the single-reward function makes the training rounds less than the detail-reward function, but in the test phase, the experimental effect of detail-reward function is better than that of single-reward function. Tables 6 and 7 show the reinforcement learning parameters and network parameters of the DDPG algorithm in the inverted pendulum system. Figure 6 shows the training and testing process while using the DDPG algorithm in a dynamic environment.  It can be seen from Figure 6(a),(b) that in the case of using single-reward function, when the episodes are less than 180, the number of times the rule is met is less than 100, and the reward value under each episode is less than 50. As the episodes gradually increase, the distribution of step and reward curves fluctuates greatly. In contrast, using detail-reward function, when the episodes are greater than 180, the distribution of the reward curve is relatively flat, and the reward distribution under each episode is around 210.
In Figures 6(c),(d), when steps = 15 and the single-reward function is used, the position of the cart is located at 0.05 in the positive direction, and the swing pole angle is 0.3, which means that the agent control fails. After 200 steps, by using detail-reward function, both the cart position and the swing pole angle are close to zero, while the distribution of the position and angle curves is relatively gentle.
From the above analysis it can be concluded that in the process of using the DDPG algorithm to train an agent, the reward obtained by using the single-reward function is greater than that obtained by using the detail-reward function in most cases. However, the distribution of the reward curve under the single-reward function fluctuates greatly and the model stability is poor. In addition, the model trained with single-reward function will lead to its out of control in the process of testing. The experimental effect of using the detail-reward function is better than that of using the single-reward function.  Through the analysis and comparison of the above experimental results, the following conclusions can be drawn: i) The method of using DRM in the inverted pendulum system under deep reinforcement learning is effective and innovative. ii) A good design of evaluation functions is crucial for the training and testing process. Under different algorithms, the detail-reward function produces better experimental results than a single-reward function. Figures 7 and 8 show the experimental comparison of the models generated by different deep reinforcement learning algorithms during training under the same detail-reward function in the testing phase. Figure 7 shows the distribution curves of cart positions in each step of the inverted pendulum system when DRM is applied in the experiment. After 50 steps, the model generated by DQN algorithm converges to a point close to 0 in the test phase, while the model generated by policy-gradient algorithm fluctuates in the range of [−0.06, 0.05], and the model generated by DDPG algorithm shows that the moving position of the cart gradually converges to the vicinity of zero with the increase of the number of steps. The right part of Figure 7 is an enlarged schematic diagram of area B. It can be seen from the figure that the model generated by the policy-gradient algorithm fluctuates greatly, i.e., the moving position of the cart fluctuated wildly in the range of more than 0.06. The data of area B also shows that the models generated by DQN and DDPG algorithms has a small fluctuation range, both mean that the cart fluctuates reasonably near the equilibrium point. Figure 8 is the distribution curves of swing pole angle at each step under the models generated by three algorithms under the application of DRM in the experiment. After 50 steps, the model generated by DQN algorithm converges to near zero during the test process, the model generated by the policy-gradient algorithm fluctuates in the range of [−0.05, 0.05], and the model generated by the DDPG algorithm gradually converges to zero with the increase of steps. The right part of Figure 8 is the enlarged schematic diagram of area A. After 200 steps, the models generated by DDPG and DQN will guide the swing pole angle to vibrate around 0. The fluctuation range of the swing pole angle under the model generated by policy-gradient is [−0.06, 0.06], which is larger than that of the above two models.  By analyzing the distribution curves in Figures 7 and 8, the following conclusions can be drawn: i) The system stability of the policy-gradient algorithm is weaker than that of DQN and DDPG during the test process, and the car position and the swing pole angle are difficult to converged to zero. ii) During the test process of the DQN algorithm, the cart position and the swing pole angle distribution are both close to the zero. However, the output of the DQN algorithm is a discrete action.
In a continuous control system, the frequent output of discrete actions will affect the stability of the system, and on a physical platform, the large-scale pulsed current output can greatly reduce the lifetime of electronic devices. iii) DDPG algorithm is continuous action output. Compared to the DQN algorithm, the convergence rate is slower. During the test, as the number of steps increased, the cart position and the swing pole angle distribution gradually converged to zero. In an inverted pendulum system, the output of the DDPG model can ensure the stability of the system. Compared with the above three deep reinforcement learning algorithms, as a continuous action output algorithm, DDPG can be applied to the inverted pendulum control system and achieve good experimental results by introducing DRM into the algorithm.

Insufficient research and future work
According to the characteristics of the inverted pendulum NCS, deep reinforcement learning with DRM can solve the optimal control problem of the system. Through the above experiments, the control accuracy of the inverted pendulum system under different reward functions is compared. At present, this paper has the following deficiencies in experimental details: 1) This paper just introduce DRM to the inverted pendulum system without using more complex nonlinear systems. 2) In this paper, a novel DRM method is given, and the evaluation functions and reward function are established by taking the inverted pendulum system as an example, but the unified rules for the establishment of evaluation function are not given, i.e., it depends on the designer's personalized design.
3) The parameters of the inverted pendulum system used have limitations. For example, we only set the angle of the swing pole in the range of [−0.174, 0.174], and did not study the control problem of a broader state space. 4) When designing the dynamic environment of the inverted pendulum system in this paper, the unstable disturbance of the real inverted pendulum experimental platform is not considered, which has certain limitations for deploying agent models to real physical platforms.
Future work will improve in the following aspects: 1) We will use the method of DRM in more nonlinear systems and summarize the applicability of the method in a wider experimental setting. 2) The construction of the evaluation function is a reward realization for a specific observation perspective. In the course of experiments in more nonlinear environments, the validity of the evaluation function and the design method will be summarized. 3) We will increase the control range of the swing pole angle in the inverted pendulum system, and use the deep reinforcement learning algorithms to solve the global optimal control problems. 4) We will design the target policy and action policy separately according to the importance sampling off policy learning theory in deep reinforcement learning. The action strategy is a strategy optimized for training in a dynamic environment, and the target strategy is a real experimental platform. The control strategy uses the action strategy to sample the data in the dynamic environment, which is used to predict the target strategy under the real experimental platform, so as to realize the optimal control of the real system.

Conclusions
In this paper, a reward mechanism based on control details is proposed for the first time. According to the specific requirements of control objectives, the state space is divided into different observation perspectives, and then the evaluation function is designed. Finally, the final reward function is composed in the form of continuous multiplication of the evaluation functions, which replaces the utility function in HJB equation. Based on Lyapunov stability theory and Q-learning, the stability of the control law is proved. Then taking the nonlinear inverted pendulum system as the experiment object, the detail-reward mechanism is introduced into the deep reinforcement learning algorithms. After designing the reward function according to the detail-reward mechanism, the effects of different reward functions on the experimental accuracy are compared under three algorithms: DQN, policy-gradient and DDPG. The effectiveness of this method is proved by experiments, and the optimal reward function and deep reinforcement learning algorithm model of inverted pendulum system are determined. We will do the following work in the future: 1) Apply the detail reward mechanism to other deep reinforcement learning algorithms for verification, such as A3C, TD3 and PPO. 2) Apply the detail reward mechanism to more nonlinear control systems, such as multi-agent, robot motion control systems, etc. 3) Optimize the model generated by the deep reinforcement learning algorithm to improve the control effect.