Next Article in Journal
Electroelastic Coupled-Wave Scattering and Dynamic Stress Concentration of Triangular Defect Piezoceramics
Next Article in Special Issue
AILC for Rigid-Flexible Coupled Manipulator System in Three-Dimensional Space with Time-Varying Disturbances and Input Constraints
Previous Article in Journal
Control Allocation Design for Torpedo-Like Underwater Vehicles with Multiple Actuators
Previous Article in Special Issue
Multi-Agent Reinforcement Learning with Optimal Equivalent Action of Neighborhood
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor

1
School of Electronic and Information, Jiangsu University of Science and Technology, Zhenjiang 212100, China
2
College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
*
Author to whom correspondence should be addressed.
Actuators 2022, 11(4), 105; https://doi.org/10.3390/act11040105
Submission received: 4 March 2022 / Revised: 26 March 2022 / Accepted: 2 April 2022 / Published: 6 April 2022
(This article belongs to the Special Issue Intelligent Control of Flexible Manipulator Systems and Robotics)

Abstract

:
In this paper, a novel deep reinforcement learning algorithm based on Proximal Policy Optimization (PPO) is proposed to achieve the fixed point flight control of a quadrotor. The attitude and position information of the quadrotor is directly mapped to the PWM signals of the four rotors through neural network control. To constrain the size of policy updates, a PPO algorithm based on Monte Carlo approximations is proposed to achieve the optimal penalty coefficient. A policy optimization method with a penalized point probability distance can provide the diversity of policy by performing each policy update. The new proxy objective function is introduced into the actor–critic network, which solves the problem of PPO falling into local optimization. Moreover, a compound reward function is presented to accelerate the gradient algorithm along the policy update direction by analyzing various states that the quadrotor may encounter in the flight, which improves the learning efficiency of the network. The simulation tests the generalization ability of the offline policy by changing the wing length and payload of the quadrotor. Compared with the PPO method, the proposed method has higher learning efficiency and better robustness.

1. Introduction

Over the past decade, quadrotor unmanned aerial vehicles (UAV) have attracted considerable interest from both academic research and engineering application. With some features of vertical take-off and landing, simple structure, and low cost, they have been successfully applied in military and civil fields such as military monitoring, agricultural service, industrial detection, atmospheric measurements, and disaster aid [1,2,3,4,5]. However, the quadrotor UAV is an unstable, nonlinear, and highly coupled complex system. Furthermore, external disturbances and structure uncertainties always exist in practical quadrotors affected by wind gusts, sensor noises and unmodelled dynamics. Therefore, all these factors demand an accurate and robust controller for the quadrotor to achieve a stable flight.
An autonomous GNC system includes three subsystems of guidance, navigation and control, and it undertakes all the motion control tasks of the aerial vehicles from take-off to return. The state vector of the quadrotor usually consists of position coordinates, velocity vector and attitude angle. The navigation system is responsible for state perception and estimation. The guidance system generates state trajectory commands for the quadrotor, while the control system maintains stable control to follow the trajectory. The research on the quadrotor flight control system is usually divided into two levels, one is the low-level inner loop control layer, which is mainly used for the simple motion control and stabilization of the quadrotor, and the other is the higher-level outer loop coordination layer, such as navigation, path planning and other strategic tasks. To achieve stable control and target tracking of the quadrotor, various control policies have been developed. Traditional control theory methods, such as PID control, often have very high requirements for parameter adjustment and precise preset models. Moreover, the accuracy of the controller is greatly affected by the complex environment. Therefore, many advanced control policies are proposed to solve the control problems in complex environments, such as feedback linearization control [6] adaptive control [7], model predictive control [8], immersion and invariance control [9], sliding mode control [10], adaptive neural-network control [11,12], backstepping control [13], active disturbance rejection method [14], and so on. However, the effectiveness and robustness of most technologies mainly depend on the accuracy of the dynamic model. Although some advanced algorithms have considered the uncertainty and disturbance of the quadrotor system, they are difficult to implement in real-time due to complex control policies.
Reinforcement learning (RL) algorithms have been used with promising results in a large variety of decision-making tasks, including control problems [15]. Compared with classic control techniques, RL is a learning algorithm that directly learns from the interaction with the system and improves policies without making any assumptions on the dynamic model [16]. Many complex quadrotor decision-making problems have been solved by RL technology. In [17], an obstacle avoidance RL method combined with a recurrent neural network with temporal attention is proposed to deal with cluttered environments. In [18], UAV successfully navigate to static and dynamic formulated goals through RL method with a customized reward mechanism. In [19], line of sight and artificial potential field are introduced in the reward function to guide the UAV to perform the target tracking task. In [20], an RL path planning method based on global situational information demonstrates the excellent performance of UAVs in radar detection and missile attack environments.
In addition to high-level guidance and navigation tasks, RL has also been used for low-actuator stable motion control. At this level, the complexity of the mission lies in the complex dynamics of the quadrotor and its vulnerability to unknown dynamics such as disturbances and sensor noise [21]. In this paper, we focus more on low-level motion control of quadrotor based on a fast-response RL robust controller. In [22], the stochastic nonlinear model of helicopter dynamics was fitted and successfully applied to autonomous flight control through RL for the first time. In [23], the locally weighted linear regression method was first used to approximate the quadrotor model as a Markov Decision Process (MDP) in order to realize the continuous state-action space RL controller. In [24], deep neural networks were used in RL as a powerful value function approximator to deal with complex dynamics. In [25], a low-level controller generated by deep RL implements the basic hover control and tracking tasks of a real quadrotor firmware. In order to solve the problem of continuous state-action control decisions, the newly developed algorithms, such as Asynchronous Advantage Actor-Critic (A3C), Twin Delayed Deep Deterministic Policy Gradient (TD3), Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO), are proposed [26,27,28,29] to optimize performance on high-dimensional continuous control problems. These algorithms have also been gradually developed to solve the flight control problem of quadrotors and other complex nonlinear systems [30,31,32,33].
PPO is an advanced policy gradient algorithm, which can effectively solve the problem of low learning efficiency caused by the traditional policy gradient algorithm due to the influence of the step size. The main advantages of the PPO algorithm for training control policy are: Firstly, in [34], the hyper-parameters of PPO were proved to be robust when training various tasks, and PPO can achieve an optimal balance between control accuracy and algorithm complexity. Secondly, in [35], through the comparison of performance indicators, the training control policy of PPO was superior to other RL algorithms on every metric. It is the best performing algorithm for controlling the attitude of a quadrotor. Then, in [36], the position control of the “model-free” quadrotor was successfully realized through the PPO algorithm. In [37], considering the full six DoF system dynamics of the UAV, PPO is used to train the quadrotor control policies, which has achieved the basic control task of stable hovering. The RL integrated controller designed in [38] has solved advanced tasks such as autonomous landing in actual flight for the first time. Moreover, some improved algorithms have been presented to improve the robustness and tracking accuracy of the controller. In [39], a state integrator was introduced in the actor–critic framework, and the PPO-IC algorithm was proposed to reduce the steady-state error of the system.
However, most RL methods are only for specific control environments. Further research is still needed to design an RL algorithm with fast response and stable strategy in the flight control system. As far as RL policies on quadrotors are concerned, many problems still remain unsolved. They are summarized as follows: Firstly, the quadrotor UAV is an underactuated nonlinear system with multiple inputs and outputs. For such a complex system, PPO is prone to lacking exploration and slow convergence, especially in poor initialization policies [40]. Secondly, the reward function plays an important role in RL [41]. Most reward function settings cannot achieve effective exploration in the training control policy.
Aiming at the above problems, an improved quadrotor control policy based on PPO is proposed in this paper. Firstly, in [35], PPO has the best effect on quadrotor attitude control among all baseline RL algorithms. Inspired by this, we introduce a penalized point probability distance as the probability ratio between different policies, thereby improving the exploration efficiency. Secondly, we verify that the improved PPO algorithm has a better control performance and training rate on dimensions of attitude and position. Moreover, for the exploration of new reward signals mentioned in [36] and [39], a compound reward function is proposed to converge faster to the control requirements and minimize the steady-state error during the training process. The main contributions of this paper are summarized as follows:
In this paper, an improved quadrotor control strategy based on PPO is proposed.
(1)
In the objective function of the PPO algorithm, a penalized point probability distance based on Monte-Carlo approximation is introduced to replace KL divergence in order to eliminate the strict penalty when the action probability does not match. The strategy will optimize the decision-making of the quadrotor when training the control policy. The new policy optimization algorithm helps to stabilize the learning process of the quadrotor and promote exploration, which will be remarkably robust to model parameter variations.
(2)
For actual flight control, a compound reward function is designed to replace the single reward function to prevent the training of the decision network from falling into the local optimum. With the defined reward function, the improved PPO will be applied to the quadrotor environment to train the policy network.
The organization of this article is as follows. In Section 2, the nonlinear model of the quadrotor is established, and the theoretical overview of RL is provided. In Section 3, the algorithm and reward and punishment function are optimized after analyzing the PPO algorithm. The details and results of the simulation experiment are discussed in Section 4. The conclusion is given in Section 5.

2. System Statement

The purpose of this section is to develop an RL method that can solve the fixed-point flight control problem of the quadrotor. Moreover, the method can meet the requirements for pinpoint flying and hovering based on defined rewards.

2.1. Dynamic Model of Quadrotor

A dynamical model of the quadrotor is set up by the earth-frame I(Oxyz) and the body-frame B(Oxyz) as illustrated in Figure 1.
The position and attitude of the quadrotor expressed in the inertial frame are defined as P = [x, y, z]T and Θ = [φ, θ, ψ]T, where φ, θ, ψ are roll, pitch and yaw angles, respectively. p ˙ , p ¨ is the speed and acceleration of the quadrotor.
The transformation matrix R is used to transform the thrust force from the body-fixed coordinate system to the inertial coordinate system, which is
R = C φ C θ C φ S θ S φ C φ S ψ S φ S ψ + C φ C ψ S θ S ψ C θ S φ S θ S ψ + C φ C ψ C φ S θ S ψ C ψ S φ S θ C θ S φ C φ C θ ,
where S{·} and C{·} denote sin(·) and cos(·) respectively.
The thrust generated by the four motors is defined as Ti. In the body coordinate system, the thrust of the body is vertical-upward, which can be expressed as:
T i = b u i ,   i = 1 ,   2 ,   3 ,   4 ,
where b is the thrust gain, and ui is the normalized control input.
The establishment of the quadrotor dynamics model is based on the dynamic characteristics of torque-driven rotational motion and force-driven translational motion. For the rotational motions, with the Euler’s equation of a rigid body, the sum of torque applied to the quadrotor can be expressed as
M = M τ + M c + M f = I w ˙ + w × I w ,
where I is the diagonal inertia matrix of the quadrotor, w = [ φ ˙ , θ ˙ , ψ ˙ ] T is the angular velocity of the quadrotor, and M τ = [ τ φ , τ θ , τ ψ ] T is the control torque given by:
M τ = τ φ τ θ τ ψ = L T 2 T 4 L T 1 T 3 k T 1 T 2 + T 3 T 4 ,
where the distance from the center of mass to each rotor is L, the control torques τψ along the z-axis is the sum of the reaction torques generated by the four rotors. k is the damping coefficient. The gyroscopic effect of the four rotors is M c = [ I p θ ˙ Ω , I p φ ˙ Ω , 0 ] T , where Ip is the moment of inertia, and Ω is the disturbance effect from each rotor. The drag torque of quadrotor flight is M f = [ d φ φ ˙ , d θ θ ˙ , d ψ ψ ˙ ] T , where dφ, dθ and dψ are the drag coefficients of the three axes.
For translational motions, the motion equation can be obtained from Newton’s second law:
p ¨ = R F l m F d m g ,
where Fl = [0, 0, T1 + T2 + T3 + T4] is the thrust vector. F d = [ d x x ˙ , d y y ˙ , d z z ˙ ] T is the aerodynamic drag, where dx, dy and dz are the resistance coefficients. m is the mass of the quadrotor and g is the acceleration of gravity, T z = T 1 + T 2 + T 3 + T 4 . Finally, the quadrotor dynamics equation can be expressed as:
x ¨ = [ T z ( C φ S θ C ψ + S φ S ψ ) d x x ˙ ] / m y ¨ = [ T z ( C φ S θ S ψ + S φ C ψ ) d y y ˙ ] / m z ¨ = [ T z ( C φ C θ ) d z z ˙ m g ] / m φ ¨ = [ τ φ I p θ ˙ Ω d φ φ ˙ + θ ˙ ψ ˙ ( I y I z ) ] / I x θ ¨ = [ τ θ I p φ ˙ Ω d θ θ ˙ + φ ˙ ψ ˙ ( I z I x ) ] / I y ψ ¨ = [ τ ψ d ψ ψ ˙ + φ ˙ θ ˙ ( I x I y ) ] / I z

2.2. Quadrotor Control Based on Reinforcement Learning

The goal of RL is to find an optimal policy for an agent to interact with a certain environment to maximize the total reward over time. It uses the formal framework of the Markov Decision Process (MDP) to define the interactions between the learning agent and the environment [42]. The environment is usually modelled as an MDP described by a four-tuple ( S , A , , ), where S and A are the state set and action set, respectively, and are the state transition probability function and reward function.
According to the interaction between the agent and the environment, the policy πθ is updated as:
L ( π θ ) = E s 0 , s 1 k = 0 γ k r t + k = s d π θ ( s ) V π θ ( s ) d s ,
where θ is the policy parameter, γ ∈ [0, 1) is the discount factor, πθ is a stochastic policy, V π θ ( s ) = E s t + 1 , s t + 2 t = t γ t r t s t = s , π θ .
In RL, the expected reward function of the state-action (st,at) generated by policy π is called the action-value function, which is determined as:
Q π ( s t , a t ) = E π R ( s t , a t ) + γ V π θ ( s t + 1 ) .
Its output represents the value of taking a specific action in a specific state and following this policy thereafter. Based on the baseline function V π θ ( s ) , the policy gradient can be written as:
θ L ( π θ ) = E s ρ π θ θ π θ ( s ) a A π θ ( s , a ) a = π θ ( s ) ,
where A π θ ( s , a ) = Q π θ ( s , a ) V π θ ( s ) is called the advantage function, and its value can represent the advantage of the value obtained by taking a certain action to the current policy πθ(s), ρ π θ is the state distribution following the policy πθ.
For the quadrotor control problem, the main goal is to seek an appropriate control policy to drive the quadrotor to a predefined state stably and rapidly. The quadrotor dynamics will be converted into the MDP form, and appropriate states and actions should be selected to satisfy the Markov property. The quadrotor control structure based on RL is shown in Figure 2. S is the current position and attitude information of the quadrotor, A is the control input of the quadrotor, is the policy distribution of RL, and is the reward function set for the task requirements. Through the interaction between the controller and the environment, the RL algorithm combined with the reward function can finally obtain the optimal policy of the quadrotor.
Among the basic RL algorithms, the policy gradient method is the most suitable because it is compatible with continuous states and actions. The parameterized random policy πθ (a|s) directly generates the control action, which is the probability of taking action a in the given state s and parameter θ. We need to adjust the parameters to optimize the policy according to the gradient of the performance measurement value J(πθ):
θ J ( π θ ) = E s ρ π θ , a π θ θ log π θ ( a s ) Q π θ ( s , a ) ,
The policy gradient algorithm can find the optimal control policy without considering the accuracy of the model. However, the basic RL algorithm is difficult to effectively converge to the optimal state in the continuous state-action space of the complex environment. Many advanced RL algorithms have improved policy optimization, such as PPO. The algorithm uses a Kullback–Leibler divergence (KLD) to limit the update range of the policy, thereby improving the learning efficiency of the algorithm.

3. Proposed Approach

In this section, a policy optimization with penalized point probability distance (PPO-PPD) is firstly proposed for quadrotor control. Then, a compound reward function is adopted to promote the algorithm convergence to the desired direction.

3.1. The PPO-PPD Algorithm

In the PPO method, our goal is to maximize the following alternative objective function L C P I (conservative policy iteration) proposed in [43], which is constrained by the size of the policy update.
L C P I ( θ ) = E t π θ ( a t s t ) π θ o l d ( a t s t ) A t ,
where θold is the vector of policy parameters before the update. The objective function is maximized subject to a constraint by:
E t K L π θ o l d ( · s t ) , π θ ( · s t ) δ ,
where δ is the upper limit of KLD. Applying the linear approximation of the objective function and the quadratic approximation of the constraints, the conjugate gradient algorithm can be more effective to solve the problem. In the continuous domain, KLD can be defined as:
D K L ( π θ o l d ( · s ) π θ ( · s ) ) = a π θ o l d ( a s ) ln π θ o l d ( a s ) π θ ( a s ) ,
where s is a given state. When choosing D K L ( π θ o l d π θ ) or D K L ( π θ π θ o l d ) , its asymmetry results in a difference that cannot be ignored. PPO limits the update range of policy π θ through KLD. It is assumed that the distribution of π θ o l d is a mixture of two Gaussian distributions, and π θ is a single Gaussian distribution. When the learning tends to converge, the distribution of policy π θ will approximate to π θ o l d , D K L ( π θ o l d π θ ) or D K L ( π θ π θ o l d ) should be minimized at this moment. Figure 3a is the effect of minimizing D K L ( π θ o l d π θ ) . When π θ o l d has multiple peaks, π θ will blur these peaks together, and eventually lie between the two peaks of π θ o l d , resulting in invalid exploration. When choosing another function, as shown in Figure 3b, π θ ends up choosing to fit on a single peak of π θ o l d .
By comparing forward and reverse KL, we argue that KLD is not an approximation or ideal limit to the expected discounted cost. Even if the θ output θold has the same high probability of correct action, it is still penalized for the probability mismatch of other non-critical actions.
To address the above issues, a point probability distance is introduced based on Monte Carlo approximation in the PPO objective function as a penalty for the surrogate objective. When taking action a, the point probability distance between π θ o l d ( · s ) and π θ ( · s ) can be defined as:
D P P π θ o l d s , π θ s = In π θ o l d a s + 1 π θ a s + 1 2 .
In the penalty, the distance is measured by the point probability, which emphasizes the mismatch of the sampled actions in a specific state. Compared with DKL, DPP is symmetric, that is, D P P ( π θ o l d π θ ) = D P P ( π θ π θ o l d ) , so when the policy is updated, DPP is more conducive to helping the agent converge to the correct policy and avoid invalid sample learning like KLD. Furthermore, it can be found that DPP is the lower bound of DKL by deriving the relationship between DPP and DKL.
Theorem 1.
Assuming that ai and bi are two policy distributions with K values, then D P P ( a i b i ) D K L ( a i b i ) holds.
Proof of Theorem 1.
The total variance distance is introduced as a reference, which can be written as follows:
D T V = i a i b i 2 .
From [44], D T V 2 is the lower bound of DKL, which is expressed as D T V 2 DKL. Assuming that ax = c1 is arbitrarily distributed in ai, and bx = c2 is arbitrarily distributed in bi, where c1, c2 ∈ [0, 1]. Then it can be derived:
D T V 2 = 1 4 i = 1 K a i b i 2 = 1 4 i = 1 , i x K a i b i + a x b x 2 1 4 i = 1 , i x K a i i = 1 , , i x K b i + a x b x 2 = 1 4 1 c 1 1 c 2 + c 1 c 2 2 = c 1 c 2 2
For any real number between 0 and 1, there is c 1 c 2 2 In 1 + c 1 In 1 + c 2 2 = D P P . Therefore DKL D T V 2 DPP.
Compared to DKL, DPP is less sensitive to the dimension of the action space. The optimization algorithm aims to improve the shortcomings of KLD. The reward function rt(θ) only involves the probability of a given action a, the probabilities of all other actions are not activated, and this result no longer leads to long backpropagation. Based on the DPP, a new proxy target can be obtained as:
max θ   E t π θ a t , s t π θ o l d a t , s t A t β D P P π θ o l d s , π θ s ,
where β is the penalty coefficient. Algorithm 1 shows the complete iterative process. The optimized algorithm reduces the difficulty of selecting the optimal penalty coefficient in different environments of the fixed KLD baseline from PPO. We will implement it on the quadrotor control problem.
Algorithm 1 PPO-PPD
1:
Input: max iterations L, actors N, epochs K, time steps T
2:
Initialize:
Initialize weights of policy networks θi (i = 1, 2, 3, 4) and critic network
Load the quadrotor dynamic model
3:
for iteration = 1 to L do
4:
Randomly initialize states of quadrotor
5:
Load the desired states
6:
Observe the initial state of the quadrotor s1
7:
for actor = 1 to N do
8:
for time step = 1 to T do
9:
Run policy π θ to select action at
10:
Run the quadrotor with control signals at
11:
 Generate reward rt and new state st+1
12:
 Store st, at, rt, st+1 into mini-batch-sized buffer
13:
 then
14:
 Run policy π θ o l d
15:
 Compute advantage estimations A t
16:
 end for
17:
 end for
18:
 for epoch = 1 to K do
19:
 Optimize the loss target with min-batch size M N T
20:
 then update θoldθ
21:
 Update θ w.r.t J P P O ( θ ) = max θ   E t π θ a t , s t π θ o l d a t , s t A t β D p p π θ o l d s , π θ s
22:
 end for
23:
end for

3.2. Network Structure

The actor–critic network structure of the algorithm is shown in Figure 4. The system is trained by a critic neural network (CNN) and a policy neural network (PNN) θi (i = 1, 2, 3, 4), which is formed by four policy sub-networks. The weights of the PNN can be optimized by training.
The network input of the two neural networks is the new quadrotor states [ φ ˙ , θ ˙ , ψ ˙ , φ , θ , ψ , x ˙ , y ˙ , z ˙ , x , y , z ] from the replay buffer. When the PNN collects a single state vector, the parameters of the PNN will be copied to the old PNN πθold. In the next batch of training, the parameters of πθold remain fixed until new network parameters are received. The output of PNN is πθ and πθold. The penalty DPP is obtained by calculating the point probability distance between the two policies. When the state vector enters the CNN, according to the reward function, a batch of advantage values is generated to evaluate the quality of the action taken. Through the gradient descent method, the CNN minimizes these values to update its parameters. Finally, the policies πθ and πθold, penalized point probability distance DPP and advantage value At are provided to update of the PNN. After the PNN is updated, its outputs μi and δi (i = 1, 2, 3, 4) correspond to the mean and variance of the Gaussian distribution. As the normalized control signals for the four rotors of the quadrotor, a set of 4-dimensional action vectors ai (i = 1, 2, 3, 4) are randomly sampled from a Gaussian distribution.
Based on the multilayer perceptron (MLP) structure in [45], the actor-critic network structure of our algorithm is shown in Figure 5. The structure can maintain a balance between the training speed and the control performance of the quadrotor. Both networks share the same input, consisting of 12-dimensional state vectors. PNN has two fully connected hidden layers, each hidden layer contains 64 nodes with tanh function. The output layer is a 4-dimensional Gaussian distribution with mean μ and variance δ. The 4-dimensional action vector ai (i = 1, 2, 3, 4) is obtained by random sampling and normalization, which will be used as the control signal of the quadrotor rotor. The structure of CNN is similar to that of PNN. It also has two fully connected hidden layers with the tanh activation function, and each layer has 64 hidden nodes. The difference is that its output is an evaluation of the advantage value of the current action, which is determined by the value of the reward function.

3.3. Reward Function

The goal of RL algorithm is to obtain the most cumulative rewards [46]. The existing RL reward function settings are relatively simple, most of which are presented as:
r = x 2 + y 2 + z 2 + ψ 2 ,
where r is the single-step reward value, (x, y, z) is the position observation of the quadrotor, and ψ is the heading angle. It is not enough to evaluate the pros and cons of the chosen actions of the quadrotor by relying on the efficiency of a single reward function. If (18) is used, it will make the action space update too large, and increase the ineffective exploration, making the convergence slower. A new reward function that combines multiple reward policies is introduced to solve the problem.
The quadrotor explores through a random policy. When the mainline event is triggered with a certain probability, the corresponding mainline reward should be given. Because the probability of triggering the main line reward is very low in the entire flight control, we need to design the corresponding reward function according to all possible states of the quadrotor. Therefore, in this paper, a navigation reward, boundary reward and target reward are designed. As the mainline reward, the navigation reward directly affects the position and attitude information of the quadrotor by observing the continuous state space.
  • Navigation Reward;
    (a)
    Position Reward
    In order to drive the quadrotor to fly to the target point, the position reward is defined as a penalty for the distance between the quadrotor and the target point. When the quadrotor is close to the target point, the penalty should be small, otherwise the penalty should be large. Therefore, the definition of position reward is as follows:
    r P = k P x e 2 + y e 2 + z e 2 k V x ˙ e 2 + y ˙ e 2 + z ˙ e 2 ,
    where xe = xxd, ye = yyd, ze = zzd are the position errors relative to the target state, x ˙ e , y ˙ e , and z ˙ e are the linear speed errors in the x, y, z-axis directions, and kP, kV ∈ (0, 1].
    (b)
    Attitude Reward
    The attitude reward is designed to stabilize the quadrotor flying to the target point and the large angle deflection is not conducive to the flight control of the quadrotor.
    It is found that although a simple reward function like φ 2 + θ 2 + ψ 2 aims to make the attitude angle tend to 0, and the quadrotor will weigh the position reward and the attitude reward to find the local optimal policy, which is not the best control policy for quadrotor fixed-point flight. When the position is closer to the target point, the transformation function of its attitude angle also tends to 0. Without considering ψ, φ and θ can also be inversely solved to be 0. Therefore, replacing the attitude angle itself by its transformation function into the reward function will not affect the judgment of the quadrotor during position control, and can increase the stability of the inner and outer loop control. The attitude reward is defined as:
    r A = k A ( C φ S θ C ψ + S φ S ψ ) 2 + ( C φ S θ S ψ + S φ C ψ ) 2 ,
    where ( φ , θ , ψ ) are the attitude observation and kA ∈ (0, 1].
    (c)
    Position-Attitude Reward
    When the distance to the target point is farther, the weight of the position reward is larger. As the quadrotor flies closer to the target point, the weight of the position reward decreases, and the weight of the attitude reward gradually increases. The specific reward function setting is as follows:
    r P A = k P A a 2 φ , θ max ( e p , 0.001 ) ,
    where ep is the position error relative to the target state, aφ and aθ are the normalized actions of roll and pitch from 0 to 1, kPA ∈ (0, 1] and a φ , θ 2 is the sum of the squared roll and pitch actions. It is constrained by the reciprocal of ep to minimize the oscillation of the quadrotor near the target position. Therefore, its contrast parameter is set to 0.001.
  • Boundary Reward;
    In many earlier roll-outs, when the roll angle or pitch angle of the quadrotor is over 40°, the motor will receive an emergency stop command to minimize damage [47]. In order to maintain stability, we set a boundary restriction and failure penalty to the attitude angles to prevent the quadrotor from crashing due to excessive vibration. The specific restriction is as follows:
    r B A = 0 , R A t R max a t t i t u d e ζ p e n a l t y , R A t > R max a t t i t u d e ,
    where RAt is the error between the attitude angle and the target attitude at time t, Rmax attitude is the maximum safe attitude angle, the boundary penalty ζpenalty is a positive constant.
    For position control, the random states sampled may differ by several orders of magnitude in different flying spaces. In order to reduce the exploration time of the quadrotor, we will set a safe flight range with the target point as the center, so that the quadrotor can reduce unnecessary invalid exploration. The reward is determined as:
    r B P = 0 , R P t R b o u n d a r y R P t R b o u n d a r y , R P t > R b o u n d a r y ,
    where RPt is the distance between quadrotor and the target point at time t and Rboundary is the safe flight range of the quadrotor we set.
  • Goal Reward;
    The mainline event of the quadrotor is to reach the target point, so in order to prompt the quadrotor to move to the target as soon as possible, a goal reward is designed. Unlike other rewards, when the quadrotor triggers a mainline event, it should be given a positive reward. When the distance between the quadrotor and the target point is less than Rreach, it is determined that the quadrotor has reached the target point. The specific reward definition is as follows:
    r G = ζ g o a l , R P t R r e a c h 0 , R P t > R b o u n d a r y .
    These rewards may affect the training performance of the policy network. In this paper, when designing the quadrotor controller, all these rewards are set in combination with the corresponding tasks, and the final comprehensive reward is defined as the sum of them as follows:
    r = r P + r A + r P A + r B A + r B P + r G .

4. Simulation

In this section, we use the proposed PPO algorithm to evaluate the quadrotor flight controller based on neural network. The simulation has been performed comparing with the PPO algorithm controller.

4.1. Simulation Settings

The quadrotor model in the simulation is constructed based on the dynamics given in (6). The parameters of the quadrotor are listed in Table 1.
The parameter settings in the simulation model all meet the body parameters of the real quadrotor as shown in Figure 6, so as to maximize the simulation of the flight state of the real quadrotor. Considering the safety factors in actual flight, we define the safety range of the state. The range of attitude angle ϕ and θ is −45° to 45°, and the range of angular velocity φ ˙ and θ ˙ is −4.5 rad/s to 4.5 rad/s, which meets the limitation of the gyroscope sensor. The quadrotor is specified to operate within a range of −2.5 m to 2.5 m in the x direction, −2.4 m to 2.4 m in the y direction, and 0 m to 2.4 m in the z direction.

4.2. Training Evaluation

In the offline learning phase, the PPO-PPD is applied. The training parameters are given in Table 2.
In order to verify the performance of the PPO-PPD policy, we act on multiple motion tasks in OPEN GYM [48] between PPO and PPO-PPD. The two algorithms use the same network structure and environment parameters. Motion tasks are selected from discrete action space tasks (such as Acrobot, CartPole and Pendulum), and continuous tasks (such as Ant, Half-Cheetah, and Walker2D [49]). Both PPO-PPD and PPO are initialized randomly and run five times. The comparison results are shown in Figure 7.
For an intuitive comparison of algorithm performance, Table 3 shows the best performance of PPO-PPD and PPO in different tasks. It can be observed from Figure 7 that the PPO-PPD has a faster and more accurate control policy than PPO. We then evaluate both algorithms in a quadrotor system with randomly initialized states.
In order to train a flight policy with generalization ability, the initial state of the quadrotor is random during training. The target point is set at [0, 0, 1.2]. When the policy converges, the quadrotor should be able to complete the control task of taking off and hovering to the target point at any position. We use the average cumulative reward and average value loss to measure the effect of learning and training. In each step, the greater the reward value of the feedback, the smaller the error for the desired state. The training of the quadrotor should also be carried out in the direction of smaller and smaller errors. A faster and more accurate control policy is reflected in a larger and more stable cumulative reward. In this study, we perform a calculation after every 50 sets of data are recorded, and the average cumulative reward and value loss are evaluated as the average of the 50 evaluation sets. Based on the same network and training parameters, we compare the PPO and PPO-PPD.
Under the initial network parameters, we conduct ten independent experiments on the two algorithms. The standard deviation of these ten experiments is indicated by the shaded part. It is shown that in the initial stage of training, both policies have obvious errors. With the continuous training of the agent, the errors of the two algorithms are gradually reduced to zero. In Figure 8a, it is very clear that the steady-state error is nearly eliminated by the PPO-PPD policy after 1000 training iterations. Although PPO policy converges after 3000 training, it is always affected by the steady-state error, and the error does not show any reduction in the next training iterations.
It can be seen from the learning progress in Figure 8b, PPO-PPD has a higher convergence rate and obtains a higher reward than PPO. In the standard deviation, PPO-PPD is more consistent with less training time. In addition, the policy begins to gradually converge when the reward value reaches 220. Therefore, a predefined threshold of 220 is set to further observe the training steps of the algorithms.
To further verify the effectiveness of compound reward function in the process of training policies, we compare the performance of PPO-PPD with compound reward, PPO-PPD with single reward, and PPO with single reward. The single reward function is taken from (17) and the compound reward function is taken from (24). Table 4 lists the training steps required for the three algorithms to reach the threshold.
In Table 4, PPO-PPD with compound reward function takes the least number of time steps in the flight task, because the compound reward function accelerates the convergence of correct action and reduces the blind exploration of quadrotors. Comparing the PPO-PPD with a single reward function with PPO, the advantages of PPO-PPD in the algorithm structure has a better learning efficiency.
As shown in Figure 9, 60 groups of training data are sampled to obtain the final landing position of the quadrotor after the 100th, 500th and 800th training iterations of the three algorithms.
It can be drawn that the two algorithms cannot train a good policy before the 100th step. Due to the exploration efficiency, PPO-PPD has been able to sample several more rounds of good control policies than the PPO. The advantage is especially noticeable after the 500th step of training. Finally, PPO-PPD with compound reward successfully trains the control policy after the 800th training step. Because of the multi-objective reward, the PPO-PPD with compound reward can stabilize the quadrotor at the target point after completing the mainline event. However, the PPO-PPD with single reward achieves the target point with probability deflection due to its single reward. It is obvious that the quadrotor by PPO controller has not obtained a good control policy in 800th iterations. It is concluded that the PPO-PPD with compound rewards is superior to the other two methods.
The attitude control of the quadrotor at the fixed position is conducted first. This test does not consider the position information of the quadrotor, and only uses the state of the three attitude angles as the observation space. The set attitude angle state of the quadrotor model is initialized to [30, 20, 10]°, and the target attitude angle is set to [0, 0, 0]°. It can be seen from Figure 10a that PPO and PPO-PPD policies can achieve stable control. However, the PPO-PPD has smoother control performance and higher control accuracy than the PPO algorithm. On the contrary, the PPO algorithm response also has a relatively large steady-state error. Moreover, it can be observed that the quadrotor under the two control strategies can reach the steady state after 0.5 s. Comparing the mean absolute steady-state error of the two algorithms, as shown in Figure 10b, the PPO-PPD policy can achieve higher control accuracy.
Then we test the two controller performances in the fixed-point flight task under the same training iterations. The observation space for the test is the motion performance of the quadrotor on the x-axis, y-axis, and z-axis and the attitude changes of roll angle and pitch angle. A total of five observations are made. In order to maximize its flight performance, the initial position of the quadrotor is set around the boundary with the coordinates [2.4, 1.2, 0] and the desired position [0, 0, 1.2], which is assumed to be the center of the training environment point. Figure 11a shows the performance results of the two control policies.
It can be seen from the comparison, although both PPO-PPD and PPO converge, the PPO algorithm does not learn an effective control policy when taking off on a relatively unsafe boundary area. In terms of position control, the control policy learned by the PPO algorithm has a slow convergence with a certain steady-state error. In terms of attitude control, both policies maintain good convergence in control stability, but due to the instability of the PPO policy in the position loop, there is still a slight error in the attitude under the effect of quadrotor control. Furthermore, to compare the training results more directly, we calculate the mean absolute steady-state error on the position control loop for the two policies in steady-state at 7 s, and the comparison results are shown in Figure 11b.
In this test, both algorithms can converge to a stable policy, but PPO-PPD have the smaller steady-state error and faster convergence rate. Next, we will conduct more tests to observe the performance of the control policy trained by PPO-PPD.

4.3. Robustness Test

The main purpose of quadrotor offline learning is to learn a stable and robust control policy. In this section, we test the generalization ability of the training model, and the test is performed on the same quadrotor. In order to conduct a comprehensive robustness test to observe the learned policy, we designed two different cases.
1.
Case 1: Model generalization test under random initial state.
In different initial states of the quadrotor, the PPO-PPD algorithm is used to test its performance. The test is still divided into two parts. We first observe the attitude change of in the fixed-point state, that is, the control task is that the quadrotor hovers at a fixed position, randomly initializes the state within a safe range, and the attitude in the random state can be adjusted to the required steady state. We conduct the experiment 20 times, and each experiment lasts 8 s. As shown in Figure 12a, the three attitude angles start at different initial values, and the control policy can successfully converge their states.
The policy learned by the PPO-PPD algorithm can make the quadrotor stable in different states with few errors, which is enough to prove the good generalization ability of the offline policy. Next, we give the quadrotor a random initialization position within a safe range and observe its position change to test the generalization ability of the RL control policy on fixed-point flight tasks. The experiment is performed 20 times, and the duration of each group is 8 s. The results are shown in Figure 12b.
It can be seen from the results that the control policy learned by PPO-PPD has very good generalization ability. No matter what the initial position of the quadrotor is, the control policy can quickly control the quadrotor to fly to the desired target point, which is enough to prove the stability of the offline policy.
2.
Case 2: Model generalization test under different sizes.
In order to verify the robustness and generalization ability of the off-line learning control strategy, the attitude control task is carried out on quadrotor models of different sizes. The policy is tested by starting at [−15°, −10°, −5°], then flying to the attitude [0, 0, 0] in 10 s. Furthermore, a PID controller is introduced to verify the robustness of the RL control policy. In the same way as RL, PID gains are also selected by observing the system output response through trial and error. To measure the dynamic performance of the control policies, the sum of error is calculated during the flight as a metric, which is the absolute tracking error accumulated at the three attitude angles in each step. As a cascade control, the initial PID parameters are selected as follows: the position loop kp = 0.15, ki = 0.001, kd = 0.5; and the attitude loop kp = 0.25, ki = 0.001, kd = 0.4.
To prove the control performance of PPO-PPD under different specification models, we conducted the following simulation. The distance from the rotor of the quad-rotor model to the center of mass is 0.31 m, which is defined as the standard radius. Then we choose to test the model set radius from 0.2 m (35%) to 1.1 m (250% larger). For these model sets, the maximum thrust and mass of the quadcopter remain unchanged.
It can be seen from Figure 13 that the two RL controllers show a stable performance at radius of 0.31 m and 0.5 m. However, the attitude based on PID controller has already produced a slight oscillation. When the radius increases to 0.7 m, the PID controller has poor stability and robustness because of the parameter uncertainty. When the radius is larger than 0.9 m, the PPO policy cannot stabilize the model while the PPO-PPD policy still obtains a stable performance until 1.1 m. Figure 14 shows the sum of attitude error between the PPO-PPD and PPO algorithms at steady state. After comparison, the PPO-PPD algorithm always maintains stable, consistent, and accurate control within a large radius.
In addition, the robustness of the quadrotor of different masses are tested through a fixed-point flight mission. The mass of the quadrotor gradually increases due to the weight of payloads, which is not added in the training phase but is directly tested with the learned offline policy. The payloads are from 20% to 80% of the mass of the quadrotor, which also affects the moment of inertia of the quadrotor. After a simple test with offline training, we reduce the difficulty of fixed-point flight task to better observe the effect of load on quadrotor flight. A total of five tests are carried out. In each test, only the mass of the quadrotor is changed. The quadrotor starts from the initial point [0, 0, 0] and the desired position is [1.2, 1.0, 1.2].
The position curves of the five set tests are shown in the Figure 15. The existing PID gain can no longer meet the control requirements when the payload accounts for 40%. The PPO policy complete the task only when the mass is below 120%. When the mass is increased to 140%, there is a large position steady-state error although the quadrotor based on PPO controller is still stable. It is mainly because most of the thrust balances the gravity provided by the payloads, that the thrust acting on the position becomes small. When the payload reaches 60% to 80%, PPO cannot remain the stability of quadrotor. However, PPO-PPD can quickly reach the target position without steady-state errors in different payloads. As shown in Figure 16, the sum of position errors is compared between the PPO-PPD and PPO policy. From the comparison results, the PPO-PPD control policy has shown great robustness on different quadrotor models with different sizes or payloads.
3.
Case 3: Anti-disturbance ability test.
The actual quadrotor system is vulnerable to disturbances such as wind dusts and sensor noises. To verify the anti-disturbance ability of the PPO-PPD control policy, the quadrotor rotation system is added to Gaussian white noises. The test is carried out through the control task of the quadrotor hovering at a fixed point. The quadrotor flies from [0, 0, 0] to [1.2, 1.2, 1.2] using the PPO-PPD offline policy. The RL controller runs continuously for 32 s. For the first 4 s, the quadrotor takes off from the starting point and hovers at the desired position, then a noise is applied to the roll motion signal from 4 s.
The flight performance of the quadrotor is shown in Figure 17. Due to the influence of noise, the rolling channel and position of the quadrotor fluctuated slightly. The quadrotor immediately returns to the stable state when the noise disappears at t = 12 s. The noise signal is applied to the roll and pitch channels at t = 16 s, the quadrotor tends to be stable although there are slight oscillations. When the noise signal increases by 150% at the 24th second, the quadrotor has a large attitude oscillation and position deviation. In general, the control policy of PPO-PPD can successfully deal with the disturbances.
From the results of all the cases, the control policy by PPO-PPD in the offline stage shows strong robustness of quadrotor models of different sizes and payloads. Although the PPO controller has a good generalization ability, the proposed PPO-PPD method is proven to be more superior in convergence and robustness.

5. Conclusions

An improved proximal policy optimization algorithm is proposed to train the quadrotor to complete the low-level control tasks of take-off, precise flight and hover. A policy optimization method with a penalized point probability distance can provide the diversity of policy. Together with the proposed compound reward function, the new RL controller effectively reduces the training time of the control policy and improves the learning efficiency. By varying the radius and mass of the quadrotor in the test, the offline control policy is shown to have a good robustness. In addition, compared with the PPO algorithm off the shelf, the control policy learned by the proposed algorithm reduces the steady-state error of the position and attitude, and improves the control accuracy. In future work, we will focus on exploring the role of neural networks in complex nonlinear system task environments, and combine more traditional control techniques with RL to optimize the control performance of the quadrotor.

Author Contributions

Methodology, W.X. and H.Y.; software, H.W. and W.X.; validation, H.W.; formal analysis, H.Y.; investigation, H.Y.; resources, S.S.; data curation, W.X. and H.W.; writing—original draft preparation, H.W.; writing—review and editing, W.X.; supervision, H.Y. and S.S.; project administration, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China (Grant 61903163), Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grants 19KJB510023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Levulis, S.J.; DeLucia, P.R.; Kim, S.Y. Effects of touch, voice, and multimodal input, and task load on multiple-UAV monitoring performance during simulated manned-unmanned teaming in a military helicopter. Hum. Factors 2018, 60, 1117–1129. [Google Scholar] [CrossRef] [PubMed]
  2. Zhou, X.; Lee, W.S.; Ampatzidis, Y.; Chen, Y.; Peres, N.; Fraisse, C. Strawberry maturity classification from UAV and near-ground imaging using deep learning. Smart Agric. Technol. 2021, 1, 100001. [Google Scholar] [CrossRef]
  3. Jiao, Z.; Jia, G.; Cai, Y. A new approach to oil spill detection that combines deep learning with unmanned aerial vehicles. Comput. Ind. Eng. 2019, 135, 1300–1311. [Google Scholar] [CrossRef]
  4. Wetz, T.; Wildmann, N.; Beyrich, F. Distributed wind measurements with multiple quadrotor UAVs in the atmospheric boundary layer. Atmos. Meas. Tech. Discuss. 2021, 2021, 3795–3814. [Google Scholar] [CrossRef]
  5. Estrada, M.A.R.; Ndoma, A. The uses of unmanned aerial vehicles–UAV’s-(or drones) in social logistic: Natural disasters response and humanitarian relief aid. Procedia Comput. Sci. 2019, 149, 375–383. [Google Scholar] [CrossRef]
  6. Martins, L.; Cardeira, C.; Oliveira, P. Feedback linearization with zero dynamics stabilization for quadrotor control. J. Intell. Robot. Syst. 2021, 101, 7. [Google Scholar] [CrossRef]
  7. Pliego-Jiménez, J. Quaternion-based adaptive control for trajectory tracking of quadrotor unmanned aerial vehicles. Int. J. Adapt. Control. Signal Process. 2021, 35, 628–641. [Google Scholar] [CrossRef]
  8. Hossny, M.; El-Badawy, A.; Hassan, R. Fuzzy model predictive control of a quadrotor unmanned aerial vehicle. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), IEEE, Piscataway, NJ, USA, 1–4 September 2020; pp. 1704–1713. [Google Scholar]
  9. Aslan, F.; Yalçın, Y. Immersion and invariance control for Euler angles of a fixed-wing unmanned aerial vehicle. Asian J. Control. 2021, 1–12. [Google Scholar] [CrossRef]
  10. Xue, W.; Zhu, X.; Yang, X.; Ye, H.; Chen, X. A moving target tracking control of quadrotor UAV based on passive control and super-twisting sliding mode control. Math. Probl. Eng. 2021, 894–907. [Google Scholar] [CrossRef]
  11. Ren, Y.; Zhao, Z.; Zhang, C.; Yang, Q.; Hong, K.S. Adaptive neural-network boundary control for a flexible manipulator with input constraints and model uncertainties. IEEE Trans. Cybern. 2020, 51, 4796–4807. [Google Scholar] [CrossRef]
  12. Zhao, Z.; Ren, Y.; Mu, C.; Zou, T.; Hong, K.S. Adaptive neural-network-based fault-tolerant control for a flexible string with composite disturbance observer and input constraints. IEEE Trans. Cybern. 2021, in press. [Google Scholar] [CrossRef] [PubMed]
  13. Jiang, T.; Lin, D.; Song, T. Finite-time backstepping control for quadrotors with disturbances and input constraints. IEEE Access 2018, 6, 62037–62049. [Google Scholar] [CrossRef]
  14. Yuan, Y.; Cheng, L.; Wang, Z.; Sun, C. Position tracking and attitude control for quadrotors via active disturbance rejection control method. Sci. China Inf. Sci. 2019, 62, 10201. [Google Scholar] [CrossRef]
  15. Schreiber, T.; Eschweiler, S.; Baranski, M.; Müller, D. Application of two promising Reinforcement Learning algorithms for load shifting in a cooling supply system. Energy Build. 2020, 229, 110490. [Google Scholar] [CrossRef]
  16. Wang, Y.; Sun, J.; He, H.; Sun, C. Deterministic policy gradient with integral compensator for robust quadrotor control. IEEE Trans. Syst. Man Cybern. Syst. 2019, 50, 3713–3725. [Google Scholar] [CrossRef]
  17. Singla, A.; Padakandla, S.; Bhatnagar, S. Memory-based deep reinforcement learning for obstacle avoidance in UAV with limited environment knowledge. IEEE Trans. Intell. Transp. Syst. 2019, 22, 107–118. [Google Scholar] [CrossRef]
  18. Bouhamed, O.; Ghazzai, H.; Besbes, H.; Massoud, Y. Autonomous UAV navigation: A DDPG-based deep reinforcement learning approach. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, Piscataway, NJ, USA, 12–14 October 2020; pp. 1–5. [Google Scholar]
  19. Li, B.; Wu, Y. Path planning for UAV ground target tracking via deep reinforcement learning. IEEE Access 2020, 8, 29064–29074. [Google Scholar] [CrossRef]
  20. Yan, C.; Xiang, X.; Wang, C. Towards real-time path planning through deep reinforcement learning for a UAV in dynamic environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
  21. Azar, A.T.; Koubaa, A.; Ali Mohamed, N.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A.; et al. Drone deep reinforcement learning: A review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
  22. Kim, H.; Jordan, M.; Sastry, S.; Ng, A. Autonomous helicopter flight via reinforcement learning. Adv. Neural Inf. Process. Syst. 2003, 16, 1–8. [Google Scholar]
  23. Waslander, S.L.; Hoffmann, G.M.; Jang, J.S.; Tomlin, C.J. Multi-agent quadrotor testbed control design: Integral sliding mode vs. reinforcement learning. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Piscataway, NJ, USA, 2–6 August 2005; pp. 3712–3717. [Google Scholar]
  24. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  25. Pi, C.H.; Hu, K.C.; Cheng, S.; Wu, I.C. Low-level autonomous control and tracking of quadrotor using reinforcement learning. Control. Eng. Pract. 2020, 95, 104222. [Google Scholar] [CrossRef]
  26. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Westminster, UK, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  27. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Westminster, UK, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  28. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Westminster, UK, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
  29. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  30. Lee, S.; Bang, H. Automatic gain tuning method of a quad-rotor geometric attitude controller using A3C. Int. J. Aeronaut. Space Sci. 2020, 21, 469–478. [Google Scholar] [CrossRef]
  31. Shehab, M.; Zaghloul, A.; El-Badawy, A. Low-Level Control of a Quadrotor using Twin Delayed Deep Deterministic Policy Gradient (TD3). In Proceedings of the 2021 18th International Conference on Electrical Engineering Computing Science and Automatic Control (CCE), IEEE, Piscataway, NJ, USA, 10–12 November 2021; pp. 1–6. [Google Scholar]
  32. Barros, G.M.; Colombini, E.L. Using Soft Actor-Critic for Low-Level UAV Control. arXiv 2020, arXiv:2010.02293. [Google Scholar]
  33. Chen, D.; Qi, Q.; Zhuang, Z.; Wang, J.; Liao, J.; Han, Z. Mean field deep reinforcement learning for fair and efficient UAV control. IEEE Internet Things J. 2020, 8, 813–828. [Google Scholar] [CrossRef]
  34. Bøhn, E.; Coates, E.M.; Moe, S.; Johansen, T.A. Deep reinforcement learning attitude control of fixed-wing uavs using proximal policy optimization. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), IEEE, Piscataway, NJ, USA, 11 June 2019; pp. 523–533. [Google Scholar]
  35. Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Reinforcement learning for UAV attitude control. ACM Trans. Cyber-Phys. Syst. 2019, 3, 1–21. [Google Scholar] [CrossRef] [Green Version]
  36. Lopes, G.C.; Ferreira, M.; da Silva Simões, A.; Colombini, E.L. Intelligent control of a quadrotor with proximal policy optimization reinforcement learning. In Proceedings of the 2018 Latin American Robotic Symposium, 2018 Brazilian Symposium on Robotics (SBR) and 2018 Workshop on Robotics in Education (WRE), IEEE, Piscataway, NJ, USA, 6 November 2018; pp. 503–508. [Google Scholar]
  37. Jiang, Z.; Lynch, A.F. Quadrotor motion control using deep reinforcement learning. J. Unmanned Veh. Syst. 2021, 9, 234–251. [Google Scholar] [CrossRef]
  38. Rodriguez-Ramos, A.; Sampedro, C.; Bavle, H.; De La Puente, P.; Campoy, P. A deep reinforcement learning strategy for UAV autonomous landing on a moving platform. J. Intell. Robot. Syst. 2019, 93, 351–366. [Google Scholar] [CrossRef]
  39. Hu, H.; Wang, Q. Proximal policy optimization with an integral compensator for quadrotor control. Front. Inf. Technol. Electron. Eng. 2020, 21, 777–795. [Google Scholar] [CrossRef]
  40. Wang, Y.; He, H.; Tan, X.; Gan, Y. Trust region-guided proximal policy optimization. arXiv 2019, arXiv:1901.10314. [Google Scholar]
  41. Jagodnik, K.M.; Thomas, P.S.; van den Bogert, A.J.; Branicky, M.S.; Kirsch, R.F. Training an actor-critic reinforcement learning controller for arm movement using human-generated rewards. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 1892–1905. [Google Scholar] [CrossRef] [PubMed]
  42. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  43. Kakade, S.; Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning, San Francisco, CA, USA, 8–12 July 2002. [Google Scholar]
  44. Chu, X. Policy optimization with penalized point probability distance: An alternative to proximal policy optimization. arXiv 2018, arXiv:1807.00442. [Google Scholar]
  45. Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Control of a quadrotor with reinforcement learning. IEEE Robot. Autom. Lett. 2017, 2, 2096–2103. [Google Scholar] [CrossRef] [Green Version]
  46. Xu, X.; Cai, P.; Ahmed, Z.; Yellapu, V.S.; Zhang, W. Path planning and dynamic collision avoidance algorithm under COLREGs via deep reinforcement learning. Neurocomputing 2022, 468, 181–197. [Google Scholar] [CrossRef]
  47. Lambert, N.O.; Drew, D.S.; Yaconelli, J.; Levine, S.; Calandra, R.; Pister, K.S. Low-level control of a quadrotor with deep model-based reinforcement learning. IEEE Robot. Autom. Lett. 2019, 4, 4224–4230. [Google Scholar] [CrossRef] [Green Version]
  48. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
  49. Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ international conference on intelligent robots and systems, IEEE, Piscataway, NJ, USA, 7–12 October 2012; pp. 5026–5033. [Google Scholar]
Figure 1. Quadrotor helicopter and the body-fixed frame.
Figure 1. Quadrotor helicopter and the body-fixed frame.
Actuators 11 00105 g001
Figure 2. Quadrotor closed-loop control system.
Figure 2. Quadrotor closed-loop control system.
Actuators 11 00105 g002
Figure 3. Comparison of forward and reverse KL. (a) Forward KL distribution. (b) Reverse KL distribution.
Figure 3. Comparison of forward and reverse KL. (a) Forward KL distribution. (b) Reverse KL distribution.
Actuators 11 00105 g003
Figure 4. Schematic diagram of the policy update.
Figure 4. Schematic diagram of the policy update.
Actuators 11 00105 g004
Figure 5. The structure of neural network.
Figure 5. The structure of neural network.
Actuators 11 00105 g005
Figure 6. The structure of real quadrotor.
Figure 6. The structure of real quadrotor.
Actuators 11 00105 g006
Figure 7. The average accumulated rewards obtained by PPO-PPD and PPO in different tasks. The shaded area represents the mean ± standard deviation.
Figure 7. The average accumulated rewards obtained by PPO-PPD and PPO in different tasks. The shaded area represents the mean ± standard deviation.
Actuators 11 00105 g007
Figure 8. (a) Average steady-state error in the evaluation of policies learned by PPO-PPD and PPO. (b) Average accumulated reward in the evaluation of policies learned by PPO-PPD and PPO.
Figure 8. (a) Average steady-state error in the evaluation of policies learned by PPO-PPD and PPO. (b) Average accumulated reward in the evaluation of policies learned by PPO-PPD and PPO.
Actuators 11 00105 g008
Figure 9. Performance comparison of control policies trained by PPO-PPD with compound reward, PPO-PPD with single reward and PPO with single reward. The circle indicates that the quadrotor successfully reached the target point stably. The non-filled circle indicates that the quadrotor is stable, but the target point was not found. The triangle indicates that the quadrotor is unstable.
Figure 9. Performance comparison of control policies trained by PPO-PPD with compound reward, PPO-PPD with single reward and PPO with single reward. The circle indicates that the quadrotor successfully reached the target point stably. The non-filled circle indicates that the quadrotor is stable, but the target point was not found. The triangle indicates that the quadrotor is unstable.
Actuators 11 00105 g009
Figure 10. (a) Attitude control responses of the learned policies from PPO-PPD and PPO. (b) Average absolute steady-state error of PPO-PPD and PPO.
Figure 10. (a) Attitude control responses of the learned policies from PPO-PPD and PPO. (b) Average absolute steady-state error of PPO-PPD and PPO.
Actuators 11 00105 g010
Figure 11. (a) Position and attitude control responses of the learned policies from PPO-PPD and PPO. (b) Average absolute steady-state error of PPO-PPD and PPO.
Figure 11. (a) Position and attitude control responses of the learned policies from PPO-PPD and PPO. (b) Average absolute steady-state error of PPO-PPD and PPO.
Actuators 11 00105 g011
Figure 12. (a) PPO-PPD attitude control performance test in 20 different initial states. (b) PPO-PPD position control performance test in 20 different initial states.
Figure 12. (a) PPO-PPD attitude control performance test in 20 different initial states. (b) PPO-PPD position control performance test in 20 different initial states.
Actuators 11 00105 g012
Figure 13. Attitude comparison among PID, PPO and PPO-PPD controller in different sizes.
Figure 13. Attitude comparison among PID, PPO and PPO-PPD controller in different sizes.
Actuators 11 00105 g013
Figure 14. Sum of error in different sizes between the PPO-PPD algorithm and PPO algorithm.
Figure 14. Sum of error in different sizes between the PPO-PPD algorithm and PPO algorithm.
Actuators 11 00105 g014
Figure 15. Position comparison among PID, PPO and PPO-PPD controller with different payloads.
Figure 15. Position comparison among PID, PPO and PPO-PPD controller with different payloads.
Actuators 11 00105 g015
Figure 16. Sum of error with different payloads between the PPO-PPD and PPO algorithm.
Figure 16. Sum of error with different payloads between the PPO-PPD and PPO algorithm.
Actuators 11 00105 g016
Figure 17. Position and attitude of quadrotor under noise disturbance.
Figure 17. Position and attitude of quadrotor under noise disturbance.
Actuators 11 00105 g017
Table 1. Parameters of the Quadrotor Simulator.
Table 1. Parameters of the Quadrotor Simulator.
ParameterDescriptionValue
mMass0.2 kg
LWing length0.31 m
gAcceleration of gravity9.81 m/s2
bThrust gain5.723
kReaction torque gain0.172
IxX-axis moment of inertia0.008 kg·m2
IyY-axis moment of inertia0.008 kg·m2
IzZ-axis moment of inertia0.03 kg·m2
dxX-axis air resistance coefficient0.001
dyY-axis air resistance coefficient0.001
dzZ-axis air resistance coefficient0.001
Table 2. Training parameters.
Table 2. Training parameters.
ParameterValue
Reward discount factor γ0.97
Learning rate0.00025
Value function coefficient0.01
Entropy coefficient0.5
Mini-batch size M128
Number of actors N4
Maximum number of iterations L1000
Simulation sampling time per step0.02 s
Penalty coefficient β0.5
Table 3. The performance comparison between PPO-PPD and PPO.
Table 3. The performance comparison between PPO-PPD and PPO.
PPO-PPDPPOComparison
Acrobot−77.540−150.458+48%
CartPole408.245341.732+19%
Pendulum−38.512−52.455+27%
Ant5943.4594268.432+39%
HalfCheetah6337.0175422.827+17%
Walker2D3672.8553146.045+17%
Table 4. Training steps to reach 220 threshold.
Table 4. Training steps to reach 220 threshold.
AlgorithmTraining Steps
PPO-PPD with compound reward614
PPO-PPD with single reward1347
PPO with single reward2875
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xue, W.; Wu, H.; Ye, H.; Shao, S. An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor. Actuators 2022, 11, 105. https://doi.org/10.3390/act11040105

AMA Style

Xue W, Wu H, Ye H, Shao S. An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor. Actuators. 2022; 11(4):105. https://doi.org/10.3390/act11040105

Chicago/Turabian Style

Xue, Wentao, Hangxing Wu, Hui Ye, and Shuyi Shao. 2022. "An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadrotor" Actuators 11, no. 4: 105. https://doi.org/10.3390/act11040105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop