Reinforcement Learning-Based Backstepping Control for Container Cranes

A novel backstepping control scheme based on reinforcement fuzzy Q-learning is proposed for the control of container cranes. In this control scheme, the modified backstepping controller can handle the underactuated system of a container crane. Moreover, the gain of the modified backstepping controller is tuned by the reinforcement fuzzy Q-learning mechanism that can automatically search the optimal fuzzy rules to achieve a decrease in the value of the Lyapunov function. .e effectiveness of the applied control scheme was verified by a simulation in Matlab, and the performance was also compared with the conventional sliding mode controller aimed at container cranes. .e simulation results indicated that the used control scheme could achieve satisfactory performance for step-signal tracking with an uncertain lope length.


Introduction
A robotic container crane is a robot that lifts the cargo off the ground with ropes and then carries the cargo to the designated locations. As the robot carries the cargo to the destination, it is necessary to stabilize the swing angle of the rope. Significant swing-related problems may cause the cargo to fall or even roll over. erefore, an appropriate control strategy is necessary to ensure that the robot responds quickly to the ideal command while suppressing the amplitude of the swaying angle of the rope. e uncertainties of the system, resulted from the uncertain system parameters, can bring up the challenges on the design of the controller. erefore, some previous control techniques relying on exact models exhibited certain limitations [1][2][3]. Many present control strategies aimed to the uncertain systems have been proposed for this problem, such as the sliding mode control [4][5][6][7][8], fuzzy control [9,10], adaptive control [11,12], and fuzzy PID control [13,14] strategies.
Reinforcement learning (RL) is a learning method that gradually explores the optimal policy by interacting with the environment [15]. In the reinforcement learning, the target is usually to maximize the cumulative rewards or minimize the cumulative costs over the entire learning process. e entire process of typical reinforcement learning can be described as the following. e learning process starts by the agent adopting an action in the initial state based on the current policy, and the adopted action will transfer the system from the current state to the next state with certain probability. Subsequently, the agent will repeat to adopt an action and then transfer the system from the current state to the next state until the end of the learning. In this process, an action transferring the system from the current state to the next state will be evaluated with the reward or cost that is also called the instant reward or cost. e offered instant rewards/costs of each action from all the visited states can be further used to dynamically explore the optimal policy of adopting actions that can achieve the maximum of rewards or the minimum of costs over the entire process, which can be completed by many temporal difference (TD) methods such as Q-learning [16] and SARSA [17].
When the reinforcement learning is applied in the field of control for continuous systems, it is inevitable to encounter the problem of "curse of dimensionality" that means the number of discrete states that are supposed to be visited by the agent increases to infinity. erefore, fuzzy logic can be used to fuzzify the system states, allowing the application of reinforcement learning methods originally used for discrete systems [18].
In recent years, there have been developments in the use of fuzzy reinforcement learning theory to solve control problems of nonlinear systems. In a recent approach [19], for the coordinated control problem of multiple manipulators, a reinforcement learning method was used to deal with the uncertainties of the dynamic models. is approach took into account minimizing both the errors of tracking trajectory and the control quantities for each robot, thereby solving the problem of the inconsistencies between different manipulators. In the literature [20], the control law was outputted from the reinforcement learning mechanism in which the actions corresponding to each state were set to satisfy the stability requirement, and a neural network was used to solve the problem of "the curse of dimensionality." In this paper, a modified backstepping controller is proposed for the underactuated system of robotic container cranes.
e control gain of this controller is important because it influences the convergence of tracking errors. However, it is difficult for the designer to empirically obtain the appropriate values of the control gain because the experience about the appropriate control gains is always expensive or even unavailable in practices. erefore, fuzzy Q-learning is applied to automatically search the optimal fuzzy rules that can output the appropriate values of the control gain. More precisely, the value of the Lyapunov function is used to judge the applied actions, and the control gains that result in the decrease of the value of Lyapunov function will be given a high-value reward and vice versa. erefore, the control target can be achieved by the fuzzy Q-learning mechanism that obtains the optimal fuzzy rules outputting the appropriate control gains in the applied controller after the appropriate learning process, which can reduce the value of Lyapunov function and then achieve the convergence of tracking errors. e rest of this paper is organised as follows: in Section 2, a nonlinear dynamics model of robotic container cranes is established by the Lagrangian method. In Section 3, a reinforcement fuzzy Q-learning-based backstepping control scheme is detailed to control the position of the load and stabilize the swaying angle of the rope. e stability proof is also presented in this section. In Section 4, the simulation is conducted to verify the effectiveness of the applied controller, and the performance is compared with the conventional sliding model controller. e conclusion is given in Section 5.

Dynamical Model of the Robot
e robotic container crane model is shown in Figure 1, where x is the horizontal displacement of the robot; θ is the load swing angle; m 1 and m 2 are the weights of the robot body and the load, respectively; and L and F are the length of the rope and the driving force of the robot, respectively.
Assuming that the entire system is fiction-free and the ropes have no mass and undergo no elastic deformation, the kinetic (T) and potential (U) energy of the robot system can be, respectively, expressed as follows: where g is the local gravitational acceleration. According to the Lagrangian equation, d dt where τ � [F, 0] T is the control inputs of the system and q � [x, θ] T . e dynamics equations of the container crane can be achieved: 2 Mathematical Problems in Engineering e above equation can be rewritten in the state space form e dynamics of a container crane, which is shown in equation (4), can also be presented in the block diagram as shown in Figure 2

The Design of Reinforcement Learning-Based Backstepping Controller
In this section, a backstepping controller for a crane robot is designed. A fuzzy reinforcement Q-learning mechanism is applied to determine the appropriate parameters of the controller to achieve stability. e control scheme is shown in Figure 3. First, the state space form of the system dynamics equation can be rewritten as To satisfy the above equations, Furthermore, e dynamics model (equation (7)) is written in a form on which the backstepping control method can be conveniently applied: T . e first Lyapunov function of backstepping control is designed as follows: where x d and θ d are the desirable values of trolley's position along the X-axis and the sway angle, respectively. Taking the first derivative for equation (9), It is noticed that negative _ V 1 ≤ 0 can be obtained by Consequently, the second Lyapunov function is designed as where e 2 � X 2 − X * 2 . Taking the first derivative for equation (11) yields the following: where It is worthwhile to notice that the crane robot system (equation (7)) is an underactuated system that has 2 controlled variables (position and swag angle, respectively) and only 1 control input (force F). As a result, the conventional backstepping control law for fully actuated systems is not applicable in this case. Moreover, the control law shown in equation (13) that can necessarily ensure a negative derivative of Lyapunov function V 2 would result in impractically big control signals when its dominator is approximate to zero (i.e., E x and E θ are of small values).
Consequently, a novel control law for F that can avoid the issue resulted from a zero value in the dominator is applied as follows: where ψ ≥ 0 is the control gain. It is noticed that the negative derivative of Lyapunov function (equation (11)) is Mathematical Problems in Engineering 5 maintained as long as the parameter Ψ shown in equation (14) satisfies the following equation: It is difficult to design the parameter δ in a deterministic way to satisfy equation (15) because the small value of |E x − (cos θ/L)E θ | would result in the impractically big value of ψ. Consequently, the reinforcement fuzzy Q-learning is applied to search the appropriate values of ψ.
In equations (12) and (15), to ensure the stability of the system, Ψ is adjusted based on the values of |E x − (cos θ/L)E θ | and δ. A reasonable linguistical adjustment rule is as follows: ψ is large e terms small, medium, and large in the above equations are all linguistical descriptions. e actual numerical output is obtained by the fuzzy reasoning based on the numerical values of actions and the parameters of the fuzzy structure. e group of actions corresponding to the linguistical description of "large" is ψ large � u 1 , u 2 , . . . , u l . e numerical value of u i directly affects the performance of the controller. e action among Ψ is selected based on |E x − (cos θ/L)E θ | and δ using the fuzzy Q-learning method to achieve the convergence of Lyapunov function. In this control scheme, E x − (cos θ/L)E θ and δ are used as the inputs for fuzzy reasoning, i.e., the state in Q-learning. Γ 1 � E x − (cos θ/L)E θ and Γ 2 � δ are fuzzified by the triangular membership functions shown in Figure 4 with the details of fuzzy sets shown as equations (16) and (17): Lin Γ 2 � τ 2,1 , . . . , τ 2,b , . . . , τ 2,B , b � 1, 2, . . . , B, (17) where A is the number of fuzzy sets (τ 1,a ) for Γ 1 � E x − (cos θ/L)E θ and B is the number of fuzzy sets (τ 2,b ) for Γ 2 � δ.
We define the n-th fuzzy rule Rn in the fuzzy Q-learning as follows: Rn: If s k 1 is L n 1 and s k 2 is L n 2 and. . .and s k m is L n m , then, ψ � U nL (ψ � u 1L with q 1L (n, 1) or ψ � u 2L with q 2L (n, 2) or ψ � u 3L with q 3L (n, 3) or. . .or ψ � u pL with q 1L (p, 1)), where the set u ∈ (u 1L , . . . , u pL ) is the chosen set of parameters Ψ under the n-th rule in the state s at the moment k. e Rn rule corresponding to the input state vector sk � {s k 1 , s k 2 , . . ., s k m } yields the membership functions {μ (sk), . . ., μ (sk)} at a given time. Each u in the set UnL has a corresponding q value. erefore, it is necessary for the reinforcement learning to continually update the q value for each action in all rules based on the membership functions and the rewards to achieve the optimal policy of selecting actions u iL in all rules. Next, the rewards are given according to the variance of the value of Lyapunov function ( _ V < 0). First, the Q-learning mechanism selects the smallest u value corresponding to the q value based on each fuzzy rule: To prevent u from falling into local optimum conditions in the selection process, a greedy mechanism is introduced: e numerical value of the parameter Ψ is obtained by defuzzifing the sample selected by the rule: Under the greedy mechanism, the choice of u is random, which makes the reinforcement learning more globally exploratory in the training process. e defuzzification of the Q value of the state vector at time k can be expressed by the following equation: e defuzzification of the target value under state s k can be expressed as follows: When the state vector s k of the system enters the next state s k+1 under the action of u IL , the generated cost information is c k , and the time error in the process is where η ∈ (0, 1) is the discount factor that reflects the consideration of the future reward and c k is the reward given at the instant k. In this case, the variance of the Lyapunov function (equation (11)) is used as the reward. More precisely, if the value of Lyapunov function decreases during the period from instant k − 1 to instant k, a large reward value will be given. Otherwise, a small reward value will be given. e function describing the reward is as follows: It is clear that a larger value (no greater than 1) will be attained with a more dramatic decrease of the Lyapunov function.
e reward with the largest value of 1 will be achieved only if V(k) � 0, which means the system achieves the desired steady state. e iteration equation of the final q value is as follows: where λ is the learning coefficient between 0 and 1. e parameters of the proposed controller, which should be determined and tuned by users, consist of the parameter in the backstepping part (the parameter β 1 in the items of E θ and E x in equation (14)) and the parameters in the reinforcement learning part (parameters of fuzzy sets τ 1,a in equation (16) and τ 2,b in equation (17), mutation probability ε in equation (19), discount factor η in equation (23), learning rate λ in equation (25), and action group of control gain ψ). Several rules of tuning the parameters of the controller are given in Remarks 1-6 to achieve a satisfying control performance.

Remark 1.
In the backstepping part, β 1 is the proportional gain of the first virtual control gain of the backstepping controller. A big value of β 1 can be chosen in order to achieve a fast decreasing value of Lyapunov function, which means the fast convergence of the errors of both the trolley's position and the sway angle. However, an excessively big value of β 1 shows the risk on amplifying the measurement noise of tracking errors (e 1 ) and the derivatives of tracking errors ( _ e 1 ), which would negatively influence the control performance. On other words, there is a trade-off between the fast convergence of system errors and the immunity to the measurement noise. Hence, we suggest that the trials of selecting the value of β 1 should start at a small value (e.g., 0.01) and then gradually increase the value until the satisfyingly fast decrease of Lyapunov function is achieved.

Remark 2.
In the reinforcement learning part, the fuzzy sets of inputs used to do fuzzy reasoning (τ 1,a and τ 2,b shown in equations (16) and (17)) are important because they transform the numerical inputs Γ 1 and Γ 2 into the group of firing rates corresponding to the linguistic description (e.g., small, medium, and big), which are applicable on fuzzy reasoning. Hence, we suggest choosing big values for τ 1,A and τ 2,B and small values for τ 1,1 and τ 2,1 in order to cover the range of Γ 1 and Γ 2 during the control task. We also suggest that the fuzzy sets τ 1,a and τ 2,b should be distributed evenly among the selected range in order to well present the dynamics of the second Lyapunov function to the fuzzy inference.

Remark 3.
In the reinforcement learning part, mutation probability ε reflects the trade-off between the exploration of potentially better solutions and the exploitation of learnt good solutions. It is generally agreed that the good solutions are likely obtained during the later stage of learning; therefore, we suggest offering ε with a big value (e.g., 0.6) during the initial stage of control and a small value (e.g., 0) during the later stage of control.

Remark 4.
In the reinforcement learning part, discount factor η shows the attention on the control performance in future steps. In our case, although the variance of Lyapunov function is influenced by the previous control signals, the current variance of Lyapunov function is mainly determined by the current control signal, which means the current control gain on stabilizing the system should be judged mainly by the current performance (the current variance of Lyapunov function). erefore, we suggest discount factor η should be offered a small value (e.g., 0.1).

Remark 5.
In the reinforcement learning part, learning rate λ reflects the efficiency of remembering new knowledge and forgetting the old knowledge. A big value of λ could achieve a fast convergence of q, which means a high learning efficiency. However, it is desirable for reinforcement learning to keep the old knowledge to certain extent because of the risk on learning the false knowledge (e.g., the data used to learn are contaminated by measurement noise or unknown disturbances). erefore, we suggest learning rate λ should be offered a medium value (e.g., 0.4∼0.6).
Remark 6. Action group of control gains (u 1L , . . . , u pL ) can be regarded as the most imperative part in determining the parameters of the controller because the actual control gains are calculated by equation (20) based on the members of this group. e minimum value of the member (u 1L ) in this group should be small enough while the maximum value of the member (u pL ) in this group should be big enough, which ensure the optimal control gains satisfying equation (15) (meaning the convergence of Lyapunov function) are inside the set of calculated actual control gains. However, excessive Hence, we suggest the minimum value of member (u 1L ) should be very small (e.g., 0), and the trials of selecting the maximum value of member (u pL ) should start at a big value and then gradually decrease the value until the chattering effect is insignificant. Moreover, the rest members (u 2L , . . . , u (p− 1)L ) in the action group are suggested to be evenly distributed between the minimum value u 1L and the maximum value u pL in order to calculating smooth control gains.

Simulation Result
Simulation was run to verify the effectiveness of the used controller. Our control target was to accurately control the position of the load (m 2 ) with as little the sway angle of the rope as possible. In other words, the load (m 2 ) is supposed to reach the designated position, and the angle of the rope is supposed to be stabilized around 0 by the used control scheme.
After carefully selecting parameters based on the aforementioned rules of selection shown in Remarks 1-5, the detailed parameters of the controller are determined. Various parameters of the robot system and the controller during the simulation are shown in Table 1.
e numbers of linguistical variables (fuzzy sets shown in equations (16) and (17)) to describe Γ 1 and Γ 2 are set as 10.
e number of action candidates in the action group for calculating the control gain ψ is set as 50 on each fuzzy rule. As a result, after carefully selecting the parameters based on Remark 2, the membership function of e desirable of trolley's position is set as a constant so that x d � 0.1 m, while the sway angle is supposed to be minimized so that θ d � 0 rad. e length of the rope is the crucial element influencing the stability of the crane system [19], and we set an uncertainty of ± 20% to the rope length to show the ability of the applied control scheme to handle the uncertainty of rope length. e probability to explore potentially optimal fuzzy rules is set as ε � 0.5 during initial 60 s, ε � 0.3 from 60 s to 100 s, and ε � 0 after 100 s.
In the simulation, the proposed controller is compared with the conventional sliding mode controller (SMC) mentioned in [21]. e performance of the proposed reinforcement learning-based BSC and the conventional SMC on tracking a constant trolley's position are shown in Figure 5. Clearly, in the proposed reinforcement learning-based BSC, the significant overshooting is observed during the initial 20 s, and a longer time is used to drive the trolley to reach the target position compared with conventional SMC, which could be resulted from the exploration on the bad fuzzy rules during the initial stage of reinforcement learning. However, compared with the conventional SMC, the designed controller can achieve the less steady-state error (SSE) after learning the optimal rules during the later stage of control (shown as the subplot on the right part of Figure 5). In other words, the designed controller can achieve the position tracking of the trolley with more accuracy at the expense of fluctuations during the initial stage of control, which is similar to the nature of reinforcement learning that the optimal solutions are obtained after trying many bad solutions. For avoiding the observed overshooting and fluctuations, the good fuzzy rules that could be obtained from the experience of designers and the prior knowledge of the crane system could be used as the initial rules of reinforcement learning. e performance of the designed controller and the conventional SMC on stabilizing the sway angle at 0 degree are compared in Figure 6. Compared with the conventional SMC, although the fluctuation of sway angle under the used reinforcement learning BSC lasts longer (sway angle takes longer time to reach 0), it is clear that the sway angle can be around 0 degree with less chattering effects and less steadystate errors during the late period of control, shown as the 2 subplots in Figure 6. In other words, the applied controller on stabilizing sway angle starts at bad performance (longer settling time 0∼60 s) and then achieves better performance than conventional SMC during the late stage of control (60 s∼200 s). e reason of such performance of sway angle is the same to that of the trolley position, the nature of reinforcement learning that optimal solutions are obtained at the cost of trying many bad solutions. On top of that, the less overshooting of sway angle during the initial stage is also observed in the proposed controller compared with that of the conventional SMC. Figure 7 shows the control forces generated by the reinforcement learning-based BSC and the conventional SMC. e applied reinforcement learning-based BSC provides outstanding chattering reduction and smaller control forces in comparison with conventional SMC during the entire period of control. It is also observed in Figure 7 that the control forces generated by reinforcement learning BSC in the initial stage of control (0-80 s) is relatively chattering than that in the late stage of control (80-200 s), which is in accordance with Figures 5 and 6. e reason is the same to that of the trolley position and sway angle, which have been explained in the previous discussion of results for Figures 5 and 6.    the second Lyapunov function are described by the limited linguistical variables (fuzzy sets) so that there are no enough fuzzy rules to correctly determine the appropriate control gain achieving the convergence when the value of Lyapunov function is close to zero. For example, the linguistic variables describing the fuzzy inputs are "zero" when |Γ 1 | ≪ 0.0002 and |Γ 2 | ≪ 0.002, which means there is only one fuzzy rule corresponding to all the states inside the range (|Γ 1 | ≪ 0.0002, |Γ 2 | ≪ 0.0022) though those states would need different values of control gain to achieve the convergence of Lyapunov function. A more detailed example can be used to illustrate this issue; the state of pair (|Γ 1 | � 0.0002, |Γ 2 | � 0.002) and the state of pair (|Γ 1 | � 0.00001, |Γ 2 | � 0.0001) have the same linguistic description "Γ 1 is zero and Γ 2 is zero" in the fuzzy reasoning, and consequently, the same control gain will be generated for those 2 states though they would need different control gains to achieve convergence of Lyapunov function. erefore, the Lyapunov function will continuously fluctuate as a result of the only one fuzzy rule that corresponds to the state of "Γ 1 is zero and Γ 2 is zero" and adaptively tries different action candidates in different states inside a small range (e.g., |Γ 1 | ≪ 0.0002 and |Γ 2 | ≪ 0.002). e outputs of fuzzy logic inference in different times are shown in Figure 9. Clearly, the output function of fuzzy logic inference is different in different times because the fuzzy rules on which the fuzzy logic inference depends are different in different instants of learning. More precisely, the initial fuzzy rules are randomly set so that the output function of fuzzy inference is random at the beginning. And then, the output function of fuzzy logic inference varies significantly during the medium period of learning (t � 20 s; t � 100 s). After that, the output function of fuzzy logic inference tends to be stable with a less variance on its shape (t � 100 s, t � 160 s, and t � 180 s), which means the optimal rules have been learnt.

Conclusions
In this paper, we proposed a reinforcement learning-based backstepping control scheme (BSC) that can handle the underactuated system of container cranes. In the control scheme, the control gain of BSC, which influences the stability, is tuned by the reinforcement fuzzy Q-learning that automatically searches the optimal fuzzy rules to generate the appropriate control gains of BSC to achieve the decrease of Lyapunov function. e simulation results show the effectiveness of the applied control scheme by the less chattering effect and less steady-state error under the uncertainty of rope length compared with the conventional SMC.

Data Availability
e data used to support the findings of this study are included within the article and can be used for other research studies.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. preparation of this manuscript and on improving the quality of the figures used in the manuscript. is work was financially supported by major national planning projects of China's Ministry of Industry and Information (Z135060009002-50).