Fuzzy Neural Network Q-Learning Method for Model Disturbance Change: A Deployable Antenna Panel Application

,


Introduction
To achieve long-term continuous meteorological observations, a 5-meter deployable antenna has been arranged in the geosynchronous orbit [1,2].The 5-meter deployable antenna panel can be divided into five units, which are the upper left plate, the upper right plate, the lower left plate, the lower right plate, and the retainer plate, as shown in Figure 1.An active adjustment system was adopted to meet the observation requirements of the 5-meter deployable antenna, as shown in Figure 2.
(1) Each of the reflecting panels was supported by more than one active actuators, and the control model of the active adjustment system is a typical multipleinput multiple-output (MIMO) system (2) The strong coupling effects are caused because the active actuator interacts with each other during the adjustment process The control model of the active adjustment system is changed because of the disturbances which came from working conditions, model simplifications, and external interferences of the 5-meter deployable antenna.Added to that is that the disturbance can significantly affect the observation accuracy of the deployable antenna, particularly the disturbance change.To improve the observation accuracy of the deployable antenna, various control methods are worth our using for reference in the control process of the active adjustment system.
In the past few years, a great deal of progress had been made in relation to the control of MIMO systems and numerous control strategies were introduced [3][4][5].The intelligent control methods for MIMO systems have been put forward by researchers combining fuzzy control theory with artificial intelligence and adaptive control [6][7][8].Hamdy et al. proposed a novel inverted fuzzy decoupling scheme [9]; moreover, the Smith predictor (SP) based on a fuzzy decoupling scheme was proposed for a MIMO chemical process with multiple time delays.The schemes were achieved using fuzzy logic, thereby avoiding the reliance on model-based analyses.Furthermore, Hamdy and Ramadan proposed another fuzzy decoupling-based PI controller [10].Fuzzy decoupling schemes are simple to design and easy to implement, demonstrating good decoupling capabilities.However, such methods are only applicable to cases wherein the output is a state quantity without the disturbances.
The discrete linear quadratic regulator (LQR) controller for the active adjustment system model ensured the observation accuracy requirement of the deployable antenna [11].However, the LQR controller was dependent on the accurate control model, and it was presented without considering the disturbances.
In [12], an adaptive reduced dimension fuzzy decoupling control strategy based on fusion functions was proposed to eliminate the effect uncertainties due to coupled subsystems by designing and incorporating a fuzzy decoupling disturbance observer.But the method would no longer be applicable, for the disturbances were greater than the range of the set threshold.
According to [13], the Q-learning was considered to have a good capability for dealing with uncertainties.Thus, The Q-learning algorithm was applied for solving the mobile robot navigation in the dynamic environment problem by limiting the number of states based on a new definition for the state space [14].The proposed method could reduce the dimension of the Q value table and improved the efficiency of the system.A novel Qlearning method was described for unknown discretetime (DT) linear quadratic regulation problems [15].In the method, Q-learning was proved to be an effective scheme for unknown dynamical systems without requiring a deep understanding of the system dynamics.The proposed two methods could deal with the uncertainties from the dynamic environment, but they were unable to cope with the disturbances from the model simplifications and external interferences.
A backward Q-learning strategy was presented based on the combination with the Sarsa algorithm and the Qlearning [16].The proposed method was found to enable converging more quickly to true Q values and improve final performance.A heuristic accelerated Q-learning method was developed by the heuristic function to affect action selection [17], and results of the simulation showed that a simple heuristic function could enhance the performance of the Q-learning method.The proposed two strategies could enhance the performance of the control system, but the Q value table of the Q-learning was only for an individual control model, respectively; it was not used for the other control model.
Therefore, the key to solve the disturbance change is to make the parameters of Q-learning adapt to the control model change.But in the traditional Q-learning algorithm, the parameters of Q-learning are difficult to adapt to the control model change because the environment where the agent is located is spatially continuous.However, discretization of continuous state-action spaces could have huge impact on the dimension of the Q value table.One solution to this problem is introducing the fuzzy inference system (FIS) into Q-learning, which could be applied to discrete-time state mapping to continuous state-actions without affecting the dimension of Q value.During the last decade, a great deal of progress has been made on the area of combination the Q-learning method with the FIS.
The numerous existing strategies were widely used to robot navigation, but few mature methods were learned for the active adjustment system of a deployable antenna panel.In this paper, a high precision active control   7 International Journal of Aerospace Engineering method named fuzzy neural network Q-learning control (FNNQL) strategy is proposed to overcome the model disturbance change of the active adjustment system of the deployable antenna panel.The main idea of the FNNQL controller is that the FIS is introduced into Q-learning, and the input of Q-learning is fuzzified.Therefore, the parameters of Q-learning are learned online by the fuzzy RBF neural network, which makes the Q value table suitable for the disturbance change from the control model.At the same time, the robustness of the control method is improved by ensuring the precision of the deployable antenna.
The FNNQL controller has the following advantages: (1) The error of the model disturbance is reduced because the input of Q-learning is fuzzified, and the accuracy of state transition is improved (2) A fuzzy rule base of the coupled MIMO system can construct using trial-and-error by Q-learning so that the accurate relationship between the premise and consequent parameters of the fuzzy system can be gotten (3) The outputs of the fuzzy system are Q value and action selection of Q-learning, respectively.And the value function and action can be modified by the RBF neural network with the change of state environment

Problem Statement
The active adjustment mechanism of the deployable antenna mathematical model can be described as follows [11]: where Z c ðkÞ stands for the displacement before the k th adjustment, UðkÞ is the external force of the k th adjustment, and B represents the transfer matrix of the external force and actual displacement.
To study the effect of disturbance changing for the antenna precision, the model disturbance can be described by the uncertainty matrix ΔB, which was randomly generated in the range of τ% based on the 2-norm of control matrix B, defined as  where τ is the positive constant belonging to [1, 100].Based on Equations (1) and ( 2), the discrete multivariable control model with disturbance can be described as Since the active adjustment mechanism of a deployable antenna reflector is made of several active actuators, they affected each other during the antenna reflect panel adjustment process.Therefore, it is necessary to convert the complex control system into a more simplified control system.For any MIMO control system with a full rank transfer matrix, a compensator can be designed to decouple it [23].
Assuming that WðsÞ is the transfer function matrix of Equation (1), then where C and A are unit diagonal matrices, respectively, and WðsÞ is the full rank matrix.Thus, a compensator can be designed to decouple the MIMO system as shown in Equation (1).
Based on a unit diagonal matrix decoupling principle, Equation (4) can be written as where W p ðsÞ is the transfer function matrix of the feedforward compensator.According to Equation ( 5), we can derive From Equation ( 6), suppose the output matrix of the feed-forward compensator is B −1 , the output equation of the feed-forward compensator can be written as Rearranging Equation (7), we obtain   3), the following equation can be obtained: Multiplying both sides of Equation ( 9) by B −1 , Equation ( 9) can be further expressed as where XðkÞ is the state vector, XðkÞ = ½x 1 ðkÞ, x 2 ðkÞ,⋯,x N ðkÞ, N is the number of the actuators, and B −1 ΔB is an unknown vector representing the disturbance matrix.From Equation (10), the discrete multivariable decoupled system can be decomposed into many independent single variable state equations.In this section, the middle plate of the deployable antenna panel is taken as an example to test the validity of Qlearning, as shown in Figure 3.In addition, the 25 actuators are distributed on the middle plate of the deployable antenna as shown in it.
In this section, a simulation analysis is performed to test the validity of the Q-learning controller.During the simulation, control matrix B was obtained using finite element software ANSYS (version 14.5), and the raw data from tests is shown in Table 1.The disturbance matrixes ΔB can be obtained in the case of τ = 15, τ = 30, and τ = 90, according The units are mm.In the simulation, the value in Table 2 was introduced as disturbances into Equation (10) to assess the ability of the Q-learning controller.The displacement was adjusted as shown in Figure 4.In the figure, the curves were seen converged at the 25 th step after large fluctuations, which indicated the system reached stability.Figure 5 shows the root mean square (RMS) error curve for the 25 actuators, and it is found that the displacements of 25 actuators are below 10 -3 mm at the 8 th step.The RMS is in the micron scale, which indicates a tendency to narrow.It is proved that the Q value table and the action selection obtained by continuous trial-and-error can adapt to the disturbance of 15%.
In the next simulation, the Q value table which was obtained with disturbance of 15% was used to deal with the disturbance form in the range of 30% to 90%, respectively.The results can be shown in Figures 6 and 7.As shown in the figures, since the disturbances for the control system were changed, the excessive overshooting was caused.
That shows that the Q value table obtained under the condition of the certain disturbance could not adapt to other disturbance changes.If the system is to be stable under the other disturbances, it is necessary to get the Q value corresponding to the disturbance through continuous trial-anderror, which increases the workload.
To keep the system's stability with the consideration of disturbance change, the FNNQL control strategy was proposed in the following section.In the proposed controller, the two parts were included.In Part I: Fuzzy Parameter Learning, the fuzzy neural network was introduced into Q-learning, and the input of Q-learning is fuzzified.Meanwhile, the trial-and-error method was used in the Qlearning system to gain the fuzzy rule base.In Part II: Fuzzy Parameter Adjustment, with the decreasing of temporal-difference (TD) error, the parameters of the membership function were updated through the fuzzy RBF neural network [24].The Q value and the stateselection of Q-learning could be obtained to adjust the

Fuzzy Neural Network Q-Learning Controller Design
3.1.Part I: Fuzzy Parameter Learning.In this part, since the FIS was introduced into the Q-learning, the optimal parameters of the FNNQL controller would be obtained.The structure of the FNNQL controller can be shown in Figure 8, which can be divided into five layers.The 1 st layer is the system input, which is followed by the 2 nd layer, where the fuzzy is illustrated.And then, the 3 rd layer is the activation operation, and the 4 th layer is the action selection.
Finally, the output results can be obtained in the 5 th layer.The detailed description on each layer will be introduced in the following paragraphs.
3.1.1.Input Layer.In this layer, to reflect the characteristics of disturbance change in real time, the error of the control state and the rate of the error change should be taken as the inputs of the FNNQL controller.According to Equations (1) and (10), the error of the control state can be described as Based on Equation (12), the rate of the error change can be described as According to Figure 8, the inputs in this layer are also the previous parameter of the neural network FIS (NNFIS).
3.1.2.Fuzzy Layer.The inputs of the fuzzy layer are the outputs of the1 st layer.They would be fuzzified in this layer.The quantization factors k e and k ec are introduced into the fuzzified equations to map the inputs to the fuzzy domain accurately effectively.The fuzzified equations for eðkÞ and ecðkÞ are defined, respectively, as In the process of fuzzification, the three fuzzy subsets are defined for eðkÞ and ecðkÞ, which can be defined as eðkÞ = fB, N, Sg and ecðkÞ = fB, N, Sg.The membership function is Gauss. Actuator Actuator Actuator Actuator Table 6: Initial values of the error change rate mc.
12 International Journal of Aerospace Engineering Suppose F e can be divided into p regions and F ec can be divided into q regions, which are defined, respectively, as Based on Equations ( 15) and ( 16), the number of the fuzzy rules can be defined as In this paper, the fuzzy membership function can be defined as where σ is the width of the membership function and m is the centre of the membership function.Then, the membership functions of the two fuzzy inputs eðkÞ and ecðkÞ can be defined as where i represents the rule number, i = f1, 2, ⋯, ng; m i ðkÞ and σ i ðkÞ are the centre and the width of μ i , respectively; and mc i ðkÞ and σc i ðkÞ are the centre and the width of μc i , respectively.The detailed inner structure of the i th node in the fuzzy layer can be seen in Figure 9.
In this layer, the inputs of the control system are fuzzified.In the next step, the rate of the fuzzy rule activated should be calculated to get the accurate output of the FNNQL control system.
Table 8: Initial values of the error change rate σc.
13 International Journal of Aerospace Engineering 3.1.3.Activation Calculation Layer.First, the i th fuzzy rule can be supposed as where a i is the optimal action selection of the FNNQL control.
For measuring the usage of fuzzy rules in NNFIS, according to Equations ( 19) and (20), the rate of the fuzzy rule activated is defined as where i represents the rule number and φ i is the rate of the fuzzy rule activated.According to φ i , the optimal action selection will be described with the i th fuzzy rule as an example in the following layer.
3.1.4.Action Selection Layer.In this layer, the accurate output of the FNNQL control system can be obtained based on the premise parameters, and each fuzzy input of the control system should obtain its optimal action.According to the ε-greedy policy, the equation of the action selection can be defined as where Rq i ∈ A, A = fa 1 , a 2 ,⋯,a c g, is the set of the random action; Bq i is the optimal action selection; ε is the action selection probability; and i is the number of the fuzzy rule.
To assess the action selection, an additional penalty function will be used in this layer as shown in where ξ ⟶ 0, λ ∈ ½0, 1.If the stated error is within the accuracy range, a positive reward is obtained and the action selection is enhanced correspondingly.When it is not satisfied, a negative or zero reward is obtained, and the action selection is weakened instead.With using multiple trial-and-errors and ε-greedy, the relation of the mapping between the fuzzy inputs and the optimal action selection can be obtained.Therefore, the fuzzy rule base can be constructed.Since the state error is caused by   International Journal of Aerospace Engineering model disturbance, the fuzzy rules obtained with trial-anderror are consistent with disturbance change.The structure of this layer can be shown in Figure 10.
3.1.5.Output Layer.In this layer, the Q value table corresponding to the optimal continuous action selection can be obtained as the consequent parameters of NNFIS based on the fuzzy rule base constructed in Layer IV.According to the approximate Q-learning principle [25] and Equation ( 23), the value of the optimal continuous action selection can be defined as Similarly, the Q value function corresponding to the optimal continuous action selection can also be represented as where Q j is the total of all Q value function and j is the number of the control states.Based on Equations ( 10), (21), and ( 25), the equation of the control system can be rewritten as where UðkÞ = ½eðkÞ, ecðkÞ T is the input vector of the FNNQL controller.
In Part I: Fuzzy Parameter Learning, the Q value function QðjÞ was obtained, which can be used to adjust the cluster centre and cluster width of the membership function in Part II: Fuzzy Parameter Adjustment.

3.2.
Part II: Fuzzy Parameter Adjustment.According to Equation (27), the new displacement X P ðk + 1Þ can be obtained.And then, the new error eðk + 1Þ and the rate of the error ecðk + 1Þ can be derived based on Equations ( 12) and ( 13).The reward value vector r can be obtained based on Equation (24).And then, according to the Bellman optimality theory [26], the time difference error of TD (0) can be described as where γ is the discount rate.Actuator  Actuator Actuator The mean square error (RMS) function, which is used to measure control system deformation accuracy, can be defined: In the FNNQL controller, since the membership function is Gauss RBF, the centre of an ideal RBF should be in line with the location of the fuzzy input that occurs most frequently, and all widths are exactly the boundary values of the highest frequency region of fuzzy input.Therefore, it is necessary to get the most suitable centre and width by adjusting the parameters so as to get accurately the disturbance distribution.
According to Equation (29), the centre and width can be updated by the gradient descent [27,28].The gradient descent linear function of eðkÞ can be expressed as where ð∂RMSðkÞÞ/ð∂m ij ðkÞÞ is the gradient descent, β m ij ðkÞ is the step-size of the gradient descent, and m ij ðkÞ is the centre of the j th input variables of the i th fuzzy rule.The gradient descent can be further expressed as Based on Equation (29), Based on Equation (28), Table 11: Error change rate mc.Actuator Based on Equations ( 20), (21), and (22), Substituting Equations (33), (34), (35), and (36) into Equation (32), we obtain Substituting Equation (37) into Equations ( 30) and (31), the gradient descent linear function of eðkÞ can be further expressed as The gradient descent linear function of ecðkÞ can be expressed as where β mc ij ðkÞ is the step-size of the gradient descent and mc ij ðkÞ is the cluster centre of the j th input variables of the i th fuzzy rule.
Similarly, the gradient descent linear function of σ ij ðkÞ can be expressed as where ð∂RMSðkÞÞ/ð∂σ ij ðkÞÞ is the gradient descent and β σ ij ðkÞ is the step-size of the gradient descent.The gradient descent can be further expressed as Based on Equations ( 20), (21), and (22), Substituting Equations (33), (34), (35), and (44) into Equation (43), Equation (45) can be obtained as the following: Substituting Equation (45) into Equations ( 41) and (42), the gradient descent linear function of σ ij ðkÞ can be The width of the rate of the error change σc ij ðkÞ can be expressed with the linear relation of gradient descent as where By using the above equations, the most suitable centre and width of the membership functions were obtained to get accurately the disturbance distribution.Thus, all

Simulation
In this section, the middle plate of the 5-meter deployable antenna panel was taken as an example to test FNNQL strategy.The initial value of XðkÞ, the transfer matrix B, and the disturbances of the uncertainties are the same as in Section 2. The parameters of the FNNQL controller can be given as follows: the step-size of the gradient descent is β = 0:4.The quantization factors k e = 0:1 and k ec = 0:05.The discount rate γ = 0:8.The initial values of the centre m, the error change rate of centre mc, the width σ, and the error change rate of width σc can be found in Tables 5-8.
In the case of μc = fB, N, Sg, μ = fB, N, Sg, we can get 9 fuzzy sets for fuzzy control through some combinations.The fuzzy sets can be expressed as   To assess the effectiveness of the FNNQL controller on dealing with the disturbance change, the value functions QðkÞ are trained according to the initial value, which can be given in Equation (11); the training curve is shown as Figure 12.
As seen in Figure 12, after 5000 random trial-and-errors, the curves of 25 nodes were all seen a tendency to narrow and they were kept down to just ±0.03.That further proved that the Q-learning system had been familiar with the corresponding relationship between control correction and disturbance change.Moreover, the following parameters are obtained for the stable system: the value functions of the 25 actuators could also be obtained, and the value function QðkÞ of the 21 st actuator is taken as an example, which can be found in Table 9.The centres and the error change rates of the centres, the width, and the error change rates of width of 25 actuators can be obtained in Tables 10-13, respectively.
Next, the parameters which are found in Tables 9-13 were applied to the deployable antenna active adjustment system with different disturbances to verify the adaptability of the FNNQL controller.In the active adjustment system of a deployable antenna panel with disturbances, which could be found in Table 3, the deformation was adjusted by the FNNQL controller and the adjustment curve was shown in Figures 13 and 14.Contrast Figure 13 with Figure 6, after adjustment by the FNNQL controller, the displacements of all adjustment actuators were below 10 -3 mm after the 40 adjustments, and the RMS was below 10 -3 mm when the system was regarded as stable.
Similarly, the active adjustment system of a deployable antenna panel with different disturbances was found, which could be found in Table 4; the deformation was adjusted by the FNNQL controller, and the adjustment curve was shown in Figures 15 and 16, respectively.Contrast Figure 15 with Figure 7, after adjustment by the FNNQL controller, the displacements of all adjustment actuators were also below 10 - 3 mm after the 50 adjustments.
As seen in Figures 14-16, the actuators' displacements were adjusted by the FNNQL controller with the consideration of different disturbances and the RMS of the active adjustment system was below 10 -3 mm when the system was regarded as stable.The results of the simulation in this paper indicated that the proposed method could be used to optimize the model disturbance change.

Conclusions
In this paper, the FNNQL control method was studied for the active deformation adjustment mechanism of a deployable antenna panel.Firstly, the error of the model disturbance was reduced because the input of Q-learning is fuzzified; the accuracy of state transition is improved.A fuzzy rule base of the coupled MIMO system was constructed using trialand-error by Q-learning so that the accurate relationship between the premise and consequent parameters of the fuzzy system was obtained.Secondly, the width and the centre of the membership function were modified by the RBF neural network with the change of state environment.Finally, the effectiveness and the superiority of the proposed approach were demonstrated using the middle plate of the 5 m deployable antenna panel as an example.The results of the simulation indicated that the proposed method was an effective tool for the active deformation adjustment with disturbance change of the 5 m deployable antenna panel.
However, in the paper, the value function was obtained after 5000 random trial-and-errors; the computing time of the control method is long due to the fact that the number of the trial-and-errors is big.How to reduce the time complexity of the method is our future research.

Figure 3 :
Figure 3: Middle plate of the deployable antenna panel.
position (휇m) Node position (휇m) Node position (휇m) e position of node 21 e position of node 22 e position of node 23 e position of node 24 e position of node 25

Figure 7 :
Figure 7: Q-learning adjustment curve with a disturbance of 90%.

Figure 8 :
Figure 8: Structure of fuzzy neural network Q-learning based on RBF.

Figure 9 :
Figure 9: The inner structure of the i th node.

Figure 10 :
Figure 10: Structure of the action selection layer.
of the state e(k) Calculate the rate of the error change ec(k)

Figure 12 :
Figure 12: Training curve of value function.

Table 1 :
Transfer matrix B of the middle plate.

Table 5 :
Initial values of clustering centre m.

Table 7 :
Initial values of clustering width σ.

Table 10 :
Clustering centre m of 25 actuators.