Neural Computing Enhanced Parameter Estimation for Multi-Input and Multi-Output Total Non-Linear Dynamic Models

In this paper, a gradient descent algorithm is proposed for the parameter estimation of multi-input and multi-output (MIMO) total non-linear dynamic models. Firstly, the MIMO total non-linear model is mapped to a non-completely connected feedforward neural network, that is, the parameters of the total non-linear model are mapped to the connection weights of the neural network. Then, based on the minimization of network error, a weight-updating algorithm, that is, an estimation algorithm of model parameters, is proposed with the convergence conditions of a non-completely connected feedforward network. In further determining the variables of the model set, a method of model structure detection is proposed for selecting a group of important items from the whole variable candidate set. In order to verify the usefulness of the parameter identification process, we provide a virtual bench test example for the numerical analysis and user-friendly instructions for potential applications.


Introduction
Because a total non-linear model can provide a very concise representation for complex non-linear systems and has good extrapolation characteristics, it has attracted the attention of academic research and applications. Compared with the polynomial non-linear auto-regressive moving average with exogenous input (NARMAX) model, the total non-linear model is an extension of the polynomial model, which can be defined as the ratio of two polynomial expressions [1][2][3]. The introduction of denominator polynomials makes the NARMAX model non-linear in parameters and regression terms. Therefore, compared with the polynomial model, the model identification and the controller design of the total non-linear model are much more challenging [4,5]. In view of the difficulty of parameter estimation of a total non-linear model, using simple and effective algorithm and machine learning should be considered for extracting the information from measurement data.

Literature Survey
At present, a variety of model structure detection techniques and parameter estimation algorithms are developed for non-linear models, including the orthogonal model structure detection and parameter The authors of [19] presented a thorough analysis that included two kernel components, the SISO model and the orthogonal algorithms are parameter estimators for polynomial non-linear models such as predictive and back propagation computation. Since then, rational model identification has gone to diversified directions, such as more theoretical considerations of a non-linear least squares algorithm [4], a maximum likelihood estimation [3], and a biased compensation recursive least squares algorithm [2]. It has been noted that the MIMO rational model identification has seldom attracted research, probably due to the complexity in algorithm formulation and the coupling effect. However, this MIMO rational model identification should be a research agent now because of recent applications and increasing computing facilities.
The total non-linear system model, which is relatively new, is the alternative name of the NARMAX rational model, which was defined by a survey paper on the rational model identification [19]. The total non-linear model emphasizes the non-linearity in both the parameters and control inputs, and it has been taken as a challenging structure for designing non-linear dynamic control systems [1]. The rational model gives more consideration as expanded polynomials in math, structure detection, and parameter estimation in the field of system identification [2,3]. Therefore, the main contribution of the new study Entropy 2020, 22, 510 3 of 15 is to use neural computing algorithms for a MIMO model parameter estimation. The new study is a complement to those classical NAMAX approaches.
The rest of the paper is organized as follows. The total non-linear model is described in Section 2. Section 3 presents the gradient descent calculation of parameter estimation. Next, model structure detection is discussed in Section 4. A convergence analysis of an algorithm is presented in Section 5.
Simulation results and discussions are demonstrated in Section 6. Finally, Section 7 includes the paper conclusions and some of the future aspects.

Total Non-Linear Model
In mathematics, the dynamic total non-linear model of a MIMO system with error can be defined as . . , u J , y 1 , y 2 , . . . , y I , e 1 , e 2 , . . . , e I ) b i (u 1 , u 2 , . . . , u J , y 1 , y 2 , . . . , y I , e 1 , e 2 , . . . , e I ) +e i (t)i = 1, 2, . . . , I (1) where y(t) = [y 1 (t), y 2 (t), . . . , y I (t)] ∈ R I andŷ(t) = ŷ 1 (t),ŷ 2 (t), . . . ,ŷ I (t) ∈ R I are the measured output and model output, respectively; u(t) = u 1 (t), u 2 (t), . . . , u J (t) ∈ R J is the input; e(t) = [e 1 (t), e 2 (t), . . . , e I (t)] ∈ R I is the model error; and t = [1, 2, . . . , T] ∈ Z +T is the sampling time index. Numerator a i (t) ∈ R and denominator b i (t) ∈ R as represented by polynomials, regression term p n k (t), and p d k (t) are products of past inputs, outputs, and errors, such as u The task of parameter estimation is to extract the relevant parameter values from the measured input and output data for a given model structure. To form a regression expression for parameter estimation, multiplying e i (t) of both sides of Formula (1) gives To consider the neuro-computing approach for parameter estimation, a total non-linear model is expressed into a non-completely connected feedforward neural network, as shown in Figure 1.
The total non-linear system model, which is relatively new, is the alternative name of the NARMAX rational model, which was defined by a survey paper on the rational model identification [19]. The total non-linear model emphasizes the non-linearity in both the parameters and control inputs, and it has been taken as a challenging structure for designing non-linear dynamic control systems [1]. The rational model gives more consideration as expanded polynomials in math, structure detection, and parameter estimation in the field of system identification [2,3]. Therefore, the main contribution of the new study is to use neural computing algorithms for a MIMO model parameter estimation. The new study is a complement to those classical NAMAX approaches.
The rest of the paper is organized as follows. The total non-linear model is described in Section 2. Section 3 presents the gradient descent calculation of parameter estimation. Next, model structure detection is discussed in Section 4. A convergence analysis of an algorithm is presented in Section 5.
Simulation results and discussions are demonstrated in Section 6. Finally, Section 7 includes the paper conclusions and some of the future aspects.

Total Non-Linear Model
In mathematics, the dynamic total non-linear model of a MIMO system with error can be defined as where To consider the neuro-computing approach for parameter estimation, a total non-linear model is expressed into a non-completely connected feedforward neural network, as shown in Figure 1.  We define the network with an on both sides, Formula (11) is obtainedinput layer, a hidden layer, and an output layer, where: (i) The input layer consists of regression terms p n k (t)(k = 1, . . . , N) and p d k (t) (k = 1, . . . , D); here, a neuron in the hidden layer is not connected to all the neurons in the input layer, that is, the network is a non-completely connected feedforward neural network. (ii) The action function of the neurons in the hidden layer is linear, and the output of the hidden layer neurons is a i (t) or b i (t). (iii) The action function of the output layer neurons is linear, and the output of the ith output layer neuron is b i (t)e i (t). (iv) The connection weights between the input layer neurons and the hidden layer neurons are the parameters θ n k and θ d k of the model. (v) The connection weight between the hidden layer neurons and the ith output layer neurons are −1 and the observed output y i (t).
Leung and Haykin proposed a rational function neural network [20] but did not define a generalized total non-linear model structure or consider the relevant errors. Therefore, their parameter estimation algorithm could not provide an unbiased estimation for noise damaged data, which was essentially a special implementation of Zhu and Billings's [7,8] methods in the case of no noise data. The method proposed in this paper is a further study of the method in Zhu [18]. The characteristics of a total non-linear model (1) are as follows: (i) By setting parameter i = 1, Zhu's [18] model can be a special case of the model in Formula (1).
(ii) The model is non-linear in parameters and regression terms, which was caused by denominator polynomials. (iii) When the denominator b i (t) of the model is close to 0, the output deviation would be large. In this paper, considering this point, division operation was avoided in the action function of the neuron when the neural network model was being built. (iv) The structure of the neural network corresponding to the total non-linear model is a non-completely connected feedforward neural network, or a partially connected feedforward neural network. Therefore, the convergence of the network becomes a big problem, which is the difficulty of this paper. (v) The model has a wide range of application prospects. In many non-linear system modeling and control applications, the total non-linear model has been gradually adopted. Some non-linear models, such as the exponential model e x , which describes the change of dynamic rate constant with temperature, cannot be directly used. The exponential model can be firstly transformed into a non-linear model (e ), and then, system identification can be implemented [19,21,22].

Gradient Descent Calculation of Parameter Estimation
For the convenience of the following derivations, set the output of neuron i in the output layer of the neural network as Define the error measure function of one iteration of network as: The Lyapunov method is often used to analyze the stability of a neural network [23]; similarly, the network parameters are estimated by minimizing the network error based on the Lyapunov method. It should be noted that when the total non-linear model is represented in the neural network structure of Figure 1, the parameter estimation of the model can be described as the training of neural network weight by minimizing the error E(t) in Formula (5).
In order to train the weights of the network, the learning algorithm based on the gradient descent is given by Formulas (6) and (7): where η n and η d are learning rates. By deriving Formula (4) from θ n k on both sides, Formula (8) is obtained: Substituting Formula (8) into Formula (6) to get Formula (9), we can then get Formula (10): By deriving Formula (4) from θ d k on both sides, Formula (11) is obtained: Substituting Formula (11) into Formula (8) to get Formula (12), we then get Formula (13): The gradient descent algorithm for parameter estimation of a total non-linear model is summarized in Algorithm 1. Algorithm 1. Gradient Descent Algorithm 1: Initialization: The weights of the neural network (parameters of a total non-linear model) are set as random little numbers with uniform distribution; the average value is zero, and the variance is small. Set the maximum number of iterations T, the minimum error ε, and the maximum number of samples P. 2: Generate training sample set {X, Y} of the neural network according to Formula (1), where : Input a training sample p to the neural network. 4: Calculate the output value a i (k),y i (t)e i (t) and f i (t) of the neurons in the hidden layer and the output layer according to Formulas (2), (3), and (4), respectively. 5: Adjust the weight of the neural network according to Formulas (10) and (13). 6: Calculate the error E(t) according to Formula (4) and calculate the total error according to Formula (14).

Model Structure Detection
Model structure detection is to select important items from a rather large model set (usually called the whole item set) and determine the sub-model with important items [18]. Because of the powerful self-learning and associative memory function of an artificial neural network [24], it is the first-choice tool to identify the model structure. When identifying systems with unknown structures, it is important to avoid losing these important items in the final model. For the structure detection of a total non-linear model, the connection weight estimation in the neural network, that is, the parameter estimation of the total non-linear model, could be used to select the significant terms.
For the important and unimportant items in the whole model item set, the knock-out algorithm is adopted. First, remove the items that lead to the increase of network error, and then knock out the items with lighter weight according to the requirements of significance. Finally, test the error of the non-linear model composed of the remaining items. The specific algorithm is summarized in Algorithm 2. Algorithm 2. Knock-Out Algorithm 1: Using the network structure shown in Figure 1, all the items contained in the whole items set are taken as the input of the network. 2: The algorithm in Section 3 is used to train the network, and network error E 1 is obtained. 3: A new network structure is obtained by randomly removing a network input. The algorithm in Section 3 is used to train the new network, and network error E 2 is obtained. If E 2 ≤ E 1 , then E 1 = E 2 . Otherwise, this operation should be invalid (the input is reserved). 4: Another input is selected, and step 3 is executed again until all the input items are executed once. 5: The N connection weights between the input layer and the hidden layer are sorted in descending order. The first n weights are selected to make the significance reach 95%. Meanwhile, Formulas (15) and (16) are met, and the network input items corresponding to the first n weights are retained. In the above process, the neural network is not only used to estimate the parameters of the model but also to detect the structure of the model and analyze the significance of the regression term.

Convergence Analysis of the Algorithm
Convergence proof: Assuming that a connection weight of the neural network shown in Figure 1 is changed, this weight can take any value. When the weight θ n k corresponds to the regression term parameter of the numerator of the total non-linear model, the resulting network error changes as follows (remove the lower corner marks in Formula (2) for the convenience of proof): Substitute Formula (2) into Formula (1) to get Formula (18): When θ n k is updated, (18) becomes (19): e(t) is the new error of the neural network after the weight has been updated. Subtract Formula (18) from Formula (19) to get Formulas (20) and (21): In order to ensure e(t) 2 ≤ e(t) 2 Solving Formula (22) gives: When the changed weight θ d k corresponds to the regression parameter of the denominator of the total non-linear model, the resulting network error change is as follows: Subtracting Formula (24) from Formula (25) gives b(t) as the new denominator of the neural network after the weight has been updated.
In order to satisfy e(t) 2 ≤ e(t) 2 , namely, Because the learning coefficient is too large, the training effect of the network is not effective; accordingly, we take 0 ≤ η n ≤ 1 to get b(t)b(t) > 0, and thus, it has: To sum up, the network is convergent when the following conditions are met: a(t) 2 p d k (t) 2 Under these two conditions, this algorithm provides a convergence estimate for the parameters of the total non-linear model.

Simulation Results and Discussions
Consider a representative example of a total non-linear model: Because the disturbance of input data will cause interference to the estimation of parameters [25], in this section, the parameter estimation for different inputs was selected. Firstly, for the simulation system without noise, 2000 pairs of input/output data were used as data sets for uniform sampling in 20 cycles, and the learning rate was designed as a linear attenuation sequence (in 50 iterations, the learning rate decreases from η 0 = 0.5 to η end = 0.02). The algorithm in this paper was used to estimate 10 parameters at the same time. Table 1, after 50 iterations, shows the estimated values and mean square deviation of parameters. The inputs u 1 (t) and u 2 (t) of the system are either a sine wave or square wave with an amplitude of 2. Figure 2 shows the difference between the measured value y 1 (t) (the real output value of Formula (6.1)) and the output valueŷ 1 (t) obtained using the parameter estimator when inputs u 1 (t) and u 2 (t) are both sine waves. In the same way, Figure 3 shows the difference between measured value y 2 (t) and output valueŷ 2 (t) when inputs u 1 (t) and u 2 (t) are both sine waves. Figure 4 shows the difference between measured value y 1 (t) and output valueŷ 1 (t) when input u 1 (t) is a sine wave and u 2 (t) is a square wave. sampling in 20 cycles, and the learning rate was designed as a linear attenuation sequence (in 50 iterations, the learning rate decreases from η0 = 0.5 to ηend = 0.02). The algorithm in this paper was used to estimate 10 parameters at the same time. Table 1, after 50 iterations, shows the estimated values and mean square deviation of parameters. The inputs ( ) and ( ) of the system are either a sine wave or square wave with an amplitude of 2. Figure 2 shows the difference between the measured value ( ) (the real output value of Formula (6.1)) and the output value y ( ) obtained using the parameter estimator when inputs ( ) and ( ) are both sine waves. In the same way, Figure 3 shows the difference between measured value ( ) and output value y ( ) when inputs ( ) and ( ) are both sine waves. Figure 4 shows the difference between measured value ( ) and output value y ( ) when input ( ) is a sine wave and ( ) is a square wave.   sampling in 20 cycles, and the learning rate was designed as a linear attenuation sequence (in 50 iterations, the learning rate decreases from η0 = 0.5 to ηend = 0.02). The algorithm in this paper was used to estimate 10 parameters at the same time. Table 1, after 50 iterations, shows the estimated values and mean square deviation of parameters. The inputs ( ) and ( ) of the system are either a sine wave or square wave with an amplitude of 2. Figure 2 shows the difference between the measured value ( ) (the real output value of Formula (6.1)) and the output value y ( ) obtained using the parameter estimator when inputs ( ) and ( ) are both sine waves. In the same way, Figure 3 shows the difference between measured value ( ) and output value y ( ) when inputs ( ) and ( ) are both sine waves. Figure 4 shows the difference between measured value ( ) and output value y ( ) when input ( ) is a sine wave and ( ) is a square wave.  Error of ( ) with sine-sine input. Figure 3. Error of y 2 (t) with sine-sine input. Figure 5 shows the difference between measured value y 2 (t) and output valueŷ 2 (t) when input u 1 (t) is a sine wave and u 2 (t) is square wave. Figure 6 shows the difference between measured value y 1 (t) and output valueŷ 1 (t) when input u 1 (t) and u 2 (t) are both square waves. Figure 7 shows the difference between measured value y 2 (t) and output valueŷ 2 (t) when input u 1 (t) and u 2 (t) are both square waves. It can be seen that when the inputs are both sine waves, the accuracy of parameter estimation is the highest, while when the inputs are square waves, the estimation accuracy of the parameters is relatively lower. This is because when the inputs are square waves, the output of the system has an overshoot, that is to say, the observation value of the system itself has an error. Training the network with error data will certainly lead to an error of parameter estimation; especially the estimation error of θ 2 is the highest. Here, θ 2 is the parameter of y 3 2 (t − 2) because y 3 2 (t − 2) is the third power of the output, which further amplifies the error. Substituting the value of y 3 2 (t − 2) into the update of θ 2 would inevitably lead to an estimation error.
Entropy 2020, 22, x 10 of 16 Figure 4. Error of ( ) with sine-square input. Figure 5 shows the difference between measured value ( ) and output value y ( ) when input ( ) is a sine wave and ( ) is square wave. Figure 6 shows the difference between measured value ( ) and output value y ( ) when input ( ) and ( ) are both square waves. Figure 7 shows the difference between measured value ( ) and output value y ( ) when input ( ) and ( ) are both square waves. It can be seen that when the inputs are both sine waves, the accuracy of parameter estimation is the highest, while when the inputs are square waves, the estimation accuracy of the parameters is relatively lower. This is because when the inputs are square waves, the output of the system has an overshoot, that is to say, the observation value of the system itself has an error. Training the network with error data will certainly lead to an error of parameter estimation; especially the estimation error of Ɵ is the highest. Here, Ɵ is the parameter of ( − 2) because ( − 2) is the third power of the output, which further amplifies the error.
Substituting the value of ( − 2) into the update of Ɵ would inevitably lead to an estimation error.    Figure 5 shows the difference between measured value ( ) and output value y ( ) when input ( ) is a sine wave and ( ) is square wave. Figure 6 shows the difference between measured value ( ) and output value y ( ) when input ( ) and ( ) are both square waves. Figure 7 shows the difference between measured value ( ) and output value y ( ) when input ( ) and ( ) are both square waves. It can be seen that when the inputs are both sine waves, the accuracy of parameter estimation is the highest, while when the inputs are square waves, the estimation accuracy of the parameters is relatively lower. This is because when the inputs are square waves, the output of the system has an overshoot, that is to say, the observation value of the system itself has an error. Training the network with error data will certainly lead to an error of parameter estimation; especially the estimation error of Ɵ is the highest. Here, Ɵ is the parameter of ( − 2) because ( − 2) is the third power of the output, which further amplifies the error.
Substituting the value of ( − 2) into the update of Ɵ would inevitably lead to an estimation error. Figure 5. Error of ( ) with sine-square input. Figure 5. Error of y 2 (t) with sine-square input.

Figure 7.
Error of ( ) with square-square input.
For a system with noise interference, it is more difficult to estimate its parameters because of the error of the measured value itself [26,27]. In order to further verify the performance of the estimator, a system with noise was selected for parameter estimation, that is, by adding a noise signal to the original system. The noises r ( ) and ( ) were random uniform sequences, where each mean value was zero, each variance was 0.1, and each signal-to-noise ratio was 10. We repeated the previous experiment for the system with noise, and the experimental results are shown in Table 2. The difference between the measured value and the estimated value of the system output is shown in Figures 8-13.  Figure 7. Error of y 2 (t) with square-square input.
For a system with noise interference, it is more difficult to estimate its parameters because of the error of the measured value itself [26,27]. In order to further verify the performance of the estimator, a system with noise was selected for parameter estimation, that is, by adding a noise signal to the original system. The noises r 1 (t) and r 2 (t) were random uniform sequences, where each mean value was zero, each variance was 0.1, and each signal-to-noise ratio was 10. We repeated the previous experiment for the system with noise, and the experimental results are shown in Table 2. The difference between the measured value and the estimated value of the system output is shown in Figures 8-13.
In order to detect the structure of the model, 20 items including 10 items in Formula (6.1) were used as the whole items set of models. The newly added numerator term is of order 1, the denominator term is of order 2, and the input lag and output lag are both of order 1. Using the knock-out algorithm in Section 4, the final 10 items of the model are in good agreement with those in Formula (6.1).
From the above experimental results, the estimation accuracy of the algorithm proposed in this paper is acceptable, and the mean square deviations are all less than 0.003. This level of error is acceptable.  Figure 8. Error of ( ) with noise and sine-sine input. Figure 8. Error of y 1 (t) with noise and sine-sine input.      . Error of ( ) with noise and sine-sine input. Figure 10. Error of ( ) with noise and sine-square input. Figure 10. Error of y 1 (t) with noise and sine-square input.
Entropy 2020, 22, x 13 of 16 Figure 11. Error of ( ) with noise and sine-square input. Figure 11. Error of y 2 (t) with noise and sine-square input. Figure 11. Error of ( ) with noise and sine-square input.  In order to detect the structure of the model, 20 items including 10 items in Formula (6.1) were used as the whole items set of models. The newly added numerator term is of order 1, the denominator term is of order 2, and the input lag and output lag are both of order 1. Using the knockout algorithm in Section 4, the final 10 items of the model are in good agreement with those in Formula (6.1).
From the above experimental results, the estimation accuracy of the algorithm proposed in this paper is acceptable, and the mean square deviations are all less than 0.003. This level of error is acceptable. Figure 12. Error of y 1 (t) with noise and square-square input. Figure 11. Error of ( ) with noise and sine-square input.  In order to detect the structure of the model, 20 items including 10 items in Formula (6.1) were used as the whole items set of models. The newly added numerator term is of order 1, the denominator term is of order 2, and the input lag and output lag are both of order 1. Using the knockout algorithm in Section 4, the final 10 items of the model are in good agreement with those in Formula (6.1).
From the above experimental results, the estimation accuracy of the algorithm proposed in this paper is acceptable, and the mean square deviations are all less than 0.003. This level of error is acceptable. Figure 13. Error of y 2 (t) with noise and square-square input.

Conclusions
In this paper, the parameter estimation of a SISO rational model was extended to that of a MIMO total non-linear model. A method of parameter estimation of a MIMO non-linear rational model based on a gradient descent algorithm was proposed, and the convergence condition was proposed for the asymmetry of the network. It was proven that the estimator is properly effective by mathematical derivation and simulation. This estimation method has a strong generalization property and could be widely used in many fields, such as non-linear system modeling and control applications. Some systems that could not directly use this method, such as the exponential model describing the change of the kinetic rate constant with the temperature, could first be converted into a rational model and then use the developed estimation method. Some of the future work could be foreseen as (1) estimating the parameters of the state space model based on an artificial neural network, (2) estimating the parameters of a MIMO state space model, (3) estimating the parameters of the non-linear state space model, and (4) estimating the parameters of total non-linear spatial state models. Funding: This research is funded by Prince Sultan University, Riyadh, Kingdom of Saudi Arabia. Special acknowledgement to Robotics and Internet-of-Things Lab (RIOTU), Prince Sultan University, Riyadh, Saudi Arabia. We would like to show our gratitude to Prince Sultan University, Riyadh, Saudi Arabia.