Brief paperForward and backward least angle regression for nonlinear system identification☆
Introduction
A large class of nonlinear dynamic systems can be described by a nonlinear autoregressive model with exogenous input (NARX) (Chen, Billing, & Luo, 1989) where the set represents the real system input and output at time interval being the size of the training data set. Their largest input and output lags are and , respectively. denotes the error. The set is the model input vector and output at time interval . For simplicity, the model input is rewritten as with the dimension . is some unknown function.
Constructing such a NARX model involves three steps (Ljung, 1999): (1) model input selection. More specifically, the unknown lags and need to be determined. Statistical tests and regression methods are among the popular approaches (Haber and Unbehauen, 1990, Lind and Ljung, 2005, Lind and Ljung, 2008); (2) choice of mapping function . Polynomials (Billings & Chen, 1989), radial basis function (RBF) networks (Chen, Cowan, & Grant, 1991), neuro fuzzy networks (Harris et al., 2002, Wang and Mendel, 1992) and wavelet networks (Billings and Wei, 2005, Zhang, 1997) are popular options. Though some suggestions are made on the function selection (Sjöberg et al., 1995), no unified framework is available; (3) parameter identification in function . This requires the specific expression of the model (1). One popular NARX model structure is a linear combination of nonlinear functions whose parameters are given a priori, which is formulated as (Ljung, 1999) where is some nonlinear function with pre-fixed nonlinear parameters vector , and , are the linear coefficients to be optimized. The model (2) is also called the linear-in-the-parameters model for pre-fixed nonlinear parameters ’s. However, the number of nonlinear functions is often large, the fixed values for these nonlinear parameters are not optimized, and some nonlinear functions are redundant. This is often referred to as an over-parametrization problem, and not all nonlinear functions are necessarily included into the final model but a good subset is desirable (De Nicolao & Trecate, 1999). Within this context, building a linear-in-the-parameters model becomes a model reduction or selection problem. This paper focuses on the model selection issue.
The main objective of model selection is to build a parsimonious model with good generalization performance. Exhaustive search to test all possible subsets is only suitable for a very small number of candidate model terms (nonlinear functions) while it is computationally too demanding when the number is large (Lind and Ljung, 2008, Mao and Billings, 1997). This is known to be an NP-hard problem. To reduce the computational effort, stepwise forward selection methods (Miller, 2002), like forward orthogonal selection (Chen et al., 1991), fast recursive algorithm (Li, Peng, & Irwin, 2005) and orthogonal matching pursuit (Pati, Rezaiifar, & Krishnaprasad, 1993), start from an empty model and add one term at a time until the model performance is satisfied. The alternative is the stepwise backward selection that begins with the full model using all the candidates, and then deletes one term at a time. All these methods are fast, greedy therefore suboptimal (Kump et al., 2012, Sherstinsky and Picard, 1996). Hence, a parsimonious model with the smallest model size is always desirable.
To improve the model compactness and the generalization performance, the combination of forward selection and backward selection has been proposed in Li, Peng, and Bai (2006) and Zhang, Li, Bai, and Wang (2012), where the backward selection is used to reselect and replace those insignificant terms produced by the forward selection. Alternatively, a number of hybrid methods combining forward selection and backward elimination (instead of backward replacement) have been reported (Haugland, 2007, Soussen et al., 2011, Zhang, 2011), where the backward elimination removes insignificant terms. The elimination scheme is also referred to as model pruning. For example, term clustering (Aguirre & Billings, 1995) and simulation error (Farina and Piroddi, 2010, Piroddi and Spinelli, 2003) based pruning methods have been studied for constructing polynomial NARX models.
It is noted that the subset selection may fail in the following scenarios:
- •
The candidate terms are highly correlated and redundant, which may lead to the ill-conditioning problem (Moussaoui, Brie, & Richard, 2005). The forward selection can avoid selecting highly correlated terms but is not entirely immune to the ill-conditioning problem. The backward selection easily suffers from this problem as it has to deal with the inversion of all the terms at the beginning.
- •
If the training data is severely polluted by noise, these subset selection methods may fit the models into noise which leads to the over-fitting problem (Chen et al., 2010, Poggio and Girosi, 1990). The pre-filter and cross validation are useful to provide a tradeoff between the training accuracy and generalization performance, but additional computations are incurred.
- •
If the training data does not contain sufficient information, a model with no or low bias but high variance may not have a satisfactory prediction accuracy. A biased model may be more desirable using a good bias/variance trade-off technique (Johansen, 1997, Poggio and Girosi, 1990).
- •
If small changes in the data can result in a very different model, then the model is less robust and its prediction accuracy is reduced (Tibshirani, 1996).
Given these above considerations, regularization methods are popular techniques to build sparse, robust and biased models by imposing additional penalties or constraints on the solution. A general regularization algorithm is the Tikhonov regression that adds a penalty term to sum squared error (SSE) cost function (Bishop, 1997, Johansen, 1997, Moussaoui et al., 2005), which is given by where the regularization parameter controls the fitting smoothness and the model size. denotes the function derivatives of different orders. However, this may be computationally too demanding. More recently, the ridge regression and least absolute shrinkage and selection operator (LASSO) use additional norm and norm penalties, respectively. The cost function becomes and
These two methods use simplified penalty terms on weights and they aim to minimize the sum of SSE and norms of model weights. Though the ridge regularization can shrink the large weights towards zeros but has little effects on small weights (Kump et al., 2012). Unlike the ridge regression, LASSO has the potential to shrink some weights to exact zeros and can be interpreted as a Bayesian estimator (Tibshirani, 1996). More recently, some modifications have been proposed on the penalty term, such as using the differences between adjacent coefficients (Ohlsson, Ljung, & Boyd, 2010) or differences among all the coefficients (Ohlsson & Ljung, 2013). The difficulty is to mathematically give an explicit solution for the optimal regularization parameter . The optimal regularization parameter can be determined by cross validation. Alternatively, it can be estimated by the Bayesian framework under Gaussian prior distributions. Though a number of algorithms have been proposed (Osborne et al., 2000, Rosset and Zhu, 2007, Tibshirani, 1996), most of them are computationally inefficient compared to the forward selection.
As a promising regularization scheme–the least angle regression (LAR)–has been proposed and widely studied (Efron, Hastie, Johnstone, & Tibshirani, 2004). It is a variant of the forward selection as it begins with an empty model with initially no regressor and then selects one term at a time until a stop criterion is satisfied. Unlike the forward selection where the model weights (coefficients) are identical to the least squares solution, the least angle scheme is used to determine the weights. LAR has a few distinctive advantages. First, it is computationally just as fast as the forward selection and more efficient than LASSO methods due to its complete piecewise linear path. Further, it can be easily modified to produce solutions for LASSO estimator. However, the LAR is still a local method and may not produce a sparser model than the forward selection and LASSO methods.
The main objective of this paper is to improve the model sparsity and generalization performance of the LAR algorithm. This is achieved by introducing a replacement scheme using an additional refinement stage. The new method has a forward LAR stage and backward LAR stage. The forward LAR is the same as the original LAR. The backward stage compares the significance of each term in the initial model with the remaining terms in the candidate pool and then replaces insignificant ones, leading to an improved model in terms of compactness and performance. The main difference with our previous work on forward and backward methods is that the least angle scheme rather than least squares approach is employed to determine the model coefficients. Unlike other existing model pruning methods, the proposed method employs the replacement scheme instead of elimination. Further, the LAR is a computationally efficient regularization method without additional computational efforts to determine the regularization parameter. A more detailed difference analysis is given in Section 3. Extensive numerical simulations on the construction of four NARX models, including the polynomial, RBF, neuro fuzzy and wavelet models are presented to demonstrate that the new method is able to produce sparser model than the original LAR algorithm and some alternatives.
Section snippets
NARX models
The matrix form of the linear-in-the-parameters NARX model shown in Eq. (2) is given as where is the output vector, is the weight vector, is the residual vector. The matrix is the whole candidate term pool given by It is an -by- matrix with . The ultimate goal of the model selection is to select a subset given by where denote the indexes of selected terms, and to
Forward and backward least angle regression
Although LAR is a computationally efficient regularized forward algorithm, it is still a local method, and the resultant model can be further improved in terms of the model compactness and generalization performance. This paper introduces a backward refinement stage to replace those insignificant terms in the initial model built by the forward LAR and thus improves the model generalization performance and compactness.
A numerical example
Consider the following benchmark nonlinear system (Hong, Harris, & Wilson, 1999) where the noise-free system input and output are and at time interval , respectively. The output is corrupted by a Gaussian noise sequence with the signal-to-noise rate 25 dB. The equivalent noise variance is 0.0031. The noise is only added on system output and has no effect on system
Conclusion
A forward and backward least angle regression (LAR) algorithm has been proposed for constructing a range of NARX models. The backward stage can replace the insignificant terms previously selected by the forward LAR with the most significant ones left in the candidate term pool, leading to a more compact model with improved generalization performance. Numerical results have confirmed the effectiveness of the proposed method.
Long Zhang received his B.Eng. and M.Eng. degrees as an outstanding graduate in Electrical Engineering and Automation from Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively, and the Ph.D. degree in Electronics, Electrical Engineering and Computer Science from Queen’s University Belfast, UK, in 2013. He is currently a Research Associate at the Department of Automatic Control and System Engineering, University of Sheffield, UK. His research interest includes system
References (47)
- et al.
Prediction and simulation errors in parameter estimation for nonlinear systems
Mechanical Systems and Signal Processing
(2010) - et al.
Structure identification of nonlinear dynamic systems: a survey on input/output approaches
Automatica
(1990) On Tikhonov regularization, bias and variance in nonlinear system identification
Automatica
(1997)- et al.
Variable selection via rival (removing irrelevant variables amidst lasso iterations) and its application to nuclear material detection
Automatica
(2012) - et al.
A two-stage algorithm for identification of non-linear dynamic systems
Automatica
(2006) - et al.
Regressor selection with the analysis of variance method
Automatica
(2005) - et al.
Regressor and structure selection in NARX models using a structured anova approach
Automatica
(2008) - et al.
Regularization aspects in continuous-time model identification
Automatica
(2005) - et al.
Identification of switched linear regression models using sum-of-norms regularization
Automatica
(2013) - et al.
Segmentation of ARX-models using sum-of-norms regularization
Automatica
(2010)
Initialization by selection for wavelet network training
Neurocomputing
Nonlinear black-box modeling in system identification: a unified overview
Automatica
Improved structure selection for nonlinear models based on term clustering
International Journal of Control
A new look at the statistical model identification
IEEE Transactions on Automatic Control
Disscussion of least angle regression
Annals of Statistics
Identification of non-linear rational systems using a prediction error estimation algorithm
International Journal of Systems Science
A new class of wavelet networks for nonlinear system identification
IEEE Transactions on Neural Networks
Some relations between the various criteria for autoregressive model order determination
IEEE Transactions on Acoustics, Speech and Signal Processing
Orthogonal least squares methods and their application to non-linear system identification
International Journal of Control
Orthogonal least squares algorithm for radial basis funtion networks
IEEE Transactions on Neural Networks
Sparse kernel regression modeling using combined locally regularized orthogonal least squares and D-optimality experimental design
IEEE Transactions on Automatic Control
Consistent identification of NARX models via regularization networks
IEEE Transactions on Automatic Control
Cited by (0)
Long Zhang received his B.Eng. and M.Eng. degrees as an outstanding graduate in Electrical Engineering and Automation from Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively, and the Ph.D. degree in Electronics, Electrical Engineering and Computer Science from Queen’s University Belfast, UK, in 2013. He is currently a Research Associate at the Department of Automatic Control and System Engineering, University of Sheffield, UK. His research interest includes system identification, neural networks, statistical regression and fault diagnosis in both time and frequency domains.
Kang Li received the B.Sc. degree from Xiangtan University, Xiangtan, China, in 1989, the M.Sc. degree from the Harbin Institute of Technology, Harbin, China, in 1992, and the Ph.D. degree from Shanghai Jiaotong University, Shanghai, China, in 1995. He is currently a Professor of Intelligent Systems and Control with the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, UK. He is involved in bioinformatics with applications on food safety and biomedical engineering. He is also a Visiting Professor with the Harbin Institute of Technology, Shanghai University, Shanghai, China, and the Ningbo Institute of Technology, Zhejiang University, Zhejiang, China. He held a Visiting Fellowship or Professorship with the National University of Singapore, Singapore, the University of Iowa, Iowa City, IA, USA, the New Jersey Institute of Technology, Newark, NJ, USA, Tsinghua University, Beijing, China, and the Technical University of Bari, Taranto, Italy. He has authored more than 200 papers in his areas of expertise, and edited 12 conference proceedings (Springer). His current research interests include nonlinear system modeling, identification and control, bio-inspired computational intelligence, and fault-diagnosis and detection, with recent applications in power systems and renewable energy, and polymer extrusion processes. He serves in the editorial boards of Neurocomputing, the Transactions of the Institute of Measurement and Control, Cognitive Computation, and International Journal of Modelling, Identification and Control.
- ☆
This work was supported in part by the U.K. Research Councils under Grants EP/G042594/1 and EP/L 001063/1, in part by the Chinese Scholarship Council, in part by the National Natural Science Foundation of China under Grants 61271347 and 61273040, and by the Science and Technology Commission of Shanghai Municipality under grant 11JC1404000. The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Antonio Vicino under the direction of Editor Torsten Söderström.