Elsevier

Automatica

Volume 53, March 2015, Pages 94-102
Automatica

Brief paper
Forward and backward least angle regression for nonlinear system identification

https://doi.org/10.1016/j.automatica.2014.12.010Get rights and content

Abstract

A forward and backward least angle regression (LAR) algorithm is proposed to construct the nonlinear autoregressive model with exogenous inputs (NARX) that is widely used to describe a large class of nonlinear dynamic systems. The main objective of this paper is to improve model sparsity and generalization performance of the original forward LAR algorithm. This is achieved by introducing a replacement scheme using an additional backward LAR stage. The backward stage replaces insignificant model terms selected by forward LAR with more significant ones, leading to an improved model in terms of the model compactness and performance. A numerical example to construct four types of NARX models, namely polynomials, radial basis function (RBF) networks, neuro fuzzy and wavelet networks, is presented to illustrate the effectiveness of the proposed technique in comparison with some popular methods.

Introduction

A large class of nonlinear dynamic systems can be described by a nonlinear autoregressive model with exogenous input (NARX) (Chen, Billing, & Luo, 1989) y(t)=f(y(t1),,y(tny),u(t1),,u(tnu))+ξ(t)=f(x(t))+ξ(t) where the set {u(t),y(t)} represents the real system input and output at time interval t,t=1,2,,N,N being the size of the training data set. Their largest input and output lags are nu and ny, respectively. ξ(t) denotes the error. The set {x(t),y(t)} is the model input vector and output at time interval t. For simplicity, the model input x(t)=[y(t1),,y(tny),u(t1),,u(tnu)] is rewritten as x(t)=[x1(t),,xr(t)] with the dimension r=ny+nu. f() is some unknown function.

Constructing such a NARX model involves three steps (Ljung, 1999): (1) model input selection. More specifically, the unknown lags ny and nu need to be determined. Statistical tests and regression methods are among the popular approaches (Haber and Unbehauen, 1990, Lind and Ljung, 2005, Lind and Ljung, 2008); (2) choice of mapping function f(). Polynomials (Billings & Chen, 1989), radial basis function (RBF) networks (Chen, Cowan, & Grant, 1991), neuro fuzzy networks (Harris et al., 2002, Wang and Mendel, 1992) and wavelet networks (Billings and Wei, 2005, Zhang, 1997) are popular options. Though some suggestions are made on the function selection (Sjöberg et al., 1995), no unified framework is available; (3) parameter identification in function f(). This requires the specific expression of the model (1). One popular NARX model structure is a linear combination of nonlinear functions whose parameters are given a priori, which is formulated as (Ljung, 1999) y(t)=i=1Mpi(x(t),vi)θi+ξ(t) where pi is some nonlinear function with pre-fixed nonlinear parameters vector vi, and θi,i=1,,M, are the linear coefficients to be optimized. The model (2) is also called the linear-in-the-parameters model for pre-fixed nonlinear parameters vi’s. However, the number of nonlinear functions M is often large, the fixed values for these nonlinear parameters are not optimized, and some nonlinear functions are redundant. This is often referred to as an over-parametrization problem, and not all nonlinear functions are necessarily included into the final model but a good subset is desirable (De Nicolao & Trecate, 1999). Within this context, building a linear-in-the-parameters model becomes a model reduction or selection problem. This paper focuses on the model selection issue.

The main objective of model selection is to build a parsimonious model with good generalization performance. Exhaustive search to test all possible subsets is only suitable for a very small number of candidate model terms (nonlinear functions) while it is computationally too demanding when the number is large (Lind and Ljung, 2008, Mao and Billings, 1997). This is known to be an NP-hard problem. To reduce the computational effort, stepwise forward selection methods (Miller, 2002), like forward orthogonal selection (Chen et al., 1991), fast recursive algorithm (Li, Peng, & Irwin, 2005) and orthogonal matching pursuit (Pati, Rezaiifar, & Krishnaprasad, 1993), start from an empty model and add one term at a time until the model performance is satisfied. The alternative is the stepwise backward selection that begins with the full model using all the candidates, and then deletes one term at a time. All these methods are fast, greedy therefore suboptimal (Kump et al., 2012, Sherstinsky and Picard, 1996). Hence, a parsimonious model with the smallest model size is always desirable.

To improve the model compactness and the generalization performance, the combination of forward selection and backward selection has been proposed in Li, Peng, and Bai (2006) and Zhang, Li, Bai, and Wang (2012), where the backward selection is used to reselect and replace those insignificant terms produced by the forward selection. Alternatively, a number of hybrid methods combining forward selection and backward elimination (instead of backward replacement) have been reported (Haugland, 2007, Soussen et al., 2011, Zhang, 2011), where the backward elimination removes insignificant terms. The elimination scheme is also referred to as model pruning. For example, term clustering (Aguirre & Billings, 1995) and simulation error (Farina and Piroddi, 2010, Piroddi and Spinelli, 2003) based pruning methods have been studied for constructing polynomial NARX models.

It is noted that the subset selection may fail in the following scenarios:

  • The candidate terms are highly correlated and redundant, which may lead to the ill-conditioning problem (Moussaoui, Brie, & Richard, 2005). The forward selection can avoid selecting highly correlated terms but is not entirely immune to the ill-conditioning problem. The backward selection easily suffers from this problem as it has to deal with the inversion of all the terms at the beginning.

  • If the training data is severely polluted by noise, these subset selection methods may fit the models into noise which leads to the over-fitting problem (Chen et al., 2010, Poggio and Girosi, 1990). The pre-filter and k cross validation are useful to provide a tradeoff between the training accuracy and generalization performance, but additional computations are incurred.

  • If the training data does not contain sufficient information, a model with no or low bias but high variance may not have a satisfactory prediction accuracy. A biased model may be more desirable using a good bias/variance trade-off technique (Johansen, 1997, Poggio and Girosi, 1990).

  • If small changes in the data can result in a very different model, then the model is less robust and its prediction accuracy is reduced (Tibshirani, 1996).

Given these above considerations, regularization methods are popular techniques to build sparse, robust and biased models by imposing additional penalties or constraints on the solution. A general regularization algorithm is the Tikhonov regression that adds a penalty term to sum squared error (SSE) cost function (Bishop, 1997, Johansen, 1997, Moussaoui et al., 2005), which is given by CFTikhonov=t=1Nξ2(t)+λi=1MDFi where the regularization parameter λ controls the fitting smoothness and the model size. DFi denotes the function derivatives of different orders. However, this may be computationally too demanding. More recently, the ridge regression and least absolute shrinkage and selection operator (LASSO) use additional l2 norm and l1 norm penalties, respectively. The cost function becomes CFridge=t=1Nξ2(t)+λi=1Mθi2 and CFlasso=t=1Nξ2(t)+λi=1M|θi|.

These two methods use simplified penalty terms on weights and they aim to minimize the sum of SSE and norms of model weights. Though the ridge regularization can shrink the large weights towards zeros but has little effects on small weights (Kump et al., 2012). Unlike the ridge regression, LASSO has the potential to shrink some weights to exact zeros and can be interpreted as a Bayesian estimator (Tibshirani, 1996). More recently, some modifications have been proposed on the penalty term, such as using the differences between adjacent coefficients (Ohlsson, Ljung, & Boyd, 2010) or differences among all the coefficients (Ohlsson & Ljung, 2013). The difficulty is to mathematically give an explicit solution for the optimal regularization parameter λ. The optimal regularization parameter can be determined by cross validation. Alternatively, it can be estimated by the Bayesian framework under Gaussian prior distributions. Though a number of algorithms have been proposed (Osborne et al., 2000, Rosset and Zhu, 2007, Tibshirani, 1996), most of them are computationally inefficient compared to the forward selection.

As a promising regularization scheme–the least angle regression (LAR)–has been proposed and widely studied (Efron, Hastie, Johnstone, & Tibshirani, 2004). It is a variant of the forward selection as it begins with an empty model with initially no regressor and then selects one term at a time until a stop criterion is satisfied. Unlike the forward selection where the model weights (coefficients) are identical to the least squares solution, the least angle scheme is used to determine the weights. LAR has a few distinctive advantages. First, it is computationally just as fast as the forward selection and more efficient than LASSO methods due to its complete piecewise linear path. Further, it can be easily modified to produce solutions for LASSO estimator. However, the LAR is still a local method and may not produce a sparser model than the forward selection and LASSO methods.

The main objective of this paper is to improve the model sparsity and generalization performance of the LAR algorithm. This is achieved by introducing a replacement scheme using an additional refinement stage. The new method has a forward LAR stage and backward LAR stage. The forward LAR is the same as the original LAR. The backward stage compares the significance of each term in the initial model with the remaining terms in the candidate pool and then replaces insignificant ones, leading to an improved model in terms of compactness and performance. The main difference with our previous work on forward and backward methods is that the least angle scheme rather than least squares approach is employed to determine the model coefficients. Unlike other existing model pruning methods, the proposed method employs the replacement scheme instead of elimination. Further, the LAR is a computationally efficient regularization method without additional computational efforts to determine the regularization parameter. A more detailed difference analysis is given in Section  3. Extensive numerical simulations on the construction of four NARX models, including the polynomial, RBF, neuro fuzzy and wavelet models are presented to demonstrate that the new method is able to produce sparser model than the original LAR algorithm and some alternatives.

Section snippets

NARX models

The matrix form of the linear-in-the-parameters NARX model shown in Eq. (2) is given as y=PΘ+ξ where y=[y(1),,y(N)]T is the output vector, Θ=[θ1,,θM]T is the weight vector, ξ=[ξ(1),,ξ(N)]T is the residual vector. The matrix P is the whole candidate term pool given by P=[p1,,pM]. It is an N-by-M matrix with pi=[pi(x(1),vi),,pi(x(N),vi)]T. The ultimate goal of the model selection is to select a subset given by Pm=[pi1,,pim] where [i1,,im] denote the indexes of selected terms, and to

Forward and backward least angle regression

Although LAR is a computationally efficient regularized forward algorithm, it is still a local method, and the resultant model can be further improved in terms of the model compactness and generalization performance. This paper introduces a backward refinement stage to replace those insignificant terms in the initial model built by the forward LAR and thus improves the model generalization performance and compactness.

A numerical example

Consider the following benchmark nonlinear system (Hong, Harris, & Wilson, 1999) u(t)=sin(π20t)z(t)=2.5z(t1)z(t2)1+z2(t1)+z2(t2)+0.5u(t1)+0.3cos(0.5(z(t1)+z(t2)))y(t)=z(t)+ξ(t)} where the noise-free system input and output are u(t) and z(t) at time interval t, respectively. The output y(t) is corrupted by a Gaussian noise sequence ξ(t) with the signal-to-noise rate 25 dB. The equivalent noise variance is 0.0031. The noise is only added on system output and has no effect on system

Conclusion

A forward and backward least angle regression (LAR) algorithm has been proposed for constructing a range of NARX models. The backward stage can replace the insignificant terms previously selected by the forward LAR with the most significant ones left in the candidate term pool, leading to a more compact model with improved generalization performance. Numerical results have confirmed the effectiveness of the proposed method.

Long Zhang received his B.Eng. and M.Eng. degrees as an outstanding graduate in Electrical Engineering and Automation from Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively, and the Ph.D. degree in Electronics, Electrical Engineering and Computer Science from Queen’s University Belfast, UK, in 2013. He is currently a Research Associate at the Department of Automatic Control and System Engineering, University of Sheffield, UK. His research interest includes system

References (47)

  • Y. Oussar et al.

    Initialization by selection for wavelet network training

    Neurocomputing

    (2000)
  • J. Sjöberg et al.

    Nonlinear black-box modeling in system identification: a unified overview

    Automatica

    (1995)
  • L.A. Aguirre et al.

    Improved structure selection for nonlinear models based on term clustering

    International Journal of Control

    (1995)
  • H. Akaike

    A new look at the statistical model identification

    IEEE Transactions on Automatic Control

    (1974)
  • A.T. Berwin

    Disscussion of least angle regression

    Annals of Statistics

    (2004)
  • S.A. Billings et al.

    Identification of non-linear rational systems using a prediction error estimation algorithm

    International Journal of Systems Science

    (1989)
  • S.A. Billings et al.

    A new class of wavelet networks for nonlinear system identification

    IEEE Transactions on Neural Networks

    (2005)
  • Bishop, C.M. (1997). Neural network for pattern recognition....
  • D. Burshtein et al.

    Some relations between the various criteria for autoregressive model order determination

    IEEE Transactions on Acoustics, Speech and Signal Processing

    (1985)
  • S. Chen et al.

    Orthogonal least squares methods and their application to non-linear system identification

    International Journal of Control

    (1989)
  • S. Chen et al.

    Orthogonal least squares algorithm for radial basis funtion networks

    IEEE Transactions on Neural Networks

    (1991)
  • S. Chen et al.

    Sparse kernel regression modeling using combined locally regularized orthogonal least squares and D-optimality experimental design

    IEEE Transactions on Automatic Control

    (2010)
  • G. De Nicolao et al.

    Consistent identification of NARX models via regularization networks

    IEEE Transactions on Automatic Control

    (1999)
  • Cited by (0)

    Long Zhang received his B.Eng. and M.Eng. degrees as an outstanding graduate in Electrical Engineering and Automation from Harbin Institute of Technology, Harbin, China, in 2008 and 2010, respectively, and the Ph.D. degree in Electronics, Electrical Engineering and Computer Science from Queen’s University Belfast, UK, in 2013. He is currently a Research Associate at the Department of Automatic Control and System Engineering, University of Sheffield, UK. His research interest includes system identification, neural networks, statistical regression and fault diagnosis in both time and frequency domains.

    Kang Li received the B.Sc. degree from Xiangtan University, Xiangtan, China, in 1989, the M.Sc. degree from the Harbin Institute of Technology, Harbin, China, in 1992, and the Ph.D. degree from Shanghai Jiaotong University, Shanghai, China, in 1995. He is currently a Professor of Intelligent Systems and Control with the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, UK. He is involved in bioinformatics with applications on food safety and biomedical engineering. He is also a Visiting Professor with the Harbin Institute of Technology, Shanghai University, Shanghai, China, and the Ningbo Institute of Technology, Zhejiang University, Zhejiang, China. He held a Visiting Fellowship or Professorship with the National University of Singapore, Singapore, the University of Iowa, Iowa City, IA, USA, the New Jersey Institute of Technology, Newark, NJ, USA, Tsinghua University, Beijing, China, and the Technical University of Bari, Taranto, Italy. He has authored more than 200 papers in his areas of expertise, and edited 12 conference proceedings (Springer). His current research interests include nonlinear system modeling, identification and control, bio-inspired computational intelligence, and fault-diagnosis and detection, with recent applications in power systems and renewable energy, and polymer extrusion processes. He serves in the editorial boards of Neurocomputing, the Transactions of the Institute of Measurement and Control, Cognitive Computation, and International Journal of Modelling, Identification and Control.

    This work was supported in part by the U.K. Research Councils under Grants EP/G042594/1 and EP/L 001063/1, in part by the Chinese Scholarship Council, in part by the National Natural Science Foundation of China under Grants 61271347 and 61273040, and by the Science and Technology Commission of Shanghai Municipality under grant 11JC1404000. The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Antonio Vicino under the direction of Editor Torsten Söderström.

    View full text