On evolutionary system identiﬁcation with applications to nonlinear benchmarks

This paper presents a record of the participation of the authors in a workshop on nonlinear system identiﬁcation held in 2016. It provides a summary of a keynote lecture by one of the authors and also gives an account of how the authors developed identiﬁcation strategies and methods for a number of benchmark nonlinear systems presented as challenges, before and during the workshop. It is argued here that more general frameworks are now emerging for nonlinear system identiﬁcation, which are capable of addressing substantial ranges of problems. One of these frameworks is based on evolutionary optimisation (EO); it is a framework developed by the authors in previous papers and extended here. As one might expect from the ‘no-free-lunch’ theorem for optimisation, the methodology is not particularly sensitive to the particular (EO) algorithm used, and a number of different variants are presented in this paper, some used for the ﬁrst time in system iden-tiﬁcation problems, which show equal capability. In fact, the EO approach advocated in this paper succeeded in ﬁnding the best solutions to two of the three benchmark problems which motivated the workshop. The paper provides considerable discussion on the approaches used and makes a number of suggestions regarding best practice; one of the major new opportunities identiﬁed here concerns the application of grey-box models which combine the insight of any prior physical-law based models (white box) with the power of machine learners with universal approximation properties (black box). (cid:1) 2018 The Authors. Published by Elsevier Ltd. ThisisanopenaccessarticleundertheCCBY


Introduction
In March of 2016, an interesting meeting on the subject of nonlinear system identification (NLSI) took place at Vrije Universiteit Brussel (VUB) in the Belgian capital.The meeting was interesting for two reasons; in the first case, it was organised with the intention of bringing together experts from the disciplines of electrical engineering, mechanical engineering and machine learning, in order to draw out common elements of best practice for nonlinear system modelling/identification, and also to exploit any potential synergies.The second feature of interest was that the meeting was organised around the discussion of three benchmarks for NLSI, each designed in such a way as to challenge theory and practice in specific ways.
The three benchmark problems were as follows: A Bouc-Wen Hysteretic System.The NLSI challenges of this benchmark were associated with the fact that the system of interest had an unmeasurable state in its equations of motion, and the fact that the model form was not linear in the parameters.The system equations were encoded in a Matlab p-file, which allowed participants complete freedom in choosing the form of the excitation used for identification.The data were thus generated by computer simulation, although noise was added to the response in the p-file to give an element of realism.
A Wiener-Hammerstein System with Process Noise.The main challenge associated with this benchmark was that the system was a block-structured system where significant noise was added to an internal state.The system was encoded in an electronic circuit and was thus experimental (at least from an electrical engineering point of view).Although participants did not have the freedom to completely experiment with excitation signals, they were allowed to propose signals, which were then used in a number of measurement campaigns in order to collect data for the benchmark exercise.
A Cascaded Tanks System.This was an experimental liquid level system, in which fluid passed between two tanks.The main challenges were that an unmeasured state was present again, but mainly that the record of data for NLSI was very short.Further challenges arose due to the overflow of the tanks, which introduced a hard saturation nonlinearity and some uncertainty in the form of process noise.Participants were not given any control of the experiment in this particular case.
This paper is a record of the participation of a team of University of Sheffield (UoS) academics and researchers.It comprises a summary of a keynote presentation by one of the authors, followed by detailed descriptions of how the team attempted to solve the benchmark problems.It is argued here that general frameworks are beginning to emerge for NLSI, which are capable of addressing ranges of disparate problems.Two of the main candidates for such a general framework are those based on evolutionary optimisation (EO) and Bayesian inference.In fact, the algorithms applied here were taken from the EO approach developed by the authors over a number of years and extended in order to address the benchmarks.The power of the EO framework is clearly evidenced by the fact that it provided the best solutions to two out of the three benchmark problems at the focus of the workshop.As one might expect from the 'no-free-lunch' theorem for optimisation [1], it would be surprising if a single variant of the EO algorithm stood out as the overall best choice, so the viewpoint here has been to present a number of possibilities (reflecting the slightly different tastes of the authors and illustrating the range).Although the EO approach is favoured in this paper, the Bayesian framework for NLSI is also very powerful and is being pursued by the authors; however, there is simply not room here to compare the two frameworks.If the reader is interested in seeing how modern Bayesian methods can contribute to NLSI they can consult the references in the following section.
One of the main contributions of this paper is to highlight and develop the idea of using grey-box models for NLSI.Grey box models combine the insight of a physics-based (white box) model with the explanatory power of machine learners (black box) which have universal approximation/representation properties.In fact, the grey-box model presented here for Benchmark Three also combines the power of the EO and Bayesian approaches by using a Gaussian process NARX (Nonlinear AutoRegressive with eXogeneous inputs) model to capture behaviour missed by the physical model and to thus substantially improve predictions.
It is important to note two facts.The first is that the paper has been formed in order to give an honest account of the identification results, as presented at the workshop; it deliberately does not contain any results which exploit lessons learned during or after the meeting, although those lessons have led to a great deal of progress for the participants since.The second fact to note, is that four separate studies are presented here, each carried out by separate subgroups of the UoS team; this means that the studies may reflect slightly different views on the practice of NLSI -the authors are all in general agreement about the aims, objectives and importance of the subject.The amount of ground covered here also means that the paper is rather lengthy; this is sadly unavoidable if it is to reflect properly the weight of the work conducted.
The layout of the paper is as follows: the following section summarises one of the workshop keynotes, and discusses the question of whether NLSI can be reduced to a problem in machine learning.The three subsequent sections, outline in turn, how the authors approached the three VUB benchmark problems.

Introduction
The material of this section was originally the subject of a keynote at the VUB Workshop; as a result, its remit is broader than the material which follows, which presents studies of the specific benchmark exercises.However, the discussion will touch on various issues which surface during the detailed studies and will also attempt to capture aspects of current thinking in terms of NLSI within the Engineering Dynamics community.In order to faithfully cover what was discussed in the keynote, it will be necessary to go over a little previously-published ground; however, this will also help to make the paper more selfcontained.
Historically, one could argue that the main developments in the general theory of linear SI have come from the electrical engineering and control communities.This work resulted in a comprehensive and rigorous body of material which is visible through classic texts and monographs like [2,3].Although general ideas from SI were certainly adopted by the Engineering Dynamics community, the main developments there are associated with a specific methodmodal analysis [4].Modal analysis arose naturally as a result of the fact that linear engineering dynamics is concerned with a specific system of secondorder differential equations, derived from Newton's second law, and usually expressed in the matrix form, where xðtÞ is the excitation force and yðtÞ is the displacement response of the system of interest; M; C and K are, respectively, the so-called system mass, damping and stiffness matrices.(Throughout this paper, capitals will denote matrices and underlines will denote vectors; overdots will indicate differentiation with respect to time.)Modal analysis works by using matrix diagonalisation to reduce an N-Degree-of-Freedom system to N Single-Degree-of-Freedom (SDOF) systems; it has proved overwhelmingly successful in the context of linear engineering dynamics.
The main limitations of the linear approaches discussed (and they are serious limitations) is that they do not adequately address nonlinearity, nonstationarity and uncertainty.The issue of nonstationarity in SI deserves an article of its own (in fact, a recent special issue of MSSP was dedicated to this matter1 ) and is not considered further here.In terms of nonlinearity, a great deal of work has been carried out over the last fifty years in particular, but it is fair to say that no panacea has emerged.Instead, at least as far as structural dynamics is concerned, a toolbox philosophy has evolved.Quite a wide variety of approaches to nonlinear SI (NLSI) have been developed, each with its own optimal domain of applicability, as summarised in [5][6][7].At the moment, there is the prospect of more generally applicable methods emerging, and this will be discussed in more detail a little later.
In terms of uncertainty, there are a number of interesting matters to discuss.It is fair to say that Engineering Dynamics has largely assumed throughout its history that deterministic models are appropriate for system modelling and prediction; recent (and not so recent) developments suggest otherwise.For example, the detailed dynamical modelling of biomechanical systems is becoming more commonplace, and this faces the immediate problem that the mechanical properties of tissue vary considerably from individual to individual and even within a single individual; this presents serious issues in terms of the construction of predictive models able to generalise from person to person.Because of the problem of uncertainty, deeper probabilistic reasoning is becoming much more common in the analysis of dynamical problems.(Of course, probability theory is only one of a large range of possible uncertainty theories [8]; however it is by far the most highly developed and pervasive).Many of the lessons learned recently have come from the field of machine learning.
In some areas of engineering dynamics, uncertainty has been (at least partially), accommodated in theory and practice for a long time, and it is interesting to note that SI is a good example of this -at least in the linear case.It has long been recognised, that to identify a parametric model from measured data, one has to allow for the fact that noise may be present in any measurements in order that the identified parameters for the model are meaningful.Here, meaningful has largely been taken in the past to mean unbiased, which is to say that the parameters truly allow modelling of the underlying structure or system of interest, rather than capturing aspects of the noise inherent in the individual set of data used for learning the model.However, quite independently of Bayesian approaches, developments in regularised methods of linear SI have forced a reevaluation of the meaning and import of bias [9,10].In general, the inclusion of noise models in the linear and nonlinear approaches has often been considered sufficient for the removal of bias; possibly the most principled approach in the nonlinear case is the NARMAX approach pioneered by Billings and co-workers [11].Despite some progress, one might argue that the probabilistic reasoning that underlies SI was often hidden until recently.In particular, despite the fact that many leastsquares estimators used for SI are maximum-likelihood estimators under given assumptions, SI users in structural dynamics would often implement algorithms in linear algebra and treat the resulting crisp parameters as constituting 'the model'; the main objective of noise models has been to ensure that there is no systematic bias in the estimated parameters.Furthermore, even if a covariance matrix for the estimates was found, it was usually only used to provide confidence intervals or 'error bars' on the parameters; predictions would still be made using crisp parameter estimates.Such approaches are powerful, but do not fully accommodate the fact that the data may be consistent with a number of different parametric models, and thus only go part way to recognising and accommodating general aspects of uncertainty.
Recent advances in machine learning [12,13] have not only offered more principled and holistic means of addressing the issue of uncertainty, but have also offered the prospect of a more general paradigm for NLSI.In fact, another more general approach has emerged in recent years which is also rooted in the broader discipline of soft or natural computing [14] -one based on evolutionary optimisation.Both of these new developments for NLSI will be discussed in more detail here; however, it is the evolutionary methods which will dominate the detailed investigation of the VUB benchmarks later in the paper.Before discussing the general approaches in detail, it will prove useful to define a little terminology; it will be useful to divide predictive models into two classes: white and black-box models.
A white-box model here is one where the equations of motion have been derived from the underlying physics of the problem and the model parameters have direct physical meanings; e.g.finite element models or the lumped-mass model represented by Eq. (1).A black-box model is formed by taking a class of models with some universal approximation property and learning the parameters from data; in such a model, like a neural network, the parameters will not generally be physical.SI or learning from data in machine learning terms, is essential to a black-box approach; for the white-box model, parameters may be estimated from data or fixed by physical laws.

Evolutionary optimisation
As far as the current authors are concerned, the original motivation for developing optimisation-based methods for NLSI was not so much rooted in uncertainty analysis, but with other technical problems associated with identifying nonlinear systems.These technical problems are most easily discussed with respect to specific systems; as the Bouc-Wen system was the original system of interest [15] and also appears later as one of the VUB benchmarks, it makes good sense to introduce it here.
The general Bouc-Wen (BW) model [16,17] is a nonlinear hysteretic restoring force model, where the total restoring force is composed of a polynomial non-hysteretic and a hysteretic component based on the displacement yðtÞ and velocity _ yðtÞ time-histories.The general Single-Degree-of-Freedom (SDOF) hysteretic system described in the terms of Wen [17] is, m€ y þ rðy; _ yÞþzðy; _ yÞ¼xðtÞ ð2Þ where rðy; _ yÞ is the polynomial part of the restoring force and zðy; _ yÞ the hysteretic; m is the mass of the system and xðtÞ is the excitation force.The polynomial component may be assumed linear if desired (and justified), but is essentially a static nonlinear function of y and _ y.In contrast, the hysteretic component is defined via an additional equation of motion [17], for n odd, or, for n even.
The parameters a; b and n govern the shape and smoothness of the hysteresis loop (this will be elaborated later).As a system identification problem, this set of equations presents a number of difficulties, foremost are: The variables available from measurement will generally be the input x and some form of response: displacement, velocity or acceleration.Even if all the response variables mentioned are available, the state z is not measurable and therefore it is not possible to use Eqs.( 3) or ( 4) directly in a least-squares formulation.The parameter n enters the state Eqs.( 3) and ( 4) in a nonlinear way; this means that a linear least-squares approach is not applicable to the estimation of the full parameter set; at the least, some iterative nonlinear least-squares approach is needed as in [18].
Both of these difficulties can be addressed by adopting an (evolutionary) optimisation approach.The SI problem is simply framed as a minimisation problem with the objective/cost function defined as a normalised mean-square error between the 'measured' data and that predicted using a given parameter estimate, i.e., Jðm; c; k; a; bÞ¼ where the error is framed in terms of displacement response, N is the number of measured points and the caret denotes a quantity predicted by the model.In general, any metric measuring the 'distance' between measured data and predictions can be used; the authors usually use a normalised version of (5), where r 2 y is the variance of the measured displacements.This cost function has the following useful property; if the mean of the output signal is used as the model i.e. ŷi ¼ y for all i, the cost function is 100.0 (and can be thought of as a percentage).This definition of cost could just as easily be used with velocity or acceleration data.The clear advantage of the optimisation approach is that it does not require measurements of the latent variable z and is insensitive to whether the model is linear in the parameters.
So far, this type of problem could be approached using any optimisation method e.g. a Gauss-Newton approach was adopted in [18].However, evolutionary approaches offer various advantages, and in order to show this, it is useful to give a concrete example.As variants on the Differential Evolution (DE) algorithm were used to analyse two of the VUB benchmarks, it makes sense to describe the basic DE algorithm here.
The standard DE algorithm of [19] attempts to transform a randomly-generated initial population of parameter vectors into an optimal solution through repeated cycles of evolutionary operations, in this case: mutation, crossover and selection.In order to assess the suitability of a certain solution, a cost or fitness function is needed; the cost function in Eq. ( 6) is the one used here.Fig. 1 shows a schematic for the DE procedure for evolving between populations.The following process is repeated with each vector within the current population being taken as a target vector; each of these vectors has an associated cost taken from Eq. (6).Each target vector is pitted against a trial vector in a competition which results in the vector with lowest cost advancing to the next generation.
The mutation procedure used in basic DE proceeds as follows.Two vectors A and B are randomly chosen from the current population to form a vector differential A À B.Amutated vector is then obtained by adding this differential, multiplied by a scaling factor F, to a further randomly chosen vector C to give the overall expression for the mutated vector: C þ FðA À BÞ.The scaling factor, F, is often found to have an optimal value between 0.4 and 1.0.
The trial vector is the child of two vectors: the target vector and the mutated vector, and is obtained via a crossover process; in this work uniform crossover is used.Uniform crossover decides which of the two parent vectors contributes to each chromosome of the trial vector by a series of D À 1 binomial experiments.Each experiment is mediated by a crossover parameter C r (where 0 6 C r 6 1).If a random number generated from the uniform distribution on [0, 1] is greater than C r , the trial vector takes its parameter from the target vector, otherwise the parameter comes from the mutated vector.In order to ensure that all trial vectors differ from their associated target vector, even if C r ¼ 0, a single chromosome in the trial vector is randomly chosen to take the corresponding value from the mutated vector.
This process of evolving through the generations is repeated until the population becomes dominated by only a few low cost solutions, any of which would be suitable.Like the vast majority of optimisation algorithms, convergence to the global minimum is not guaranteed; however, one of the benefits of the evolutionary approach is that it is more resistant to finding a local minimum.In fact, this usually proves to be the main benefit.The other advantage of the approach is that the algorithm does not need estimates of the gradients or Hessians of the parameters.
As discussed above, evolutionary optimisation provides a useful framework for NLSI that overcomes a number of technical problems that one encounters in trying to use standard least-squares methods; in fact, some variant or other has been used in order to address each of the benchmark studies discussed later.Where the evolutionary approaches can be found wanting, is in their accommodation of uncertainty.In fact, it is possible to estimate confidence intervals for parameters [20], but this does not amount to a comprehensive treatment of uncertainty, as observed earlier.One aspect of the evolutionary approaches which is almost never exploited, is that they return a population of solutions at every generation -all the information obtained throughout this process could be used to account for uncertainty and could result in more robust predictions.Generally speaking, the evolutionary approaches are probably best applied for white-box models.This is because the size of the population required is a function of the number of parameters estimated; one usually chooses the number of individuals to be five or ten times the number of model parameters.Larger populations will lead to more computational cost, and if evaluation of the objective function is not fast, the cost may be prohibitive.As white-box models are almost always more parsimonious in terms of parameters, they are singled out for evolutionary modelling.This observation has not stopped various people attempting to use evolutionary algorithms to train neural networks, for example, but the evidence shows clearly that this is not usually a good idea.
Unlike the evolutionary approaches, the other general framework for NLSI which is emerging, is almost entirely motivated by a desire to understand and account for uncertainty, and this will be discussed next.

Bayesian inference
As supported by recent work in the machine learning community, a more robust approach to parameter estimation, and also model selection, can be formulated on the basis of Bayesian principles [12,21,13].It will be shown that, among the potential advantages offered by a Bayesian formulation are: The estimation procedure will return parameter distributions rather than point estimates of parameters.Predictions can (in principle) be made by integrating over all possible models consistent with the data, weighted by their probabilities.
Evidence for a given model structure can be computed, leading to a principled means of model selection.
The Bayesian approaches to NLSI/system models can be applied to both white-and black-box models; in fact, methods for black-box models arguably emerged first e.g.Bayesian learning algorithms for Multi-Layer Perceptron (MLP) neural networks [22].In terms of white-box models, Bayesian methods are not new to structural dynamics, as evidenced by over 20 years of work by Jim Beck and colleagues [23][24][25][26][27]; however, they have by no means been fully exploited.More recently, Bayesian ID methods for white-box differential equation models have emerged in the context of systems biology [28,29].
In order to discuss the advantages of a Bayesian approach, it is useful to re-state what the problem of SI is, i.e. given measured data from a structure, how does one infer the equations of motion which 'generated' the data.Although the problem can be stated simply, it has a number of technical difficulties and is generally not at all easy to solve.At the base of the issues is the fact that SI is an inverse problem of the second kind and can be extremely ill-posed even if the underlying equations are assumed to be linear in the parameters of interest; the 'solution' may not be unique.Furthermore, if the equations of motion of the system of interest are not linear in the parameters, the difficulties multiply.
Much of the difficulty can be blamed on uncertainty again; even if the issue is simply that the measurements or data from a system will almost always be contaminated by some form of random noise.For notation, assume data D ¼fðx i ; y i Þ; i ¼ 1; ...; Ng of sampled inputs x i and outputs y i .If there is no noise, then the identification algorithm of choice should give a deterministic estimate of any system parameters w; however, if noise ðtÞ is present, w will become a random variable conditioned on D. In this context one arguably no longer wishes an estimate of w, but to specify ones belief in its value, expressed through some appropriate uncertainty framework.If the framework of choice is probability theory, the problem becomes one of estimating the probability density function of the parameters pðwjD; MÞ, where the density is conditioned on the training or estimation data D, but also on the choice of model M. Thus, in the presence of noise, the most one can learn from any data is the probability density function of the parameters; however, in a probabilistic context, this is everything. 2ven so, the uncertainty in any estimated parameters is by no means the only uncertainty of interest.The usual objective of SI is to provide some sort of predictive model i.e. a mathematical model which can estimate or predict system outputs if a different system input is given.If a crisp parameter estimate is all one has, the best one can do is a crisp prediction; however, if a parameter distribution is known, one can potentially do better and form a predictive distribution.Suppose the response to a new input sequence x Ã were desired, one could in principle find the density for the predicted outputs, and then the mean of this distribution would give 'best' estimates for predictions, and the covariance would allow one to establish confidence intervals.Note that the prediction assumes the presence of a w.In practice, one might use the mean or the mode of the parameter distribution, but these are still point estimates, not reflecting the uncertainty in the parameters.
A fully Bayesian prediction strategy would, again, in principle, allow one to marginalise over parameter estimates, i.e., This is a very powerful idea: allowing for a fixed model structure, one makes predictions using an entire set of parameters consistent with the training data; each point in parameter space weighted according to the likelihood of the given data.In practice, there are problems in implementing the full Bayesian approach due to the difficulty of evaluating the integral in (8).
Another advantage offered by the Bayesian approach is that it can potentially weigh the evidence for competing model forms.Suppose the true model structure must be one of a finite number fM i ; i ¼ 1; ...; Mg; one can imagine computing the probability of observing the data PðDjM i Þ and selecting the model with highest probability.Even more in the Bayesian spirit, one could marginalise over all possible model structures e.g. for prediction, Unfortunately, the posterior over models PðM i jDÞ is difficult to compute.If one appeals to Bayes theorem in the form, and assumes equal priors on models, one arrives at the Bayes factor, which weights the evidence for two models in terms of marginal likelihoods of the data given the models.Sadly, the marginal likelihoods are usually intractable integrals.
In summary, assuming one can overcome some of the computational difficulties involved (e.g.high-dimensional numerical integrals), the Bayesian framework for NLSI is very general indeed.Like the evolutionary approaches, the Bayesian ones have no technical problems with unmeasured states or models which are nonlinear in the parameters.All of this suggests that NLSI has been reduced to the problem of finding appropriate computational algorithms for machine learning; that the problem of NLSI has been reduced to one of machine learning.The next part of the discussion here will argue that this is not the case.

Is NLSI just machine learning?
To recap, it appears that NLSI can be formulated in terms of a machine learning approach; the problems raised so far relate only to difficulties in numerical calculations.The argument here is going to be that NLSI is more than this, and the first argument will be based on going back to the idea of uncertainty.
The previous discussion has highlighted the importance of considering uncertainty; however, in the reality of modelling engineering systems and structures, one needs to go a little further and address the issue that there are two main types of uncertainty.
Aleatory Uncertainty: is essentially randomness.Examples are measurement noise superimposed on data or the behaviour of truly stochastic systems (i.e.Brownian motion).This is uncertainty which cannot be removedirreducible uncertainty.Epistemic Uncertainty: is essentially ignorance.It commonly arises because all of the underlying causes (physics) of a problem are not known.This type of uncertainty can be removed by designing experiments to learn the missing physics -i ti sreducible.The case will be presented here that even the Bayesian approach discussed earlier is not adequate to address the full issues of uncertainty because it is only really formulated to deal with aleatory uncertainty. 3In reality, there may be ignorance of the form of the model, even of the underlying physical principles of the processes at work.
Dealing with epistemic uncertainty leads one naturally to the idea of a Grey-Box model.A grey-box model is one for which only some of the underlying physics is specified i.e. it has a white-box component; one can then attempt to reduce any model error by adding a nonparametric component and learning its behaviour from data.This observation in turn leads to the idea of two types of grey-box models: A grey-box model will be said to be of Type A if the nonparametric component is a true black-box model.A grey-box model will be said to be of Type B if the nonparametric component is motivated in some way by physics rather than simple possession of a universal approximation property.
Type B models are arguably the result of physics and creativity and cannot be found by learning from data alone.Two examples will be considered here.

Friction models
Friction is dynamically the resistance to motion produced by interfacial contacts between two bodies in relative motion.The phenomenon has a microstructural origin, and the detailed physics is the subject of the discipline of tribology.A true white-box model of friction would be prohibitively costly for most purposes of structural dynamics, so simplified effective models are usually assumed.The most simplistic semi-physical representation is via the Coulomb model which simply reverses the action of a constant force when the direction of motion reverses; in the context of an SDOF oscillator, one has, The Coulomb model is very limited, but is conceptually simple and very convenient for SI.Among the immediate limitations of the model is the fact that it does not distinguish between static and dynamic friction and that it does not account for the hysteresis loops which are observed in reality.In order to accommodate the hysteresis effect, the more sophisticated Dahl model was introduced in 1968; the basic model has the form, where the z is a state variable interpreted as the elastic deformation of surface asperities of adjacent bodies (note the resemblance to the Bouc-Wen model in this respect; this is a characteristic of hysteresis models) and r 0 represents a sort of average asperity stiffness and d D determines the shape of the hysteresis but, in the literature, is often set to unity.The Dahl model is a Type B grey-box model; the white-box component is simply an SDOF oscillator, while the black-box component is constructed by considering a model of the average movement/displacement of microscopic asperities.The SI problem has become more difficult than the Coulomb case (the model has an unmeasured state and is nonlinear in the parameters) but gives a better representation of the dynamics, respecting better as it does, the underlying physics of the problem.The Dahl model motivated the construction of a better model -the LuGre model [31].While the Dahl model captures the difference between static and dynamic friction (sometimes called predisplacement) and hysteresis, it is unable to account for stick-slip behaviour and the so-called Stribeck effect (decrease of friction with velocity over a certain velocity regime; visible in Fig. 2(a)).A simplistic view of the LuGre model is that it adds an effective damping component for the average asperity movement (Fig. 3).
The equations of motion for the LuGre model incorporated into the motion of an SDOF oscillator are, where F S and F C are the static and Columb friction coefficients respectively, sð _ yÞ is the Stribeck curve, v s is the Stribeck velocity and d vs is the Stribeck shape factor (d vs ¼ 2 is often used in the literature).A parametrised model of the Stribeck curve can be included in the overall identification/learning problem.Note that, although some of the parameters do not have completely clear physical interpretations, others -like F C and F S -encode simple, direct physics.Other friction models have evolved in turn from the LuGre model, including the Leuven integrated friction model which also accounts for presliding hysteresis [33].However, the point has been made for now; Type B grey-box models like the Dahl and LuGre models are based on physical and engineering insight and rely on the introduction of model terms which are very problem specific and capture behaviour in a parsimonious manner, that would be difficult for black-box basis terms to capture in such a small number of terms.While the grey-box structures are more challenging from an NLSI viewpoint, they are addressable using the powerful methods discussed earlier; for a study on the issues associated with estimating grey-box friction models more generally, the reader may consult [34].

Hysteresis models
The second example of a class of Type B grey-box models is provided by hysteretic systems.Setting aside friction, hysteretic or memory-dependent phenomena are observed in many areas of physics and engineering such as electromagnetism, phase transitions and elastoplasticity of solids [35].As in the case of friction, the exact physics at play is often complex and a simplified effective model structure can be sufficient for the purposes of engineering dynamics.As introduced in Section 2.2, one of the most commonly-used effective hysteresis models is the Bouc-Wen (BW) model.As shown earlier, the BW model incorporates an unmeasured system state specified by an additional equation of motion, which allows a versatile representation of a family of hysteresis loops (Fig. 4).Unlike the friction models, the BW model does not have a direct physical interpretation; however, it has been constructed in order to give the aforementioned versatility.
Sadly, the BW model is not generally versatile enough.Various effects commonly occur in hysteretic systems which cannot be captured by the basic model.One example is given in Fig. 5, which shows pinching of the hysteresis loops for a nailed sheathing connection in a wooden frame [36].
As in the case of the friction models, new terms need to be added in order to capture the missing physics.As before, these are designed as effective terms intended to capture the required behaviour without detailed modelling of the microstructural physics which cause it.One of the most successful extensions of the BW model is the Bouc-Wen-Baber-Noori (BWBN) Model [37][38][39].The BWBN model is designed to capture the pinching effect illustrated earlier, and also to model the strength and stiffness degradation observed in many real hysteretic structural systems.The form of the model is, where gðÞ; mðÞ and hðzÞ are parameters associated with the strength, stiffness and pinching, degradation effects; gðÞ; mðÞ and AðÞ are increasing functions of the absorbed hysteretic energy , The pinching function hðzÞ is specified as,  hðzÞ¼1 where and z u is the ultimate value of z, specified by, Fig. 5. Illustration of a nailed sheathing connection in a wooden frame and the corresponding pinching hysteresis curve [36].
Explaining how the extra terms are motivated is beyond the scope of this paper, the reader should consult the original references.The point here is that the BWBN model is a Type B grey-box model, informed by basic physics (like the absorbed energy) but not attempting to capture it in all its microstructural detail.As in the case of the friction models, the result is a parsimonious model which will capture behaviour better than a black-box model with a comparable number of parameters would.
Once one has the model (and some data), machine learning can take over in order to estimate parameters (or to select between candidate models), but getting the model form is another matter.One might argue about whether extraction of a model of this complexity is SI, or whether it is fundamental physics; the current authors would argue that it is SI -it is not intended as an exploration of basic physics, but as a means of providing an effective predictive model.

Conclusions
The local conclusions for this section are as follows: The Bayesian and evolutionary viewpoints on nonlinear SI offer quite general frameworks for the estimation of model parameters.They offer advantages over point parameter estimation (populations in the case of the evolutionary approach, distributions in the case of the Bayesian).In terms of uncertainty analysis, the Bayesian approach is likely to be advantageous even when evolutionary schemes allow estimates of parameter confidences.Many of the insights here have come from machine learning work, along with very powerful parameter estimation and model structure detection techniques from the Mechanical and Electrical Engineering communities; this synergy is precisely what the VUB workshop was intended to expose.Machine learning is not everything.System identification needs physical insight and expertise in order to overcome the problem of model form uncertainty.This is just as true for grey-box models as white-box models.Although it has not been discussed yet, the problems of developing an optimal test or data collection strategy is still not completely possible using automated analysis (this will be discussed in the context of the Benchmark One results).
The paper next moves on to the VUB benchmark problems.Each problem was addressed by subgroups of the overall team.As discussed earlier, each subgroup separately adopted an evolutionary optimisation approach.In two cases, the algorithm adopted was an extension or variant of the Differential Evolution algorithm described earlier; in the other, a type of swarming algorithm motivated by the behaviour of Antarctic krill was adopted.In terms of Benchmark One and the whitebox component of Benchmark Three, it would have been a straightforward matter to adopt a Bayesian white-box approach; in fact, some of the current authors have recently applied Approximate Bayesian Computation (ABC) techniques to the problem of hysteretic (Bouc-Wen) system identification, with a great deal of success [40,41].However, applying the methods in order to make a comparison here, would have lengthened the paper considerably.In terms of Benchmark Two, it is probably fair to say that there is no comprehensive Bayesian approach to the identification of Wiener-Hammerstein models agreed upon in the research community; although MCMC methods could clearly be applied if the computational burden did not prohibit it.

Introduction
With this section, begins the study of the VUB benchmarks.In each case, the participants in the workshop were presented with a detailed specification of the problem, including how to access or generate the required data.Only those details which are necessary for understanding the results here will be included in the current paper.
This section is concerned with Benchmark One, which required the system identification of a simulated SDOF Bouc-Wen (BW) hysteresis system.As discussed earlier, the general BW model [16,17] is one of the most commonly-used mathematical models for describing hysteretic behaviour.It was used as a benchmark because of the specific challenges associated with the system possessing an unmeasurable internal variable and the dynamic nature of the nonlinearity.As the current authors had already developed a successful evolutionary approach to BW system identification [15], the opportunity was taken to focus on two methodological issues: benchmarking the identification with a linear model, and choosing appropriate (if not optimal) excitation signals for the generation of training data.
In this case, the detailed specifications of the benchmark can be found in [42].Restating the equation of motion for the BW oscillator in the form and notation of [42] gives, m L € yðtÞþrðy; _ yÞþzðy; _ yÞ¼xðtÞ ð20Þ where m L is the system mass, yðtÞ is the displacement and xðtÞ is the force input.The static nonlinear term, rðy; _ yÞ, which only depends upon the current values of displacement and velocity, is assumed to be linear, where k L and c L are the linear stiffness and linear viscous damping parameters.The history-dependent (hysteretic) nonlinear term, zðy; _ yÞ, obeys the first-order differential equation, where a; b; c; d and t are the Bouc-Wen parameters which control the shape/structure of the hysteresis loop.
For the purposes of the benchmark study, the BW system to be identified was encoded within a Matlab p-file.This gave the user complete freedom over the choice of force input.This freedom was critical to the system identification strategy which will be discussed in the next section.The BW p-file made use of a Newmark numerical integration scheme and detailed guidelines were given regarding upsampling, filtering and decimation for generation of data.The p-file also added a constant level of Gaussian band-limited noise to the output displacement; the force input was assumed to be free of noise.In order to allow comparison of different system identification strategies, two fixed test datasets were provided.One of these datasets was a random phase multisine dataset whilst the other was a sine-sweep dataset.The test data sets were not to be used during the parameter estimation phase of the work but were to allow reporting of a Figure-of-Merit (FoM) for each of the two test datasets, which could be used to compare identification results across participants.The FoM was an unnormalised Root-Mean-Square error, defined by, where N t is the number of points in the given test set, i is the sample index and the caret denotes quantities estimated by the model.
It is important to note that there can be two important modes of prediction using models generally; although the distinction is most marked when a pure discrete-time model is used rather than a continuous-time one.Suppose the model output at sampling instant i is a function of previous samples of input and output.For the training data, measured inputs and outputs are available.This means that one can estimate the current output by substituting measured inputs and outputs into the model function.Somewhat confusingly, some sectors of the SI community refer to this mode of estimation as prediction.A more stringent test of the model is to start with measured initial conditions and to recursively feed back estimated outputs into the model function in order to generate the next estimate; this mode is called simulation.For the benchmarks, the participants were asked to report simulation errors, where the distinction was possible.For the continuous-time model of Benchmark One; forward predictions are made by using an initial-value solver, and the mode is naturally simulation.Furthermore, in all cases throughout this paper, the final error measures reported are for an independent testing set, unseen throughout the training process.
In terms of training, all system parameters were to be identified using alternative training data generated using the supplied p-file.

System identification strategy and focus of study
As stated above, the focus of the current study will be concerned with presenting the authors' views regarding two important factors, which are sometimes overlooked, related to the success of a nonlinear system identification procedure.The first of these factors is that the choice of forcing input, in terms of both the signal type and amplitude, plays a critical role in any NLSI procedure.The second factor relates to the idea that it can be difficult to gauge the success of a nonlinear identification procedure without the use of a reference or baseline model.It is held that, before conducting a nonlinear system identification procedure, the best linear system should first be identified.
Although this was not a focal point of the study, the optimisation algorithm employed in this study was partially chosen as it was a good opportunity to showcase a relatively new, robust optimisation technique, which may be unfamiliar to many.The algorithm chosen was the JADE [43] algorithm, a relatively simple, adaptive variant of the better known Differential Evolution (DE) discussed in the introduction here [19].Whilst the basic DE algorithm is simple and relatively robust, it does require that the hyperparameters, F (in the mutation operation) and C r (in the crossover operation) be specified before commencing.Poor choices for these hyperparameters may result in lack of convergence or premature convergence.Furthermore, it is often the case that the best values for the hyperparameters may vary during the optimisation process.JADE seeks to address these issues whilst retaining simplicity.
There are a number of key differences between JADE [43] and basic DE [19].The first occurs during the mutation operation, whereby the current best solution is combined with a differential of two randomly chosen vectors.The scaling factor within the mutation operation is drawn from a probability distribution rather than kept at a fixed value as in DE.The crossover operation is implemented in the same way as in DE, except that the crossover ratio (which decides the probability of a child element being drawn from the original or the mutation matrix) is also drawn from a probability distribution rather than being a fixed constant.The final difference is during the selection phase whereby, in addition to updating the matrix for use in the next generation, the mean values for the hyperparameter distributions are adjusted to take into account previous successful values.Within the selection phase, 'losing' solutions are saved within an archive and may be selected later to form the random differential within the mutation operation.The purpose of this archive is to counteract the greedy nature of using the current best solution within the mutation phase, thereby preventing premature convergence.Whilst there is an extra computational cost incurred due to these extra operations in JADE when compared to standard DE, this cost is generally dwarfed by the cost of function evaluations which will be the same in DE and JADE.
In order to verify the algorithm code and to illustrate the potential benefits of JADE compared with basic DE, a standard optimisation benchmark problem, namely a ten-dimensional variant of Rosenbrock's function, was investigated.The results are shown in Fig. 6.Although both the standard DE and JADE identify the best solution to within machine precision, the JADE algorithm only requires around one-sixth of the number of generations.In the authors' experience, this level of saving is typical.

Results of linear system identification
As stated previously, it is the authors firm opinion that, before conducting a nonlinear system identification, a baseline linear system identification is first required.Apart from serving as a baseline and allowing the practitioner to decide if the extra complexity of a nonlinear ID is justified, the linear SI process is capable of highlighting the point at which a linear system approximation is no longer capable of reproducing the actual system behaviour to within some degree of error.Identification of the point at which the nonlinearity begins to have a significant bearing on the response ensures that the nonlinear SI process is being conducted at an appropriate level of forcing.In the current work, the linear SI will also serve to highlight the influence of the added noise and will help to narrow the search range for the linear parameters within the nonlinear SI.
The aforementioned approach was conducted on the BW system for both a broadband random input and a linear chirp input.For each input type, a number of forcing levels was investigated.For the broadband random input, signal variances of 0:1; 0:5; 1; 5; 10; 50; 100; 500; 1000; 5000; 1 Â 10 4 ; 5 Â 10 4 ; 1 Â 10 5 ; 5 Â 10 5 ; 1 Â 10 6 ; 5 Â 10 6 and 1 Â 10 7 N 2 were considered, whilst the amplitudes chosen for the chirp input were 0:1; 0:5; 1; 5; 10; 50; 100; 500; 1000 and 5000 N. The forcing levels were chosen to ensure that similar ranges of input signal power were being applied for the random and chirp inputs.Fig. 7 shows an example of each type of input; both each had a duration of 2 s and were sampled at 15 kHz.The frequency of the chirp input varied linearly from 2 Hz to 75 Hz.
The Bouc-Wen system was simulated in Simulink with a 4th-order Runge-Kutta numerical integration scheme.For each of the input types and each of the forcing levels, three runs of the JADE optimisation algorithm were conducted; in each case a population size of 30 was used and the algorithm was run for a total of 100 generations.The parameter bounds for the mass, damping and linear stiffness were set at one order of magnitude below to one order above the parameters given in the problem specification; these bounds are given in Table 1.
On completion of each run, both the lowest FoM (Eq.( 23)) and NMSE (Eq.( 6)) were stored along with the corresponding estimates for the mass, damping and linear stiffness parameters.Fig. 8 shows the NMSE values for each of the runs for the various forcing levels of both the random and chirp inputs.Fig. 8(a) shows the results from the broadband random input whilst Fig. 8(b) shows the results from the chirp input.There is consistency across the three optimisation runs at each forcing level in that there is limited scatter in the three points for a particular forcing level.The next observation is of the similarity in the overall trend of the results, with both the random and chirp results exhibiting a 'U' shape.This is explained quite easily and, in many ways, is the justification for conducting the linear SI.At low levels of input power, the effect of the constant level of output noise has a large bearing on the error value.At high levels of input power, the increased error is due to the nonlinear effect becoming more pronounced and a linear system being incapable of reproducing the behaviour.Coincidentally, both the random input and chirp input produce the same minimum value of NMSE of 1.6% although this occurs at an input power of 1000 N 2 for the random input and 50 N 2 for the chirp input.It may also be noted that the chirp input results in greater NMSE values than the random input for equivalent power.This information will be revisited in the nonlinear SI section.
Fig. 9 shows the results of the parameter estimation for each of the three JADE runs for each of the forcing levels for the random input.At low forcing levels, the noise effect results in inconsistent predictive capability.At higher forcing levels, the nonlinear effect is absorbed into the linear parameter estimates; in particular, the viscous damping term attempts to absorb the hysteretic effect.At the highest power levels, the viscous damping term reaches its upper bound constraint.Fig. 10 shows the parameter estimates for the three JADE runs for each of the forcing levels for the chirp input.The results show similar behaviour to the random input, but the linear SI loses consistent predictive capability at a lower input power.
The linear system identification has served several purposes.It has shown the levels of input power required for the random and chirp input to result in responses containing a significant nonlinear component, and it also provides the opportunity to reduce the linear parameter range for the nonlinear SI.After the SI, it was decided to alter the permissible mass range to be between 1.5 kg and 2.5 kg and the permissible viscous damping parameter to be between 5 N/(m/s) and 30 N/(m/s).It was also decided that the linear stiffness parameter range should not be reduced from the bounds given in Table 1 due to the linear stiffness effect being divided across the two terms (K L and a) in the BW system.

Results of nonlinear system identification
Now that the linear SI has been conducted, the nonlinear SI process may commence, informed by the previous results.The identification followed a similar pattern to the linear identification process.As before, broadband random and chirp inputs were investigated over a range of different forcing levels; however, on this occasion, some of the lower levels of excitation were not investigated.In order to eliminate redundancy in the BW model, the b term was combined with the c and d terms.Furthermore, after some preliminary tests, it was decided that the t term could be fixed to a value of 1. 4 For each forcing level, three runs of the JADE optimisation algorithm were conducted.Because of the greater complexity of the nonlinear identification problem, the population size was set to 70 and the optimisation was run for 200 generations.(A rule of thumb for evolutionary optimisation is that the number of individuals should be set to 5-10 times the number of parameters.)The lower and upper bounds for the parameters of interest are shown in Table 2.
The lowest FoM and NMSE were stored along with the corresponding nonlinear parameter estimates after each run.

Broadband random training input
Fig. 11 shows the NMSE values for each of the JADE runs for the various forcing levels of the random input, superimposed onto the equivalent results from the previously presented linear SI results.The plot clearly shows the forcing levels at which the nonlinear identification provides added value.The improvement only really occurs above an input power of 1000 N 2 (the point at which the linear SI gave the lowest error).As the input power increases, so the NMSE of the nonlinear identification decreases and the gap between the linear and nonlinear SI widens.The lowest NMSE was 7:3 Â 10 À3 % for one of the runs with an input power of 5 Â 10 6 N 2 .Fig. 12 shows the parameter estimates for each of the runs for the broadband input.Most immediately noticeable, is the significant amount of scatter and inconsistency in the parameter estimates at all but the higher forcing levels; this is likely to be a result of insensitivity to the nonlinear terms at low forcing.There is not only scatter between the three runs at a particular forcing level, but also a lack of consistency between the estimates at different forcing levels.A lack of variability between the three JADE runs and consistency between forcing levels is only observable for the highest five input powers (1 Â 10 5 N 2 and greater).The only parameter which still lacks some consistency, even at the higher forcing levels is c L , the linear damping parameter, whose estimates generally increase with increasing input power.

Chirp training input
Fig. 13 shows the NMSE values for each of the JADE runs for the various forcing levels of the chirp input, superimposed on the equivalent results from the previously-presented linear SI results.The added value provided by the nonlinear identification over linear SI is significantly more pronounced than for the random input and commences at a lower level of input power.At the highest power level, there is an increase in NMSE values and inconsistency between runs -this may be due to a mismatch between the Runge-Kutta (used for model prediction) and Newmark (used for generating training data) numerical integration algorithms.The lowest NMSE obtained was 3:9 Â 10 À4 % for one of the runs with an input power of 5 Â 10 5 N 2 -this is around 1/20th of the lowest NMSE using random inputs.Fig. 14 shows the parameter estimates for each of the JADE runs for the chirp input.On this occasion, there is some scatter between runs and inconsistency between forcing levels for the first three or four forcing levels; after that (i.e. when the nonlinearity has significant effect on the response), the estimates become more consistent.At the highest level of forcing, the scatter between the three runs returns.The consistency of these parameter estimates, when viewed in combination with the NMSE plots of Fig. 13, give a large degree of confidence that, by returning a nonlinear model NMSE that shows vast improvement over the corresponding linear model NMSE, the true system has been identified.This should be contrasted with the situation where a nonlinear SI (with no prior linear SI baseline) is conducted using a relatively low level of input forcing and a low value for the NMSE is incorrectly interpreted as meaning the true system has been identified.Table 3 shows parameter estimates for the chirp input run that resulted in the lowest NMSE of 3:9 Â 10 À4 %.Two of the three JADE optimisation runs for the 1000 N (equivalent to an Input Power of 5 Â 10 5 N 2 ) amplitude chirp forcing input returned exactly the same parameter values, whilst the other run returned values extremely close to those shown.Fig. 15 shows the predicted displacement response for the estimate that resulted in the lowest NMSE.As would be expected from such a low NMSE value, the actual and predicted responses are indistinguishable.Of greater interest is the high degree of nonlinear behaviour present in the responses, especially from between around 0.4 s to 1.1 s; it is easy to understand why linear SI would struggle to return anything other than a high NMSE value.

Model test and figure-of-merit
Once the BW system parameters had been identified using the process detailed in the previous two sections, the predicted system could then be tested against the two fixed test datasets, namely the random phase multi-sine and the sinesweep signal.The test datasets were sampled at 750 Hz and the data that were generated using the estimated system parameters were sampled at 15 kHz then downsampled by a factor of 20 to also give a 750 Hz sampling frequency.

Random phase multi-sine testing set
Fig. 16 shows the actual and predicted displacement response plots for the random phase multi-sine test dataset.At first glance, it may seem that, on this timescale, there is a near-identical match.Closer examination reveals that there is a slight mismatch of responses at the start of the signals.Fig. 17 shows a zoomed version of Fig. 16 for the first second of time, and the mismatch is a little clearer.The mismatch is likely to be a result of issues relating to the transient response.It would be possible to improve the prediction by fine-tuning the initial conditions on this test dataset, whilst retaining the previously identified model parameters.Even with the slight initial mismatch, the estimated model returned a very creditable FoM of 4:75 Â 10 À5 m; this corresponds to an NMSE value of 0.51%.

Sine-sweep testing set
Fig. 18 shows the actual and predicted displacement response plots for the sine-sweep test dataset.As with the random phase multi-sine test signal, there initially appears to be a near-identical match.Closer examination reveals that there is a slight mismatch of responses just after they peak at around 145 s.Fig. 19 shows a zoomed version of Fig. 18 for the time period between 145 s and 160 s.The FoM for the sine-sweep is 1:02 Â 10 À5 m, which corresponded to a very impressive NMSE value of 0.024%.This success may, in some part, be due to the use of a chirp input for the identification of the system parameters.In fact, the EO approach here gave the lowest cost solution for Benchmark One at the time of the workshop.

Discussion
The work on the first benchmark, presented in this part of the paper, focussed on highlighting the role played by the force input in the nonlinear SI process and the need for performing a linear system identification before undertaking its nonlinear counterpart.By conducting SI, the added value of the nonlinear identification can be demonstrated and nonlinear behaviour can be highlighted.
It is hoped that the work has demonstrated that, whilst random inputs are entirely appropriate for linear SI, they should be used with caution for nonlinear identification due to their tendency to linearise.It was shown here, that chirp inputs of corresponding power resulted in significant reduction in NMSE along with reduced variability in identified parameters.
The results from the two fixed test datasets found that the system was identified with 0.51% NMSE for the random phase multi-sine data and 0.024% MSE; the corresponding FoMs were 4:75 Â 10 À5 m and 1:02 Â 10 À5 m.These results were obtained using JADE optimisation in combination with a high-amplitude linear chirp input excitation.Note that the results presented here do not include confidence intervals for the parameter estimates.As discussed earlier in the paper, this is not a weakness of the evolutionary approach, as witnessed by the study in [20] which specifically deals with a Bouc-Wen system.The confidence intervals were not computed here because the objective of the benchmarking exercise was simply to obtain the FoM for the various approaches and report those.
Finally, if the results of this section had been presented as a scientific study in other circumstances, they would be considered lacking in a very important respect, i.e. the method and the results are not compared with any competing methods and the approach selected is not evaluated on other datasets.These considerations are vital in a machine learning study in almost all contexts except the one here; the need for local comparison is obviated by the fact that the overall workshop aim was to provide comparisons between the approaches of different participants; the consideration of other datasets is clearly not necessary given the specific focus on the VUB benchmarks.
As discussed earlier, although not covered in any detail here, the discipline of Bayesian inference also offers the possibility of a general framework for NLSI.As well as general references given in the previous section, the reader may wish to consult the recent [44,40,41], which illustrate the use of Approximate Bayesian Computation, which can simultaneously estimate parameters and weigh the evidence for competing model forms.Two of the papers also focus on a BW system (including an experimental dataset), so can form a useful basis for comparison with the current software.

Introduction
This benchmark presents a Wiener-Hammerstein (WH) electronic circuit where the main challenge results from high process noise, which is the dominant noise distortion [45].The problem is to identify a Wiener-Hammerstein model, which is a block-structured system, as shown in Fig. 20, comprising two linear time-invariant (LTI) blocks in series, separated by a static nonlinear function.
There are precursors for the uses of EO approaches in WH identification.The approach proposed here is very different from the one reported in [46], where the evolutionary optimisation is used only for the pole-zero allocation problem reported in [47].The approaches that are presented in [48,49] are more similar to the method presented in this paper.However, [48] only considers the problem where the LTI blocks are represented by a FIR model, and a simplified differential evolution algorithm is used.The method presented in [49] uses a biosocial culture algorithm.It is comparable to the approach presented here, although it requires more hyperparameters to be selected by the user.
In brief, the benchmark data have been generated from an electronic circuit.The first LTI block can be described well with a third-order lowpass filter.The second LTI subsystem is designed as an inverse Chebyshev filter which has a transmission zero within the excited frequency range, making the inversion of the filter difficult.The static nonlinearity f ðxÞ has been realised with a diode-resistor network, resulting in a saturation nonlinearity.The problem is difficult because the high level of process noise e x will potentially bias the parameters.
The LTI blocks are represented in notation by RðZ À1 ; w 1 Þ and SðZ À1 ; w 2 Þ, where Z À1 represents the backward shift operator or Z-transform variable and w 1 and w 2 are the model parameters for the blocks.
For the identification approach chosen here, each linear block will be represented by an ARX model of the form, and the problem will be to estimate the parameters w i ¼ða i ; b i Þ for each block.The nonlinear function will be represented here by a sum over sigmoids, with an option to use a polynomial representation included in the code.One of the issues with identifying WH models is that the representation is not unique.For example, the system representation is invariant under an exact two-parameter group of transformations, and this presents a type of 'gauge' that can (if so desired) be used to fix scales by setting b 1 ¼ 1 in both of the relevant ARX models.Another source of non-uniqueness is generated by the fact that a delay s on one of the linear blocks, can be compensated by a lead Às in the other [50].1. rand1 2. current-to-best2: In the strategy current-to-rand, K is defined as a coefficient of combination and would generally be assumed in the range [À0.5, 1.5]; however, in the implementation of [52] and the one used here, the prescription K ¼ F is used to essentially restrict the number of tunable parameters.The SADE algorithm here uses the standard crossover approach, except that at least one crossover is forced in each operation on the vectors.If mutation moves a parameter outside its allowed (predefined) bounds, it is pinned to the boundary.Selection is performed exactly as in DE; if the trial vector has smaller (or equal) cost to the target, it replaces the target in the next generation.
The adaption strategy must now be defined.First, a set of probabilities are defined: fp 1 ; p 2 ; p 3 ; p 4 g, which are the probabilities that a given mutation strategy will be used in forming a trial vector.These probabilities are initialised to be all equal to 0.25.When a trial vector is formed during SADE, a roulette wheel selection is used to choose the mutation strategy on the basis of the probabilities (initially, all equal).At the end of a given generation, the numbers of trial vectors successfully surviving to the next generation from each strategy are recorded as: fs 1 ; s 2 ; s 3 ; s 4 g; the numbers of trial vectors from each strategy which are discarded are recorded as: fd 1 ; d 2 ; d 3 ; d 4 g.At the beginning of a SADE run, the survival and discard numbers are established over the first generations, this interval is called the learning period (and is another example of a hyperparameter).At the end of the learning period, the strategy probabilities are updated by, After the learning period, the probabilities are updated every generation but using survival and discard numbers established over a moving window of the last N L generations.The algorithm thus adapts the preferred mutation strategies.SADE also incorporates adaption or variation on the hyperparameters F and C r .The scaling factor F mediates the convergence speed of the algorithm, with large values being appropriate to global search early in a run and small values being consistent with local search later in the run.The implementation of SADE used here largely follows [51] and differs only in one major aspect, concerning the adaption of F. Adaption of the parameter C r is based on accumulated experience of the successful values for the parameter over the run.It is assumed that the crossover probability for a trial is normally distributed about a mean C r with standard deviation 0.1.At initiation, the parameter C r is set to 0.5 to give equal likelihood of each parent contributing a chromosome.The crossover probabilities are then held fixed for each population index for a certain number of generations and then resampled.In a rather similar manner to the adaption of the strategy probabilities, the C r values for trial vectors successfully passing to the next generation are recorded over a certain greater number of generations and their mean value is adopted as the next C r .The record of successful trials is cleared at this point in order to avoid long-term memory effects.The version of the algorithm here adapts F in essentially the same manner as C r but uses the Gaussian Nð0:5; 0:3Þ for the initial distribution.At this point, the reader might legitimately argue that SADE has simply replaced one set of hyperparameters (F; C r ) with another (duration of the learning period, etc.).In fact, because DE and SADE are heuristic algorithms, there is no analytical counter to this argument.However, the transition to SADE is justified by the fact that the algorithm appears to be very robust with respect to the new hyperparameters.
A benchmark of the SADE algorithm against the standard DE on a Bouc-Wen identification can be found in [15];i ti s shown there that SADE offers a very large speedup in terms of convergence, very similar to the JADE algorithm used for Benchmark One.

Algorithm verification
Unlike the other two VUB benchmarks, the current authors had not attempted to identify block-structured systems before, and thus considered that the developed algorithm code required verification on a known problem.The system chosen was taken from [53] and comprised the two LTI blocks, and static nonlinearity, Data were generated using a white noise input.In order to fit the model, the correct number of lags n x and n y were specified and ten SADE runs were carried out with a population of 90 individuals; 1000 generations were used.The exercise was rather time-consuming and took of the order of three hours; however, the best solution reached an NMSE of 1:12 Â 10 À28 and arrived at the model, No parameters were fixed, and the resulting non-uniqueness of the model is clearly visible in the numerator parameters.The denominator parameters were estimated very accurately (the bracketed quantities after the parameters indicate the number of following zeroes).Multiplying up the numerator parameters and the linear term in the nonlinear function showed that the 'linear gain' corresponds perfectly with the original system.Fig. 21 shows the evolution of the SADE cost function across the ten runs; in five of the runs, SADE converged to the 'global' minimum, in one of the runs it did not quite converge, and in four of the runs it arrived at a local minimum.

Benchmark identification
Having established that the SADE algorithm worked on WH identification problems, attention moved to the actual VUB benchmark.As part of the benchmark exercise, participants were invited to design input signals for the identification, which could then be run through the benchmark circuit in order to generate training data.The current authors did not design an input, but decided to simply choose an appropriate dataset from among those created.The data set chosen was that stored as 'WH CombinedZeroMultisineSinesweep:mat' which used a period of a random multisine followed by a swept-sine excitation.The input from the whole dataset is shown in Fig. 22.
The total dataset contained 57,346 points of input and output.A subset comprising the first 10,000 points of each record was chosen for the training data, which meant that the excitation was only the random multisine.The data were sampled at 78,125 Hz; Fig. 23 shows the frequency domain information pertaining to the training data selected, up to the Nyquist frequency.
Because of the amount of time needed to run SADE on a 10,000 point training set, it was impractical to determine the orders of the ARX models in the linear blocks via cross-validation; for this reason input and output orders of 6 lags were used as it was anticipated that this should be adequate to accurately capture the behaviour of the third-order filters in the circuit.As discussed with respect to Benchmark One, it is good practice to establish a baseline for the identification by fitting a linear model.Of course, the linear modelling did not require SADE, an ordinary least-squares approach was used.So that the linear model captured the same temporal range as the nonlinear model, 12 input and output lags were used. 5n order to judge the results of the Benchmark Two identification, two noise-free test datasets were provided: a random multi-sine and a chirp; the idea was to report the FoM on the two test sets.In the case of the (12,12) linear ARX fit, the FoM values on the two test sets were found to be 0.057 and 0.022 (simulation results: multi-sine and chirp, respectively).These results proved a little inferior to results from a Best Linear Estimate (BLE) model (see [54], for an explanation of the concept), fitted by other participants in the workshop; Fig. 24 shows comparisons between the true and predicted results on the two test sets.
The SADE algorithm was then used to identify the system.As discussed above, ARX(6, 6) models were used for the linear blocks and four sigmoid functions were used in order to estimate the static nonlinearity (sigmoids were chosen as the static nonlinearity was known to have a saturation characteristic).Altogether, this gave a model with 37 tunable parameters.Because SADE is a nonlinear optimisation algorithm, it can be sensitive to the initial ranges chosen for the parameters; because of this, a number of runs were carried out in order to guide the choice of the initial ranges; these were chosen to be [À1, 1] for all ARX parameters and [À100, 100] for all sigmoid parameters.Because of the size of the problem, only one final SADE run was carried out; this used a population of 370 individuals and ran for 1000 generations; this took around 12 h.The converged model gave FoM values of 0.044 on the multi-sine test set and 0.020 on the chirp test set.The predictions on the test sets for the nonlinear SADE model are shown in Fig. 25.
The results for the full nonlinear model on the multi-sine test set are better than a naive linear least-squares, but not as good as the BLE; the results on the chirp test set are better than both the naive linear model and the BLE.However, the improvements over a linear model are not at all marked.This begs the question: does the SADE model actually make use of the static nonlinearity?One can answer this question by making a forward run through the full WH model and plotting the input to the static nonlinearity, z 1 , against the output, z 2 .The results of this exercise are shown in Fig. 26.It is clear that the SADE WH model is actually nonlinear; furthermore, the saturation characteristic is clearly captured.

Discussion
The results for Benchmark Two are provided here as a first attempt at identification of a block-structured system by the authors.Although the results are far from perfect, it is hoped that the proposed method is of interest.Given that the algorithm performed very well on a noise-free simulation, it is a fair assumption that the poorer results on the actual benchmark are because of the very high level of internal process noise added.There is room for improvement in several areas.One issue here is that the ARX model orders are hyperparameters and should have been determined in a principled manner; however, the very slow nature of the algorithm precluded that here.One of the reasons for the slow convergence was that the algorithm generated individuals for the initial population completely randomly.It was subsequently realised that this generated many unstable linear blocks, and the corresponding models were essentially wasted in terms of genetic material.As mentioned in the introduction, this paper has been written as a record of the authors' participation in the benchmark exercise, and only shows the results of the original submissions.However, a more rigorous variant of the algorithm was developed later and among the advances in that algorithm was a constraint that the initial population contained only stable models  [55].When the improved SADE algorithm was applied to the 'Silverbox' benchmark (much less process noise than the VUB benchmark) [56], it achieved results consistent with the previous best attempts.
Because the optimisation method here only uses 'forward' runs through the WH model, the transmission zero in the second LTI block has no effect on the identification; this is a strength of the approach.One of the other possibilities for the SADE approach is to use more general objective functions without too much disruption to the core algorithm.For example it may be possible to implement a maximum likelihood variant [57,58].

Benchmark three: cascaded tanks
The third of the VUB benchmarks was based on the identification of a physical experiment involving the vertical flow of water between tanks [59].The system was considered challenging because it contained an unmeasured state, but mainly because only a small training set (1024 points) was given.
A combination of physical modelling and machine learning techniques were employed here to address the cascaded tanks problem.This form of grey-box (see Introduction) model involved first fitting a physical white-box model.Then, the outputs of the white-box were used to provide additional feature-rich inputs to a black-box model -in this case a Gaussian Process (GP) regression model.
For the cascaded tanks system, the physics of the problem is understood, but not fully.Because of the detailed behaviour of the fluid -particularly under the overflow condition -it is not possible to fully capture the behaviour of the system in a physical model.However, the partial physics models which are available are good approximations to the true behaviour of the system.This makes this form of grey-box modelling a sensible choice, the white-box model can almost be thought of as a strong prior for the subsequent Bayesian method.

White-box modelling
Two different white-box models were established to investigate the effect of a more sophisticated model on the predictive performance.When white-box models are formed from known physical behaviour they should perform well in extrapolation in addition to any interpolation capability.If large discrepancies are seen between training and testing errors, this could indicate that the physics of the system chosen fails to capture the true behaviour.
When the system is not in overflow, and assuming Bernoulli's principle holds, the state equations of the system can be written as, In both these cases, xðtÞ is the input to the system, k n represents the nth parameter of the system and w n and e are the noise terms.z 1 and z 2 are state variables representing the liquid levels in the two tanks; only z 2 is measured, giving the observable output y; z 1 is thus an unmeasured state.The noise terms are assumed zero in the white-box model, allowing the black-box to compensate for noise.It is important to form a white-box model without noise, as this can distort the residuals which the black box is trying to fit.The equations in (32) define the first physical model to be fitted: Model 1.This model was extended to include losses in the system due to friction or geometry; this involved adding additional terms, which introduced two more parameters which required fitting.The extended physics model is shown below in Eq. ( 33) (from this point on, explicit time dependence in the variables will be omitted), The equations shown in (33) define the extended physics model, referred to as Model 2. As in the other benchmarks considered here, the process of fitting the parameters of these models is cast as a multidimensional optimisation with respect to a cost function.The normalised mean-squared error (NMSE), (6), between the predictions and the true system outputs is once more used as the cost function.
The latter two methods represent more recent approaches in optimisation which will be briefly described.The QPSO method is based on the same principles as the regular particle swarm but the dynamics of each particle is changed from the classic formulation to one where every particle is treated in a quantum manner.As DE has been described in a little detail, Particle Swarm Optimisation deserves a brief explanation as it is the basis of two of the methods here [60].In many ways, PSO is one of the simplest iterative population-based optimisation algorithms.The PSO algorithm is motivated by the flocking behaviour of flights of birds in search of food.Each particle (bird) i in the population (flock) is represented by a vector of its position and velocity ðy i ; v i Þ; the update rules for a generation are simply, where y Ã i is the best position experienced so far by particle i, measured in terms of some cost function Jðy i Þ, and y Ã is the best position of any particle.c 1 and c 2 are learning factors (hyperparameters) and X 1 and X 2 are randomly generated.The particles are thus steered according to their individual experience and that of the flock.The basic algorithm encoded in Eq. ( 34) belongs to a simpler class of biologically-inspired algorithms than the DE-inspired family, which are essentially motivated by real-coded genetic algorithms; the PSO variants are not dependent on operations like mutation and crossover.The PSO variants are thus interesting to consider alongside of the DE variants.In the 'quantum' version of the PSO algorithm, the trajectory of the particles is dependent upon, both, an attractor that is a random combination of the global best and previous best position for each particle, and a term relating to the particles' potential fields.This change in the position update procedure arguably encourages better exploration properties for the QPSO over the PSO.
The krill herd is another population-based optimiser which aims to mimic the behaviour of Antarctic krill, it has been shown that the krill move according to three factors: the movement of neighbouring krill, a foraging action, and physical diffusion [64].This leads to the updates of the particle positions in the KH algorithm being a combination of vectors modelling these three factors.The particles are affected by: the motion of neighbouring particles and the global best particle, a weighted average of the global best solution and the centre of mass of all individuals (food location), which relates to the foraging motion; the physical diffusion is modelled by a random perturbation on particle positions. 6ach of the optimisation schemes was run 50 times, with 1500 generations and a population size of 500.The best run and average run results were compared using the NMSE of Eq. ( 6); this gives a value of zero for a perfect model or 100 if the predictions, ŷ, are set to the mean of the true outputs.The results of the optimisation are shown in Fig. 27.It can be seen that the first model initialises with a lower NMSE; this is likely to be a result of the lower-dimensional parameter space.However, the mean and best errors in training are significantly improved with the addition of the loss terms in Model 2. As the dimensionality of the parameter space increases, the differentiation between the optimisation schemes becomes more apparent; for this particular problem, the QPSO method not only provides the best overall training and testing error, it also exhibits very fast convergence speeds.
Model 1 shows good performance, with a NMSE below 5; in this case the ability of both the PSO and QPSO methods to find a good set of parameters is clearly seen (Table 4).
The addition of the loss parameters in Model 2 allowed a significant improvement of over three percentage points in the training error and over four points in the model prediction error, as shown in Table 5.Since the additional model terms have physical meaning, an improvement in model behaviour was expected, but not to the extent seen, given the relatively small effect the friction and geometry losses have on the system.As stated earlier, the model was considered unlikely to overfit in the traditional sense of an overcomplicated model, due to the additional terms being directly related to physical phenomena.If the terms relating to the losses had been negligible, this should have led to the optimisation scheme returning very small   values for these parameters and little improvement in NMSE would be seen.The more accurate the model of the physics used, the better the performance of that model, provided the model incorporates true physical behaviour.

Black box modelling
For the black-box model used to augment the physical white box, a Gaussian Process (GP) regression model [12,65] has been used; in particular, the GP-NARX model of [66].The GP-NARX model is a combination of the standard GP regression model with a nonlinear auto-regressive framework.That is, combinations of lags in the input and output dimensions are used to form a higher-dimensional input to the GP model; in addition to this, a lag multiplier is introduced to increase the spacing between data points.For the sake of completeness, a very condensed description of the GP-NARX model is provided here.

Gaussian process NARX models
The basic premise of a Gaussian process (GP) is to perform inference over functions directly, as opposed to inference over parameters of a function.In short, a GP is a distribution over functions, which is conditioned on training data so that the most probable functions are the best fits to the data.Let X ¼½x 1 ; x 2 ...x N T denote a matrix of multivariate training inputs, and y denote the corresponding vector of training outputs.The input vector for a testing point will be denoted by the column vector x Ã and the corresponding (unknown) output by y Ã .A Gaussian process prior is formed by assuming a (Gaussian) distribution over functions, f ðxÞ$GP mðxÞ; kðx; xÞ ðÞ ð35Þ where mðxÞ is the mean function and kðx; x 0 Þ is a positive-definite covariance function.
One of the defining properties of the GP is that the density of a finite number of outputs from the process, both observed and unobserved, is multivariate normal.This property, combined with standard results for Gaussian distributions, can be used to condition unobserved points on observed training points: this mechanism effectively fits the GP to the training data.
Following a Bayesian approach, the prior mean can be assumed to be zero (see [65] for a discussion).Assuming a Gaussian noise model with variance r 2 n , the joint distribution for training and testing values is, where KðX; XÞ is a matrix whose i; jth element is equal to kðx i ; x j Þ.Similarly, KðX; x Ã Þ is a column vector whose ith element is equal to kðx i ; x Ã Þ, and Kðx Ã ; XÞ is the transpose of the same.
In order to make use of the above, it is necessary to re-arrange the joint distribution pðy; y Ã Þ into a conditional distribution pðy Ã jyÞ.Using standard results for the conditional properties of a Gaussian reveals [65], where m Ã ðx Ã Þ¼Kðx Ã ; XÞ½kðX; XÞþr 2 n I À1 y ð38Þ is the posterior mean of the GP and, k Ã ðx Ã ; x 0 Þ¼kðx Ã ; x 0 ÞÀKðx Ã ; XÞ½KðX; XÞþr 2 n I À1 KðX; x 0 Þð 39Þ is the posterior variance.Thus the GP model provides a posterior distribution for the unknown quantity y Ã .The mean from Eq. ( 37) can then be used as a 'best estimate' for a regression problem, and the variance can also be used to define confidence intervals.The covariance function used here is the squared-exponential function, and its hyperparameters, augmented by the noise parameter r 2 n , can be readily found through maximum likelihood estimation.For considerably more details on GPs than this short description allows, see [65].
To apply GPs in the NARX framework, one simply performs a GP regression of the form, y i ¼ Fðy iÀ1 ; ...; y iÀny ; x i ; ...; i.e. regress the current system output on a range of past system inputs and outputs.
and this test can be conducted on testing data as well as training data, which is an important consideration in the more general context of machine learning.The introduction of the NARX structure to the GP model presents two key challenges; the first is the addition of hyperparameters associated with the number of lags in the model and the lag multiplier; the second is that the action of feeding outputs back onto the input in the GP NARX model breaks one of the basic assumptions of the GP -that there is no noise on the input.The issue of identifying the number of lags in the model would normally be addressed by the use of a validation set or a method such as leave-one-out (LOO) cross-validation [22].In the case of the cascaded tanks system, the small dataset (1024 points) and the lack of a validation set meant that an alternative approach had to be taken.One of the key stages of fitting a GP model is the minimisation of the negative log marginal likelihood of the process through optimisation of the hyperparameters from the covariance function [65]; since the number of options for the lag hyperparameters was small here, a manual search based on the negative log marginal likelihood of the training data allowed selection of the lag hyperparameters.This manual search led to the identification of 15 lags in both input and output with a lag multiplier of two.From this, the inputs to the GP were taken as x t ; x tÀ3 ; ...; x tÀ30 and at y tÀ1 ; y tÀ3 ; ...; y tÀ31 .A squared-exponential kernel with automatic relevance determination (ARD) was used for the GP covariance structure; the hyperparameters for this can be optimised using a conjugate gradient descent [67,68] or an evolutionary search.On the other matter, concerning input noise; in the case of the GP-NARX models here, the noise levels estimated are very small and so should have little effect when fed back onto the input.The limit being that, if there is no noise on the output and the predictions have zero error there is no difference between OSA and MPO predictions (and the true outputs).Current work is progressing on incorporating methods to propagate the uncertainty introduced in this step, where it is thought that the noise level is significant.

Results: black box
Figs. 28(a) and (b) show the ability of the GP NARX model to perform very well in the OSA (prediction) case with an NMSE of only 0.0565 and all observations lying within the 3r confidence intervals.The results in the MPO (simulation) case, however, have an NMSE of 4.6174.This performance is better than the initial white-box model (Model 1), but with the addition of the loss terms in Model 2, the white-box outperforms the GP-NARX black box.The MPO predictions of the GP-NARX model also have a number of test points lying outside the 3r confidence intervals, which indicates that the model is overconfident in its predictions.This is likely to be a result of the fact that, in training, true values for the output data are fed into the model, this effectively trains the model to make OSA predictions; this would not be an issue if the OSA prediction error was approximately zero; however, in the MPO case, the effect of feeding back noise/error on the outputs to the inputs can be clearly seen to be detrimental to model performance.

Grey box modelling
The aim of building a grey-box model as discussed in the introduction (also, see [69]), is to make use of all the prior knowledge as to the system structure which is encoded in the white-box models and then improve predictive performance of the model with the addition of a black-box component.One way to do this would be to treat the white box as a mean function of the process and attempt to fit the residuals of the model using a black-box algorithm, as in, The alternative to this is to use the information that is encoded in the white box model as an additional strongly correlated input to the black box, as in Eq. ( 44); this allows a nonlinear relationship to be established between the white-box outputs and the true system outputs.This incorporation of the white-box outputs as inputs to the black-box model retains a good signal-to-noise ratio which can be lost when only fitting the residuals.
Since the error in the white-box models was low, it was found that the formulation of the grey box in (44) was a far more effective model.The grey box was tested with outputs from both white-box models to establish if the quality of the whitebox model fit affected the performance of the grey box significantly.It was hypothesised that if the white box were sufficiently accurate, such that it had predictions which were on the noise floor of the data, the addition of a black-box component would be unable to either improve the predictive performance or would be susceptible to overfitting, despite the Bayesian formulation of the GP models.
It was again necessary to determine the lag hyperparameters for the GP-NARX model, and this was accomplished as before, with the result being two lags on the inputs, ten on the outputs, and a lag multiplier of two.These lag hyperparameters were used for both grey-box models to allow comparison; since the lags relate to the dynamic behaviour of the model there should be minimal difference in the dynamics of the two white-box models due to their similar formulation.When checking that this assumption was valid, it was found that the same set of lags were optimal in both cases with respect to the marginal likelihood.
Using the outputs of the Model 1 white box as informative inputs to the GP-NARX model yields the results seen in Fig. 29.As before, very good performance is seen in the OSA case, with an NMSE of 0.0529.There is also a significant improvement over both white-box Model 1, and the black-box model, with the MPO error lowered to 2.4621.This indicates that the whitebox model is able to provide useful information about the input space to the GP-NARX model, which is not explicitly present in the training dataset.It should be noted that this model, although offering good improvement in terms of the NMSE metric, has a large overestimation of water level in the MPO prediction around time point 750; there, the model clearly predicts a tank level over 10.0, despite this being physically impossible.This error demonstrates the importance of clear engineering interpretations of model outputs to ensure model validity.
Fig. 30 shows the outputs of the second grey-box model in the OSA and MPO case.Here the inputs of white-box Model 2 are used as informative inputs.The structure reduces the NMSE in the OSA case further to 0.0442, and in the MPO case the NMSE is lowered to 0.8178.The key improvement is seen around time point 450 where the second grey-box model shows much better performance than the GP-NARX black box, or the first grey box.It is expected that it is this area of the input space which is not well explored in the training data; however, the accuracy of the white-box Model 2 is able to provide information in this area.A summary of the best MPO performance of every model tested is shown in Table 6.The performance of the grey-box models shows significant improvement over both the white-box models which provide the additional informative inputs, and the use of just a black-box model.In fact the grey-box model incorporating white box model 2 provided the lowest FoM achieved in the workshop.

Discussion
Although the results shown here establish this form of grey-box modelling as a powerful approach to the NLSI problem, there are a number of caveats that must be addressed before it can be implemented.The first of these is the requirement for well-understood physics that can be accurately described (at least up to a point) in a series of state equations.There are many systems considered across engineering in which the governing equations may not be as easily elicited as in the cascaded tanks; this is the main reason why NLSI is not simply a matter of machine learning.
It is worth considering the complexity of a model required to allow good performance of a system; for example, one could ask if the use of a linear dynamic model as a white box would be sufficient for fitting the grey box with good predictive performance.The authors believe that, provided the white box is not incorrect in such a way as it confounds the structure of the residuals, even a simple white box will aid the performance of the grey box.It remains to be seen if the use of models which simplify the physics of a system (e.g.finite element methods) will leave enough structure in the residuals for the machine learning method to fit.It should be noted that the simulation of the white-box component of the model was computationally inexpensive for the cascaded tanks system.If the white-box portion of the model is computationally very expensive, there must be some investigation as to the cost-benefit of increasing model fidelity when the machine learning method will attempt to fit the more complex physics that is not modelled.
In summary, when conducting NLSI in the presence of well-understood physics, the incorporation of this prior knowledge in any machine learning technique is a powerful tool.The grey-box framework presented here is an intuitive and modular approach that has several advantages, these are: retention of engineering insight in the processes that are well understood in the white box and ability to compare differing black-box methods without the requirement to refit the white-box component (this is especially valuable where modelling the physical system is computationally expensive).

Conclusions
It would be unduly repetitive to summarise the local conclusions from the various studies contained in this paper.The overall conclusions can therefore be quite brief.Following an initial discussion on the nature of nonlinear system identification in general, the results from three case studies on benchmark problems are presented here.In all cases, an evolutionary optimisation scheme of some flavour provides a good, if not excellent, solution.In fact, the approaches presented here gave the best results of the workshop on two of the three benchmarks (One and Three).This supports the contention, expressed in the paper's introduction, that general frameworks for nonlinear system identification are beginning to emerge, potentially moving the discipline forward from its 'toolbox' phase.Some general elements of good practice are highlighted along the way, among them is the fact that careful design of an 'optimal' input excitation can make a serious difference to the results obtained.Another important lesson is that a grey-box model combining a physical white-box and a nonparametric machine learning algorithm can provide a predictive model superior to white-or black-box models alone.While the evolutionary approaches have succeeded in all cases here, it is important to note that equally general frameworks are emerging based on Bayesian machine learning.Although not explored here for reasons of space, such frameworks offer the possible advantage of combining parameter estimation with model selection.In order to see how such models can be applied to the benchmarks here, the reader has only to consult some of the other papers in this special issue.

Fig. 2 .
Fig. 2. Results from experimental friction force measurement taken from [32]: (a) shows the Stribeck effect and (b) shows the hysteresis loop.

Fig. 6 .
Fig. 6.Plot comparing results of Differential Evolution and JADE optimisation when applied to a 10 dimensional Rosenbrock function.

Fig. 7 .
Fig. 7. Examples of forcing input signals.(a) Shows broadband random input with variance of 10,000 N 2 and (b) shows linear chirp input of magnitude 100 N with frequency varying from 2 Hz to 75 Hz.Both inputs were sampled at one frequency of 15 kHz and signal duration was 2 s.

Fig. 8 .
Fig. 8. Plots of NMSE vs. Input Power for linear SI.(a) Shows results of linear identification results using broadband random input and (b) shows linear identification results using chirp input.

Fig. 9 .
Fig. 9. Plots of Parameter Estimates vs. Input Power for linear SI for broadband random input: (a) mass, (b) viscous damping, and (c) linear stiffness.

Fig. 12 .
Fig. 12. Plots of Parameter Estimates vs. Input Signal Power for nonlinear SI for broadband random input: Top left is the mass estimate, top right shows viscous damping prediction, left middle plot shows linear stiffness prediction, right middle plot shows a, bottom left shows bc prediction and bottom right plot shows bd prediction.

Fig. 13 .
Fig. 13.Plot of Normalised Mean Square Error vs. Input Power for linear and nonlinear SI for chirp input.

Fig. 14 .
Fig. 14.Plots of Parameter Estimates vs. Input Signal Power for nonlinear SI for chirp input: Top left is the mass estimate, top right shows viscous damping prediction, left middle plot shows linear stiffness prediction, right middle plot shows a, bottom left shows bc prediction and bottom right plot shows bd prediction.

Fig. 15 .
Fig. 15.Actual displacement response plotted with best model predicted displacement response for a 1000 N chirp forcing input.

Fig. 16 .
Fig. 16.Actual displacement response plotted with best model predicted displacement response for random phase multi-sine test dataset.

Fig. 17 .Fig. 18 .
Fig. 17.Zoomed (first second) actual displacement response plotted with best model predicted displacement response for random phase multi-sine test dataset.

Fig. 21 .
Fig. 21.SADE cost functions across the ten runs used for the verification problem (red -lowest cost per generation; blue average cost per generation).(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 23 .
Fig. 23.Frequency domain representation of Benchmark Two identification/training data: (a) input and output spectra; (b) FRF magnitude and phase.
Best convergence curve of Model 2 for the four given optimisers

Fig. 27 .
Fig.27.Comparison of mean and best convergence curves for both the physical models.Tested with the four different optimisers: differential evolution, particle swarm, quantum particle swarm, and krill herd.

Table 1
Lower and upper bounds for the linear model parameters for use within the JADE optimisation.

Table 2
Lower and upper bounds for the nonlinear model parameters for use within the JADE optimisation.
3 /m Fig. 11.Plot of Normalised Mean Square Error vs. Input Power for linear and nonlinear SI for broadband random input.K. Worden et al. / Mechanical Systems and Signal Processing 112 (2018) 194-232

Table 3
Nonlinear parameters giving the lowest NMSE from the JADE optimisation using a 1000 N chirp input.
t 1 4. Benchmark two: a Wiener-Hammerstein system with process noise

Table 4
Normalised mean-square errors for each optimisation scheme during its best run in training and the model prediction error on the test set, for that best set of parameters, for Model 1.

Table 5
Normalised mean square errors for each optimisation scheme during its best run in training and the model prediction error on the test set for that best set of parameters using Model 2.
then compute an error measure.However, a more demanding test is to compute the Model Predicted Output (MPO), which uses predicted y Ã values instead of observed y values (and thus corresponds to simulation in EE/control terminology).
7After fitting the GP-NARX model, one way to test it is to compute one step ahead (OSA) predictions, which exclusively use the training data up to that time (as discussed earlier in Section 3.1, this is prediction in the electrical engineering/control community), i.e. y Ã i ¼ Fðy iÀ1 ; ...; y iÀny ; x i ; ...; x iÀnx Þ ð41Þ and i ; ...; x iÀnx Þ

Table 6
Best NMSE and FoM scores for all five models tested in the MPO case.