Journal of Process Control

This paper introduces set-membership nonlinear regression (SMR), a new approach to nonlinear regression under uncertainty. The problem is to determine the subregion in parameter space enclosing all (global) solutions to a nonlinear regression problem in the presence of bounded uncertainty on the observed variables. Our focus is on nonlinear algebraic models. We investigate the connections of SMR with (i) the classical statistical inference methods, and (ii) the usual set-membership estimation approach where the model predictions are constrained within bounded measurement errors. We also develop a computational framework to describe tight enclosures of the SMR regions using semi-inﬁnite programming and complete-search methods, in the form of likelihood contour and polyhedral enclosures. The case study of a parameter estimation problem in microbial growth is presented to illustrate various theoretical and computational aspects of the SMR approach.


Introduction
Mathematical models capable of accurate prediction of physical phenomena have proved to be invaluable tools for engineers and scientists. In the area of process systems engineering, they routinely support the design, control and optimization of production processes, as a means of improving their economical profitability and reducing their environmental footprint. A majority of these models are nonlinear and contain adjustable parameters that need estimating from available experimental data, or else from other, more fundamental, mathematical descriptions. In this context, parameter estimation turns out to be a key step in the verification, and subsequent use, of the mathematical models.
Most commonly, parameter estimation in nonlinear models is cast as a nonlinear regression exercise, where selected parameter values are adjusted so that the model predictions match the available observations as close as possible, for instance in the leastsquares or maximum-likelihood sense [1][2][3][4]. In order to avoid for the resulting parameter estimates to be biased, one can account for measurement errors in all of the variables, both independent and dependent variable observations, by following the so-called errorsin-variables approach [5,6]. This problem has been widely studied * Corresponding author. E-mail address: b.chachuat@imperial.ac.uk (B. Chachuat). from a computational standpoint over the past decades, including the development of rigorous global optimization approaches for overcoming convergence to local optima [7,8].
Of course, there is more to model identification than just determining values for the unknown parameters. Systematic procedures have been devised to support the development and statistical verification of process models, which include testing structural identifiability, designing experiments for improved parameter precision, and inferring parameter confidence [9][10][11][12]. The focus in this paper is on the latter aspect, namely characterizing subregions in parameter space wherein the parameter values can be expected to lie. Other applications of such parameter confidence regions are in design under uncertainty [13,14], robust model predictive control [15][16][17], robust monitoring [18,19], and robust optimal design of experiments [20][21][22], to name but a few. For the scope of this paper, the emphasis is on models described by algebraic equations, but these ideas can be extended to dynamic or distributed models described by differential equations too.
Accounting for model mismatch and uncertain observations within the regression problem has spawned several schools of thought. Statistical approaches can be broadly classified as frequentist or Bayesian. The former seek to determine confidence regions around the regressed parameter values, typically a maximumlikelihood estimate, considered as the 'true' parameter values [1,2,4]. By construction, a 100(1 − ˛)% frequentist confidence region comprises 100(1 − ˛)% of the parameter values that would be obtained upon repetition of the parameter estimation using (hypothetical) new observations, considered as random variables. Approximate confidence regions, for instance based on the Wald test or the likelihood-ratio (LR) test, are known to converge to the exact confidence region in the limit of an infinite number of observations under certain conditions. Process modeling environments such as gPROMS and Aspen Custom Modeler have been relying on linear approximation and the Wald test to determine ellipsoidal confidence regions, a computationally efficient procedure for problems having several dozen unknown parameters, but one which may produce inaccurate results with large measurement errors and model mismatch or few measurement points. Confidence regions based on the LR test have been shown to yield superior approximations, but are computationally more involved since the corresponding parameter regions are complex sets in general (e.g., nonconvex, not simply connected) [23,24].
In practice, the term 100(1 − ˛)% confidence region is often misused to refer to the range of parameter values that include 100(1 − ˛)% of their probability distribution [25]. This description corresponds to so-called 100 (1 − ˛)% credible regions instead, which are defined in the Bayesian inference approach [26]. Bayesian estimation uses the available observations to construct a probability distribution of the parameters, called posterior distribution, based on a likelihood function and a prior probability distribution of the same parameters. In essence, this approach thus considers the unknown parameter values as random variables. Sampling-based techniques such as Markov-Chain Monte-Carlo (MCMC) [27,28] provide a means of constructing (approximate) credible regions, although the computational effort can become prohibitive for problems having upwards of 10 parameters [29]. A most probable estimate can be determined from the posterior distribution, which also corresponds to a maximum-likelihood estimate for a flat prior. Albeit classical frequentist and Bayesian inference regions can be reconciled in special cases, no equivalence can be drawn in general since Bayesian inference incorporates problem specific contextual information from the prior distribution, whereas frequentist inference is solely based on the data; see, e.g., [30,Chapter 5]. The debate on whether to use frequentist or Bayesian statistical inference continues to this day [25,31], but its intricacies are beyond the scope of this paper.
Regardless of whether a mathematical model's structure is correct or not, a frequentist confidence region will normally converge to the maximum-likelihood estimate as the number of observations increases. Likewise, a Bayesian posterior will normally converge to a point mass that corresponds to a most probable estimate, i.e., a point that maximizes the probability of the data given the (possibly wrong) model. An interesting alternative to these statistical approaches is set-membership estimation (SME). The traditional SME setting, also called guaranteed parameter estimation (GPE), seeks to determine the set of all possible parameter values for which a model's predictions are consistent with a set of observations subject to bounded errors [32][33][34]. The fact that this approach does not require a statistical description of the observation errors, solely bounds, is not only less demanding, but also more realistic in many practical applications, including biological systems where the measurements are often scarce and subject to large errors [21]. Beside parameter estimation, the distinctive yes-or-no answer provided by set-membership techniques can also be used for model inconsistency detection [35,36]. One caveat here is that the set of feasible parameter values may be empty in the presence of measurement outliers or due to an inadequate description of the measurement noise, thus calling for remedial strategies [37,38].
Another key challenge in nonlinear set-membership estimation is describing the feasible parameter set accurately, while remaining computationally tractable. This challenge is in fact similar to the one faced by aforementioned statistical inference methods for describing parameter confidence sets, and it may explain why setmembership estimation has not reached a wider diffusion to this day. Existing computational strategies are limited to problem with downwards of a dozen parameters. They range from approximation using sampling-based methods, including stochastic search [39], support vector machines (SVM) [40] and MCMC [41]; to rigorous complete-search methods based on interval analysis and other set arithmetics [42][43][44]; and to semidefinite relaxation techniques for semi-algebraic problems [45,46]. This paper introduces set-membership regression (SMR), a new approach to nonlinear regression. The SMR problem seeks to determine the subregion in parameter space enclosing all (global) solutions to a nonlinear regression problem in the presence of bounded uncertainty on the observed variables. By contrast with the traditional SME setting seeking for parameter values to satisfy certain feasibility constraints, the SMR approach method seeks for parameter values to satisfy an optimality condition. To the best knowledge of the authors, this problem has not been investigated in the general nonlinear setting so far. Milanese [47] studied optimality and convergence properties of least-squares estimates in the presence of unknown bounded disturbance, but their theoretical work is limited to linear problems. This paper sets out to investigate the connections of SMR with both statistical inference and set-membership estimation approaches for nonlinear algebraic models. Another principal contribution is a computational framework to describe tight enclosures of the SMR regions using complete-search methods.
The rest of the paper is organized as follows. Section 2 starts by reviewing classical results from both areas of statistical and setmembership estimation. Section 3 introduces the SMR approach and analyzes its properties, after which numerical solution strategies are developed in Section 4. A simple case study is used throughout Sections 2-4 to illustrate the main concepts and results. Section 5 presents a more challenging estimation problem in microbial growth to demonstrate the SMR approach. Finally, Section 6 concludes the paper and discusses future research opportunities.

Background
Our focus throughout this paper is on explicit models in the form where p ∈ R np is the vector of unknown parameters; and (u, y) ∈ R nu × R ny is the vector of observed variables, denoted collectively by x:=(u, y) ∈ R nx for convenience. Notice that u and y often correspond to (either controlled or uncontrolled) input and output variables, respectively, in a practical setup. It is also worth pointing out that many of the concepts and methods presented herein can be applied to models described by implicit equation systems, such as f(p, x) = 0, and models comprised of differential equations too.
Suppose that n m observations x m k :=(u m k , y m k ) of the input-output variables are available, and assume that all of these observation errors are independent and described by the probability density functions p(·| ) parameterized by . In the error-in-variables approach [6], the reconciled values u 1 , . . ., u nm for the observations are estimated alongside the unknown model parameters p. The joint probability of the prediction-observation mismatch in all data points for the parameter values Â:=(p, u 1 , . . ., u nm ) ∈ R n Â is described by the following likelihood function: with ıu k :=u k − u m k and ıy k :=g(p, u k ) − y m k . The error-in-equation approach instead, considers the input measurements u m k to be error-free; that is, the parameter vector Â reduces to p, and the likelihood function simplifies to Nonlinear regression in the maximum-likelihood sense seeks to determine values for Â in order to maximize L or, equivalently, maximize log L. In the error-in-variables approach, this estimation entails the solution of an optimization problem in the form of If the parameters describing the error distribution are also unknown, one may either approximate their values using an ad hoc estimator, or consider them as additional variables in the problem (3) [1].
In the special case of Gaussian-distributed errors, with zero mean and variance v k,i , the maximum-likelihood problem (3) is equivalent to the following weighted least-squares problem While least-squares ( 2 ) regression is optimal amongst minimumvariance mean-unbiased estimators for normally distributed observation errors, outliers can greatly distort the least-squares estimates. As an alternative, least-absolute-values ( 1 ) fitting may be preferable in the presence of outliers or if little is known about the distribution of the errors [48,49]. The 1 regression problem readŝ where standard tricks can be used to reformulate or approximate the nonsmooth absolute value term in the objective function. The solutions to the 1 regression problem (5) can also be viewed as maximum-likelihood estimates if the observation errors follow the with zero mean and variance v k,i . An ∞ regression problem can be constructed in a similar way [49].

Statistical inference
Classical frequentist confidence inference proceeds in two steps: (i) solve a regression problem, e.g., to determine a mostlikely parameter estimate as described above; and (ii) construct confidence regions around this estimate.
Under the assumption thatÂ matches the (unique) 'true' value of the model parameters, both the likelihood subset ratio statistic −2 log[L(Â | x m )/L(Â | x m )], and the Wald subset statistic 1 , follow a chi-squared distribution with n Â degrees of freedom with an increasing sample size n m → ∞ [4]. 1 The covariance matrix VÂ ∈ S n Â ×n Â + for the parameters atÂ can be approximated in various ways [50], which are asymptotically equivalent; for instance [1, § 7-5], . where Ve ∈ S nx nm ×nx nm + stands for the covariance matrix of the observation noise, is the Hessian matrix atÂ.
These asymptotic confidence results can be used to obtain (approximate) 100(1 − ˛)% confidence regions, with the usual frequentist interpretation that the probability for a random confidence region to cover the true value of Â is, in large samples, equal to 1 − ˛ [24] : • 100(1 − ˛)% likelihood-based confidence region: • 100(1 − ˛)% normal-theory (Wald) confidence region: where 0 ⊆ R n Â denotes the allowable (prior) parameter set; and 2 n Â (1 − ˛) is the 1 − ˛ quantile of the chi-squared distribution with n Â degrees of freedom. At this point, we note that confidence intervals can be inferred from any confidence region by bounding the range of values for each parameter Â i . In the case of the Wald approximation, explicit confidence bounds are obtained as A classical result in statistical inference is that the confidence regions (7) and (8) are asymptotically equivalent [51,52], with a convergence rate ∝ n −1 m . However, unlike the likelihood-based confidence regions, the Wald confidence regions are not invariant to a model reparameterization because of the (approximate) covariance term VÂ. Conversely, computing a Wald confidence region is straightforward, whereas describing a likelihood-based confidence region for a nonlinear model is generally a hard task since this region may not be convex or not even simply connected.
Unlike the frequentist view, Bayesian estimation treats the parameters as random variables, whose (posterior) probability distribution, p (Â | x m ) can be inferred from Bayes' theorem, where p(Â) is the so-called prior density of the parameters. Any is called a 100(1 − ˛)% credible set. One particular kind of credible sets is the highest posterior density (HPD) set, given by where ˛i s the largest value for which (10) holds. When a sampling approach is applied to estimate the posterior, for instance a MCMC sampler, the value of ˛c an be estimated from a procedure that examines all available samples of p(Â | x m ) [28]. It is also worth mentioning that complete-search approaches to enclosing credible sets have been proposed as well [53,54]. The connections between Bayesian and non-Bayesian statistical inference have been studied since the 1960s, for instance with regards to matching credible and confidence intervals [55,56]; or, more recently, in order to reconcile Bayesian and frequentist higher-order asymptotic expansions for predictive probability densities [57]. In linear regression problems with normally distributed measurement errors, the Bayesian posterior takes the form of a multivariate Gaussian centered at the maximum-likelihood estimate and with covariance matrix VÂ for non-informative priors, so the HPD credible regions match their frequentist counterparts. More generally, such matching can be made in cases where the Bayesian prior is invariant to model reparameterization, which is the case for Jeffreys or reference priors [58]. For simplicity, our focus in this paper is limited to uniform prior distributions with compact supports. Although such priors fail to be invariant under reparameterization, the resulting HPD sets correspond to contour levels of the likelihood function, similar to the likelihood-based confidence regions.

Set-membership estimation
The usual GPE problem in set-membership estimation seeks to determine a parameter subregion such that the predicted input-output observations are consistent with their matching measurements within given error bounds [32,33], Here the error set E ⊂ R nxnm may be any compact set and does not need a statistical description of the uncertainty. In the usual scenario where independent error bounds ±e u 1 , ±e y 1 , . . ., ±e un m , ±e y nm are given for each of the measurements, the set-membership estimation problem reads If statistical information about the observation error is nonetheless available, for instance a uniform or q-Gaussian probability distribution with compact support, one may take E directly as this support set. Even when the distribution support is not compact, one could decide to exclude those scenarios having a probability lower than a given threshold and use the corresponding HPD credible region as the error set E; see, e.g., [59]. It is not difficult to imagine a situation whereby no parameter value in 0 can be found such that the model predictions are consistent with the observations for a given error set E, i.e., the guaranteed parameter region (12) is empty. This may happen in the presence of measurement outliers, or could be caused by a large model mismatch. The former situation is common with experimental data, e.g., due to a failing or drifting sensor. Methods have been developed for robustifying set-membership estimation against outliers [37,38], alongside classical approaches to detecting outliers [60]. Moreover, one can take advantage of the latter situation, for instance to invalidate candidate models that would present a systematic offset with a certain set of observations [35,36], typically after checking for outliers [38]. Another appeal of set-membership estimation lies in its ability to detect a lack of identifiability in parametric models, that is, when model responses corresponding to distinct parameter values are indistinguishable [9].
The vast majority of computational studies in set-membership estimation uses exhaustive-search techniques based on interval analysis or other set arithmetics to describe the parameter regions (12) [42][43][44]61] . A current bottleneck of these approaches is their applicability to problems having no more than 5-10 parameters. However, if one is ready to abandon guarantees, sampling-based techniques such as SVM or MCMC can be used to approximate the parameter regions, and these remain applicable for black-box models too [40,41].
Illustrative example. We use a simple estimation problem adapted from [3] to illustrate the main approaches described in this background section, and we use the same problem to illustrate the main properties of the SMR framework developed later on in Sections 3 and 4. The model describes the dynamic evolution of biological oxygen demand (BOD), c in a wastewater sample, with parameters (Â 1 , Â 2 ) ∈ [0, 50] × [0, 2], and time t ≥ 0. For this problem, data points (t m k , c m k ) have been generated by simulating the model (13) for the parameter values Â 1 = 20 and Â 2 = 0.5, and corrupting these values with a Gaussian white noise with variance 2 c = 1. These data are reported in Appendix B for the sake of reproducibility.
Both 90% confidence regions and 90% HPD credible regions are compared in Fig. 1, in the case of an 2 -regression problem. Various sets of measurements are considered, namely n m = 4 measurement points (every other day), 8 measurement points (every day), and 16 measurement points (twice a day). The asymptotic convergence of the Wald and likelihood-based confidence regions with an increasing number of measurements is clearly visible. The HPD credible sets shown on these plots are generated from a flat prior, and are consistently smaller than their confidence counterparts; HPD credible sets constructed from a non-informative Jeffreys prior (not shown on the plots) would be identical to the likelihood-based confidence regions.
A comparison between guaranteed parameter regions for the same three sets of measurements, but corresponding to different measurement error sets in (12), is shown in Fig. 2. The first measurement error set corresponds to the usual assumption of independent error bounds on each measurement, here for 90% confidence bounds, so that 2 1 (0.9) 2 c ≈ 2.706. Notice how the corresponding guaranteed parameter sets shrink when more measurements are added, as it becomes more challenging for the model predictions to match a larger measurement set in the presence of measurement noise. Such guaranteed parameter regions could even be empty, which happens for instance with e 2 c k ≤ 1 in (14), corresponding to 68% (1-sigma) confidence bounds. Also notice that the real parameter value (20, 0.5) lies outside the guaranteed regions due to the large measurement noise. The other measurement error set in Fig. 2 is chosen as the HPD set of a joint Gaussian distribution, again for a 90% confidence limit. Guaranteed parameter sets so constructed do not shrink significantly as more measurements are added into the estimation problem, and they are thus more resilient to measurement noise than their counterpart sets constructed with independent error bounds on each measurement. This higher resilience is essentially due to an enlarged, and hence more flexible, measurement error set E 2 compared to E 1 .

Set-membership nonlinear regression
The developed set-membership regression (SMR) approach seeks to describe the subregion R in parameter space enclosing all (global) solutions to a nonlinear regression problem under all possible measurement uncertainty scenarios. Given a bounded uncertainty set E ⊂ R nxnm on the observation errors, the SMR region R is mathematically defined as In the context of the 2 -regression problem (4), SMR specializes to ∃ e u 1 , e y 1 , . . ., e un m , e y nm ∈ E : and in the context of the 1 -regression problem (5), to ∃(e u 1 , e y 1 , . . ., e un m , e y nm ) ∈ E : Notice that the constraint feasibility condition in the traditional SME formulation (12) is replaced with an optimality condition in the SMR problem (16) , making the parameter regions in SMR expectedly more difficult to characterize. Numerical solution strategies for describing enclosures of an SMR region are presented later on in Section 4. The remainder of this section investigates connections between SMR and the well-established set-membership and statistical inference approaches, respectively in Sections 3.1 and 3.2.

Set-membership interpretation
By contrast with the usual approach to set-membership estimation (Section 2.2), SMR comes with a guarantee that the set R is always non-empty, no matter how large the model mismatch or the observation errors might be, since the regression problems in (16) are all feasible by construction. Therefore, the SMR formulation is inherently resilient to the presence of outlying observations, and it does not need for such outliers to be detected or removed from the observation set before computing the parameter regions [38]. In other words, the outlying observations can be dealt with directly into the SMR problem (16) via an appropriate likelihood function.
The following inclusion result holds between SMR and GPE under mild assumptions: Theorem 1. Suppose that the probability density functions p(·| ) participating in the likelihood function (1) are all maximal at 0. Then, for a given error set E, the SMR region (16) contains the GPE region for some e:=(e u 1 , e y 1 , . . ., e un m , e y nm ) ∈ E. Since the probability density functions p(·| ) in L are all maximal at 0 by assumption, the log-likelihood function log L( · | x m + e) is (globally) maximal at Â, and therefore Â ∈ R . ᮀ Remark 1. The assumption on the likelihood function L in Theorem 1 is not very restrictive in practice. For instance, it is satisfied by both 2 -and 1 -regression problems in (17) and (18), so we have G ⊆ 2 R and G ⊆ 1 R . It is also satisfied when the probability density functions are uniform on a compact support, as is the case with ∞ -regression problems [49].
Illustrative example (continued). A comparison between GPE and SMR regions for both 1 -and 2 -regression is presented in Fig. 3, in the case of 8 measurements. The same measurement error sets E 1 and E 2 as introduced earlier in (14) and (15) are used in this comparison. For simplicity, we have applied a simple sampling procedure to inner-approximate the SMR regions: 20,000 error vectors e (i) c are generated within the multi-dimensional error sets E 1 and E 2 , here using Sobol quasi-random sampling; then, the following nonlinear regression problem is solved to global optimality to obtain a corresponding point We start by noting that the inclusion result in Theorem 1 is indeed satisfied for both measurement error sets and both regression types. Moreover, the SMR regions obtained for either measurement error sets are comparable in size. In the case of independent error bounds on the measurements (set E 1 , left plot), the SMR regions do not shrink much when more measurements are added, which is unlike the corresponding GPE regions; compare Fig. 2. This also illustrates the higher resilience of SMR to noisy or outlying measurements than GPE. For both measurement error sets, the SMR-2 regions are consistently smaller than their SMR-1 counterparts. Interestingly, this observation is consistent with the classical Gauss-Markov theorem stating that the least-squares estimator provides the estimator with lowest variance in linear regression.

Statistical interpretation
Whenever statistical information is available for the observation errors, for instance in the form of a joint probability distribution, one may choose the error set E as the corresponding HPD region for a given credibility level 1 − ˛. In the case of independent and Gaussian-distributed observation errors, such as those leading to the 2 -regression problem (4), the 100(1 − ˛)% HPD region is given by with the diagonal error covariance matrix V e :=diag(v u 1 , v y 1 , . . ., v un m , v y nm ). Likewise, for Laplacian distributed errors as in the 1 -regression problem (5), the 100(1 − ˛)% HPD region comes in the form where nxnm (1 − ˛) is the counterpart of the chi-squared value for a joint Laplacian distribution. Notice that with the error sets in (19) and (20), the SMR regions R may not converge to a singleton (or a finite set) as more observations are added into the regression problem, since the HPD limits 2 nxnm (1 − ˛) and nxnm (1 − ˛) are themselves increasing with n m for a given confidence level 1 − ˛. The SMR regions derived from such error sets are thus unrelated to their confidence and credible region counterparts in classical statistical inference (Section 2.1), which are both shrinking to a singleton as n m → ∞ (under certain regularity conditions). But while one would indeed expect convergence to some 'true' parameter value when a model's structure is correct, such an idea of 'true' parameter values becomes meaningless in the presence of structural model mismatch. By contrast, SMR does not make any assumption about the correctness of a model's structure, and a 100(1 − ˛)% SMR region is comprised of those parameter values which are equally credible under the observation error set E, in the sense of the regression problem at hand: a clear and unambiguous statistical interpretation.
To sum up, convergence of an SMR region R to a singleton is dependent on the choice of the measurement error set E, but is unrelated to whether or not the model's structure is correct. A follow-up question then is identifying scenarios under which SMR regions would be asymptotically equivalent to classical confidence regions. The following result establishes one simple connection with the Wald confidence regions (8) under certain regularity conditions.

Theorem 2.
Let the error set in the SMR problem (16) be given by for some confidence level 1 − ˛, and covariance matrix V e ∈ S nxnm×nxnm + . Assume that the likelihood function in (16) is twice continuously differentiable and the regression problems for e ∈ E all have a unique, strict global optimum. Then, the SMR region R is asymptotically equivalent to the 100(1 − ˛)% Wald confidence region W in (6) and (8), where d H is the Hausdorff metric.
Proof. Let e ∈ E, and denote by Â(e) ∈ R the corresponding solution to the regression problem max Â log L(Â | x m + e), so that withĤ: . The image of the error set (21) under the affine transformation (23) is an ellipsoid with centerÂ and shape matrix VÂ as in (6), so that Â ∈ W . Conversely, let Â be any point in W , and let e be any point in E satisfying (23). Clearly, the point Â(e) ∈ R is such that Â(e) − Â ∈ O(diam(E) 2 ) by (22). ᮀ

Remark 2.
In the special case of a linear regression, the equivalence between the SMR and Wald confidence regions in Theorem 2 turns out to be exact, not merely asymptotic. For an 2 -regression and the model y = F Â, we have which matches the likelihood-ratio confidence region L (2), as well as the Bayesian's HPD credible region B (11) for a uniform/non-informative prior. Both the frequentist and Bayesian inference regions are thus implied by the SMR framework in linear regression problems.

Remark 3.
The key difference between the error set (21) in Theorem 2 and the 100(1 − ˛)%-HPD region (19), is that the HPD limit in the former, namely 2 n Â (1 − ˛), is independent of the number of observations. This is also the reason why the error set (21) shrinks to the origin, and therefore R converges to the singleton set {Â} as n m → ∞ (under the assumptions of Theorem 2). Conversely, a 100(1 − ˛)% confidence region may be regarded as the asymptotic equivalent to an SMR region with the confidence level 100(1 − ˇ)% on the jointly Gaussian-distributed observation errors in (19) such that 2 nxnm (1 − ˇ) = 2 n Â (1 − ˛). For instance, a 90%-confidence region in a two-parameter regression problem is asymptotically equivalent to an SMR region with 67%, 20% and 0.26% joint confidence for 4, 8 and 16 observations, respectively.  Illustrative example (continued). A comparison between 90% likelihood-ratio confidence regions and two SMR-2 regions corresponding to different measurement error sets is shown in Fig. 4, in the case of 4 and 8 measurement points. The SMR regions are innerapproximated using the same sampling strategy as previously.
The first error sets correspond to 90% HPD regions in (19) for the jointly Gaussian-distributed measurement errors-or, equivalently, the set E 2 in (15). These SMR regions are found to be significantly larger than their 90% likelihood-ratio (or Wald) confidence counterparts. Also recall that, by Theorem 1, these SMR regions always enclose the GPE regions shown in Fig. 3 for the same error sets E 2 .
The second error sets are constructed per (21), in order to illustrate the asymptotic equivalence with classical confidence regions as established through Theorem 2; they correspond to 67% and 20% HPD regions for jointly Gaussian-distributed measurement errors with 4 and 8 measurements, respectively, as discussed in Remark 3. Such asymptotic convergence with an increasing number of measurements is clearly visible in Fig. 4, where the small discrepancy observed on the left plot for n m = 4 cannot be seen anymore on the right plot for n m = 8. The SMR framework is thus capable of providing equivalent confidence information as in classical statistical inference, with the attendant advantage of being able to switch between alternative error set descriptions or likelihood functions seamlessly.

Numerical solution and approximation
Describing the SMR region R as defined in (16) is a difficult task in general. A simple approach to enclosing R by a set of algebraic constraints, which would then allow the application of the same set-inversion techniques as for GPE (Section 2.2; Appendix A), entails a substitution of the regression problems by their optimality conditions. Since every element Â in (the interior of) R should satisfy the first-and second-order optimality conditions for some observation error e ∈ E, we have However, since the optimality conditions (24) hold for both local and global maxima of the likelihood function, as well as saddle points, this inclusion could end up being very conservative for nonlinear regression problems in general. Another important caveat with this approach is the computational penalty of applying a setinversion algorithm in the (n Â + n x n m )-dimensional domain 0 × E, not merely in the original n Â -dimensional domain 0 . The following subsections set out to develop more tractable, yet still conservative, bounding strategies to alleviate the computational burden of SMR, both in the form of confidence-like regions (Section 4.1) and polyhedral regions (Section 4.2).

Likelihood-contour enclosure
We consider the problem of enclosing the SMR region R within a confidence-like region of the form for some constant ≥ 0. Notice that the computational complexity of describing, or closely approximating, the relaxed region R ( ) is then comparable to describing either a likelihood-based confidence region (7) or a GPE region (12), for instance by applying a set-inversion algorithm in the original n Â -dimensional domain 0 .
The following theorem provides a systematic means of computing a value * such that R ( * ) is a tight enclosure of R , upon specializing ϕ(Â):= log L(Â | x m ). This situation is depicted in Fig. 5.

Theorem 3. Given any continuous function
Moreover, the enclosure with * is tight in the sense that the two sets share one or more boundary points.
Solving this SIP problem is hard in general, since both the semi-infinite constraint and the objective function are generally nonconvex for a nonlinear regression problem. Existing solution approaches to SIP rely on either one of two key ideas [62,63]. In local reduction methods, a semi-infinite constraint is represented locally by a finite number of instances of the constraint, upon invoking the implicit function theorem. Alternatively, discretization (and exchange) methods involve replacing the uncertain parameter set with a finite discretization so as to create a relaxation of the SIP, and then iteratively refining this discretization until convergence. The focus in the remainder of this paper is on the second type of methods, for which global optimality certificates can be provided upon solving the nonlinear programming (NLP) subproblems to global optimality using complete search methods [64][65][66].
More specifically, we apply the cutting-plane SIP algorithm by Blankenship and Falk [67] in order to construct a sequence of decreasing upper bounds k on the upper bound * given by (27) ; that is, we construct an inclusion sequence R ( k ) ⊇ R ( * ) ⊇ R . Within the SMR framework, this algorithm entails an iteration between: (i) the finite-dimensional nonlinear programming (NLP) subproblems where k 0 :={ 0 , . . ., k } is a finite subset of 0 ; and (ii) the feasibility subproblems The subset k 0 at iteration k = 1 may be initialized as the empty set, or better, as a singleton set with the maximum-likelihood estimatê Â (see Section 2).
Under the assumptions that the likelihood function L is jointly continuous in (Â, e) and that the parameter set 0 and the error set E are both compact, any point of accumulation Â * of the sequence {Â k } will correspond to the best possible lower bound * in (27) [67,Theorem 2.1]. In practice, the iterations may be interrupted when the following termination criterion is satisfied for a certain tolerance > 0, Naturally, such a convergence property of the cutting-plane algorithm hinges on the ability to solve all of the nonconvex subproblems (28) and (29) to global optimality. Otherwise, the resulting threshold values * could be underestimated, leading to likelihood contours that exclude parts of the corresponding SMR regions. The practical applicability of this approach may thus be hindered by its computational complexity. One way to expedite convergence of the cutting-plane algorithm is via the addition of redundant constraints, namely constraints that do not alter the optimal solution set of the SIP (27) yet tighten the relaxations in (28); see, e.g., [68,69] for more details about KKTbased tightening in SIP. Provided that the likelihood function is sufficiently smooth, one can add the first-and second-order optimality cuts (24) as redundant constraints in the subproblem (28), so that 2 (Â k , e k ) ∈ arg min

0.
2 Given that most NLP solvers do not currently support constraints in the form of linear matrix inequalities (LMI), one can always substitute the LMI constraint In the case that none of the regression problems max ∈ 0 log L( | x m + e) have local (suboptimal) solutions for any e ∈ E, enforcing the semi-infinite constraint in (27) is of course equivalent to satisfying the optimality conditions (24), and so the cutting-plane algorithm will trivially terminate after a single iteration. Otherwise, the intermediate solution points Â k to the NLP subproblems (31) might correspond to local optima of the regression problems for e k ∈ E, and the algorithm thus keeps iterating by adding cutting planes until all of these local optima have been excluded. At this point, satisfying both the discretized semi-infinite and optimality constraints in (31) becomes equivalent to enforcing the original semi-infinite constraint in (27), and the algorithm will then terminate exactly -optimality gap = 0 in (30) -at the next iteration. This behavior will be illustrated for the case study problem in Section 5.1.

Polyhedral enclosure
Applying a set-inversion approach to describe (an enclosure of) the SMR region R can prove computationally expensive, if at all tractable, especially for the estimation problems encountered in real-life situations. A computationally less demanding task entails the computation of a simple (axis-aligned) box enclosure for an SMR region; for instance, by solving a pair of optimization problems for each parameter Â i , i = 1 . . . n Â , as Clearly, these bounds may be computed by applying a similar cutting-plane algorithm as in Section 4.1 above, whereby the discretization subproblem (28) is now replaced with (Â k , e k ) ∈ arg min/argmax and possibly supplemented with the redundant optimality cuts (24) as in (31).
As an alternative to the direct solution of the SIP problems in (32), one can also use the likelihood-contour enclosure R ( * ) in (25) with the lower bound * from (27) in order to construct an NLP relaxation of the SIP problem. A conservative box enclosure can be computed in this way by solving the auxiliary (potentially nonconvex) NLP problems Of course, the presence of several disconnected subsets in an SMR region cannot be detected by a simple box enclosure, and information about correlations between the parameters Â i in the actual SMR region is also lost. Part of this information could nonetheless be recovered by constructing a polyhedral enclosure of the SMR region, e.g., expressed in the form for a set of vectors n 1 . . .n m ∈ R n Â and scalars ı 1 . . .ı m , ı 1 . . .ı m ∈ R. Specializing the function ϕ(Â):=n T k Â in Theorem 3 provides a means of constructing such non-axis-aligned polyhedral cuts. Herein, the directions n k are chosen in such a way that the cuts correspond to a (face or interior) diagonal of the box enclosure [Â, Â], with ∈ { − 1, 0, 1} n and | | = n Â i=1 i ≥ 2. Further, the limits ı k , ı k in (34) such that the polyhedral cuts are tight can be computed via the solution of the auxiliary SIP problems s.t.
possibly supplemented with the redundant optimality cuts (24) once again. Similar to the box enclosure (33) earlier, conservative, yet computationally less demanding, polyhedral cuts could be derived from the likelihood-contour enclosure R ( * ) by solving the auxiliary NLP problems Notice that the spans (ı k − ı k ) are bounded in [0, 1] by construction. The case of a 2-dimensional face diagonal, where i = j = 1 are the only nonzero elements in (35), is shown in Fig. 5 for illustration. Enumerating all such pairs of parameters (Â i , Â j ) 1≤i<j≤n Â calls for the solution of 2n Â (n Â − 1) auxiliary optimization problems. More generally with | | ≥ 2 nonzero elements in the vector , the number of optimization problems is equal to 2 | | n Â | | . To manage this high combinatorial complexity when the number of parameters n Â is high, it is of course possible to include only those cuts involving combinations of | | = 2 or 3 parameters in the polyhedral enclosure at the price of a more conservative polyhedral enclosure. A simple way of detecting correlations among any parameter pair (Â i , Â j ) 1≤i<j≤n Â is by calculating the shortest-to-longest ratio between the spans (ı k − ı k ) obtained with i = j = 1 on the one hand, and i =− j = 1 on the other hand. A ratio close to 0 indicates an elongated set projection onto (Â i , Â j ) in one of the diagonal directions, and therefore a large correlation between Â i and Â j ; whereas, a ratio close to 1 indicates a more spherical set projection onto (Â i , Â j ). This approach is the counterpart to the shortest-tolongest axis ratio in an ellipsoidal (Wald) confidence region, which is also the basis for the so-called modified E-optimality criterion in experimental design [11]. More generally, shortest-to-longestspan ratios could be computed with | | > 2 in order to unravel correlations among more than 2 parameters likewise. Other classical criteria, such as the A-optimality and D-optimality criteria, also have counterparts in the SMR framework, given by the sum of all the parameter ranges Â i − Â i for i = 1 . . . n Â and the volume of the polytope (34), respectively. To conclude this subsection, it is worth mentioning that the construction of such polyhedral enclosures is also relevant to the approximation of classical inference regions, for instance the likelihood-ratio confidence regions (7).
Illustrative example (continued). Various enclosures of SMR-2 regions are presented in Fig. 6 for the BOD case study, here with either 4 or 8 measurement points. The measurement error set E is constructed based on (21) at the confidence level 1 − ˛ = 0.9. The threshold values * in the likelihood-contour enclosures R ( * ) (25) are computed using the cutting-plane SIP algorithm described in Section 4.1, with first-order optimality cuts as in the discretized subproblem (31) . When the subsets k 0 are initialized with the corresponding maximum-likelihood estimateÂ, the cutting-plane Fig. 6. Comparison of outer-approximation strategies to enclose the SMR-2 regions for the BOD example with 4 (left) and 8 (right) measurement points: enclosures based on likelihood-contour cuts (25), and polyhedral cuts (34). The error set E is constructed based on (21) at the confidence level 1 − ˛ = 0.9. The triangles represent the real parameter values.

Table 1
Comparison between the thresholds * and log L(Â | x m ) − 1 2 2 2 (0.9) corresponding to the SMR-2 region (16) and the likelihood-based confidence region (7), respectively, for the BOD example with 4, 8 and 16 measurement points. algorithm finds the exact solutions * during the first iteration, irrespective of the number of measurement points. Even for such a simple estimation problem though, the solution of the discretized subproblem (31) to global optimality using GAMS-BARON proves computationally challenging as the number of measurement increases, here taking 404 CPU-sec for 4 measurement points, 3810 CPU-sec for 8 measurement points, and failing to close the gap within 7200 CPU-sec for 16 measurement points. 3 The GAMS code is provided as part of the Supplementary Information (see Appendix C) for the sake of reproducibility. The likelihood-contour enclosures R ( * ) are found to provide a very close approximation of the SMR-2 regions in Fig. 6-these enclosures are computed using the set-inversion algorithm described in Appendix A. This is expected given the fast convergence between the SMR and likelihood-based confidence regions already observed in Fig. 4, and confirmed by the comparison in Table 1 between the thresholds defining these two confidence regions.
For simplicity, the polyhedral cuts in Fig. 6 are constructed from the likelihood-contour enclosures R ( * ) rather than the actual SMR-2 regions R here. The numerical solution of the auxiliary NLP subproblems (33) and (37) to global optimality using GAMS-BARON is fast in comparison with the SIP problems, taking <1 CPUsec.
Finally, the shortest-to-longest-span ratios in the polyhedral enclosures of the SMR-2 regions for 4, 8 and 16 measurement points are 0.284 0.965 ≈ 0.294, 0.249 0.970 ≈ 0.257 and 0.256 0.967 ≈ 0.265, respectively. These small ratios (compared to 1) indicate that the SMR regions are 3-to 4-times flatter in one direction compared to the other direction, which unravels the presence of a strong correlation between Â 1 and Â 2 in (13), which is in agreement with the visual impression on Fig. 6.

Case study in temperature-dependent microbial growth
We now apply the SMR framework to a more challenging estimation problem in microbial growth, emphasizing their properties and drawing comparisons with other set-membership and statistical inference methods. Two models describing the effect of culture temperature, T on the growth rate, of a microbial population, each one comprising four parameters, are: (i) The Ratkowsky model [71]: where T min and T max (K) represent the minimal and maximal temperatures, respectively; while b (K −1 h 0.5 ) and c (K −1 ) are extra parameters adding flexibility to the shape of the growth model. The cardinal temperature model [72]: where T min and T max (K) also represent the minimal and maximal temperatures, respectively; T opt (K) corresponds to the optimal growth temperature; and opt (h −1 ) is the maximal growth rate attained at T opt . Experimental data used in the regression are from [71] for the bacterium E. coli. This data set comprises 15 measurement pairs (T k , k ) within the temperature range 294-320 (K), and it is reproduced in Appendix B for completeness. The standard deviation of the growth rate measurements is taken as = 0.1 (h −1 ) throughout. Results of a maximum-likelihood estimation with constant-variance and Gaussian-distributed errors -or, equivalently, a standard least-squares regression -are presented in Fig. 7. Both model predictions are found to be in good agreement with the experimental data, yet with a higher likelihood for the cardinal temperature model. Note also that errors are only taken into account for the growth rate measurements (outputs) herein, i.e. the temperature measurements (inputs) are considered to be exact.

Computational procedure and performance
For both candidate models we use the cutting-plane SIP algorithm of Section 4.1 to compute the threshold values * (27), and we describe tight likelihood-contour enclosures R ( * ) (25) of the SMR regions using a set-inversion algorithm (see Appendix A) in turn. We apply a similar cutting-plane SIP algorithm to determine the box and polyhedral enclosures based on (32) and (36) with | |  = 2, as described in Section 4.2. First-order optimality cuts are added in the discretized NLP subproblems for all of the SIP problems, as in (31), in order to expedite the convergence of the cutting-plane algorithm, and the sets k 0 are initialized with the maximum-likelihood estimatesÂ at iteration k = 1. All of the NLP subproblems in the SIP algorithm are solved with the global solver GAMS-BARON-these GAMS codes are provided as part of the Supplementary Information (see Appendix C) for reproducibility. 3 Lastly, the set-inversion computations are carried using our in-house library CRONOS [44], which is available from https://github.com/omega-icl/cronos.
In the case of the cardinal temperature model, a single iteration is needed to solve all of the SIP problems exactly -optimality gap = 0 in (30). This behavior indicates that none of the regression problems for this model exhibit local, suboptimal solutions for the measurement error sets of interest in Sections 5.2 and 5.3 below. In the case of the Ratkowsky, the SIP problems are also solved exactly, but the cutting-plane algorithm terminates after several iterations due to the presence of local optima; for instance, computing the solution value * of (27) for the SMR problem in Section 5.2 takes 10 iterations to terminate, as shown in Fig. 8.
Even though the cutting-plane SIP algorithms terminate exactly (after a single or several iterations), certifying global optimality for most discretized NLP subproblems is currently intractable with the state-of-the-art global solvers BARON [73] and ANTIGONE [66]. As already discussed in Section 4.1, this lack of guarantees could result in the likelihood contours or polyhedral cuts excluding parts of the actual SMR regions. The odds of missing a global optimum in a discretized NLP subproblem is nonetheless mitigated by letting BARON or ANTIGONE run up to a time limit of 7200 CPU-sec here.

SMR with jointly Gaussian-distributed errors
We consider SMR-2 regions, where the error set E corresponds to the HPD region of a joint Gaussian distribution, as in (21). In order to draw on the asymptotic equivalence with a 95% confidence region in classical frequentist inference (Theorem 2), we select a 15% HPD region for the joint Gaussian distribution of the measurement errors here (see Remark 3). The likelihood-contour and polyhedral enclosures of these SMR regions are compared in Figs. 9 and 10 for the Ratkowsky and cardinal temperature models, respectively. The results from a random sampling are also shown on these plots, which lie inside the actual SMR regions.
Since the polyhedral cuts are tight by construction (Theorem 3), the seemingly large discrepancy between these cuts and the sampled SMR regions in Figs. 9 and 10 is mainly attributable to the sampling not being sufficiently exhaustive. Moreover, the comparisons between the polyhedral and likelihood-contour enclosures on these figures show that the conservatism introduced by the second remains small for both models in the present case of jointly Gaussian-distributed measurement errors.
Reported above each plot in Figs. 9 and 10 are the shortestto-longest-span ratios in the polyhedral enclosure for the various parameter pairs (see Section 4.2). With the Ratkowsky model, all of these ratios happen to be smaller than 0.4, and even lower than 0.25 for the parameter pair (T min , b), thereby suggesting strong correlations in the parameter set (T min , T max , b, c). With the cardinal temperature model by contrast, most of the ratios are close to or above 0.5, suggesting much weaker correlations amongst the parameters (T min , T max , T opt , opt ) thereof. Moreover, the SMR intervals for the parameters T min and T max -which participate and share the same interpretation in both models -are much larger for the Ratkowsky model than they are for the cardinal temperature model. On the basis of these results, a modeler would normally retain the cardinal temperature model over the Ratkowsky model.
Although the Wald confidence ellipsoids in Figs. 9 and 10 differ significantly from the SMR region enclosures, similar conclusions can nonetheless be drawn with regards to parameter precision and correlation for both the Ratkowsky and cardinal temperature models based on the main axes of the projected Wald ellipsoids. One can also compare the threshold of a likelihood-contour enclosure with its likelihood-based confidence region counterpart (7): for the Ratkowsky model, we find * ≈ 5.9 and log L(Â | x m ) −  being quite close to each other for both models provides yet another illustration of the asymptotic equivalence between classical statistical inference approaches and SMR for such choices of the error set (viz. Section 3.2). Also notice the higher likelihood threshold of the cardinal temperature model compared with the Ratkowsky model, which provides yet another indication of a much more confident estimation.

SMR with independently-distributed errors
We consider alternative SMR-2 regions, where the error set E now comprises independent, 1-sigma error bounds on the measurements, E:= e ∈ R 15 ∀k = 1. . .15, e ,k ≤ .
Similar to the jointly Gaussian-distributed case in Section 5.2 above, we compare various approximations of such an SMR region for the cardinal temperature model in Fig. 11; namely, the tight likelihoodcontour and polyhedral enclosures, and an inner-approximation using a random sampling. Despite the error set E in (38) now being significantly different from a Gaussian HPD region, the polyhedral enclosures turn out to be comparable in shape and size to those in Fig. 10; and the shortest-to-longest-span ratios for the various parameter pairs are similar too. The likelihood-contour enclosure in Fig. 11 describes a rather close approximation of the SMR region too, albeit proving to be more conservative than for the jointly Gaussian-distributed case in Fig. 10. A similar behavior is obtained for the Ratkowsky model (results not shown).
In addition to SMR region approximations, Fig. 11 displays the guaranteed parameter region as given by (12), for the same error set (38). One can check that the inclusion property established in Theorem 1 holds. The guaranteed parameter region turns out to be much smaller than the SMR region here due to both the model mismatch and underestimating the measurement noise. For the Ratkowsky model, the guaranteed parameter region even happens to be empty for these data and error sets. Therefore, unlike SMR regions, guaranteed parameter regions do not provide a reliable means of detecting parameter correlations in the present case.

Conclusions and future research directions
This paper has introduced set-membership regression (SMR), a new approach to parameter estimation which seeks to determine the subregion in parameter space enclosing all (global) solutions to a nonlinear regression problem subject to uncertain observations. An SMR region is thus understood as comprising those parameter values that are equally credible under the selected observation error set, in the sense of that regression problem. In particular, this interpretation is not conditional upon the model's structure being correct. Another distinctive feature of SMR is its ability to consider likelihood functions and error sets other than those corresponding to jointly Gaussian-distributed errors, including least-absolute-error ( 1 ) regression, and independent error distributions or simple error bounds when the underlying statistics is unknown.
In a bounded-error context, SMR provides a means of robustifying existing guaranteed parameter estimation methods. By drawing on the principles of maximum likelihood estimation, an SMR region encloses the corresponding guaranteed parameter set, and unlike the latter, it may not become empty in the presence of large model mismatch or measurement errors and outliers. From a statistical inference viewpoint, SMR has been shown to be asymptotically equivalent to the Wald confidence regions for specific choices of the measurement error set. It will be important to keep developing the underlying SMR theory as part of future work, so as to better grasp the links with both frequentist and Bayesian statistical inference analysis.
Another important contribution of this paper is a computational framework for describing tight enclosures of the SMR regions, in the form of likelihood-contour and polyhedral enclosures. These enclosures can be described via the solution of auxiliary optimization problems, which are typically nonconvex and embed semiinfinite constraints. While tractable in principle using global optimization techniques based on complete search, our experience with such optimization problems is that they challenge state-ofthe-art global optimization solvers such as BARON or ANTIGONE, even for small-scale estimation problems as exemplified with the BOD and microbial growth case studies. The tackling of larger-scale problems, including error-in-variables formulations, is a clear call for improved global search techniques; e.g., by exploiting problem structures or creating redundancy to strengthen the relaxations, or by combining with effective heuristics to increase the likelihood of finding a solution early on during the search [65].
One straightforward extension of the SMR methodology includes parameter estimation problems with other sources of uncertainty than just measurement errors. In principle, any set of nuisance parameters could be accounted for in the regression framework based on a description of the corresponding uncertainty set, similar to the measurement error set E in (16).
Lastly, it is worth reiterating that the SMR framework can be extended to parameter estimation in dynamic systems too. The main bottleneck in doing so is of computational rather than conceptual nature, since limited work has been published to date on SIP with differential equations embedded [74]. For instance, applying the cutting-plane SIP algorithm of Section 4 to the dynamic case should rely on efficient complete-search methods for global optimization and constraint satisfaction in dynamic optimization problems [44,75,76].

Data statement
No new data was collected in the course of this research.