Learning non-stationary and discontinuous functions using clustering, classification and Gaussian process modelling

Surrogate models have shown to be an extremely efficient aid in solving engineering problems that require repeated evaluations of an expensive computational model. They are built by sparsely evaluating the costly original model and have provided a way to solve otherwise intractable problems. A crucial aspect in surrogate modelling is the assumption of smoothness and regularity of the model to approximate. This assumption is however not always met in reality. For instance in civil or mechanical engineering, some models may present discontinuities or non-smoothness, e.g., in case of instability patterns such as buckling or snap-through. Building a single surrogate model capable of accounting for these fundamentally different behaviors or discontinuities is not an easy task. In this paper, we propose a three-stage approach for the approximation of non-smooth functions which combines clustering, classification and regression. The idea is to split the space following the localized behaviors or regimes of the system and build local surrogates that are eventually assembled. A sequence of well-known machine learning techniques are used: Dirichlet process mixtures models (DPMM), support vector machines and Gaussian process modelling. The approach is tested and validated on two analytical functions and a finite element model of a tensile membrane structure.


Introduction
Computational models, which allow scientists and engineers to accurately simulate complex systems and predict their behaviour in various contexts, are nowadays a key tool present in 1 virtually all fields of applied sciences and engineering.Cast as computer experiments, they are able to predict with high fidelity the behaviour of the studied system in replacement of, or as a complement to laboratory experiments.The downside of such high-fidelity models is however that they are computationally demanding.This is even more relevant in the context of uncertainty quantification or design optimization, where the models need to be evaluated multiple times.
Surrogate models have become paramount in such fields as they allow for an efficient solution of otherwise computationally intractable problems.They are inexpensive proxies that can be used in lieu of expensive computational models.Examples of such surrogates include Gaussian process models also known as Kriging (Santner et al., 2003;Rasmussen and Williams, 2006), polynomial chaos expansions (Xiu and Karniadakis, 2002;Blatman and Sudret, 2011), support vector machines (Vapnik, 1995), polynomial response surfaces (Myers and Montgomery, 2002), etc.These methods have been applied in various problems pertaining to uncertainty quantification or design optimization.The use of surrogate models in such fields are now mature as shown by the recent reviews in reliability analysis (Teixeira et al., 2021;Moustapha et al., 2022), Bayesian inversion (Yan and Zhang, 2017) or design optimization (Chatterjee et al., 2019;Moustapha and Sudret, 2019a).
In most of these applications, it is assumed that the computational models to approximate feature some accommodating properties such as smoothness, differentiability or stationarity.
Yet there exists cases when these assumptions do not hold.In mechanical engineering, this may happen for instance when solving non-linear problems involving instability such as snap-through or bifurcations in the solution path, e.g., crash simulation.In computational fluid dynamics, simulations of compressive flows that involve shocks also belong to this category.In other cases, the underlying phenomenon may present different localized features or extreme regime variations which are strongly dependent on the inputs.
Various methods have been developed in the field of uncertainty quantification to tackle such problems.The first class of methods borrows from digital signal processing and image detection to identify discontinuities or strong gradients of the function to approximate using techniques such as polynomial annihilation (Le Maître et al., 2004;Gorodetsky, 2012).Such approaches however rely on uniformly sampled grids and are often limited to two-dimensional problems.Sargsyan et al. (2012) proposed a technique combining Bayesian inference and polynomial chaos expansions that does not require using a regular grid and hence allowing for a reduced number of samples.However, their approach was also developed for two-dimensional problems and the authors did not investigate how well it scales with dimensionality.
Another class of methods relies on Gaussian process (GP) regression where the irregularities on the model to approximate are tackled by introducing non-stationary covariance functions or 2 kernels.Indeed, such kernels allow one to capture heterogeneous variations or heteroscedastic noise while keeping the computational budget low.The direct approach to build such kernels is to consider the noise variance, signal variance and/or characteristic length scale to be input-dependent, such as in Paciorek and Schervish (2003).Heinonen et al. (2016) proposed an approach where all three parameters are considered latent variables and inferred as hyperparameters of the GP.Such an approach has shown increased efficiency compared to vanilla GP but it also comes with an increased inference cost due to the fact that there are no more closedform solution and the hyperparameters need to be calibrated using sampling based techniques (See Rasmussen and Ghahramani (2001)).Furthermore, they do not allow to tackle problems with discontinuities.
A more sensible approach based on non-stationary GP consists in splitting the input space using for instance treed Gaussian processes or a mixture of experts (Tresp, 2000;Rasmussen and Ghahramani, 2001;Meeds and Osindero, 2005).Similarly, it is also possible to define nonstationary Gaussian process models by partitioning the training data into smaller subsets using clustering techniques, such as in Zhang et al. (2019) and Konomi et al. (2019), where K-means and nearest-neighbors clustering are used.Such approaches also have the advantage of offering faster training and testing of the model as the experimental design is divided into smaller and more computationally manageable subsets.Finally, another popular way to define non-stationary kernels is by warping the input, and sometimes the output, space.By doing so, one may find a latent space where the function to approximate is smoother.Examples of such techniques include warped GP (Marmin, 2018) or manifold GP regression (Calandra et al., 2016;Kuleshov et al., 2018).
In this work, we will focus on multi-stage techniques where the problem is solved by using a sequence of well-known machine learning techniques.More specifically, we consider the class of methods based on the following three-stage approach: clustering, classification and regression (Boroson and Missoum, 2017;Dupuis et al., 2018).Basudhar and Missoum (2008); Serna and Bucher (2009) were the first to propose decomposing the problem of identifying multiple failure domains of mechanical systems using support vector machines.However, they do not include the regression step as they are only concerned with an optimization problem where only the state of a sample is of interest (i.e., whether it belongs to the failure domain or not).Moustapha (2016); Moustapha and Sudret (2019b) extended the approach to the prediction of the model responses by building local Kriging surrogates in each identified domain.However in all these approaches, it was assumed that the clusters were identified either using expert knowledge or by only considering the model responses which span different ranges.Niutta et al. (2018) proposed identifying the clusters by detecting jumps in the model responses for relatively close samples.However, this technique works only in low-dimensional problems and when the response of different clusters are disjoint.This is a strong limitation and was to some extent overcome by using joint clustering of both the inputs and outputs in Bernholdt et al. (2019).In that work, they use K-means clustering to identify the clusters and multi-layer perceptrons for classification and regression tasks.The number of clusters is defined here using the elbow approach, which is a visual technique requiring user interaction.Furthermore it is not robust w.r.t. the initialization of the K-means algorithm and noise in the data.More generally, an important limitation in the contributions presented above is that the three steps are disconnected and the prediction uncertainty in one step is not accounted for in the subsequent ones.
In this paper, we propose an approach that aims at solving these two limitations.First, to automatically identify the number of clusters in a robust way, we consider a non-parametric Bayesian technique, namely Dirichlet process mixture models (DPMM).The interest in using DPMM are three-fold: i. they automatically estimate the optimal number of clusters according to patterns identified in the data, ii.they offer a probabilistic framework that allows one to propagate the epistemic uncertainty related to this clustering task to both the subsequent classification and regression steps, and iii.they are flexible enough and their complexity can grow as new data is observed (for instance in an active learning scheme, where new regimes of the model could be identified).
In the remainder of this paper, we first present the three-stage methodology and how the steps are connected in Section 2. In Section 3, we present in details the three methods used in each step, namely, Dirichlet process mixture models, support vector machines for classification and Gaussian process modelling.We finally illustrate the proposed approach in Section 4 using two analytical examples and an engineering application related to the design of a tensile membrane structure (Valdés-Vázquez et al., 2020, 2021).
2 Problem set-up and three-stage approach Let us consider a set of N data points (X , Y) where X = x (i) ∈ X ⊂ R M , i = 1, . . .N is a set of M -dimensional inputs and Y are corresponding scalar outputs such that is only accessible through an evaluation over a finite set of input points.We further assume in this setting that the model is non-smooth, i.e., it exhibits sharp localized features and, most noticeably, discontinuities.As the model can only be evaluated on a finite set of samples, discontinuities in the current work is assumed when the model presents extreme variations in the outputs for seemingly close input points.
The goal of the analysis is to learn the input-output relationship of the model M through the limited set of training data D = (X , Y), also known as experimental design.This ultimately leads to a cheaper-to-evaluate surrogate model that can be used to predict the response of the model for any new point.Generally, this type of problems is tackled using regression techniques where a class of parameterized models are assumed and then their hyper-parameters are calibrated so as to minimize a generalization error.Such models would however fail when there are discontinuities or heterogeneous variations associated to limited observations.In this work, we consider tackling this problem by splitting the space along the discontinuities and building local regression models in each of the obtained subdomains.To achieve this, we consider a three-stage framework which is illustrated in Figure 1 and summarized as follows:

Clustering
Classification Regression 1. Clustering: The first learning step aims at identifying patterns in the data that hint to subdomains separated by discontinuities.To achieve this, we cluster the joint input-output data points.This is an unsupervised learning problem for which numerous techniques have been developed (Pham and Afify, 2017).K-means clustering (Lloyd, 1982) is probably the most common approach thanks to its simplicity.However, it assumes that the number of clusters is known and further fails when the clusters are of disproportionate sizes.Another approach that partially overcomes difficulties related to K-means clustering are Gaussian mixture models which offer a probabilistic framework for clustering (Rokach and Maimon, 2005).They hence allow for a more nuanced clustering of the data by providing soft cluster memberships, i.e., each data point is assigned with a probability of belonging to a given cluster.This feature allows one to solve more complex problems, e.g., when the clusters are partially overlapping.However, similarly to K-means, they assume that the number of clusters is known in advance.In general, trial-and-errors approaches are used to define the optimal number of clusters for such problems, which is not optimal.
We therefore consider in this work a more holistic approach where the number of clusters is also inferred from the data using a non-parametric Bayesian model, more specifically Dirichlet process mixture models (Li et al., 2019) as described in Section 3.1.
At the end of this step, the experimental design is split into K subsets C k , k = 1, . . ., K.
2. Classification: Assuming that the data have been clustered, we can now place labels on them and turn to supervised learning.More specifically, let us assume K clusters are identified in the previous step.We thus define the labels {ℓ 1 , . . ., ℓ K } and the labelled training data X × L where each couple The goal of this step is then to partition the input space such that any new sample can be mapped to at least one of the clusters C k .This will ultimately allow us to select the appropriate local regression model(s) to evaluate the new point.
This task is carried out in this work by using support vector machines (SVM) for binary and multi-class classification (Vapnik, 1995).The probabilistic framework is introduced by considering Platt's approach to computing posterior probabilities given a binary SVM prediction (Platt, 2000).For multi-class problems, binary classifiers are appropriately combined to provide both class membership and posterior probabilities.

Regression:
In this final step, Gaussian process (GP) models (Rasmussen and Williams, 2006) are employed to make the final prediction.We further investigate the use of three different approaches for combining the various GP models built in this stage.In the first two approaches, local surrogate models M k are built for each of the K identified clusters.
When it comes to prediction, the recombination is made as follows: Hard recombination: In this approach, the surrogate model which corresponds to the cluster predicted by the classifier is solely used to make the final prediction, i.e., where ℓ k and 0 otherwise; Soft recombination: In this approach, the prediction for each point is obtained as a weighted combination of all the local surrogate models, i.e., where the weight w k (x) ∈ [0, 1] with K k=1 w k (x) = 1 may be related to the actual probability that the point x belongs to the cluster C k as defined by the classifier.
Categorical recombination: Contrary to the previous two approaches, a single Gaussian process model is built here.This is achieved by using an additional variable which is a categorical parameter indicating which cluster a given point belongs to, i.e., the training set is the couple {X , L} × Y where L = ℓ (i) , i = 1, . . ., N are the labels of the training set identified in the clustering stage.The surrogate model is therefore built on a space of dimension M + 1: M (x) = M cat x = x, ℓ (x) , where the categorical variable is given by the SVC prediction, i.e., ℓ (x) = M SVC (x).
The following section describes in details each of the ingredients introduced in the proposed framework.
3 Description of the components of the proposed method 3.1 Clustering using Dirichlet process mixture models

Gaussian mixture models
Let us now consider the set of available data W = w (i) , i = 1, . . ., N , where w (i) = x (i) , y (i) is a vector gathering both inputs and outputs, and let us assume that they are associated to some latent variables z.In a clustering set-up, say using a Gaussian mixture, the latent variables would be z = {π, µ, Σ} where π are mixing coefficients and µ and Σ are the mean and covariance of multivariate normal random variables.The goal is then to find the posterior distribution p(z|w) of the latent variables given the data and using Bayes rules, i.e., where p (w|z) is the data likelihood, p (z) = p (π, µ, Σ) is the prior over the latent variables and p (w) is the evidence.
The prior can be fully factorized into p (π) p (µ) p (Σ) since the three parameters are considered mutually independent.The prior on the mixing coefficients p (π) is usually chosen as a Dirichlet distribution with parameters α/K where α is a positive scaling parameter and K is the predefined number of clusters: where Γ is the Gamma function.
The Dirichlet distribution is chosen precisely because it is the conjugate distribution to the multinomial distribution, which is used for clusters membership assignment, later denoted by c.
The generative model for data derived from a Gaussian mixture model can therefore be cast as where µ k and Σ k are respectively the mean and covariance parameters of each local Gaussian distribution in the mixture.
It is generally assumed in such a model that K << N , which in other words means that samples from all clusters have been observed.However, there may exist cases when K is in the same order or even larger than N .An alternative view to such cases is that at any moment all clusters have not yet been observed and drawing more data from the generative model will reveal new clusters.This naturally leads to extending this finite mixture model into an infinite one using non-parametric Bayesian models whose complexity can grow as more data are observed.
This is precisely what a Dirichlet process mixture model does.It generalizes the generative model described in Eq. ( 5) by assuming an infinite number of clusters, i.e., that K → ∞.This corresponds to choosing a Dirichlet process (Ferguson, 1973) as prior for the mixing coefficients, as explained in the sequel.

Dirichlet process
A Dirichlet process (DP) is a distribution over distributions defined by a base distribution G 0 and a positive scaling parameter α.The output from a Dirichlet process is therefore a discrete distribution.It is however not possible to directly draw from G considering the formal definition of a Dirichlet process.Other alternative views such as the Chinese restaurant process (Aldous, 1985), the Pólya urn scheme (Blackwell and MacQueen, 1973) or the stick-breaking representation (Sethuraman, 1994) have been proposed instead.
In this work, we consider the latter approach.More specifically, let us consider an infinite collection of two random variables V k ∼ Beta(1, α) and η * k ∼ G 0 with k = {1, 2, . ..}.The stick-breaking representation of G is then defined as follows: where δ is the Kronecker symbol.This representation is illustrated in Figure 2 where the η * k are location parameters also known as atoms and π k are corresponding weights.
In a DP, there is a countably infinite number of atoms and the weights sum up to 1, making G a discrete distribution.This infinite set of atoms lends itself to modelling priors in infinite mixture models.More specifically, the DP is used in Dirichlet process mixture models as a non-parametric prior in a hierarchical Bayesian model specified as follows (Antoniak, 1974;Blei Given a dataset W, each data point w (i) is assumed to be generated by first drawing a component label c (i) = {1, 2, . ..} with probability distribution p(c (i) = k|V ) = π k (v) and then drawing w (i)   from p(w (i) |η k ).In this work, p is chosen as a distribution from the exponential family for which G 0 is a conjugate prior, which turns out to also belong to the exponential family and hence making inference easier.

Posterior estimation
The latent variables in this setting are therefore z = {v, η, c}.The goal of the analysis is then to find the posterior distribution of these latent variables given the observed data W, which is denoted by p (z|W, θ).There is no closed-form solution to this problem and typical solution schemes rely on Markov Chain Monte Carlo (MCMC).MCMC algorithms allow one to obtain an approximation of the posterior using Markov chains whose stationary distribution is the sought posterior.The usual approach in Dirichlet process mixture models is Gibbs sampling which is particularly suited to this task as one can have access to the conditional distributions of the latent variables analytically (Neal, 2000;Ishwaran and James, 2001).However, the difficulty with MCMC algorithms is that they are expensive, as they require a large number of samples, often generated sequentially, and their convergence is difficult to monitor.
An alternative approach to circumvent these issues is variational inference, where the esti-mation of the posterior is replaced by an optimization problem (Wainwright and Jordan, 2003).
More specifically, the intractable posterior is replaced by a parametric family of variation distributions denoted here by q ν (z|ν).In this paper, we consider the approach proposed by Blei and Jordan (2006) which relies on the mean-field approximation, i.e., the variational distribution is fully factorized (all the latent variables are mutually independent).The optimization problem then consists in finding within the selected family of variational distributions the values of the hyperparameters ν that will minimize the Kullback-Liebler (KL) divergence between the true posterior and its approximation.This quantity reads By noting that the divergence is always positive (or using Jensen's inequality), it can be shown that minimizing Eq. ( 8) is equivalent to maximizing a lower bound of the marginal log likelihood, also referred to as ELBO and denoted by By appropriately choosing the family of variational distributions for each latent variable, it is possible to make the computation of the ELBO tractable.In the approach proposed by Blei and Jordan (2006) considered here, the factorized variational distribution is cast as where q γt (v t ) are Beta distributions, q τt (η t ) are exponential family distributions and q Φ k (c k ) are multinomial distributions.In this equation, the infinite samples is truncated to T terms by setting q(v T = 1) = 1, which implies that π t (v) = 0 for t ≥ T .The solution to this problem is eventually obtained using a coordinate ascent algorithm for which the incremental updates can be computed analytically (Ghahramani and Beal, 2000).The reader is referred to Blei and Jordan (2006) for further details.
3.2 Classification using support vector machines

Binary classification
Support vector machines are a popular supervised learning algorithm developed by Vapnik (1995).They were developed for binary classification and were later extended to account for multiple classes.Let us first consider the binary case (i.e., assuming only two clusters were identified) and denote the dataset by x (i) , ℓ (i) , i = 1, . . ., N where ℓ (i) = {−1, 1} are the labels of the training points.
Given this training set, the support vector classifier (SVC) prediction for any yet-to-be observed sample reads (Smola and Schölkopf, 2004) where {α, b} are parameters to calibrate.The coefficients α i , some of which are the so-called support vectors, and the offset parameter b are obtained by solving a quadratic optimization problem min where h = {−1, . . ., −1} is a column vector of size N and K = K + 1/CI N with C > 0 being a penalty term.The matrix K is the so-called Gram matrix built by evaluating the parameterized kernel function on pairs of points of the training data set, such that K ij = k x (i) , x (j) ; θ , i, j ∈ {1, . . ., N }.Multiple kernels have been used in SVM.In this work, we consider the Gaussian kernel defined by The

Extension to multi-class classification
Let us now consider the case when the classification task aims at categorizing data with a set of K > 2 labels, where each label is defined as ℓ (i) = ℓ k if the original training pair x (i) , y (i)   belongs to the cluster C k .
The most popular approach to tackle this multi-class problem is to reduce it to a series of binary classification problems that can be solved using a standard SVM algorithm.The two most popular approaches are the one-against-all and the one-vs-one decomposition schemes (Hastie and Tibshirami, 1997;Moreira and Mayoraz, 1998).In the former, one binary problem is derived for each class k by assigning one label, say the positive one, to all samples such that ℓ (i) = ℓ k and the negative label to all the other samples.In the one-vs-one approach, binary classifiers considering all pairs of labels and ignoring all other samples are built.This leads to a total of K(K − 1)/2 classifiers, which is larger than the K classifiers required by the one-against-all approach.However, such classifiers are trained on a noticeably smaller subset of the training samples making the overall procedure computationally efficient despite the larger number of classifiers to build.
Both approaches can be generalized, or somehow combined, using concepts of the error correcting output codes (ECOC) (Dieterich and Bakiri, 1995).The recombination of the binary classifiers into a final one can be achieved either by a simple voting system or by considering the posterior probabilities derived from each classifier.In this work, we consider the one-vs-one approach with a final voting system thanks to its simplicity and efficiency.We note that in case of equal voting between two classes, we heuristically choose the class that was predicted with the classifier that considered the two classes of interest.

Posterior probabilities
As mentioned in Section 2, the soft recombination of the final predictor requires some weights which are proportional to the probability that the sample belongs to a given class.In case of SVM, such weights can be derived by computing posterior probabilities derived from the classifier.In practice, this can be achieved by post-processing the output of the classifier using a sigmoid function as proposed by Platt (2000): where the coefficients A and B are calibrated by solving a regularized maximum likelihood problem.In this work, we use an efficient numerical implementation proposed by Lin et al. (2007).
There have been many attempts to extend these probabilities to multi-class problems (Hastie and Tibshirami, 1997;Moreira and Mayoraz, 1998;Wu et al., 2004;Wang, 2008).Let us denote by the posterior probability provided by the classifier that discriminates between the classes C i (positive) and C j (negative).Note however that we are interested in the overall probability of belonging to a class given all possible classes, i.e. p i = P (x ∈ C i ).Moreira and Mayoraz (1998) proposed estimating this probability by combining the partial ones, i.e., This value is however flawed, as it accounts for spurious probabilities defined by classifiers discriminating two classes, none of which being the true one.
Using Bayes theorem, it can however be noted that which, by averaging over all combinations of i and j, leads to the following system of equations: since P (x ∈ C i ∪ C j ) = (p i + p j ).Wu et al. ( 2004) noted that this system of equations can be written in a matrix form where p = {p 1 , . . ., p K } T and T is a K × K matrix whose elements read (20) Wu et al. ( 2004) then noted that there exists a finite Markov chain whose transition matrix is T , since K j=1 T ij = 1 and 0 ≤ T ij ≤ 1.Further assuming that p ij > 0 for any i, j ∈ {1, . . ., K} implies that T ij > 0, which ensures that the Markov chain is irreducible and aperiodic.In fine, these conditions guarantee that Eq. ( 19) defines a Markov chain whose stationary distribution exists and is unique.
Taking advantage of the fact that T is a transition kernel and p is the stationary distribution of the corresponding Markov chain, we cast Eq. ( 18) in an iterative scheme where the initial values p (0) i , i = {1, . . .K} using the estimate in Eq. ( 16) and p ij are the partial probabilities obtained by the binary one-vs-one classifiers using Eq. ( 14) .This chain eventually converges after a few iterations, generally with t < 100 in our examples, to the posterior probabilities estimates.

Basics of Kriging
The final ingredient considered in the proposed framework is Kriging a.k.a.Gaussian process model.It is used here to build local surrogates in the different regions identified by the clustering step.
A Kriging model assumes that the model to approximate is of the form (Santner et al., 2003;Rasmussen and Williams, 2006) where the first summand represents the trend written here in a polynomial form using p regressors f j with corresponding coefficients β j .The second summand is a zero-mean stationary covariance process defined by an auto-covariance function Cov [Z (x) , Z (x ′ )] = σ 2 R (x, x ′ ; θ) where σ 2 is the process variance and R is an auto-correlation function parameterized by the vector θ.In this work, we consider an anisotropic Matérn 5/2 auto-correlation function defined by The calibration of the model is performed by estimating the regression coefficients of the trend and the hyperparameters of the selected kernel that minimize a generalization error, herein using a maximum likelihood approach (Santner et al., 2003;Bachoc, 2013;Lataniotis et al., 2018).
Following this step, Kriging assumes that any unknown sample actually follows a normal distribution N µ M , σ 2 M where the mean is the actual prediction, while the standard deviation informs about the local accuracy of the prediction.The two quantities respectively read where is the estimate of the process variance, F = f j x (i) , j = 1, . . ., p, i = 1, . . ., n 0 is the Vandermonde matrix, R is the correlation matrix with R ij = R x (i) , x (j) ; θ , r (x) is a vector gathering the correlation between the unknown sample x and the experimental design points and i) , i = 1, . . ., n 0 is the vector of available model responses.
To account for the categorical variable, the compound symmetry kernel defined by Pelematti is considered.The parameter r is computed here by embedding this kernel within a usual auto-correlation function for continuous variables with a tunable parameter θ cat that can be calibrated in the same setting than the continuous parameters.More precisely, we consider a Gaussian kernel which then reads: R ℓ (i) , ℓ (j) ; where S ℓ (i) ,ℓ (j) = 0 if ℓ (i) = ℓ (j) and 1 otherwise.The final auto-correlation function is obtained by multiplying the M + 1 one-dimensional auto-correlation functions i.e., where θ = {θ, θ cat } and x (i) = x (i) , ℓ (i) .

Examples
The proposed algorithm is illustrated in this section with two analytical toy functions and an engineering problem related to a tensile membrane structure design.To assess its accuracy, we estimate the following two generalization errors using a validation set of size N val = 10 4 : Normalized mean-square error: Mean absolute error: Furthermore, each analysis is repeated 20 times in order to assess the robustness of the proposed algorithm with respect to the statistical uncertainty associated with the experimental designs.

Manhattan function
For this first validation example, we consider a two-dimensional function proposed by Rai (2015).
The function consists of three global regions, one of which is a checkerboard, and reads The checkerboard is made of smaller rectangular regions alternating the values of 0 and 1 as illustrated in Figure 3.
In this section, we will illustrate each of the three steps of the proposed algorithm.We first start by showing how the clustering algorithm splits the data.Figure 4 shows the clusters identified using three experimental designs of different sizes.The original model is built assuming 10 regions where each of the squares in the checkerboard is considered as one region on its own.However, regardless of the experimental design, the clustering algorithm reduces the checkerboard into two regions, one with y = 1 and the other with y = 0.This results in disconnected subdomains but as we will see in the next paragraph, this does not affect the overall prediction capability of the algorithm.Another important observation from the partitions in Figure 4 is that the more data points, the more clusters are identified.For small datasets, the partition is quite sensitive to the data.However, the partition becomes more stable and robust as the data size is increased.
Once the clusters are identified (4 different ones in the case of medium-size experimental design, and in the sequel), the inputs are labelled accordingly and binary classification is performed on each pair of classes.Figure 5 shows the resulting classifiers for one realization of the experimental design.The blue and orange points correspond to the positive and negative labels respectively, while the support vectors are highlighted in green.The thick line is the classifier, whereas the dashed ones delimit the margin.Finally, the gray triangles represent the data points that were ignored by the illustrated classifier.As expected, support vector machines are appropriately calibrated for the problem at hand.However, the choice of the Gaussian kernel may not be appropriate for the classification of C 3 against C 4 (Figure 5f) as it produces smooth boundaries whereas the original boundary results from a checkerboard with sharp edges.This does not substantially affect the results.However, better prediction could have been obtained by including the choice of the kernel in the model selection.
The next step is then to recombine those predictions into a final one.In the hard reconstruction, a vote is carried out and the class that wins is the final prediction.The resulting partition of the input space is shown in Figure 6. Figure 7 shows the soft reconstruction approach where each tile represents the probabilities of a given point to belong to a given class.
The resulting classification is in accordance with the regions defined by the original model except for the boundaries of the checkerboard which present some slight deviations.Also, the boundary between the two regions where M is smooth (i.e. , polynomial or sines) is not exactly the line This partition of the input space is eventually used to build local Kriging surrogates to provide the final prediction.For this example, we repeat the analysis 20 times where each repetition starts with a randomly sampled Sobol' sequence.Figure 8 shows boxplots of the resulting errors for increasing sizes of the experimental design.For any ED size, both recombination techniques yield improved N M SE and M AE.In general, the soft reconstruction also yields better prediction than the hard one.This is even more clear when considering the M AE error.For this example, the prediction with categorical Kriging is not included, since it does not lead to good results.This is due to the fact that each region is fundamentally different from the other, hence using a single Kriging model, even with categorical variables, is not appropriate.

Snap-through instability problem
This example is a mechanical problem related to the snap-through instability of a two-bar truss structure.The structure is loaded at its tip and responds linearly with small displacements until a critical point is reached.Past that point, the structure suddenly snaps through a new equilibrium point and resumes its small displacements.In this example, we consider as quantity of interest the displacement w of the tip of the structure as illustrated in Figure 9 .
The corresponding displacement of the tip of the truss can then be computed as follows: In this example, we assume that the length of the bar l 0 = 5 m and the initial inclination angle α 0 = 10 • are deterministic.In contrast, the load, the Young's modulus and the cross section areas are assumed random and characterized by the distributions shown in We run the analysis using the proposed method and considering three different experimental design sizes and 20 repetitions.The resulting errors are summarized as boxplots shown in Figure 10.The first observation is that the difference between the results obtained by the proposed method and a direct Kriging model (i.e. a single Kriging model built using the entire data set) is much more important than in the previous case, often by orders of magnitude.This is due to the fact that the two regimes of non-linear structure behaviours are prominently different as shown in Figure 11.Furthermore in this example, categorical Kriging performs quite well.It is not clear however which recombination approach is the best.When looking at the normalized mean square error, the hard recombination is slightly better.This is the opposite when looking at the mean absolute error, i.e., the soft and categorical recombination are slightly better.

Tensile fabric structure
In this final example, we investigate a model that simulates the behaviour of a tensile membrane structure (TMS) under extreme loading (Valdés-Vázquez et al., 2020, 2021).TMS are flexible lightweight structures made of composite fabric spanning long distances.They have many advantages in terms of architectural sophistication but are yet challenging to design.By their very nature, they are unable to carry out-of-plane moments and shear forces that may result from the extreme wind loads they are expected to withstand.They further require careful pre-stressing to keep a stable form.
Special codes are designed to simulate the response of complex tensile membrane structures.
Comet is one such in-house finite element code developed at the University of Gua (Valdés-Vázquez et al., 2021).In this work, we consider a hypar (hyperbol-paraboloid), which is one of the most common shapes for TMS, designed using Comet and illustrated in Figure 12.The probabilistic model is described using the random variables presented in   building a single surrogate model to account for both leads to inaccurate results.We consider then the three-stage approach proposed in this paper, with an experimental design of size 500 and a validation set of size 1, 000.The experimental design is split into five different subsets of sizes 100, 200, 300, 400 and 500.In each of these, the DPMM clustering rightly identifies that there are two sets of responses.
Figure 14 shows the resulting NMSE and MAE for each experimental design size.As ex-

Conclusion
Surrogate modelling is now a well-established method that allows one to reduce the computational burden of simulation intensive methods that require multiple evaluations of a costly computational model.Building an accurate surrogate model with limited data generally requires that the functions to approximate are smooth and regular.This is however not always the case in many applications, e.g.crash simulation or computational fluid dynamics.
In this paper, we propose a three-stage approach for the approximation of non-smooth functions for systems exhibiting multiple behaviours and/or discontinuities.The problem is tackled by dividing the task into three complementary parts: i. a joint input-output clustering stage that identifies the different patterns exhibited by the system using a non-parametric Bayesian approach, namely a Dirichlet process mixture model, ii. a partition of the input space according to the identified clusters using support vector machines, and eventually iii. the construction of local surrogates, herein Kriging models, using data from each of the partitions.For any new point, the prediction is made by appropriately recombining the predictions made by each of the Kriging models, according to the assigned class of the new point.
The proposed approach is validated on two analytical examples and an engineering application (FE-based tensile membrane structure).It is shown to be both accurate and efficient compared to a traditional surrogate modelling approach ignoring the non-smoothness.
The three methods selected for each stage all provide probabilistic predictions.While the posterior probabilities of the support vector machines classifiers have been used within the soft reconstruction scheme, the ones provided by the Dirichlet process mixture models have not been exploited yet.However, as seen in the examples, mislabelling the initial data leads to large errors.These could be reduced by accounting for the uncertainties in the clustering stage.In a future work, we intend to account for the latter so as to provide a fully probabilistic prediction scheme that propagates the epistemic uncertainties from one step to the next.

Figure 1 :
Figure 1: Illustration of the three-stage approach.

Figure 2 :
Figure 2: Illustration of a Dirichlet process: G 0 is the base distribution from which the atoms η * k hyperparameters of this model are the penalty term C which controls the penalty incurred for misclassifying a training point and the kernel parameter θ which controls, among others, the smoothness of the separating hyperplane.They are both estimated in this work by minimizing the span estimate of the leave-one-out error(Vapnik and Chapelle, 2000;Chapelle et al., 2002) using the covariance-matrix adaptation evolution scheme (CMA-ES) (See Arnold and Hansen (2012);Moustapha et al. (2018Moustapha et al. ( , 2021) ) for details).

Figure 3 :
Figure 3: Example 1 -Three-dimensional representation of the Manhattan function.

Figure 4 :
Figure 4: Example 1 -Clustering of the data by DPMM considering two repetitions of three experimental designs of increasing sizes.

( a )Figure 5 :
Figure 5: Example 1 -Pairwise classification of the data (with 4 clusters identified in Step 1 for the medium-size experimental design).

Figure 6 :
Figure 6: Example 1 -Partition of the space in the 4 regions using hard reconstruction.

Figure 7 :Figure 8 :
Figure 7: Example 1 -Partition of the space in the 4 regions using soft reconstruction.

Figure 9 :
Figure 9: Illustration of the two-bar truss structure subject to snap-through.

Figure 10 :
Figure 10: Example 2: Boxplots of the computed errors for various methods and experimental design sizes.

Figure 11
Figure11shows the original vs. predicted vertical displacement for the four approximations using a random subset of the validation set of size 200.The left panel of this figure shows how a single model (called "direct") spans the entire range between the two regimes of the truss and leads to huge errors.In contrast, the multi-stage approaches properly detect the discontinuities.It is also clear from this figure how the recombination scheme affects the final prediction when there are classification errors.The soft recombination reduces the error for those cases when there is uncertainty in the classification.Note that the same outlier points are observed in Figures11a and 11bwhen hard reconstruction and categorical Kriging are used: these outliers only stem from classification error.
Figure 11: Example 2: Original vs. predicted vertical displacement for different approximation techniques.

Figure 12 :
Figure 12: Hypar structure considered in this study.

Figure 13 :
Figure 13: Example 3: Kernel smoothing density of the maximum reaction force of the hypar.

Figure 14 :
Figure 14: Example 3: Computed errors for the hypar structure for increasing experimental design sizes.