Gaussian process approach within a data-driven POD framework for fluid dynamics engineering problems

This work describes the implementation of a data-driven approach for the reduction of the complexity of parametrical partial differential equations (PDEs) employing Proper Orthogonal Decomposition (POD) and Gaussian Process Regression (GPR). This approach is applied initially to a literature case, the simulation of the stokes problems, and in the following to a real-world industrial problem, inside a shape optimization pipeline for a naval engineering problem.


Introduction and motivations
Reduced Order Modeling (ROM) offers a simplification for very complex computational models [22], keeping by high accuracy for the solution we want to compute but, at the same time, reducing the computational cost to reach it.In a parametric context, this means that we are able to approximate the solution for any new unforseen parameter in a very efficient and rapid way.Especially for complex phenomena of which numerical solutions are expensive to computee.g.solve a Partial Differential Equations (PDEs) -, ROM makes a huge difference in terms of the global computational load, allowing for a quick (or even real-time) response and for a multiple queries approach, e.g. an optimization scenario, otherwise not viable.
In the Reduced Basis (RB) [12] method, a set of truth solutions for some selected parameters are computed using a numerical solver over the full-order model.Once these snapshots are collected, we combine them to extract the basis which spans the low-dimensional space, thus we can exploit it by using the basis as global test functions in a Galerkin method.Thanks to the low dimensionality of the reduced order model, the related system results smaller and faster to be solved.Despite this advantage, RB meets some difficulties for applications in industrial field: the necessity to extract the discrete operators precludes the use of commercial software, as well as the effort required to derive the equations onto the reduced space discourages its exploitation.
Data-driven reduced order model are then gaining popularity to tackle industrial problems, mainly for its easiness of application and for the increasing quantity of available scientific data.Within the data-driven approach, it is possible avoiding to project equations and operators onto the reduced spacethat can be, depending on the equations, very demanding and complex -and approximating the solution maninfold in the reduced space with an interpolation/regression.In this contribution, we coupled the Proper Orthogonal Decomposition (POD), a well-known technique for computation of reduced space basis [2], with the Gaussian Process Regression (GPR) for a probabilistic prediction of the new solutions.While the POD technique is pretty standard for projection based snapshots method in several fields of application -e.g.structural engineering [19], fluid dynamics [26,13,27] -, the GPR has been adopted only in [11] for a structural problem.The POD is involved to reduce the large dimension of the input snapshots and the GPR is applied as a probabilistic response surface for the approximation of the parametric solution manifold.The novelty of this work is in fact the creation of a computational framework by combining model order reduction and probabilistic prediction, in order to increase the accuracy of the state-of-the-art (data-based) metodologies for a very limited amount of snapshots.This framework results then particularly suited for industrial setting, where simulations can last days or even weeks, making unfeasible the computation of large database of solutions.Similar methods have be proposed in the last year to data-based modeling of parametric problems, using interpolation to approximate the solution manifold.Among all contributions in literature, we cite [24,6,29] for a good overview.An alternative to interpolation or regression for the online prediction is constituted by the Artificial Neural Network (ANN), as described in [33].Even if such approach results very accurate, it requires larger datasets, resulting particularly ineffective for limiting the number of full-order snapshots.Approaches that exploit the Active Subspaces (AS) property for the reduction of the number of required input snapshots are presented in [30,8].
The paper is structured as follows: the POD and GPR methods are described more in details in Section 2 and Section 3, respectively, while in Section ?? we describe their integration into a single framework.The numerical results are presented in Section 4 and they are subdivided as following: first we applied the POD-GPR framework to a simply toy problem, a Stokes flow passing around a cylinder, then we move to an industrial problem where turbulent biphase Navier Stokes flow is simulated around a parametric cruise ship geometry.Finally, we summarize some conclusion and further future perspectives in Section 5.
introduce the reader to the general idea behind POD and give some inshights on the mathematical theory supporting it.
We now begin with an introduction to the class of Reduced Basis (RB), presenting POD in the following as a special case.Finally, we discuss a modification of the standard POD formulation, which is equation free and it will be the one used in this work.
Reduced Basis.We begin with a formal definition of a system of discretized parametric PDEs in the form: where µ ∈ P ⊂ R p is the parameter, u N is the truth solution to our problem, a(•, •; µ) is the parametric bilinear form and X N is a finite dimensional space of dimension N .We introduce the notion of the solution manifold, that is the set of all possible solutions of our parametric problem under the variation of the parameter: The final goal of RB is to approximate any element of this manifold using a low number of basis functions, or modes, {χ i (x)} N i=1 by setting what we call the reduced basis.These N functions are globally defined over the computational domain and are obtained using some pre-computed truth solutions for particular parameter values.
The reduced solution u N N ≈ u N is hence composed by a suitable linear combination of these modes: and the reduced formulation of the problem (1) becomes: where X N N = span({χ i (x)} N i=1 ).Hence, while the degrees of freedom associated with the truth solution N are typically high, the degrees of freedom associated with the RB approximation are only N , where N N .
Proper Orthogonal Decomposition.POD refers to a particular RB which considers a hierarchical orthonormal basis generated using energetic considerations, exploting them as global basis function for a Galerkin framework [2].The mathematical tool behind POD is Singular Value Decomposition (SVD), as we will briefly discuss now.Let us begin with some notation.We will call a possible outcome of our system y a snapshot and denote with N its dimensionality (y ∈ R N ).We will then denote with n the number of snapshots collected, and gather them in the snapshots' matrix Y ∈ R N ×n : We then call N the number of POD modes, i.e. the dimension of the reduced basis, and assume N ≤ n < N .
A formal definition of the POD basis can be then given: ., n} is the POD basis of Y if and only if it is a solution of: This can be read as: the POD basis is the one that maximizes the similarity (as measured by the square of the scalar product) between the snapshots matrix and its elements, under the constraint of orthonormality.In this sense, when we obtain the N -rank POD basis, we have the set of dimension N capable of optimally express the variance in the snapshots.
The link between POD and SVD is stated in following theorem, that can be proven using Lagrangian penalization techniques (see for example [32]): given by the set of the first N left singular vectors {ψ i } N i=1 .As a last remark on POD, we highlight the fact that, as already shown, to compute the reduced basis we need a set of what can be called, using machine learning nomenclature, training data, that is a set of truth solutions for particular values of the parameters computed using the high fidelity solver.This is usually the most expensive phase in the ROM pipeline, and the applicability of ROMs techniques is often related to the size of the sampling needed to obtain certain performances.
Connected to this last aspect, we mention here another important feature of ROMs, that is the offline-online decomposition.By this, we mean that the use of RB techniques can be split, as we have already seen, in two phases: • an offline-phase, in which a set of high fidelity solutions are collected and the reduced basis is computed by combining them; • an online-phase, in which we project the problem onto the reduced space and the solutions for new parameters are computed in an efficient manner.
The first phase requires typically more computational time, having to rely on the high fidelity solver, and is usually executed in high performance computing clusters using parallel programming techniques.The second phase, however, is quite inexpensive from the computational point of view and could be done in low-end terminal or even tablet/smartphones, widening the field of applicability of this method.
Data-driven Proper Orthogonal Decomposition.Data-driven POD is an alternative to standard POD that is more used in the industrial setting because of its ease of use and its flexibility.Within this method, the first phase, the generation of the reduced basis, is performed in the same way as the standard approach.A significant difference, however, is that while in intrusive POD-Galerkin we need to rely upon an open-source software to compute the truth solutions (since in the online phase we will need to have access to the source code in order to project the equations).In the data-driven approach, this is not necessary, and we can use commercial software or even experimental data to train our model.The reason why in data-driven POD methods we have this freedom in the offline phase is that the online phase is completely different from the standard POD.
We recall here that the assumption of RB-ROMs is that the truth solution of our problem u N can be approximated by the reduced solution u N N composed by linear combination of spatial modes χ i (x) multiplied by coefficients ξ i (µ), that is: Then we build an interpolation, or a regression, using high-fidelity data coming from offline phase, approximating the function that associates the value of the parameter µ to the modal coefficients of the related solution {ξ i (µ)} N i=1 .We can use the interpolator to infer the value of the coefficients associated with new parameters.The values of the coefficients are then used to reconstruct the approximated truth solution using (7).
This approach is entirely data-driven and is independent both on the equations and on the physics of the problem.This has its own advantages and disadvantages.The ease of use and the complete freedom in the generation of the snapshots, that are crucial in an industrial setting in which commercial software are widely spread, correspond to a lower accuracy associated with the reduced model.

Gaussian process regression model
We introduce in this section the Gaussian Process Regression (GPR), the supervised learning technique we adopt to approximate the modal coefficients.We provide in the following lines a summary of the method, in order to let the reader understand the entire pipeline, but avoiding to touch more details; for a complete description of such framework, we suggest [21].
A Gaussian Process (GP) is a stochastic process whose finite-dimensional distributions are Gaussian.In supervised learning, this process is usually adopted for regression manner, by constructing a stochastic model from a set of input data capable to predict quantities of interest.We define the set D = {(x i , y i )} for i = 1, 2, . . ., M as the set containing all the input-output pairs, where x i ∈ P ⊂ R m is the input parameter and y i ∈ R is the corresponding output.We assume the output follows an unknown regression function f : P → R we want to approximate (we avoid in this simple overview to deal with noise), such that y i = f (x i ) for i = 1, 2, . . ., M .Thus, we define a GP such that: where µ(x) refers to the mean of the distribution and K(x) to its covariance, defined as K ij = K(x i , x j ).The covariance function, also called kernel, K : P × P → R should be properly selected depending on the problem, for this work we use the squared esponential: The prior joint Gaussian distribution for the outputs y results then where y ∈ R M is the output vector and X ∈ R m×M is the matrix whose columns are the input parameters.We specify that, for sake of compactness, we adopt a compress notation for the covariance matrices, such that K XX = (K(x i , x j )) with i, j = 1, . . ., M , where x i and x j are the i-th and j-th columns of X.We are now interested in predicting the output for new test input X by exploiting the joint distribution based on the above prior, that is: Now, the ouputs ȳ = f ( X) can be expressed in probabilistic terms by simply using the conditional Gaussian distribution: It is important to remark that, for a good result regarding the prediction, we need to optimize the hyperparameters within the covariance function.In this work, since a squared exponential kernel (Eq.9) has been adopted, we optimize the hyperparameters σ and l, respectively the variance and the lenghtscale, by maximizing the likelihood through a standard multi-start grandient-based optimization algorithm.
In this work, concerning the GPR, we use the Python open source package called GPy [10].

Numerical results
We present in this section the numerical results obtained by applying the described method to two different computational fluid dynamics simulations.To emphasize the versatility of the proposed framework with respect to the highfidelity solver, we firstly apply it to a finite element (FE) problem then to a finite volume (FV) one.The coupling between the original model and the reduced one is in fact based only on data, resulting in a complete modular pipeline.In the next sections, we briefly introduce both the full-order models before providing a discussion about the reduction accuracy, even if in this case the original model has not a strong impact in the generation of the reduced order space, in order to give to the reader the possibility to reproduce the results.

Parametric Stokes flow around cylinder
The first numerical experiment where we use the POD-GPR framework is the simulation of a parametric Stokes flow passing around a circular cylinder.We As we can see from Eq. 12, we consider two physical parameters µ = (µ 0 , µ 1 ) ∈ P ⊂ R 2 controlling the viscosity µ of the fluid and the velocity of the fluid at the inlet boundary Γ IN .The 2-dimensional domain Ω (sketched in Figure 1) is a rectangle with a circular hole, on which we impose on the physical walls (Γ S ∪ Γ N ∪ Γ CYL ) no-slip boundary condition, on the inlet the parametric flow velocity µ 1 and on the outlet a null pressure.Starting from Eq. 12, we can obtain the weak formulation of the Stokes equations by multiplying these equations to the test functions where a(•, •; µ) and b(•, •; µ) are the parametric bilinear forms.
To solve the Stokes problem described above, we adopt a finite element (FE) approach.We define the finite dimensional spaces indicates the piecewise polynomial space defined on the triangulation T of the domain Ω.For this work, we have chosen a second order polynomial for the velocity space and a first order one for the pressure space, i.e. (P 2 , P 1 ).The use of the so-called Hood-Taylor scheme [28] is sufficient for ensuring the infsup condition; for the mathematical proof of the stability of such scheme, we refer to [3].Moreover, the Reynolds number is very low (Re < 1) for all the combinations of parameters, making possible to avoid additional stabilization to numerically solve the problem.
The discretized parametric Stokes problem reads as follow: where u h (µ) ∈ V h and p h (µ) ∈ Q h are the finite dimensional unknowns we want to compute.The discretized problem has been solved using the FEniCS framework [1].
We initially create the high-fidelity database by computing the pressure and velocity fields for 50 samples, randomly generated in the parameter space P = [.1, 10] × [1e−4, 1e−6].We then apply the described method to these snapshots, using initially the POD algorithm to reduce the dimensionality of the solutions, aka extract the modal coefficients.
In this case, we use 12 modes for the construction of the reduced space and apply a GP -we repeat that the used kernel is the squared exponential oneto the distribution of the modal coefficients.To test the accuracy, we create a test dataset containing 20 random samples selected in the parameter space, that of course are different to the input database.For all these test parameters, we approximate the reduced solutions and compute the realtive error between the reduced test solutions and the truth ones.Figure 2 reports the average relative error for all the analized fields in relation to the number of input snapshots, starting from only 2 snapshots.We specify that, using less than 12 snapshots, the number of POD modes is limited to the number of snapshots.As we can see, the error results very small with even few snapshots, especially for both the velocity components that are beyond the 5% with only 8 snapshots.The trend for the pressure error is less steep (we need 16 snapshots to reach an error of 5%), but it is able to reach at the end a precision of about 2.5%.A graphical visualization of the test solutions obtained using only 3 input samples is provided in Figure 3.We underline the error becomes pretty much constant using many snapshots, due to the Bayesian online phase, making the usage of GP during the online stage profitable especially in the cases where the offline stage has a limited computational budget.

Multiphase turbulent Navier-Stokes flow in an industrial parametric domain
The second numerical experiment we set up to test the POD-GPR framework is the simulation of the flow around the hull of a cruise ship.The reduction is then employed inside a shape optimization pipeline, enabling the use of a global optimization procedure on the design space.We now proceed to discuss the main ingredients of the optimization pipeline in order to understand the role of the POD-GPR reduction.We avoid going into many technical details and refer the reader to [18,5] for a more extensive presentation of all the optimization pipeline.
The first step of the optimization pipeline consists of the generation of the parametric design space, where each possible shape is associated to a particular value of a pre-defined parameter.This is done by considering an initial undeformed configuration and applying to it a technique called Free Form Deformation (FFD) (for more details see [25,15,23]).FFD is a deformation technique in which the object to be deformed is initially put inside a lattice of points; then some of these points are moved, in a parametric way, and the space enclosed is deformed smoothly according to these motions.
In our case, we define the lattice shown in Figure 4.This lattice is enclosing the bottom-frontal part of the boat, being the part we want to deform, and is composed of a total of around 500 points, of which only a small part will be displaced.
In particular we define a set of movements connected to 6 independent parameters.The way in which these parameters are defined and the ranges they take values from are related to the kind of shapes we want to consider and the various constraints those shapes are subjected to.For example, the most simple and straightforward constraint we imposed is that of volume, i.e we impose that the hulls must not have a total volume that reaches too small values.Another constraint we impose is that the deformations applied on the hull needs to connect continuously and smoothly to the undeformed parts of the boat.This latter constraint can be eimposed by keeping fixed some points in the lattice.The FFD step is performed using PyGEM, an open-source Python library (see [20]).
The next step consists of the evaluation of the performance of the newly generated hull.This is done via numerical simulation of the flow of water and air around it.In particular, we consider a 3D biphase (water and air) incompressible flow.The physical model is based on the well known Incompressible Navier-Stokes equations, with modellation of turbulence based on a RANS approach (for more details see [4]), Finite Volume discretization (see [16]) and using the Volume of Fluid (VOF) methodology (see [14]) to take into account the biphase nature of the fluid.
More specifically, the equations considered are the following: where u, p denote the fluid velocity and pressure fields, ρ, ν and g denote respectively density, dynamic viscosity and gravity acceleration, R is the Reynolds stress tensor and α is the fraction of water.
The first equation in Eq. ( 15) expresses the momentum balance in the infinitesimal volume and is essentially a reformulation of Newton'st 2 nd law of dynamics expressing the Eulerian fluid's acceleration ( ∂u ∂t ) as a result of convection ((u•∇)u), a gradient of pressure ( 1 ρ ∇p) and diffusion (∇•ν∇u).The second equation instead expresses the mass conservation, emerging as solenoidality of the velocity field because of the incompressibility hypothesis.
We note that both the velocity and pressure fields in this equations represent only the mean part, decoupled from the fluctuating part using a RANS approach.All the effects of the fluctuating part are contained in the Reynolds stress tensor R, which will be in then modeled as a function of the mean velocity field u in order to close the equations.
The third equation is coming from the VOF modelling of the biphase flow.Volume of fluid is a free-surface modeling technique that allows to describe a multiphase fluid composed of two incompressible, isothermal immiscible fluids (water and air, in our case).For a more in-depth discussion on VOF, we refer to [14].
This method is based on a phase-fraction technique: a scalar variable, denoted by α, is added to N-S equations, representing the fraction of water contained in the infinitesimal volume (or in the finite volume, when we will discretize the equations).This variable belongs to the interval [0, 1], where α = 1 represents a point in which water is present, α = 0 a point in which air is present and α ∈ (0, 1) represents interface points.By its nature, α is a discontinuous variable, and so the discretization method must ensure that the interface is captured a small number of cells.
In addition to the momentum and mass balance equations, we have two equations where the density ρ and the kinematic viscosity ν are defined using an algebraic formula expressing them as a convex combination of the corresponding water and air properties: Finally, the third equation in the system (15) is the equation required to close the system after the addition of the variable α.In fact, this equation is simply a transport equation for the fraction of fluid in which only the convective term is considered.
As for the discretization and solution of the above equations, we used the implementation contained in the open-source CFD software OpenFOAM (see [17]).
The POD-GPR framework described above is then employed in order to reduce the computational time required for the resolution of the numerical problem associated to a new shape.Going into the details of the reduction, we consider as a snapshot, and hence as a solution, not the whole volumetric flow field but the the field of total resistance (viscous+pressure) over the hull.
The first step consists of the collection of the snapshots matrix, i.e. a set of solutions of the Full Order Model for selected parameter values.In particular we sampled the parameters space described above considering the vertices (2 6 = 64 points) and other 36 random points inside the domain, for a total of 100 snapsohts.Other 20 random snapshots have been generated for purpose of testing.
The extraction of the modes and approximation step have then been conducted using EZyRB [7], an open-source Python library for data-driven model order reduction.We now show some graphs reporting the relative L 2 error between FOM and ROM: In particular, in Figure 5 (left), the error is expressed as a function of the number of modes used and as a function of the number of snapshots used.In both the figures we report in different colors different regression techniques used in the approximation of the modal coefficients.We consider the GPR approach, the classical n-dimensional linear interpolation and two approximation techniques based on radial basis functions (RBF), implemented using the Python package Scipy (see [31]).The former two methods differ one from the other for the value of one particular parameter that defines the method, called "smoothness", which quantifies how close the approximation technique is to interpolation, where a value of 0 correspond to a plain interpolation while higher values correspond to more smooth approximations.In particular, we considered values of the smoothness equal to 0 and 0.01.
We can see from Figure 5 that, using the GPR for the regression of the modal coefficients, we are able to use a very low number of snapshots obtaining an error which is comparable to the performances of different regression techniques for higher number of snapshots.This is really interesting since, as argued in the previous section, the offline phase, i.e. the generation of the snapshots, is indeed the most expensive from the computational point of view.However, when the number of snapshots increases, the performances of GPR and the other methods tend to converge to a same value.
The final step of the shape optimization pipeline consists on the exploration of the design space in order to find the optimum.To perform this, we used a evolutionary optimization algorithm, implemented inside DEAP, an open-source Python library for evolutionary optimization (see [9]).
From the point of view of the speed-up in the computational time obtained with the use of the POD-GPR framework, we mention that while in order to run the above mentioned optimization procedure using the FOM we would have needed around 140 weeks of CPU time, the use of the ROM approach restricted this time to around 5 weeks.Of these 5 weeks, the biggest partition is associated to the offline phase, i.e. the generation of the 100 training simulations, being the time associated to the evaluation of the online phase almost negligible.It is then clear that, in light of the considerations made regarding the error as a funcion of the number of snapshots for the POD-GPR framework, we could have augmented this speed-up of at least a factor 10 by generating fewer snapshots.
Finally we report, in Figure 6 the optimal shape found, which correspond to a reduction of drag of 3% from the initial hull.The figure represents sections of the hull on the longitudinal and transversal plane, in blue for the undeformed hull and red for the optimal deformed hull.

Conclusions and perspectives
In this contribution, we applied the POD-GPR framework to two completely different parametric problems.To emphasize the equation-free nature of the algorithm, we demonstrate the applicability to a Stokes problem discretized using finite element method, then to a turbulent multiphase industrial problem discretized using finite volume method.In both cases, it emerges that the GPR method for the online prediction of the modal coefficients (for untested parameters) is able to make a prediction with a relative small error by using very few input snapshots.The advantage of such method is then maximazed when the offline phase results quite expensive due to the high-fidelity model.
The adoption of this method open several possibilities for future developments, by using the GPR not only for the prediction of new modal coefficients, but also for the quantification of the error, or for a greedy approach for new snapshots selection by exploiting the variance of the Gaussian distribution.Additionaly, the use of different kernels can be investigated.

Figure 1 :
Figure 1: The domain Ω for the parametric Stokes flow simulation.

u y 2 :
Relative average error between the truth solutions and the solutions obtained with the POD-GPR framework, evaluated in 20 test parameters, varying the number of snapshots.

Figure 3 :
Figure3: Graphical visualization of the reconstructed fields for a test parameter using only 3 input samples.The plots are organized as following: the three columns represent, from left to right, pressure, velocity among x direction and velocity among y direction.The first row shows the truth solutions, the second row the solutions approximated using the POD-GPR method and the third one represents the absolute error between them.

Figure 4 :
Figure 4: View of the the lattice of points over the undeformed hull: the blue dots are the FFD control points, while the red arrows represent the displacements.

Figure 5 :
Figure 5: Average relative L 2 error between the FOM and the ROM as a function of the number of modes, keeping fixed the number of snapshots to 80, (left) and the number of snapshots used, keeping fixed the number of modes to 20 (right).

Figure 6 :
Figure 6: View of the x (right), y (upper left) and z (lower left) sections for the optimal hull shape; in blue the undeformed hull, in red the deformed hull.