1 Introduction

In a 2009 compilation of essays, the fourth paradigm of science was first described [1]. After centuries of science based on observation—the empirical period of the first paradigm—came a period based on the establishment of scientific laws—the second paradigm, think of Newton—and much more recently a period in which simulation took an important role—the third paradigm. Very recently, the authors of this essay argue that we have entered a period in which data plays a prominent role in scientific discovery and where theory and experiments, symbiotically, help data to achieve higher goals. Scientific Machine Learning is precisely a new field in which data coming from scientific experiments is used massively to unveil new, still unknown scientific laws. Some authors have begun to think about an even more recent fifth paradigm of science, in which data is obtained not from experiments, but from simulations [2]. This approach is helping scientists to look for the origin of dark matter [3] or the structure of protein folding [4], to name but two of the most relevant examples. Other data-driven approaches do not look for a closed-form scientific expression describing a particular phenomenon, but substitute phenomenological laws of low epistemic value (typically, constitutive or closure laws) by data [5,6,7,8,9,10]. Of course, this approach is somewhat more limited in terms of extrapolation capabilities. Although it largely exceeds the scope of this paper, the interested reader could find excellent reviews on machine learning in [11,12,13], for instance.

No doubt that this data-intensive approach is revolutionizing science. But, on a much more applied context, scientific machine learning is also revolutionizing industry. In the same way that these techniques can look for general laws of physics, they can also look for rigorous descriptions of the functioning of technical apparatus, thus helping us to develop digital twins in a very efficient manner [14,15,16].

However, distilling scientific laws about any physical phenomena, regardless of its practical importance, has profound implications. Galileo, for instance, in his book on two new sciences—one of which was mechanics of materials—failed completely in the description of beam bending phenomena. He assumed that rectangular sections of a beam rotate due to bending around an axis passing through the lower surface of the beam, and not through the center of mass, as it is well known today [17]. However, he was intelligent enough to think of rotating sections of the beam, which is actually the cornerstone of the celebrated Euler-Bernoulli-Navier bending theory. Thus, even if his theory was not entirely true, his choice of variables was correct. Or at least it was at that macroscopic scale of description. Of course, at that time the atomic structure of matter was not known, but this extremely fine scale of description is equally valid, albeit much less efficient. Or that of the theory of elasticity within a continuum mechanics framework. Or that of the Cosserat brothers, that includes rotations of the material point—the beam bending theory could be considered itself as a Cosserat theory on a one-dimensional continuum [18]. Any of these scales is in principle valid, yet Galileo chose the right one, as did Euler, Bernoulli and Navier. Here, by “right” we mean the most useful one in terms of engineering practice. In Sect. 2 we briefly review the statistical mechanics aspects related to the choice of an appropriate level of description for a given physical phenomenon.

But the employ of machine learning is by no means restricted to unveil unknown physical laws. Still today, and despite the impressive interest on big data approaches in social sciences, the biggest supercomputers in the world continue to be devoted to the simulation of complex phenomena [19]. Full-field and high-resolution simulations continues to be challenging, and the possibility of developing a promising family of learned simulators is appealing [20,21,22]. These learned simulators share the characteristics of being considerably faster than traditional simulation techniques such as finite elements or finite volumes (once trained, of course, this does not take into account the considerable effort invested in training these networks). While traditional techniques employ a non-negligible effort to construct each model, learned simulators, in general, employ reusable architectures that can be employed for different, but related, problems. Additionally, the results of these learned simulators are, in general, as accurate as the data employed to train the network, and do not depend on models whose validity could be compromised by extreme parameter values, for instance. Another advantage includes the possibility of employ these simulators for inverse problems and optimization procedures, since they are based on differentiable networks [23,24,25,26].

Despite of the above-mentioned advantages, neural networks are not very popular among the scientific community in general, and particularly in the computational mechanics one [27, 28]. Common pitfalls include large deviations from the expected result arising as a consequence of small perturbations in data, or the well-known phenomenon of overfitting the data (basically, learning the noise in the data), for instance. This has motivated a strong effort of research towards the inclusion of previous knowledge about the physics taking place in the learning procedure. This knowledge could be included in the form of inductive biases, for instance [29] or the particular form of the differential equation governing the physics.

Recently, very popular methods based on neural networks have been developed for the solution of Partial Differential Equations (PDEs) from data [30] that try to overcome these limitations. These methods, coined globally as Physics-informed Neural Networks (PINNs), allow one to solve PDEs from data by following a scheme that resembles a collocation method in some sense. Thus, if the PDE governing the problem is known—hence the “physics-informed” character—, its particular form can be unveiled from the available data. We devote Sect. 3 to this family of techniques.

There is still another family of machine learning techniques that consider the learning process as a sort of regression procedure operating on an Ordinary Differential Equation (ODE) governing the dynamics of the system. These techniques can be seen as a particular instance of the proposal by Weinan [31]. These techniques operate when the particular form of the PDE governing the physics is not known, nor sought. Instead, as will be seen, this particular form of seeing the learning problem leverages the vast corps of knowledge around dynamical systems to enforce the right structure of the system (conserved quantities, symmetries, etc.). To this family of techniques we devote Sect. 4.

As will be seen throughout this paper, the employ of neural networks for learning physical phenomena is an active field of research, that has produced hundreds of papers in a very limited period of time. R Recently, methods have been developed that are able to unveil constitutive laws from displacement data alone —with global force but, notably, no stress data—which constitute a great breakthrough in this discipline, see [32,33,34]. Nevertheless, the topic is still in its infancy, and much more research is expected before we will be able to fully understand and predict the behavior of machine learning techniques, particularly neural networks, in this amazing discipline. The paper ends with some conclusions about these and other considerations in Sect. 5.

2 Statistical Mechanics of Coarse Graining

The simplest approach to learn physics, at least conceptually, consists in going down to the molecular dynamics scale, where Newton laws apply, label every molecule and follow them across their travel. This is not a useful approach, obviously, but conceptually is the simplest one. At this scale, everything is reversible. So if we denote by \(\varvec{z}=\{\varvec{q}_1, \ldots , \varvec{q}_N, \varvec{p}_1, \ldots , \varvec{p}_N\}\), where N is the number of molecules (on the order of \(10^{23}\) to have meaningful results) and \(\varvec{q}_i\), \(\varvec{p}_i\), \(i=1, \ldots , N\), represent the position and momentum of every molecule, the evolution in time of the system is well-know to obey a Hamiltonian evolution, i.e.,

$$\begin{aligned} \dot{\varvec{z}} = \frac{\partial \varvec{z}}{\partial t} = \varvec{L}\nabla \mathcal H = \varvec{L}\nabla E, \end{aligned}$$
(1)

with \(\varvec{L}\) the so-called symplectic (skew-symmetric) matrix and \(\mathcal H\) the Hamiltonian of the system, the total energy, E. So, learning the system at this scale means learning the particular form of \(\varvec{L}\) and E by regression (a form of supervised learning). Distilling these from data will ensure that predictions obtained through Eq. (1) will conserve energy, thanks to the symplectic structure of the learned description.

Since this approach is by no means practical, we will need to coarse-grain the description of the system. In other words, we will need to represent it with less degrees of freedom than it actually has. This process is described with particular elegance in a paper by Pep Español, whose title we borrow for this section [35]. Alternative approaches can be found, for instance, in [36]. In essence, the process consists in eliminating those degrees of freedom whose evolution in time is faster, keeping those whose evolution takes place over longer periods of time, thus allowing us to describe the system over longer intervals with less computational effort. At the other side of the spectrum, (equilibrium) thermodynamics is a description of the system that takes into account only invariants, magnitudes that do not evolve in time (typically, mass, momentum, energy).

Each possible level of description involves a set of variables, which we will denote by \(\varvec{z}\), regardless of the particular scale, if there is no risk of confusion. Coarser levels of description will provide us with less information, but will involve smaller sets \(\varvec{z}\). The molecular dynamics scale given by Eq. (1) has a typical time scale on the order of the collision time (again, unpractical for our purposes). In any case, every description will be governed by some equation that, very much like Eq. (1) predicts the evolution of the variables governing the system. It is also important to realize that, if a clear separation of scales exists between conserved and eliminated degrees of freedom, the resulting description will be Markovian. In other words, the coarse-grained model will not depend on history, but only in the present value of its variables.

Even in the case in which the description is Markovian, it is worth noting that many microscopic states could lead to the same macroscopic state. This produces uncertainty in the future evolution of the coarse-grained description of the system, see [35, 37, 38]. This uncertainty is equivalent to fluctuation in the evolution, and fluctuation is equivalent to dissipation by the celebrated fluctuation-dissipation theorem [39, 40].

Español takes a system with two clearly separated scales to elaborate around this theory: a colloidal suspension [35]. The finest description is obtained, as discussed before, by taking position and momenta of both coloidal particles, \(\varvec{Q}_i, \varvec{P}_i\) and solvent molecules, \(\varvec{q}_j, \varvec{p}_j\), with \(i=1, \ldots ,N_\text {col}\), \(j=1, \ldots , N_\text {sol}\) the number of coloidal particles and solvent molecules, respectively. At this scale, the energy of the system is composed by the potential and kinetic energy of the particles, and the typical time scale is on the order of picoseconds.

If we are not interest in describing the movement of every solvent molecule, we can substitute them by a continuous hydrodynamic field, in which extensive magnitudes as mass density, momentum density and energy density fields substitute them. In this framework we loose information about the precise location of each molecule, but we have instead information about how many of them are located within a given region of the fluid. Particles entering and leaving the considered region in the fluid produce the fluctuations mentioned before [41]. These fluctuations are in turn responsible of the Brownian motion of the coloidal particles. Dissipative Particle Dynamics, for instance, could be employed for the description of the solvent at this scale [42,43,44].

If the coloidal particles are massive, if compared to those of the solvent, their change in position and momentum will be slow, compared to the scale of hydrodynamic interactions. Therefore, one could envisage the elimination of the hydrodynamic field in order to keep position and momenta of the coloidal particles as the sole variables of our system, \(\varvec{z} = \{\varvec{P}_i, \varvec{Q}_i\}\), \(i=1, \ldots , N_\text {col}\). This description gives rise to a Fokker-Planck (FP) equation. The general FP equation works with a probability density function \(\psi (\varvec{z},t)\) that reflects the probability of finding a particle at a given position at a given time instant. Its evolution in time is given by

$$\begin{aligned} \frac{\partial \psi (\varvec{z},t)}{\partial t} = -\frac{\partial }{\partial \varvec{z}}\cdot \left( \varvec{A}(\varvec{z},t)\psi (\varvec{z},t)\right) + \frac{1}{2}\frac{\partial }{\partial \varvec{z}}\frac{\partial }{\partial \varvec{z}}:\left( \varvec{D}(\varvec{z},t)\psi (\varvec{z},t) \right) . \end{aligned}$$
(2)

\(\varvec{A}\) is a term that takes into account deterministc effects related to the macroscopic velocity drift, while \(\varvec{D}\) is a diffusion tensor related to Brownian effects [45, 46]. The FP equation has an equivalent Itô stochastic differential equation of the form

$$\begin{aligned} d\varvec{z} = \varvec{A}(\varvec{z},t)dt + \varvec{B}(\varvec{z},t)\cdot d\varvec{W}, \end{aligned}$$
(3)

where \(\varvec{D}= \varvec{B}\cdot \varvec{B}^\top \) and \(\varvec{W}\) is a multidimensional Wiener process. In the case of our coloidal suspension, the influence of the hydrodynamic field is taken into account by introducing a friction tensor \(\varvec{\zeta }_{ij}\) that depends on the relative position of coloidal particles i and j. Its equivalent Itô stochastic differential equation has the form

$$\begin{aligned} d\varvec{Q}_i=\frac{\varvec{P}_i}{M_i}dt,\;\; d\varvec{P}_i = \varvec{F}_i^{CC}dt -\sum _j \varvec{\zeta }_{ij}\cdot \varvec{V}_jdt + d\tilde{\varvec{F}}_i, \end{aligned}$$

with \(\varvec{V}_i = \varvec{P}_i/M_i\), \(F_i^{CC}\) a force of interaction among coloidal particles and \(\tilde{\varvec{F}}_i\) a stochastic force described by a Wiener process that takes the form \(d\tilde{\varvec{F}}_id\tilde{\varvec{F}}_j= 2k_BT\varvec{\zeta }_{ij}dt\) [47].

Still a coarser description can be obtained if the coloidal particles are placed at distant locations from each other. In that case, we can assume that \(\varvec{\zeta }_{ij} = \delta _{ij} \varvec{I} \zeta \), where \(\zeta \) is a friction coefficient. This gives rise to the set of so-called Langevin equations. All this works well only if the density of coloidal particles is much bigger than that of the solvent. Otherwise, we can not isolate position and momenta of the coloidal particles as the sole variables in our model.

Coarse-graining this model ever further, one could think that the position of the coloidal particles changes much more slowly than momentum, and try to keep \(\varvec{Q}_i\) alone as governing variables of our model. This gives rise to the so-called Smoluchowsky equation [48]. One can even think that we are not interested in knowing the position of each and every coloidal particle in our suspension, but on their density over a particular region. This is possible by introducing a concentration field as the sole variable of the system. This gives rise, in turn, to the well-known Fick equation.

All this process ends by considering that we are only interested in the system at equilibrium. The description that takes into account invariants of the system only is the scale of thermodynamics. For such a simple system we have mentioned six different possible scales at which we can describe its dynamics. As mentioned before, at each of these scales a given degree of uncertainty will appear, as a consequence of our lack of knowledge about the precise microscopic state (at the molecular dynamics scale). This uncertainty appears as stochasticity. In general, at each level of description, the least biased distribution is the one that maximizes the entropy functional at that scale.

It is therefore evident that by performing experiments at a given scale—something that depends usually on the available instruments and not on our choice—we are choosing a particular description of the phenomenon under scrutiny that may heavily influence the way in which the system shows to us. In the absence of previous knowledge on the system, we are not even in the position of ensuring the non-Markovian character of the description. Recently, however, data analysis methods have been devised that help in determining the precise number of internal, phenomenological variables for an accurate description of the history of the system [38].

Under this prism, we could ask ourselves if a suitable description of our system is already available in the literature, or if the phenomenon is completely unknown to us. In the former case, this description will be given frequently in the form of a partial differential equation. The general form of this PDE could be known, but not the precise values of its coefficients nor the boundary conditions applying under the laboratory conditions. Should this be the case, the formalism of physics-informed neural networks (PINNs) becomes the natural way to tackle the problem, see Sect. 3. If, on the other hand, we do not have any information on the particular form of the law being sought—think of the equations governing dark matter, for instance—a supervised learning approach based on the dynamical systems equivalence should be preferred. These will be deeply analyzed in Sect. 4 below.

3 Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) have constituted a great success in computational mechanics and mathematics and are, perhaps, the responsible of a very active research activity in the field [30, 49,50,51,52,53,54,55,56]. PINNs assume that a general, nonlinear PDE of the form

$$\begin{aligned} \dot{ \varvec{z}} + \mathcal N [\varvec{z};\,\varvec{\mu }]=\varvec{0}, \end{aligned}$$
(4)

is known to govern the physics taking place. Here, \(\mathcal N\) is a nonlinear differential operator and \(\varvec{\mu }\) represents a set of parameters. Of course, \(\varvec{z}(\varvec{x},t)\), \(\varvec{x}\in \Omega \) represent the governing variables of the problem, defined in some open set \(\Omega \subset \mathbb R^n\). The problem is then established so as to find, given some measurement on the system, the value of the variables governing it, \(\varvec{z}(\varvec{x},t)\), and the fittest parameters \(\varvec{\mu }\) that produce these measurements.

PINNs somehow resemble a collocation method. Given some measurements \(\varvec{z}^i(t_{\varvec{z}}^i, \varvec{x}_{\varvec{z}}^i )\), the residual of Eq. (4) is defined as

$$\begin{aligned} \mathcal R = \dot{\varvec{z}} + \mathcal N [\varvec{z};\,\varvec{\mu }]. \end{aligned}$$

By approximating \(\varvec{z}\) through a deep neural network, and applying automatic differentiation, one arrives at a neural network approximation to \(\mathcal R\). By minimizing the mean squared error defined as

$$\begin{aligned} MSE = MSE_{\varvec{z}} + MSE_{\mathcal R}, \end{aligned}$$
(5)

with

$$\begin{aligned} MSE_{\varvec{z}} = \frac{1}{N_{\varvec{z}}} \sum _{i=1}^{N_{\varvec{z}}} \left| \varvec{z}(t_{\varvec{z}}^i, \varvec{x}_{\varvec{z}}^i )- \varvec{z}^i \right| ^2, \end{aligned}$$

and

$$\begin{aligned} MSE_{\mathcal R} = \frac{1}{N_{\mathcal R}} \sum _{i=1}^{N_{\mathcal R}} \left| \mathcal R ( t_{\mathcal R}^i, \varvec{x}_{\mathcal R}^i )\right| ^2, \end{aligned}$$

with \(\{t_{\varvec{z}}^i, \varvec{x}_{\varvec{z}}^i, \varvec{z}^i\}_{i=1}^{N_{\varvec{z}}}\) denote measurements on the initial and boundary values and \(\{ t_{\mathcal R}^i, \varvec{x}_{\mathcal R}^i \}_{i=1}^{N_{\mathcal R}}\) represent the collocation points for \(\mathcal R\).

When learning the physics governing some given phenomenon, knowing in advance the PDE best describing it may seem as a too stringent condition, but the fact is that we already know many details about much of the physics surrounding us. In fact, in 1929, Paul Dirac said that [57],

The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.

It seems therefore reasonable to think that some form of PDE could be envisageable for the phenomena under scrutiny. Even if this is not completely true—think again of an engineer trying to model a beam from observation: should he or she consider an Euler-Bernoulli-Navier model or a Timoshenko one?—PINNs constitute nowadays a very popular method for solving PDEs from data.

As discussed in Sect. 2 above, learning physics from data is even more difficult than that. The just presented approach does no guarantee, of course, that for any reason, the resulting prediction made by such a learned PDE will be consistent with the principles of thermodynamics that, as justified before, act as a restriction of a very high epistemic level.

If some symmetries of system are known in advance (remember that, thanks to the Noether’s theorem, for each symmetry of the system there is a conserved quantity, see [58] or its recent translation to English [59]) these can be imposed beforehand. This is precisely the path followed in [60] for Hamiltonian systems of the form given by Eq. (1). The PDE (4) is then restricted to an ordinary differential equation (ODE). We will come back to this approach in Sect. 4 below.

3.1 Thermodynamics-Based Artificial Neural Networks

In a series of papers, I. Stefanou and coworkers presented recently a technique that employs thermodynamics as a restriction for learning constitutive laws [61,62,63,64]. They coined the technique as Thermodynamics-based Artificial Neural Networks (TANNs). TANNs encode the two laws of thermodynamics by making use of automatic differentiation, thus ensuring by constructions the fulfillment of these laws, without the need to learn them from data. Thus, it also avoids the lack of fulfillment of these laws for previously unseen data.

Conservation of energy can be expressed as a PDE of the form

$$\begin{aligned} \rho \dot{e} = \varvec{\sigma }\cdot \nabla ^s \varvec{v} - \nabla \cdot \varvec{q} + \rho h, \end{aligned}$$
(6)

with \(\rho \) the density, \(\varvec{\sigma }\) the stress tensor, \(\varvec{v}\) the velocity field, \(\varvec{q}\) the heat flux and h the energy supply per unit mass. In turn, the second principle of thermodynamics can be expressed as

$$\begin{aligned} \rho (\theta \dot{s}-\dot{e}) + \varvec{\sigma }\cdot \nabla ^s \varvec{v} -\frac{\varvec{q}\cdot \nabla \theta }{\theta } \ge 0, \end{aligned}$$
(7)

with s the specific entropy and \(\theta \) the temperature.

TANNs try to avoid black-box ANNs by imposing the fulfillment of Eqs. (6) and (7). When fed with the current state of the material, \(\varvec{z}_t =\{ \varvec{\varepsilon }_t, \Delta \varvec{\varepsilon }, \varvec{\sigma }_t, \theta _t, \zeta _t, \Delta t\}\), with \(\varvec{\zeta }\) a set of internal variables and t the time, they produce an output composed by \(\{\Delta \varvec{\zeta }, \Delta \theta ,F_{t+\Delta t},\Delta \varvec{\sigma }\}\), where \(F=E-S\theta \) is the Helmholtz free energy, \(S=\rho s\) and E is an energy potential (assumed to be rate-independent such that \(\varvec{\sigma }= \frac{\partial E}{\partial \varvec{\varepsilon }}\) and \(\theta = \frac{\partial E}{\partial S}\)).

TANNs actually learn from data the values of two scalars, the Helmholtz free energy F and the dissipation \(D= \rho (\theta \dot{s}-\dot{e}) + \varvec{\sigma }\cdot \nabla ^s \varvec{v}\). For this to be possible, the increment in internal variables and temperature are also learned. The rest of the ingredients of Eqs. (6) and (7) are computed by automatic differentiation.

A somewhat related approach has recently been presented in [65]. An alternative approach based on exterior calculus is also developed in [66].

3.2 Variational Onsager Neural Networks

Huang and coworkers developed recently a technique based on the application of the Onsager variational principle [67, 68] which they coined as Variational Onsager Neural Networks, VONNs [69]. Again, their technique is based on the learning of two potentials: the free energy F and the dissipation potential D. The resulting NN enforces strongly the fulfillment of the second law of thermodynamics.

For systems at constant temperature and free of inertial effects, the Onsager principle looks like

$$\begin{aligned} \min _{\varvec{\zeta }} \mathcal R [\varvec{z}, \varvec{\zeta }], \end{aligned}$$
(8)

where \(\mathcal R\) is the so-called Rayleighian and \(\varvec{z}\) and \(\varvec{\zeta }\) represent the state variables and the process (internal) variables, respectively. The Rayleighian takes the form

$$\begin{aligned} \mathcal R [\varvec{z}, \varvec{\zeta }] = \dot{F} [\varvec{z}, \varvec{\zeta }] + D[\varvec{z}, \varvec{\zeta }] + P[\varvec{z}, \varvec{\zeta }], \end{aligned}$$
(9)

with \(P[\varvec{z}, \varvec{\zeta }]\) the power supplied by external forces.

From the variational principle, Eq. (8), we arrive at

$$\begin{aligned} \frac{\delta \dot{F}}{\delta \varvec{\zeta }} + \frac{\delta D}{\delta \varvec{\zeta }} + \frac{\delta P}{\delta \varvec{\zeta }} =0. \end{aligned}$$

Both F and D are learned through a different NN, by minimizing the loss function à la PINN, see Eq. (5). The convexity of the dissipation potential D must be imposed explicitly, by making use of the input convex neural network paradigm [70]. This may be also the case for the free energy potential F.

VONNs work after data for \(\varvec{z}\) and \(\varvec{\zeta }\). This means that, very much like TANNs, this technique needs for data on internal variables that are often impossible to measure. If these networks work is by the availability of synthetic data, coming from simulations. We will discuss further about this limitation in Sect. 4. In any case, both techniques are able to impose in a soft manner the fulfillment of the second law of thermodynamics.

In the next section we analyze a different family of techniques based on dynamical systems, ODEs, instead of PDEs, and discuss their relative advantages and disadvantages.

4 Learning Based on a Dynamical Systems Analogy

On a somewhat different setting, Weinan showed that the learning problem has the same structure of a dynamical system [31, 37, 71]. In these works, it is suggested that supervised learning has the same structure of a dynamical system of the form

$$\begin{aligned} \frac{d \varvec{z}}{dt} = f(\varvec{z},t),\;\; \varvec{z}(0)=\varvec{z}_0, \end{aligned}$$
(10)

so that the flow map

$$\begin{aligned} \varvec{z}_0 \rightarrow \varvec{z}(T,\varvec{z}_0), \end{aligned}$$

is produced by a non-linear function f whose precise form is sought. This function can be found by employing classical regression methodologies such as linear regression, support vector machines [72], or others. Of course, given the availability of the universal approximation theorem for neural networks, these appear as an appealing choice [73,74,75]. But the most interesting part of this approach is to recognize that, given the form of a dynamical system, all our previous knowledge in the field can be advantageously exploited to impose known structures in the system. The simplest structures one can think of are, of course, the Hamiltonian structure given by Eq. (1) if the system is known to be conservative, or the gradient flow structure, for instance [76]. For non-conservative variables, their evolution can be established after some potential \(\mathcal R\) in the form [77]

$$\begin{aligned} \frac{d\varvec{z}}{dt}= -\frac{\partial \mathcal R}{\partial \varvec{z}}. \end{aligned}$$

As can be readily noticed, this framework assumes no known form for the problem at at hand. In sharp contrast to the framework of PINNs, whose governing PDE is assumed to be known, this dynamical systems equivalence assumes no previous knowledge on the physics taking place from which data are obtained. We explore this formalism more in detail in the following sections.

4.1 Neural Networks for Conservative Systems

This framework has attracted the interest of many researchers in recent times. For instance, in [60] a method is developed for systems of the form given by Eq. (1). To better explain this method, let us introduce some notation first.

A differentiable map \(\phi : U\subset \mathbb R^{2N}\rightarrow \mathbb R ^{2N}\) is said to be symplectic if

$$\begin{aligned} \left( \frac{\partial \phi }{\partial \varvec{x}} \right) ^\top \varvec{L}^\top \left( \frac{\partial \phi }{\partial \varvec{x}} \right) = \varvec{L}^\top . \end{aligned}$$

In particular, if \(\phi _t(\varvec{z}_0)\) is the flow map of a Hamiltonian system, it is a symplectic map,

$$\begin{aligned} \left( \frac{\partial \phi _t}{\partial \varvec{z}_0} \right) ^\top \varvec{L}^\top \left( \frac{\partial \phi _t}{\partial \varvec{z}_0} \right) = \varvec{L}^\top . \end{aligned}$$

Assume that we are in the position of obtaining experimental data about the system at hand in the form

$$\begin{aligned} \mathcal D=\{ \varvec{z}_i,\varvec{y}_i=\phi _h(\varvec{z}_i) \}_{i=1}^{\texttt {n}_{\text {samples}}}, \end{aligned}$$

with h the time step size for an adequate numerical integration scheme on the problem. With these data we can think of constructing a feedforward neural network whose loss is of the form

$$\begin{aligned} MSE = MSE_d+\lambda \cdot MSE_s, \end{aligned}$$

with

$$\begin{aligned} MSE_d = \frac{1}{\texttt {n}_{\text {samples}}} \sum _{i=1}^{\texttt {n}_{\text {samples}}} \Vert \phi _h(\varvec{z}_i) -\varvec{y}_i \Vert ^2 \end{aligned}$$

the typical loss on the accuracy of the predictions and

$$\begin{aligned} MSE_s = \frac{1}{\texttt {n}_{\text {samples}}} \sum _{i=1}^{\texttt {n}_{\text {samples}}} \left\| \left[ \left( \frac{\partial \phi _t}{\partial \varvec{z}} \right) ^\top \varvec{L}^\top \left( \frac{\partial \phi _t}{\partial \varvec{z}} \right) \right] (\varvec{z}_i) -\varvec{L}^\top \right\| ^2. \end{aligned}$$
(11)

\(\lambda \) is a parameter to take into account the different relative size of both losses. This loss (11) enforces the symplectic structure of the sought matrix \(\varvec{L}\), thus enforcing in turn the Hamiltonian character of the resulting approximation and its inherent conservative character. The resulting network works actually as a time integrator with time step h.

Many researchers have followed similar strategies. For instance, a year or so before Jin et al. and Greydanus et al. introduced the so-called Hamiltonian Neural Networks for canonical, discrete systems, in which the loss term is of the form [78]

$$\begin{aligned} MSE = \left\| \frac{\partial \mathcal H}{\partial \varvec{p}} - \frac{\partial \varvec{q}}{\partial t} \right\| _2 + \left\| \frac{\partial \mathcal H}{\partial \varvec{q}} + \frac{\partial \varvec{p}}{\partial t} \right\| _2, \end{aligned}$$
(12)

where \(\varvec{q}\) and \(\varvec{p}\) represent, respectively, position and momenta of discrete particles. A similar approach is followed in [79]. The same loss function in Eq. (12) is employed in [80]. However, in this work the authors solve parametric problems, so the input data set is collected for different values of these parameters, thus generalizing this type of networks, while keeping their conservative character.

Even more sophisticate loss functions could be considered. For instance, Bertalan et al. assume that their data includes not only position and momenta, but also their derivatives. Their loss term is therefore of the form [81]

$$\begin{aligned} \text {Loss} = \sum _{k=1}^4 c_kf_k, \end{aligned}$$

where each one of the four terms look like

$$\begin{aligned} f_1= & {} \left\| \frac{\partial \mathcal H}{\partial \varvec{p}} - \frac{\partial \varvec{q}}{\partial t} \right\| _2,\\ f_2= & {} \left\| \frac{\partial \mathcal H}{\partial \varvec{q}} + \frac{\partial \varvec{p}}{\partial t} \right\| _2,\\ f_3= & {} (\mathcal H(\varvec{q}_0, \varvec{p}_0) - \mathcal H_0)^2, \end{aligned}$$

(this term serves for disambiguation only, since the Hamiltonian can be determined up to a constant. It is therefore assume to be known at a point \((\varvec{q}_0, \varvec{p}_0)\)), and

$$\begin{aligned} f_4 = \left\| \frac{\partial \mathcal H}{\partial \varvec{q}}\dot{\varvec{q}} + \frac{\partial \mathcal H}{\partial \varvec{p}}\dot{\varvec{p}} \right\| _2. \end{aligned}$$

The number of works that employs related approaches is huge and dates back to at least 1993 [82]. In [83], for instance, an improved training method is developed for this type of networks that is based on the symplectic character of the equations. Finzi et al. suggest a modification of the above techniques by working on cartesian coordinates and not in the phase space [84]. To improve the expressiveness of these networks, Tong et al. introduce a method that adds Taylor series expansions designed with symmetric structure [85].

Always with the same Hamiltonian structure in mind, Chen et al. introduce them in the realm of recurrent neural networks, RNNs [86]. Even Generative Neural Networks exist under this Hamiltonian prism [87]. Again following a similar rationale, Di Pietro et al. improve the results by introducing a fourth-order time integration scheme in the learning scheme [88]. Choudhary et al. focus their attention in the transition from ordered to chaotic systems, and show that Hamiltonian Neural Networks (HNNs) perform much better than classical, black-box NNs [89]. Other works also confirm the superiority of HNNs over classical NNs [90, 91].

But these architectures are interesting not only by their inductive biases, that force them to follow conservative dynamics. Galimberti and coworkers have also demonstrated that HNNs eliminate by construction the problem of vanishing gradients [92], present in many NN architectures [93]. Even a very recent survey paper has been written on this particular family of techniques [94].

An alternative route to follow in this vast family of conservative phenomena is to employ Lagrangian, instead of Hamiltonian, formalisms. One advantage of doing so is the possibility of employing arbitrary coordinates instead of canonical coordinates. Although we will see later on that this does not constitute a true limitation for many systems of practical interest, canonical coordinates satisfy a set of rules in terms of the so-called Poisson bracket. The Lagrangian formalism assumes a data set composed by \(\varvec{z} = \{ \varvec{q}, \dot{\varvec{q}} \}_{i=1}^{\texttt {n}_{\text {samples}}}\) (position and velocities of the particles). As it is well known, the Lagrangian formalism defines the so-called action functional as

$$\begin{aligned} S= \int _{t_0}^{t_1} \left( T (\varvec{q}, \dot{\varvec{q}}) - V(\varvec{q}) \right) dt, \end{aligned}$$

with T the kinetic energy and V the potential energy, so that a dynamical system will follow a path given by the minimum value of S. This value is obtained through the restriction to the so-called Euler-Lagrange equations,

$$\begin{aligned} \frac{d}{dt} \frac{\partial \mathcal L}{\partial \dot{\varvec{q}}} = \frac{\partial \mathcal L}{\partial {\varvec{q}}}, \end{aligned}$$

with \(\mathcal L = T-V\) the Lagrangian of the system. This Lagrangian is precisely the objective to reconstruct from data. By standard algebraic manipulations, we arrive at

$$\begin{aligned} \ddot{\varvec{q}} = (\nabla _{\dot{\varvec{q}}} \nabla _{\dot{\varvec{q}}}^\top \mathcal L)^{-1} \left[ \nabla _{{\varvec{q}}} \mathcal L - (\nabla _{{\varvec{q}}} \nabla _{\dot{\varvec{q}}}^\top \mathcal L) \dot{\varvec{q}}\right] . \end{aligned}$$

From this expression it is therefore straighforward to define a loss function

$$\begin{aligned} \text {Loss} = \Vert \ddot{\varvec{q}}^{\mathcal L} - \ddot{\varvec{q}}^{\text {true}} \Vert _2, \end{aligned}$$

so as to obtain an approximation to \(\mathcal L\). This is the approach followed in [95,96,97,98,99,100,101], among others. Again, the amount of works devoted to this approach and their recent dates prove the interest of the community in these approaches.

4.2 Neural Networks for Dissipative Phenomena

Despite the success in the development of network architectures that impose conservation of energy (achieved mainly in the last two years), many researchers have realized that, following the reasoning in Sect. 2, dissipation is present in nearly every phenomenon of interest. Therefore, it is of utmost importance to develop techniques able to satisfy the principles of thermodynamics in the presence of dissipation. We make an overview of these techniques in this section.

4.2.1 Deconstructing Inductive Biases

Some authors, aware of the need to include dissipation in the formulation, have opted for its direct inclusion by just relaxing the fulfillment of the restrictions (inductive biases) introduced in the last section. This is the approach followed in the Symplectic ODE nets (symODEN) approach [102, 103], inspired by the literature on controlling dynamical systems. It is also the approach followed in a very recent approach [104]. Actually, both approaches are heavily influenced by the port-Hamiltonian approach to dynamical systems, which has a strong tradition in the introduction of dissipation and control in the formulation of dynamical systems [105,106,107]. Essentially, port-Hamiltonian systems consider an evolution of the system in the form

$$\begin{aligned} \begin{bmatrix} \varvec{q} \\ \varvec{p} \end{bmatrix} = \left( \begin{bmatrix} \varvec{0} &{} \varvec{I} \\ -\varvec{I} &{} \varvec{0} \end{bmatrix}-\varvec{D}(\varvec{q}) \right) \begin{bmatrix} \frac{\partial \mathcal H}{\partial q} \\ \frac{\partial \mathcal H}{\partial p} \end{bmatrix} + \begin{bmatrix} \varvec{0}\\ \varvec{g}(\varvec{q}) \end{bmatrix}\varvec{u}, \end{aligned}$$
(13)

where, as can be noticed, dissipation is included through a symmetric, positive semi-definite matrix \(\varvec{D}\), and control in considered through an actuation \(\varvec{u}\). The formulation in Eq. (13) recovers the Hamiltonian structure if no dissipation nor control are present. However, as will be demonstrated later, this formulation does not guarantee the fulfillment of the principles of thermodynamics and can be considered, in some sense, phenomenological.

Of course, neural networks based upon this formulation have been developed in the last years [108,109,110,111,112,113]. They all share the more or less the same ingredients and are based upon the formulation just discussed.

An alternative route is the one followed by Wang and coworkers [114]. Given the equivalence of symmetries and conservation laws for different magnitudes, the authors choose to begin by employing equivariant neural networks [115,116,117]. Then, these requirements on invariance are relaxed. But again, the lack of thermodynamic foundations of this method does not guarantee, in principle, the fulfillment of the principles of thermodynamics.

4.2.2 Metriplectic Neural Networks

As mentioned above, the port-Hamiltonian or relaxed equivariance formalisms do not guarantee but a phenomenological fulfillment of the laws of thermodynamics. There is no guarantee that, once faced to previously unseen data, or trained with noisy data, these networks will produce an output with the right amount of dissipation.

In order to develop a consistent formulation, let us assume that, at a given level of description, as discussed in Sect. 2, the invariants of the system, \(I(\varvec{z})\) can be expressed as a function of the resolved variables,

$$\begin{aligned} I(\varvec{z}) = \mathcal I(\varvec{z}), \end{aligned}$$

for some suitable function \(\mathcal I\). Let us also assume that the Hamiltonian of the system can be expressed as a function of these variables,

$$\begin{aligned} \mathcal H (\varvec{z}) = E(\varvec{z}), \end{aligned}$$

with E the actual energy of the system. In these circumstances, the Fokker-Planck equation (2) and, more particularly, its equivalent Itô stochastic differential equation (3) takes the form [42]

$$\begin{aligned} d\varvec{z} = \left[ \varvec{L}(\varvec{z}) \frac{\partial E}{\partial \varvec{z}} + \varvec{M}(\varvec{z})\frac{\partial S}{\partial \varvec{z}} + k_B\nabla \varvec{M}(\varvec{z}) \right] dt + d\tilde{\varvec{z}}, \end{aligned}$$
(14)

where \(k_B\) is the Boltzmann constant, \(\varvec{M}(\varvec{z})\) is a symmetric, positive semi-definite dissipation matrix, S is a second potential (the so-called Massieu potential, entropy at this level of description) and \(d\tilde{\varvec{z}}\) is a Wiener process that satisfies

$$\begin{aligned} d\tilde{\varvec{z}} = \varvec{B}(\varvec{z})d\varvec{W}(t), \end{aligned}$$

with \(\varvec{B}\) a non-square matrix satisfying

$$\begin{aligned} \varvec{B}(\varvec{z}) \varvec{B}(\varvec{z})^\top = 2k_B\varvec{M}(\varvec{z}). \end{aligned}$$

The importance of thermal fluctuations is controlled by the relative value of the Boltzmann constant \(k_B\) with respecto to the average value of entropy. Given that E, S, \(\varvec{L}\) and \(\varvec{M}\) do not depend on \(k_B\), if these effects are of low importance, we can take the limit \(k_B\rightarrow 0\), resulting in

$$\begin{aligned} d\varvec{z} = \varvec{L}(\varvec{z}) \frac{\partial E}{\partial \varvec{z}} + \varvec{M}(\varvec{z})\frac{\partial S}{\partial \varvec{z}}. \end{aligned}$$
(15)

This same assumption, \(k_B\rightarrow 0\), induces two additional consequences (we omit the proof, the interested reader can consult [42])

$$\begin{aligned} \varvec{L}(\varvec{z})\frac{\partial S}{\partial \varvec{z}} = \varvec{0}, \end{aligned}$$
(16)

and

$$\begin{aligned} \varvec{M}(\varvec{z})\frac{\partial E}{\partial \varvec{z}} = \varvec{0}, \end{aligned}$$
(17)

which constitute the ingredients of the celebrated General Equation for the non-Equilibrium Reversible-Irreversible Coupling, GENERIC, equations [118,119,120,121,122]. This type of formulations are also known as metriplectic formulations, since they combine metric and symplectic terms [123, 124]. However, in GENERIC Eqs. (16) and (17), known as degeneracy conditions, play a fundamental role. They are key ingredients in the demonstration of the a priori satisfaction of the two laws of thermodynamics: conservation of energy in closed systems and non-negative entropy production. Indeed, given Eqs. (16) and (17), it is straightforward to prove that, given the anti-symmetry of \(\varvec{L}\),

$$\begin{aligned} \dot{E}(\varvec{z}) = \frac{\partial E}{\partial \varvec{z}}\dot{\varvec{z}} =0, \end{aligned}$$

and

$$\begin{aligned} \dot{S} = \frac{\partial S}{\partial \varvec{z}}\dot{\varvec{z}} = \frac{\partial S}{\partial \varvec{z}}\varvec{M}(\varvec{z}) \frac{\partial S}{\partial \varvec{z}}\ge 0, \end{aligned}$$

given the positive semi-definiteness of \(\varvec{M}\). Therefore, the GENERIC structure appears to be much more interesting than port-Hamiltonian ones. It consistently guarantees the satisfaction of the laws of thermodynamics by construction. This makes GENERIC a very appealing choice for the construction of inductive biases in the learning of physical phenomena.

We assume that data sets \(\mathcal {D}_i\) contain labelled pairs of a single-step state vector \(\varvec{z}_t\) and its evolution in time \(\varvec{z}_{t+1}\),

$$\begin{aligned} \mathcal {D}=\{\mathcal {D}_i\}_{i=1}^{N_{\text {sim}}},\quad \mathcal {D}_i =\{(\varvec{z}_t,\varvec{z}_{t+1})\}_{t=0}^{T}, \end{aligned}$$
(18)

so that a neural network can be constructed by means of two loss terms, a data loss term that takes into account the correct prediction of the state vector time evolution using the GENERIC integrator, defined as

$$\begin{aligned} \mathcal {L}^{\text {data}}_n=\left\| \frac{d\varvec{z}^{\text {GT}}}{dt} -\frac{d\varvec{z}^{\text {net}}}{dt}\right\| ^2_2, \end{aligned}$$

where \(\Vert \cdot \Vert _2\) denotes the L2-norm. The choice of the time derivative instead of the state vector itself is to regularize the global loss function to a uniform order of magnitude with respect to the degeneracy terms.

A second loss term takes into account the fulfillment of the degeneracy equations,

$$\begin{aligned} \mathcal {L}^{\text {deg}}_n=\left\| \varvec{L}\frac{\partial S}{\partial \varvec{z}_n} \right\| ^2_2+\left\| \varvec{M}\frac{\partial E}{\partial \varvec{z}_n}\right\| ^2_2. \end{aligned}$$

This formulation gave rise to the so-called structure-preserving neural networks [125] and thermodynamics-informed neural networks [126,127,128]. These networks have been employed recently in the development of physcs perception with the help of computer vision techniques [129, 130].

The global loss term is a weighted mean of the two terms over the shuffled \(N_{\text {batch}}\) batched snapshots.

$$\begin{aligned} \mathcal {L}=\frac{1}{N_{\text {batch}}}\sum _{n=0}^{N_{\text {batch}}}(\lambda \mathcal {L}^{\text {data}}_n+\mathcal {L}^{\text {deg}}_n). \end{aligned}$$
(19)

Note that the energy and entropy are learned through data about their gradients, so they are learnt up to an integration constant value. Note also that the activation functions must have a sufficient degree of continuity to allow for this.

Recently, alternative approaches have been developed to impose these degeneracy restrictions in hard form, instead of soft form. For instance, [131] employ a particular parametrization of the \(\varvec{L}\) and \(\varvec{M}\) matrices. Zhang and coworkers [132] employ skew-symmetric matrices for forcing orthogonality. Notably, in this last work a universal approximation theorem is provided for this class of GENERIC-based networks, thus proving their expressivity.

An approach has been developed in which both inductive biases for the thermodynamic structure of the problem and a graph structure in the network are used in conjunction [128]. This approach has demonstrated to be very convenient for the development of learned simulators trained from finite element data. The result is a network for the problem of interest in which previously unseen geometric modifications can be introduced in the domain, as well as remeshings associated to these changes, without any decrease in accuracy.

Alternative versions based on GENERIC, but employing classical, piece-wise linear regressions instead of neural networks also exist, see [133,134,135,136,137]. More details on the geometric and thermodynamic structure of information can be found at [138].

Something remarkable from this GENERIC approach is that is works whenever the set of state variables \(\varvec{z}\) is able to adequately represent the energy of the system [42]. This opens the possibility of constructing reduced-order models and apply TINNs to learn the evolution equations in this reduced-order manifold. This is the approach followed in [126]. In Fig. 1 a sketch of this architecture is depicted. At a first instance, a sparse autoencoder is trained [139]. This autoencoder is trained with the full-field variables \(\varvec{z}\). The autoencoder, that employs L1-norm enforcing sparsity, then finds a latent representation of the phenomenon \(\varvec{x}\). It is precisely in this latent space in which the method seeks for a GENERIC representation.

Fig. 1
figure 1

Sketch of the architecture employed in [126] to learn reduced-order models of a given phenomenon

It is worth highlighting that the reduced-order modeling must follow this path: first reduce, then learn the GENERIC representation. The opposite (first learn the GENERIC model, then reduce) will make the ingredients of this formalism, \(\varvec{L}\), \(\varvec{M}\), E and S to loose their required properties.

4.2.3 Generalized GENERIC Structures

One of the most frequent criticisms to the employ of GENERIC as an inductive bias when learning physical phenomena is that metriplectic formalisms are only able to consider quadratic potentials. However, this is not entirely true. While the early descriptions of metriplectic formalisms and of GENERIC itself included only quadratic dissipation potentials, see [118, 119, 123, 124], this is no longer the case for modern descriptions of the GENERIC formalism [121, 140].

To develop such a theory, we first introduce the conjugate variables

$$\begin{aligned} \varvec{z}^*= \frac{\partial S}{\partial \varvec{z}}=S_{\varvec{z}}. \end{aligned}$$

We also introduce the so-called dissipative thermodynamic forces \(\mathcal X(\varvec{z}^*)\). The third ingredient of the model is a dissipation potential

$$\begin{aligned} \Xi = \Xi (\varvec{z}, \mathcal X), \end{aligned}$$

with the following properties:

  1. 1.

    \(\Xi \) is a real-valued and regular function of \((\varvec{z}, \mathcal X)\).

  2. 2.

    \(\Xi (\varvec{z}, 0)=0\).

  3. 3.

    \(\Xi (\varvec{z}, \mathcal X)\) reaches its minimum at \(\mathcal X=\varvec{0}\).

  4. 4.

    \(\Xi (\varvec{z}, \mathcal X)\) is convex in a neighborhood of \(\mathcal X=\varvec{0}\).

  5. 5.

    \(\varvec{z}^*\frac{\partial \Xi }{\partial \varvec{z}^*} = K(\mathcal X, \frac{\partial \Xi }{\partial \mathcal X})>0\).

With these conditions, we have that

$$\begin{aligned} \dot{\varvec{z}} = \varvec{L}(\varvec{z}) \frac{\partial E}{\partial \varvec{z}} +\frac{\partial \Xi }{\partial \varvec{z}^*}. \end{aligned}$$
(20)

Solutions to Eq. (20) satisfy the conservation of energy and non-negative entropy production,

$$\begin{aligned} \dot{S} = \frac{\partial S}{\partial \varvec{z}} \frac{\partial \Xi }{\partial \frac{\partial S}{\partial \varvec{z} }} = K\Big (\mathcal X, \frac{\partial \Xi }{\partial \mathcal X}\Big )>0. \end{aligned}$$

Equation (20) reduces to the standard GENERIC equation if \(\mathcal X(\varvec{z}^*) = \varvec{z}^*\) and \(\Xi (\varvec{z}, \mathcal X) = \frac{1}{2} \varvec{z}^* \varvec{M} \varvec{z}^*\), with \(\varvec{M}\) the usual dissipation matrix, symmetric and positive semi-definite.

Therefore, the elements in the GENERIC description of complex solids is composed by

  1. 1.

    the state variables \(\varvec{z}\).

  2. 2.

    the kinematics expressed by the Poisson bracket \(\{a,b\}\),

  3. 3.

    the dissipative forces \(\mathcal X\) and

  4. 4.

    three potentials, \(E(\varvec{z})\), \(S(\varvec{z})\) and \(\Xi (\varvec{z},\mathcal X)\).

To the best of the author’s knowledge, there is no learning strategy developed on top of this last theory that, nevertheless, is worth exploring to extend the capabilities of metriplectic formalisms.

4.3 Strategies for Open Systems

The variational Onsager strategy arising from Eq. (9) presents one important advantage over strategies based upon metriplectic (GENERIC) formalisms: they are valid for externally-driven systems. On the contrary, GENERIC assumes by construction a closed system. Very recently, however, the metriplectic approach inherent in GENERIC has been extended to open systems in a port-Hamiltonian-like style [141]. This new formalism can be seen as a sort of port-metriplectic approach in which dissipative, open systems can be tackled within the metriplectic framework by simply adding ports to the formulation, through which energy is exchanged with the environment. This opens the possibility to learn systems of systems, i.e., systems composed by sub-system, possibly of different nature, while ensuring the fulfillment of the right thermodynamic structure of the problem. Under the umbrella of the port-metriplectic formalism, the evolution of the open system can be analyzed as

$$\begin{aligned} \dot{\varvec{z}}&= \lbrace \varvec{z},E\rbrace _{\text {bulk}} + [\varvec{z},S]_{\text {bulk}} \nonumber \\&= \lbrace \varvec{z},E\rbrace + [\varvec{z},S] - \lbrace \varvec{z},E\rbrace _{\text {boun}} - [\varvec{z},S]_{\text {boun}}, \end{aligned}$$
(21)

where \(\lbrace \cdot ,\cdot \rbrace \) is the Poisson bracket and \([\cdot ,\cdot ]\) represents the dissipative bracket that constitute the conservative and dissipative, respectively, terms of GENERIC. In fact, the reader can notice that both brackets are decomposed additively into bulk and boundary contributions. The degeneracy conditions, Eqs. (16) and (17) are satisfied by the bulk operators only.

5 Conclusions

We have explored the recent interest in developing learning strategies for physical phenomena that take into account inductive biases originated from the non-equilibrium thermodynamic theory. The growing interest in this field is motivated by the interest of the mechanics community on the development of strategies that avoid as much as possible black-box approaches to the problem. For our community, but also for industrialists, credible methods are necessary. These days somewhat resemble the early days of finite elements. By that times, industrialists did not understand how finite elements worked. This motivated a huge effort of research within the applied maths and engineering communities that soon dissipated any doubt. If we want these data driven techniques to be adopted by the industry in a near future, we should provide responses, very much like the ones given for finite elements. This will make it necessary a joint effort from the applied maths community, from the engineering community but also from the physics and thermodynamics community, as we have tried to demonstrate in this paper.

In any case, so far these techniques have already demonstrated a substantial increase in robustness and credibility with respect to black-box approaches. It is also worth noting that something that we have already learnt is that the more physics knowledge we add to the learning process as an inductive bias, the less data will be necessary for the same level of accuracy and, conversely, the lower the error on the approximation will be.

In general, we strongly believe that these techniques, and those that will be developed in the future, will change the way we think of simulation, paving the way for instantaneous responses for parametric problems to the designers and analysts.