Pseudo-Hamiltonian neural networks for learning partial differential equations

Pseudo-Hamiltonian neural networks (PHNN) were recently introduced for learning dynamical systems that can be modelled by ordinary differential equations. In this paper, we extend the method to partial differential equations. The resulting model is comprised of up to three neural networks, modelling terms representing conservation, dissipation and external forces, and discrete convolution operators that can either be learned or be given as input. We demonstrate numerically the superior performance of PHNN compared to a baseline model that models the full dynamics by a single neural network. Moreover, since the PHNN model consists of three parts with different physical interpretations, these can be studied separately to gain insight into the system, and the learned model is applicable also if external forces are removed or changed.


Introduction
The field called physics-informed machine learning combines the strengths of physics-based models and data-driven techniques to achieve a deeper understanding and improved predictive capabilities for complex physical systems [33,55]. The rapidly growing interest in this interdisciplinary approach is largely motivated by the increasing capabilities of computers to store and process large quantities of data, along with the decreasing costs of sensors and computers that capture and handle data from physical systems. Machine learning for differential equations can broadly be divided into two categories: the forward problem, which involves predicting future states from an initial state, and the inverse problem, which entails learning a system or parts of it from data. A wealth of recent literature exists on machine learning for the forward problem in the context of partial differential equations (PDEs). The proposed methods include neural-network-based substitutes for numerical solvers [19,20,53,48], but also methods that can aid the solution process, e.g. by optimizing the discretization to be used in a solver [2]. The focus of this paper is however on the inverse problem, and much of the foundation for our proposed model can be found in recent advances in learning neural network models for ordinary differential equations (ODEs). Greydanus et al. introduced Hamiltonian neural networks (HNN) in [25], for learning finitedimensional Hamiltonian systems from data. They assume that the data q ∈ R n , p ∈ R n is obtained from a canonical Hamiltonian system qṗ = 0 I n −I n 0 ∂H ∂q ∂H ∂p , and aim to learn a neural network model H θ of the Hamiltonian H : R n × R n → R. This approach has since been further explored and expanded in a number of directions, which includes considering control input [59], dissipative systems [58,24], constrained systems [23,9], port-Hamiltonian systems [15,17,18], and metriplectic system formulations [34,28]. A similar approach considering a Lagrangian formulation instead of a Hamiltonian is presented in [12]. Modifications of the models that focus on improved training from sparse and noisy data have been proposed in [11,31,14,10].
In [22], we proposed pseudo-Hamiltonian neural networks (PHNN), which are models for learning a formulation that allows for damping of the Hamiltonian and external forces acting on the system. In addition, we allow for the Hamiltonian system to be non-canonical, and thus consider the formulationẋ where S(x) = −S(x) T , and y T R(x)y ≥ 0 for all y. That is, S(x) ∈ R d×d can be any skew-symmetric matrix and R(x) ∈ R d×d can be any positive semi-definite matrix. Since we put no restrictions on the external forces, a pseudo-Hamiltonian formulation can in principle be obtained for any first-order ODE, which in turn can be obtained from any arbitrary-order ODE by a variable transformation.
To get an efficient model, certain restrictions should be put on it to consider systems less general than (1). Then one can learn models that can be separated into internal dynamics and the external forces, i.e. (S θ (x) − R θ (x))∇H θ (x) and f θ (x, t). This makes it possible to learn a model of the system as if under ideal conditions even if data is sampled from a system affected by disturbances, if one assumes that an undisturbed system is closed and thus given only by the internal dynamics. The motivating idea behind the present paper is to extend the framework of [22] to PDEs. In principe, one could always treat the spatially discretized PDE as a system of ODEs and apply the PHNN models of [22] to that. But that would be disregarding certain structures we know to be present in the discretized PDE and would lead to inefficient models. Compared to the ODE case, we will in the PDE setting impose some restrictions on the form of the external forces, in that we will not allow for them to depend on spatial derivatives of the solution. On the other hand, we will consider a more general form of the internal dynamics, where dissipation can result from a separate term and not just damping of the Hamiltonian. To our knowledge, the only prior work on extending HNN to PDEs is that of Matsubara et al. in [42]. In addition to Hamiltonian PDEs, exemplified by the Korteweg-de Vries equation, they also propose an extension to dissipative PDEs and demonstrate this for the Cahn-Hilliard equation. That paper has been a major inspiration for our work, but we believe our generalization to a wider class of PDEs and allowing for external forces largely expands the utility of this learning approach.
The extension of PHNN to PDEs we propose here should naturally also be put in context with other recent advances in learning of PDEs from data. Long et al. introduced PDE-Net in [38], and together with [42] this may be the work that is most comparable to what we present here. Their model is similar to the baseline model we will compare PHNN to in this paper, albeit somewhat less general. Their approach has two components: learning neural network models for the nonlinear terms in the PDE and identifying convolution operators that correspond to appropriate finite difference approximations of the spatial derivative operators present. They do however make considerable simplifying assumptions in their numerical experiments, e.g. only considering linear terms and a forward Euler discretization in time. Other works that have received significant attention are those that have focused on identifying coefficients of terms of the PDE, both in the setting where one assumes that the terms are known and approximate them by neural network models [48] and in the setting where one also identifies the terms present from a search space of candidates using sparse regression [49,51,32]. There has also been considerable recent research on learning operators associated with the underlying PDEs, where two prominent methods are (Fourier) neural operators [1,36] and DeepONet [39]. These operators can e.g. map from a source term to solution states or from an initial state to future states; in the latter case, learning the operator equates to solving the forward problem of the PDE. The review paper [5] summarizes the literature on operator learning and system identification of PDEs, as well as recent developments on learning order reductions.
As will be demonstrated theoretically and experimentally in this paper, assuming a pseudo-Hamiltonian structure when solving the inverse problem for PDEs has both qualitative and quantitative advantages. The latter is shown by numerical comparisons to a baseline model on four test cases. The main qualitative feature of PHNN is that it is composed of up to six trainable models, which after training each can be studied for an increased understanding of the system we are modelling. Moreover, we could train a system affected by external forces and remove these from the model after training, so that we have a model unaffected by these disturbances. The code for this paper is built on the code for PHNN for ODEs developed for [22], and we have updated the GitHub repository https://github.com/SINTEF/pseudo-hamiltonian-neural-networks and the Python package phlearn with this extension to the PDE case.
The rest of this paper is organised as follows. In the next section, we explore the theoretical foundations upon which our method is based. Then the pseudo-Hamiltonian formulation and the class of PDEs we will learn are presented and discussed in Section 3. Section 4 is the centerpiece of the paper, as it is here we present the PHNN method for PDEs. We then dedicate a substantial portion of the paper to presenting and evaluating numerical results for various PDEs, in Section 5. The penultimate section is devoted to analysis of the results and our model, and a discussion of open questions to address in future research. We summarize the main results and draw some conclusions in the last section.
2 Background: Derivatives, discretizations and neural networks Before delving into the pseudo-Hamiltonian formulation and the model we propose based on this, we will review and discuss some requisites for making efficient neural network models of systems governed by PDEs.

Learning dynamical systems
Consider the general first-order PDE where we by u J mean u itself and its partial derivatives up to the arbitrary order p with respect to the spatial variables x 1 , . . . , x d . We seek to train a modelĝ θ of g so that solving u t =ĝ θ (u J , x, t) leads to accurate predictions of the future states of the system. The universal approximation theorem [30,13] states that g can be approximated arbitrarily well by a neural network. In practice, we have to assume an abundance of observations of u t and u J at t and x to actually find a neural network approximation of g. This brings us straight to one of the fundamental challenges of machine learning of differential equations: we would normally not expect to have data on the derivatives, neither temporal nor spatial. Thus we will have to depend on approximations obatined from discrete data. In this paper we will use sub-and superscript to denote discrete solution points in space resp. time.
That is, u j (x) = u(x, t j ) and u i (t) = u(x i , t), and we will suppress the arguments when they are not necessary. Let us consider the issue of time-discretization first, an issue shared by ODEs and PDEs alike, and defer the second issue to the next subsection.
In several of the papers introducing the most prominent recent methods for learning finitedimensional dynamical systems, e.g. the original HNN paper [25] and the first paper by Brunton et al. on system identification [5], the derivatives of the solution is assumed to be known or approximated by finite differences. Approximating the time-derivative by the forward finite difference operator is equivalent to training using the forward Euler integrator, which is also what is done in the PDE-Net papers [38,37]. There has however been several recent papers proposing more efficient training methods that incorporate other numerical integration schemes, see e.g. [11,31,14]. We follow [22,43] and set up the training is such a way that we can use any mono-implicit integrator; that is, any integrator that relies explicitly on the solution in the times it integrates from and to. For majority of the experiments in this paper, we use the implicit midpoint method, which is secondorder, symplectic and symmetric. That is, we train the modelĝ θ by identifying the parameters θ that minimize the loss function given for one training point u j and barring regularization for now. This yields a considerable improvement over the forward Euler method at next to no additional computational cost, since the model in both cases is evaluated at only one point at each iteration of training. The option to use other integrators, including symmetric methods of order four and six, is readily implemented in the phlearn package, and we do demonstrate the need and utility of a fourth-order integrator in Section 5.3. For a thorough study of integrators especially suited for training neural network models of dynamical systems, we refer the reader to [43,44].

Spatial derivatives and convolution operators
Moving from finite-dimensional systems to infinite-dimensional systems introduces the issue of spatial derivatives to neural network models. Thankfully, a proposed solution of this issue can be found in recent literature, as several works have noted the connection between finite difference schemes for differential equations and the convolutional neural network models originally developed for image analysis [8,16,50,2]. Given a function u and a kernel or filter w, a discrete convolution is defined by Here * is called the convolution operator, and the kernel w is a tensor containing trainable weights: w = [w −m , w −m+1 , . . . , w 0 , . . . , w m−1 , w m ]. If the function u is periodic, so that u 0 = u M , we obtain a circular convolution, which can be expressed by a circulant matrix applied on the vector u = [u 0 , . . . , u M −1 ] T . A convolutional layer in a neural network can be represented as where y k (u i ) is the output of the k-th feature map at point u i , w kj are the weights of the kernel w k , b k is the bias term, and φ(·) is an activation function. The width of the layer is 2m + 1, and this is usually referred to as the size of the convolution kernel, or filter. Training the convolutional layer of a neural network constitutes of optimizing the weights and biases, which we collectively denoted by θ in the previous subsection.
Similarly, a finite difference approximation of the n-th order derivative of u at a point x i can also be expressed as applying a discrete convolution: where the finite difference weights a j depend on the spatial grid. If we assume the spatial points to be equidistributed and let h := x i+1 − x i , we have e.g.
Since we require the kernel size to be odd, a kernel size of 3, or m = 1, is sufficient and necessary to obtain second order approximations of first and second derivatives, while m ≥ 2 is needed to approximate the third derivative. As noted by [50], higher order derivatives can be approximated either by increasing the kernel size or applying multiple convolution operations. In our models, we have designed neural networks where only the first layer is convolutional, and thus the kernel size restricts the order of the derivative we can expect to learn, while it also restricts the order of the approximations of these derivatives.

Variational derivative
Given the function H depending on u, x and the first derivative u x , let H be the integral of H over the spatial domain: The variational derivative, or functional derivative, δH δu [u] of H is defined by the property When H as here only depends on first derivatives, the variational derivative can be calculated explicitly by the relation δH δu assuming enough regularity in H.

Pseudo-Hamiltonian formulation of PDEs
In this paper we consider the class of PDEs that can be written on the form where S(u J , x) and R(u J , x) are operators that are skew-symmetric resp. positive semidefinite with respect to the L 2 inner product, H and V are integrals of the form (5) and f : The naming of this class is a challenge, since similar but not identical classes have been called by a myriad of names in the literature. Ignoring the term f , the class could be referred to as metriplectic PDEs, where the name is a portmanteau of metric and symplectic [27,4]. The formulation is also similar to an infinite-dimensional variant of the General Equation for Non-Equilibrium Reversible-Irreversible Coupling (GENERIC) formalism from thermodynamics [26,45], except for f and the fact that R(u J , x) is positive instead of negative semidefinite. To be consistent with our previous work [22] and to make clear the connection to the vast recent literature on Hamiltonian neural networks, we say that (7) is the class of pseudo-Hamiltonian PDEs. This marks a generalization of the definition used in [22], in addition to the extension to infinite-dimensional systems, in that we here allow for H and V to be two different integrals. In the case V = 0 and f (u J , x, t) = 0, we have the class of (non-canonical) Hamiltonian PDEs, or integral-preserving PDEs [35]. That is, given the appropriate boundary conditions, e.g. periodic, the PDE will preserve the integral H, usually labeled as a Hamiltonian of the system. This follows from the skew-symmetry of S: Similarly, if H = 0 and f (u J , x, t) = 0 but V ≥ 0, the PDE (7) will dissipate the integral V, and V may be called a Lyapunov function.

Example 1. Consider the KdV-Burgers (or viscous KdV) equation [54]
This can be written on the form (14) with A and R both being the identity operator I, S = ∂ ∂x and f (u, x, t) = 0, and and We see this connection by deriving the variational derivatives and We have that (8)

Spatial discretization
In this section and this section only we will use boldface notation for vectors, to distinguish continuous functions and parameters from their spatial discretizations. Assume that values of u are obtained at grid points x = (x 0 , . . . , x M ). Following [21], we interpret these as quadrature points with non-zero quadrature weights κ 0 , . . . , κ M , and approximate the L 2 inner product by a weighted discrete inner product: Let p denote the discretization parameters that consist of x and the associated κ i . Then, assuming that there exists a consistent approximation H p (u) to H[u] that depends on u evaluated at x, we define the discretized variational derivative by the analogue to (6) Thus, as shown in [21], we have a relationship between the discretized variational derivative and the gradient: Furthermore, we approximate S(x, u J ) and R(x, u J ) by matrices S d (u) and R d (u) that are skewsymmetric resp. positive semidefinite with respect to ·, · κ . Then a spatial discretization of (7) is given by which may equivalently be written as positive semidefinite by the standard definitions for matrices. Thus, upon discretizing in space, we obtain a system of ODEs (13) that is on a form quite similar to the generalized pseudo-Hamiltonian formulation considered in [22]. In fact, if V = H, we obtain the system Still, we do not recommend applying the PHNN method of [22] on this directly without taking into consideration what we know about H p ; specifically, that it is a discrete approximation of the integral (5), and can thus be expected to be given by a sum of M terms that each depend in the same way on u i and the neighouring points u i−1 and u i+1 .
Example 2. Consider again the KdV-Burgers equation (8), on the domain Ω = [0, P ] with periodic boundary conditions u(0, t) = u(P, t). We assume that the M + 1 grid points are equidistributed and define h := x i+1 − x i = P/M . We approximate the integrals (9) and (10) by where the operator δ f denotes forward difference, i.e. δ f u i = ui+1−ui h . Furthermore, we approximate ∂ x by the matrix corresponding to the central difference approximation δ c defined by where the first and last rows are adjusted according to the periodic boundary conditions. To obtain (13), we have that S p = 1 h S d and R p = 1 h I, with I being the identity matrix, and take the gradients of the approximated integrals to find where u 2 and u 3 denotes the element-wise square and cube of u, and δ 2 c := δ f δ b denotes the second order difference operator approximating the second derivative by δ 2 are consistent discrete approximations of (11) and (12).

Restricting the class by imposing assumptions
Without imposing any further restrictions, the formulation (7) can be applied to any first-order PDE and will not be unique for any system. Any contribution from the two first terms on the left-hand side could also be expressed in f , and even if we restrict this term, the operators S and R are generally not uniquely defined for a corresponding integral. In the remainder of this paper, we will consider the cases where the operators S and R are linear and independent of x and u, we assume R to be symmetric, and we will not let f depend on derivatives of u. Furthermore, we apply the symmetric positive semidefinite operator A to the equation and require that this commutes with R and S. We redefine S := AS, R := AR and f (u, x, t) := Af (u, x, t), and thus get where the new S is still skew-symmetric and the new R is still symmetric and positive semidefinite. In the following we will denote the identity operator by I and the zero operator by 0, so that Iv = v and 0v = 0 for any v ∈ L 2 . We note that the zero operator is positive definite, symmetric and skew-symmetric, while the identity operator is symmetric and positive definite, but not skewsymmetric.

The PHNN model for PDEs
Since we assume that the operators A, S and R are independent of x and u, the discretization of these operators will necessarily have to results in circulant matrices, given that u is periodic. That is, they can be viewed as discrete convolution operators. We thus setÂ to be trainable convolution operators, where k 1 , k 2 and k 3 denote the kernel sizes, and we impose symmetry onÂ θ . Furthermore, we letĤ θ andV θ be two separate neural networks that take input vectors of length M , the number of spatial discretization points, and outputs a scalar. The neural networkf θ can take input vectors of both u, x and t and outputs a vector of length M .
The full pseudo-Hamiltonian neural network model for PDEs is then given bŷ where we also have introduced k 4 , which should be 1 or 0 depending on whether or not we want to learn a force term. Given a set of N training points {(u jn , u jn+1 , t jn )} N n=1 varying across time and different stochastic realizations of initial conditions, we let the loss function be defined as if the implicit midpoint integrator is used. For the experiments in Section 5.3 we use the fourth-order symmetric integrator introduced in [22], and the loss function is amended accordingly.

Implementation
PHNN is comprised of up to six trainable models; the framework is very flexible and assumptions may be imposed so one or more of the parts does not have to be learned. Moreover, careful considerations should be made on how to best model the different parts. In the following, we explain how we have set up the models in our code.

Modelling H and V
The networksĤ θ andV θ consist of one convolutional layer with kernel size three followed by linear layers corresponding to convolutional layers with kernel size one, and then in the last layer performs a summation of the M inputs to one scalar. Following the first two layers we apply the tanh activation function. To impose the periodic boundary conditions, we pad the input by adding u(−h) = u(P − h) and u(P + h) = u(h) before the convolutional layer, as was also suggested in [42]. If we wanted to be able to learn derivatives of order three or four in the integrals, or if we wanted to learn third-or fourth order approximations of the derivatives, we would need the kernel size of the convolutional layer to be five, and to pad the input by two elements on each side instead of one. This adjustment can be easily made in our code.
For the examples in this paper, we would not gain anything by increasing the kernel size, because we only have up to first derivatives in the integrals and because the training data is generated using second order spatial discretizations.

Modelling A, S, R and f
In (15), k = (k 1 , k 2 , k 3 , k 4 ) are hyperparameters that determine the expressiveness of the model. Setting k 1 = k 2 = k 3 = M and k 4 = 1 means that we can approximate the general system (14), while setting k 1 = k 3 = 1, k 2 = 3 and k 4 = 0 would be sufficient to learn a model for the discretized KdV-Burgers system (8). In fact, if we set k 3 = 3, the skew-symmetric operatorŜ [k2] θ is uniquely defined up to a constant, since we require w 0 = 0 and w 1 = w −1 . Moreover, since this constant would only amount to a scaling between the operator and the discrete variational derivative it is applied to,Ŝ [k2] θ does not have to be trained in this case; determining w 1 would just lead to a scaling of the second-order approximation of the first derivative in space that could be compensated by a scaling of H. Similarly, if the kernel size of A or R is 3, we could set w 0 = 1 and learn a single parameter w 1 = w −1 for each of these when training the model. This corresponds to learning a linear combination of the identity and the second order approximation of the second derivative in space.
We model f by the neural networkf θ that may take either of the variables u, x and t as input. This has three linear layers, i.e. convolutional layers with kernel size one, with the tanh activation function after the first two. Iff θ depends on x, periodicity on the domain [0, P ] is imposed in a similar fashion as suggested in [56,41,40] for hard-constraining periodic boundary conditions in physics-informed neural networks and DeepONet. That is, we replace the input x by the first two Fourier basis functions, sin 2π P x and cos 2π P x , which is sufficient for expressing any x-dependent periodic function.
In the numerical experiments of the next section, we do not consider systems where A and R are anything other than linear combinations of the identity and the spatial second derivative, or S is anything other than the first derivative in space. Thus we set k = [3, 3, 3, 1] in our most general model, which is expressive enough to learn all the systems we consider. We also consider what we call informed PHNN models where we assume to have prior knowledge of the operators, affecting k, and also what variables f depends on.

Leakage of constant
There may be leakage of a constant between the two last terms of the PHNN model: the dissipation term and the external force term. Hence we must make some assumptions about these terms to separate them as desired. If we want the external force term to be small, we may use regularization and penalize large values of f θ during training. The option to do this is implemented in the phlearn package. However, for the numerical experiments in the next section, we have instead opted to assume that the dissipative term should be zero for the zero solution, and thus correct the two terms in question after training is done so that it adheres to this without changing the full model. That is, if we have the model when the last training step is performed, we set to get our final model (15), which is equivalent to (17). Then we may remove the dissipation or external forces from the model simply by setting k 3 = 0 or k 4 = 0. Note, however, that this correction may not work as expected if the zero solution is far outside the domain of the training data, as the neural networkV pre θ as most neural networks generally extrapolates poorly. In that case, regularization would be preferred.

Algorithms
We refer to Algorithm 1 and Algorithm 2 for the training of the PHNN and baseline models, respectively. • Baseline: A model consisting of one neural network that takes u, x and t as input, where the output is of the same dimension as u. The network consists of two parts: First a five layer deep neural network with a hidden dimension of 20 and the tanh activitation function, then a convolutional layer with kernel size five and activation function tanh, followed by two additional layers with hidden dimension 100 and tanh activation function. The first five layers are meant to approximate any non-linear function (say u → 1 2 u 2 in the case of Burgers' equation), while the convolutional layer is supposed to represent a finite-difference approximation of the spatial derivatives. The baseline model needs a kernel size of five to approximate the third and fourth derivatives present in the first resp. last example we consider.

The KdV-Burgers equation
Consider again the KdV-Burgers equation from examples 1 and 2, but with external forces: We consider the domain [0, P ] and assume periodic solutions u(0, t) = u(P, t), with P = 20. We let η = 6, ν = 0.3, γ = 1 and f (x, t) = sin 2πx P . We generate training data from initial conditions where c 1 , c 2 and d 1 , d 2 are randomly drawn from the uniform distributions U( 1 2 , 2) and U(0, 1) respectively. That is, the initial states are two waves of height 2c 2 1 and 2c 2 2 centered at d 1 P and d 2 P , with periodicity imposed.
We compare the general and informed PHNN models to the baseline model on (18) with Three models were trained of each type for 20 000 epochs using training sets consisting of 410 states, obtained from integrating 10 random initial states on the form (19) and evaluating the solution at every time step ∆t = 0.05 until t = 2. At every epoch, the models where validated by integrating them to time t = 1 starting at three initial states and calculating the average mean squared error (MSE) from these. The model with the lowest validation score after the last epoch was saved as the final model. Figure 1 shows the solutions obtained by integrating the different models of the system. The models shown here are the best of three trained models of each type, as evaluated by the average MSE at t = 1 from 10 random initial conditions (19). From time t = 2 all models start deviating from the ground truth, because the model of the force term is not able to predict how this behaves beyond the end time t = 2 in the training data. Figure 2 gives a demonstration of one of the main qualitative features of the PHNN models: we can remove the force and dissipation from the model and still get and accurate solution of the system without these. We also extract the external force part from the baseline model bŷ This works here, since the external force is independent of the solution states and the integrals are zero when u = 0, but it would not be an option in general. Moreover, the same trick applied to the general PHNN model would give a better model of the external force than the one displayed in Figure  2; this is due to the advantage of knowing that f is not dependent on the state. Furthermore, we note that there is no way to separate the conservation and dissipation terms of the baseline model. The general PHNN model is more sensitive to variations in the initialization of the learnable parameters of the model, an issue we discuss further in Section 6.1. Hence we get a large average MSE from these models, as evident in Table 1 and Figure 3, although the best models perform quite well, as seen in 2. We note also that it might be slightly ill-advised to judge the performance of models of wave equations purely by the MSE, since this will often be dominated by the waves being out of phase and not capture how well the shape of the waves are kept.

The forced BBM equation
The Benjamin-Bona-Mahony (BBM) equation was introduced as an improvement on the KdV equation for modelling waves on a shallow surface [46,3]. We consider this equation with a time- and state-dependent source term: which can be written on the form (14) with A = 1 − ∂ 2 ∂x 2 , S = ∂ ∂x and R = 0. This requires As for the KdV-Burgers equation, we train the model on a forced system starting with a twosoliton initial condition. In this case, the initial states are given by i.e. two waves of amplitude 3(c 1 − 1) and 3(c 2 − 1) centered at d 1 P and d 2 P , where c 1 , c 2 and d 1 , d 2 are randomly drawn from the uniform distributions U(1, 4) and U(0, 1) respectively, and with periodicity imposed on [0, P ]. We set P = 50 in the numerical experiments. Furthermore, we let f (u, t) = 1 10 sin(t)u. (24) In addition to the three models described in the introduction of this section, we also test a model that is identical to the most general PHNN model except that it does not include a dissipation term. We do this because it is not clearly defined whether or how (24) should be separated into a term that is constantly dissipative and one that is not. The most general model does learn that the system has a non-zero dissipative term; however, this term added to the learned force term is close to the ground truth force term. This is due to a leakage of a term αu for some random constant α between the terms, similar to the constant leakage described in Section 4.1.3, so that we learn an approximated integralV θ = α ∆x mean absolute value of the dissipation term, but we decided against doing that and instead opted to learn also a model without the dissipation term.
For every type of model, we trained 10 distinct models with random initializations for a total of 50 000 epochs using data sets comprising 260 states, obtained from integrating 10 randomly drawn initial states with time step ∆t = 0.4 from time t = 0 to time t = 10. A validation score at each epoch was generated by integrating the models to time t = 1 starting at three initial states and calculating the mean MSE, and the model with the lowest validation score was kept. After training was done, we integrated the models starting from 10 new arbitrary initial conditions to determine the average MSE at t = 2. By the error of the models as reported in Table 2, we see that the informed PHNN model performs best, while the baseline model has a lower average MSE than the more general PHNN model, which are highly sensitive to how the neural networks in the model are initialized. We note however that the best PHNN models all greatly outperform the best baseline model, as seen in Figure 5. Thus it would be advisable to run several PHNN models with different initalizations of the neural networks and disregard those models who behaves vastly different from  Figure 3: Similar plots as in Figure 2, except the results are obtained from the average of three models. The line and the shaded area is the mean resp. standard deviation of the three models of each type.
the others. Moreover, the PHNN models more stable and behaves better with increasing time compared to the baseline model. Beyond t = 10, the accuracy of all models quickly deteriorates. This is due to the poor extrapolation abilities of neural networks; the models are not able to learn how the time-dependent f behaves beyond the temporal domain t ∈ [0, 10] of the training data. Figure 6 shows the external forces learned by the PHNNs, and how well they predict the system with these forces removed. For the general PHNN model, we have in this case also removed the dissipation term.

The Perona-Malik equation
In addition to modelling physical systems, PDEs can be used for image restoration and denoising. For instance, if the heat equation is applied to a greyscale digital image, where the state u gives the intensity of each pixel, it will smooth out the image with increasing time. The Perona-Malik equation for so-called anisotropic diffusion is designed to smooth out noise but not the edges of an image [47]. Several variations of the equation exist. We consider the one-dimensional case, with a    space-dependent force term, given by This is a PDE of the type (14) with Note that the equation can be written on the form x , but this φ[u] is not the variational derivative of any integral. We consider (25) on the domain [0, P ] with P = 6, and set for the following experiments. The initial conditions are given by where a ∈ U(−5, 5), b ∈ U(20, 40), c ∈ U(0.05, 0.15), d l ∈ U(0.3, 3), h l ∈ U(0.5, 1.5), r ∈ U(0.5, 3) and s ∈ U(10, 20).  Figure 5: Predictions of the forced BBM system system (22) obtained from the best of 10 models of each model type, as evaluated by the mean MSE at t = 2 on predictions from 10 random initial states.
Five models of each type are trained for 1000 epochs, on 10 pairs of data at time t = 0 and t = 0.02. This corresponds to the original noisy image and and an image where the noise is almost completely removed, as judged by visual inspection. Because the step between these states is quite large, a high-order integrator is required to get an accurate approximation of the time-derivative. Indeed, models trained with the second-order implicit midpoint fails to remove the noise as fast or accurately as the ground truth (25). Thus we use instead the fourth-order symmetric Runge-Kutta method (SRK4) introduced in [22]. This requires roughly four times the computational cost per epoch as using the midpoint method, but gives a considerably improved performance, as demonstrated in Figure (7). Table 3 and Figure 8 reports the result of applying the learned models on an original noisy state (27) with a = 1, b = 30, c = 0.15, d 1 = 1, d 2 = 2, h 1 = h 2 = 1, r = 2 and s = 15. We observe that the general PHNN model is remarkably good at removing the noise, even if at a slightly slower speed than the true model or the informed PHNN. The baseline model does not remove the noise satisfactorily by the end time t = 0.02. The external force is learned quite accurately by all models,

The Cahn-Hilliard equation
The Cahn-Hilliard equation was originally developed for describing phase separation [7], but has applications also in image analysis, and specifically image inpainting [6,52]. Machine learning of pattern-forming PDEs, which include the Cahn-Hilliard and Allen-Cahn equations, has been studied in [57]. Results on applying PHNN to the Allen-Cahn equation is included in our GitHub repository. However, here we only consider the Cahn-Hilliard equation, with an external force, given by This is a dissipative PDE if the external force is zero, and it can be written on the form (14) with A = I, S = 0 and R = − ∂ 2 ∂x 2 , and In the experiments, we have used ν = −1, α = 1 and µ = − 1 1000 , and The initial conditions of the training data are on the domain [0, P ] with P = 1, where a l , b l , c l and d l are random parameters from the uniform distributions U(0, 1 5 ), U(0, 1 20 ), U(1, 6) and U(1, 6), respectively. In addition to the models described in the introduction of this section, we also train a "lean" model, with k = [1, 0, 3, 1] but no prior knowledge of how R looks. For each model type, 10 randomly initialized models were trained for 5000 epochs on different randomly drawn data sets consisting of a total of 300 states, at times t = 0, t = 0.004 and t = 0.08. At each epoch, the model was evaluated by comparing to the ground truth solution of five states at t = 0.01, and the model with the lowest MSE on this validation set was kept. The resulting 10 models were then evaluated on 10 random initial conditions and the mean MSE in the last time step was calculated from this. The mean and standard deviation from the 10 models of each type is given in Table 4, and the prediction of the model of each type with the lowest mean MSE is shown in Figure 9. In Figure  10 we give the results of the average of all models, including the standard deviation. Here we also show the learned external force and the prediction when this is removed from the model. The initial state of the plots in figures 9 and 10 is (29) with a 1 = 0.1, a 2 = 0.06, b 1 = 0.01, b 2 = 0.02, c 1 = 2, We see from Figure 9 that the most general PHNN model may model the system moderately well, but it is highly sensitive to variations in the training data and the initialization of the neural networks in the model; from Figure (10) and Table 4 we see that this model may produce unstable predictions. In any case, the PHNN models struggle to learn the external force of this problem accurately without knowing R, which we see by comparing the predictions of PHNN (lean) and PHNN (informed) in Figure 10, where the difference between the models is that the former has to learn an approximationR [3] θ of R and is not informed that f is not explicitly time-dependent.

Analysis of the models and further work
Here we provide some preliminary analysis of the PHNN models, which lays the groundwork for further analysis to be performed in the future.  Table 4: Mean and standard deviation of the MSE at t = 0.02, for 10 models of each type trained on the forced Cahn-Hilliard system (28) and evaluated on predictions from 10 random initial states.  Figure 9: Predictions of the forced Cahn-Hilliard system obtained from the best of 10 models of each model type, as evaluated by the mean MSE at t = 0.02 on predictions from 10 random initial states.

Stability with respect to initial neural network
The training of neural networks is often observed to be quite sensitive to the intial guesses for the weights and biases of the network. Here we test this sensitivity for both the PHNN model and the baseline model on the KdV-Burgers experiment in Section 5.1. We keep the training data fixed and re-generate the initial weights for the neural networks and rerun the training procedure described in algorithms 1 and 2. In Figure 11 we plot the solution at the final time together with the standard deviation and the pointwise maximum and minimum values for both the baseline model and the PHNN approach, where the standard deviation and maximum/minimum is computed across an ensemble of different initial weights for the deep neural network. In Figure 12 we plot the L 2 error at the final time step against the exact solution for varying number of epochs, where the shaded areas represent the maximum and minimum values of an ensemble of varying initial weights of the neural network.

Spatial discretization and training data
We will strive to develop PHNN further to make the models discretization invariant. For now, we settle with noting that this is already a property of our model in certain cases; a sufficiently well-trained informed PHNN model will be discretization invariant if the involved integrals do not depend on derivatives.
Of the examples considered in this paper, that applies to the BBM equation, the inviscid Burgers' equation, and the Cahn-Hilliard equation if µ = 0 in (28). Figure 13 shows how the learned BBM system can be discretized and integrated on spatial grids different than where there was training data.
For the experiments in the previous section, we generated training data using second order finite difference operators to approximate the spatial derivatives. Then we trained our models on the same spatial grid, thus making it possible to learn convolution operators of kernel size three that perfectly captured the operators in the data. In a real-world scenario, this is unrealistic, as we would have to deal with discretization of a continuous system in space, as well as in time. We chose to disregard this issue in the experiments, to give a clearer comparison between PHNN and the baseline model not clouded by the error from the spatial discretization that would affect both. However, we have also tested our models on data generated on a spatial grid of four times as many discretization points, so that the data is generated from a more accurate approximation of the differential operators than possible to capture by the convolution operators. As we see in Figure  14, the PHNN models do appear to work well even with the introduction of approximation error in the spatial discretization. A more thorough study of this issue is required to gain understanding of how to best handle the spatial discretization.

Proof of convergence in the idealized case
In this section we show a simplified error estimate for learning the right hand side of an ODE. Consider thus a model ODE of the form where u : [0, T ) → R M and g : R M → R M . Note that the spatially discretized equation (13) can be cast into this form. In a certain sense, both the baseline model and the PHNN model tries to identify g by minimizing the L p -norm of the observations u(t j ) and the predictions u θ (t j ). On a high level, this gives us a sequence u θ → u in L p , but as is well-known, this would not be enough to Figure 12: Convergence of solution with respect to number of training epochs. Here, the shaded area represents the maximum and minimum errors obtained, and the solid middle line represents the mean value across different initial weights. conclude anything about the convergence of g θ → g, since L p convergence in general does not imply convergence of the the derivatives. However, by utilizing the fact that we have a certain control over the discretized temporal derivatives in the learning phase, we can show that the g θ → g in the same L p norm, provided the training loss is small enough. The following theorem makes this precise.

Remark 1.
We note that (31) implies that the sequence u j j is a forward Euler approximation of (30), while (32) means that the sequenceũ j j is the learning data obtained by either the baseline or PHNN algorithm using the forward Euler integrator.
Remark 2. In the above theorem, N i=1 ∆t u j −ũ i p 1/p is porportional to the training loss.
In other words, the error in the approximation of g we get is bounded by 1/∆t · (Training loss). Hence, to achieve an accuracy in g, we need to train to a training loss of ∆t.

Conclusions
One of the advantages of PHNN is that it facilitates incorporating prior knowledge and assumptions into the models. The advantage of this is evident from the experiments in Section 5; the informed PHNN model performs consistently very well. We envision that our models can be used in an iterative process where you start by the most general model and as you learn more about the system from this, priors can be imposed. As discussed in Section 6.1, the PHNN models are highly sensitive to variations in the initialized parameters of the neural networks. By the numerical results, the general PHNN model in particular seems to be more sensitive to this than the baseline model. However, the best trained PHNN models outperform the baseline models across the board. In a practical setting, where the ground truth is missing, we could train a number of models with different initialization of the neural networks and disregard those that deviate greatly from the others.
networks. Building on [29], we will develop methods for identifying analytic terms for one or several or all of the parts of the pseudo-Hamiltonian model (15), and compare the performance to existing system identification methods like [49,51,32]. We are especially intrigued by the possibility to identify the integrals of (14), while the external forces might be best modelled by a neural network.