Exact conservation laws for neural network integrators of dynamical systems

The solution of time dependent differential equations with neural networks has attracted a lot of attention recently. The central idea is to learn the laws that govern the evolution of the solution from data, which might be polluted with random noise. However, in contrast to other machine learning applications, usually a lot is known about the system at hand. For example, for many dynamical systems physical quantities such as energy or (angular) momentum are exactly conserved. Hence, the neural network has to learn these conservation laws from data and they will only be satisfied approximately due to finite training time and random noise. In this paper we present an alternative approach which uses Noether's Theorem to inherently incorporate conservation laws into the architecture of the neural network. We demonstrate that this leads to better predictions for three model systems: the motion of a non-relativistic particle in a three-dimensional Newtonian gravitational potential, the motion of a massive relativistic particle in the Schwarzschild metric and a system of two interacting particles in four dimensions.


Introduction
Traditionally, machine learning has been applied to areas where there are no known underlying laws that describe the observed data. For example, deep convolutional neural networks (CNNs) have been very successful in computer vision applications (see [1,2,3] for some famous examples), even though it is unclear how the lower-dimensional manifold of "valid" images is parametrised. This manifold has to be discovered in a purely data-driven way by feeding the network vast amounts of data. Of course, also in this case domain specific prior knowledge is built into the structure of the network: deep CNNs are so successful since the network structure supports the hierarchical representation of higher level features and encapsulates the fact that the input images are not random, but nearby pixels are likely to have similar values. In contrast, other fields of science such as physics are governed by well known fundamental laws. There appears to be little value in applying machine learning in this case since these laws allow the construction of an exact mathematical description, which can then be translated into an algorithm for making predictions with traditional techniques from numerical analysis. However, often the dynamics of a system is only partially constrained by these underlying laws. For example, the motion of a particle in an external potential might conserve energy and angular momentum, while the exact form of the potential or the expression for the kinetic energy is still unknown and needs to be inferred from observations. In fact, a very successful approach to constructing new theories in modern physics is to write down the most general Lagrangian which is invariant under certain symmetry transformations and to then constrain the remaining parameters through a fit to experimental data. Famously, general relativity [4] is constructed by demanding that the theory is invariant under local coordinate transformations but this allows the existence of a cosmological constant which needs to be constrained by experimental evidence. In particle physics a similar method is used for the construction of Chiral Perturbation Theory [5,6] and other effective theories for which the Lagrangian can be written as an infinite sum of terms that are invariant under certain symmetry transformations. The expansion coefficients have to be inferred from experiments.
In this work we study simpler systems arising in classical mechanics of point particles. The key idea is to represent the Lagrangian by a neural network in such a way that it exactly conserves quantities like linear and angular momentum. The weights of the network are then learned from training data which consists of trajectories that follow the true dynamics and are perturbed by random noise. While in the present paper we use synthetic training data that is obtained by integrating the exact equations of motion numerically for a set of model systems, it should be stressed that the data could also come from the observation of a real physical system. Drawing a historical analogy, our neural network can be seen as the virtual version of an astronomer who tries to derive the celestial equations of motion from observations of the planets in the night sky. In contrast to its medieval predecessors, our virtual astronomer has read Noether's paper [7] and takes care not to violate fundamental physical laws such as angular momentum conservation.

Related work
Several approaches for solving time dependent differential with neural networks have been pursued in the literature. In [8] the authors train an LSTM [9] based neural network to predict the solution at the next time step, given the solution at a number of s previous times. Hence, this can be seen as an extension of classical linear multistep methods (see e.g. [10]) and the hope is that with a suitably trained network it is possible to substantially increase the timestep size without losing accuracy, thereby making the method more efficient than traditional methods. For Hamiltonian systems a completely different approach is pursued in [11]: a neural network is trained to learn a scalar valued function H NN (q, p) which approximates the true Hamiltonian H(q, p) as a function of the generalised coordinates q ∈ R d and conjugate momenta p ∈ R d ; here and in the following d is the dimension of the dynamical system. It is well known that symplectic integrators preserve a so-called shadow HamiltonianH, i.e. the numerical solution generated with the true Hamiltonian H is the exact solution ofH. For this reason, the numerical solution approximately preserves energy. While the authors on [11] do not exploit this since they use a non-symplectic fourth order Runge Kutta integrator, in [12] it is argued that symplectic Neural Networks (SRNNs) show superior performance. SRNNs represent both the kinetic and potential energy by a neural network and then use a symplectic Verlet method [13] to propagate the solution. Instead of working in the Hamiltonian formulation, it might be more convenient to represent the Lagrangian by a neural network as in [14]. As argued there, working in the Lagrangian framework is more flexible since the system can be formulated in any coordinate system, not just canonical coordinates, which are often not easy to find. The methods discussed so far only preserve some physical quantities such as the total energy. It can be shown that energy conservation directly follows from time invariance of the system, i.e. the fact that the Lagrangian does not explicitly depend on time. Noether's Theorem [7] generalises this result and shows that invariance of the Lagrangian under any infinitesimal transformation results in a corresponding conservation law. Hence, if the neural network that represents the system's Lagrangian can be constructed such that it is exactly invariant under a certain infinitesimal symmetry transformation, this will guarantee the conservation of a physical quantity. The role of symmetries in neural networks has been explored in [15]. The authors use so-called "hub-neurons" to construct a neural network which represents a symmetric function f ∶ R → R that satisfies f (x) = f (−x). They then show that this gives better fit to ground-truth training data which is polluted by non-symmetric noise. They use a similar technique to construct a symplectic neural network for the solution of time-dependent systems, but it is not clear whether their network is truly symplectic. The central role of symmetries is also exploited in [16]. Here the aim is to use neural networks to represent closures for the Reynolds-averaged Navier Stokes equations. Such a closure allows computing the shear-stress in the fluid from the non-turbulent component of the velocity field. Crucially, if the flow is isotropic, the shear stress can only depend on combinations of the velocity field which are invariant under rotations. The authors of [16] achieve this by systematically constructing rotationally invariant scalars from the vector field and restricting the input of their neural network to these scalars.

Aim of this paper
In this paper we show how neural networks can be used to solve Lagrangian dynamical systems while exactly preserving physical quantities. In contrast to [15] we consider invariance under continuous transformations. For this, we follow [14] and represent the Lagrangian as a neural network L NN (q,q) which takes as input the coordinates q and velocitiesq. As in [16], the first layer of the network reduces q,q to a set of scalar variables which do not change under the symmetry transformations that correspond to the conserved quantities. Noether's Theorem then guarantees exact conservation and during training the network learns the finer details of dynamics of the system that are not constrained by the conservation laws. Since the network has to find solutions in a smaller sub-manifold of the entire solution space, training is likely to be more efficient. The main achievements of this paper are as follows: 1. We show how invariance under continuous symmetries can be built into Lagrangian neural network models by passing the input through a so-called symmetry-enforcing layer; according to Noether's Theorem this leads to the exact conservation of a corresponding physical quantity.
2. To demonstrate this, we consider three example systems: (a) The motion of a single particle in three dimensions under the influence of a central force field, which is invariant under rotations and thus conserves angular momentum.
(b) The motion of a massive particle in the rotationally invariant Schwarzschild metric, which is the relativistic pendant of the first problem. Again, three dimensional angular momentum is conserved. (c) The motion of two interacting particles in D-dimensional space, where the interaction potential is invariant under rotations and translations. This results in the conservation of the D-dimensional linear momentum vector and the 1 2 D(D − 1) independent components of the antisymmetric D × D angular momentum tensor. 3. For all systems we demonstrate that the generated trajectories are more realistic and that they conserve angular momentum and (in the case of the two-particle system) linear momentum to a high degree of accuracy. 4. We further show numerically that enforcing invariance under continuous symmetry transformations makes the trajectories more stable with respect to small perturbations of the initial conditions.
Structure This paper is organised as follows: in Section 2 we review Noether's Theorem and explain the construction of Lagrangian neural networks that are invariant under continuous symmetry transformations. We write down the explicit form of the symmetry-enforcing input layers for the three model systems considered in this work. Numerical experiments for the model systems are described in Section 3, where we present results that demonstrate the superior performance of our approach. Section 4 contains our conclusions and ideas for future work. Some more technical details are relegated to the appendices: we discuss the construction of scalar invariants under the special orthogonal group SO(D) in A and present loss histories for the different neural networks in B.

Continuous symmetries and conservation laws
For completeness and further reference, we start by writing down Noether's Theorem here and refer the reader to [17] for a proof and further details.
then the following quantity is a constant of motion in the sense that dI/dt = 0: As an example, consider a particle moving in three dimensional space R 3 and assume that the Lagrangian is invariant under rotations, i.e. the action of the special orthogonal group SO(3). In this case, where Γ is an element of the fundamental representation of the Lie-algebra so(3) of SO(3). The three matrices corresponding to rotations around the x-, y-and z-axis are Further, we have that dh and hence there are three conserved quantities, namely For a non-relativistic particle such as the one discussed in Section 2.3.1, the kinetic energy in the Lagrangian is  3 ) and hence ∂L/∂q j = mq j . In this case, the three conserved quantities are the components of the angular momentum vector L = m q ×q.
It is worth pointing out that the quantities in Eq. (6) are conserved for more general cases, such as for the Lagrangian that governs the motion of a massive relativistic particle in the rotationally invariant Schwarzschild metric discussed in Section 2.3.2.  Figure 1: Structure of the function Φ which maps the inputs q,q to the predicted accelerationq = Φ(q,q). The symmetry-enforcing layers T , R and R • T for the specific systems studied in this work are discussed in the main text. A network without any symmetry constraints can be obtained by removing S, as represented by an undecorated downward arrow ↓.

Neural networks with built-in conservation laws
The central idea of this paper is as follows: we wish to construct a mapping Φ ∶ R 2d → R d which predicts the accelerationq = Φ(q,q) from the position q and velocityq such that the dynamics of this system preserves certain quantities exactly. To achieve this, we use the Lagrangian formalism and represent the Lagrangian L NN by a neural network as in [11]. The equations of motion are obtained by finding the stationary points of the action (= the time integral of the Lagrangian) which implies that d dt Following [14,Eq. (6)] the acceleration can then be obtained by taking the total time derivative of Eq. (8): where the components of the matrices J q,q , Jq ,q are given by The crucial point is now that the Lagrangian is constructed such that it is exactly invariant under the symmetry transformations h s which correspond to the conserved quantities. To guarantee that L NN satisfies Eq. (1), we write where D are standard, non-linear dense layers. The symmetry-enforcing first layer S ∶ R 2d → R M is a function that combines the inputs q,q into a set of M invariants such that For example, if we consider a three-dimensional system which is invariant under the rotations in Eq. (3), then the simplest three invariants that can be constructed from the position q and the velocityq are The structure of the entire function Φ is shown in Fig. 1. Removing the symmetry-enforcing layer S results in the standard architecture already introduced in [11]. The explicit form of symmetry-enforcing layers for specific systems is discussed in Section 2.3.
The weights of the dense layers D are learned by training the neural network on measurements of the physical system. As already pointed out above, the training data could consist of (noisy) observations of a real physical system. However, to demonstrate the principle and to not be restricted by a lack of training data, here we use synthetic, simulated data instead. More specifically, we compare the output of the neural network Φ NN to (noisy) ground truth data. The ground truth predictionsŷ are generated by adding random noise to the true acceleration q = Φ true (q,q) for some choice of q,q; the function Φ true is obtained by replacing L NN with the true Lagrangian L true in Eqs. (9) and (10). For each predictionŷ the input X of the neural network is obtained by also perturbing the corresponding q,q with random noise, thus simulating measurement errors for a real physical system. R Figure 2: Layer R enforcing invariance under the group SO(3) of rotations for a single particle moving in three dimensional space as discussed in Section 2.3.1.
During training we minimise the standard mean square error (MSE) loss function, which can be written as for a set of N training samples ( represents the random noise on the data.

Model systems
We now discuss the three physical model systems, their symmetries and the corresponding conserved quantities in more detail.

Motion of single particle in a gravitational potential
First, we consider the dynamics of a particle of mass m moving in a central force field with a potential energy that is inversely proportional to the distance from the origin. The true Lagrangian is for some positive constant α > 0. It is well known that for negative total energy E = m 2q 2 − α/|q| < 0 the trajectory is an ellipse where the origin of the coordinate system coincides with one focal point. Kepler studied this kind of motion in the solar system and derived three laws from observations, without knowing the analytical relationship between position, velocity and acceleration. The Lagrangian in Eq. (13) is invariant under the rotations in Eq. (3), which leads to the conservation of the vector-valued angular momentum L in Eq. (7). As a consequence, the entire trajectory lies in the plane spanned by the initial position q(0) and initial velocityq(0); this plane is orthogonal to the vector L.
Our objective is to learn the neural network Lagrangian L NN (q,q) and hence the relationship between q,q and q from data under the assumption that the dynamics is invariant under the action of the special orthogonal group SO(3). To achieve this, recall that the simplest invariant scalars that can be constructed from position q ∈ R 3 and velocityq ∈ R 3 are q 2 ,q 2 and q ⋅q. Hence, the symmetry-enforcing layer S in Fig. 1 should take as input q andq and output these three invariant quantities; in the following we denote this layer as R (for "rotationally invariant"), see Fig. 2. Adopting the Einstein sum convention of summing over pairs of identical upper and lower indices, the three quantities that are conserved for the dynamics generated with neural network Lagrangian L NN can be written as where ε is the three-dimensional Levi Civita symbol defined in Eq. (38). We stress that by construction the angular momenta in Eq. (14) are exactly conserved, independent of the weights of the neural network. The quantity L (ρ) NN can be likened to the shadow Hamiltonian, which is conserved for symplectic integrators of time-independent Hamiltonian systems. For the true Lagrangian in Eq. (13) the "neural network" angular momentum L (ρ) NN reduces to the three components of the "true" angular momentum vector in Eq. (7), namely For future reference we combine the quantities defined in Eqs. (14) and (15) into two vectors: true .
2.3.2 Motion of a massive relativistic particle in the Schwarzschild metric Next, we consider the motion of a massive particle in the Schwarzschild metric [18,19], which is the relativistic equivalent to the Kepler problem in Section 2.3.1. With the space-time dependent metric tensor g(q) the Lagrangian R Figure 3: Layer R enforcing rotational invariance of the spatial part of the Schwarzschild problem discussed in Section 2.3.2.
can be written as where q µ are the components of the four-dimensional contravariant time-position vector q = (x, τ ) = (x 1 , x 2 , x 3 , τ ). q µ = dq µ /dt denotes the derivative with respect to the eigen-time t experienced by a moving observer and τ is the time measured by a static observer far away from the origin, where g → diag (+1, +1, +1, −1) tends to the constant metric of flat space-time.
Expressing the spatial coordinate x in spherical coordinates, the true Lagrangian in Eq. (17) can be written as where r s is the Schwarzschild radius and r = |x| is the distance from the origin. It is easy to see 2 that Eq. (18) is invariant under rotations in three dimensional space and can be expressed entirely in terms of the three scalars x 2 , x 2 , x ⋅ẋ and the time derivativeτ . As a consequence, the dynamics conserves the (specific) three-dimensional angular momentum which turns out to be L = x ×ẋ.
Note that this expression differs from the one in Eq. (7) only through scaling by the (constant) mass m. In analogy to Eq. (15) we can also write down the three components of the angular momentum vector as Again, we want to learn the Lagrangian L NN (q,q) which is represented by a neural network. We drop the dependency on τ , since we found that including it leads to instabilities during training. The reason for this is that for the true solution τ grows (approximately linearly) with time. While the neural network will learn that the coefficient that multiplies τ in the Lagrangian is small, it will always remain non-zero, thus leading to a large, unphysical contribution for τ ≫ 1. Note that dropping τ from the inputs is consistent with the true Lagrangian in Eq. (18), which only depends onτ but not the τ itself. The symmetry enforcing layer takes as input the two vectors q,q ∈ R 4 and it will return the four scalarsτ , x 2 ,ẋ 2 and x ⋅ẋ, as shown in Fig. 3. With this symmetry enforcing layer the dynamics generated by the neural network Lagrangian conserves the three quantities exactly.

Two interacting particles in D dimensions
Finally, we consider a system of two non-relativistic particles with masses m 1 and m 2 that move in D-dimensional space and interact via a potential that is invariant under translations and rotations. Setting with the interaction potential V (r) given by the double-well function for some positive constants µ, κ > 0. As usual, || ⋅ || 2 denotes the Euclidean two-norm 1 The geodesic line element is ds = g µν dq µ dq ν , which would imply that the Lagrangian is ds/dt = g µν (q)q µqν . However, since the square root is a monotonous function, the Lagrangian in Eq. (17) has the same stationary points and thus generates the same dynamics. 2 Observe that r 2 θ 2 + sin rotationally invariant? no yes translationally no T R R T Figure 4: Pictorial representation of the symmetry enforcing layers.
Again, the goal is to learn the neural network Lagrangian which approximates the dynamics of the system, while taking into a account the symmetries of the problem. Although L NN in Eq. (25) depends on 4D unknowns, it can be restricted considerably by assuming that the system is invariant under translations and/or rotations. For this, we consider the following two continuous transformations of the entire system: where the antisymmetric D × D matrix Γ belongs to the fundamental representation of the Lie algebra so(D) of the special orthogonal group SO(D).
Assuming that the physics of the system is invariant under translations and/or rotations dramatically restricts the possible form of the Lagrangian L NN . For example, for a rotationally invariant (but not necessarily translationally invariant) system the Lagrangian can only depend on dot-products of pairs of the four dynamical variables since there are six different scalar products and is the only non-vanishing contraction with the three-dimensional Levi-Civita symbol in this case. With this notation, the set of scalar invariants that are output by the symmetry-enforcing layer S can be written down as in Tab. 1. While the most general Lagrangian is a function of 4D unknowns, translational invariance reduces it to a function which only depends on 3D variables. A Lagrangian which is invariant under rotations (but not necessarily translations) depends on 10 + 4 D unknowns; assuming both translational and rotational invariance reduces this further to only 6 + 3 D variables. To derive the conserved quantities according to Eq. (2), observe that the D generators of the translation group are the vectors ∆ (α) ∈ R D with components where δ jk is the Kronecker-δ. The 1 2 D(D − 1) generators Γ (ρ,σ) of the rotation group are the antisymmetric D × D matrices with entries Γ Using Noether's Theorem, this leads to the following 1 2 D(D + 1) conserved quantities M For the true Lagrangian in Eq. (22) For future reference we collect the conserved quantities in the following vectors 3 Results

Implementation and hyperparameters
The code that was used to obtain the numerical results reported in this section has been implemented in tensorflow [20] and is freely available at the following URL, which also contains instructions on how to install and run the code:

https://github.com/eikehmueller/mlconservation_code
The results reported in this paper were generated with the release published at [21]. In all cases the Lagrangian is represented by two hidden dense layers D 1 , D 2 with n h = 128 output units each and a final dense layer D 3 with a single output unit. This unit does not have a bias term since adding a constant to the Lagrangian does not change the equations of motion. A softplus activation function is used in all cases. The layer weights are initialised with a random normal distribution similarly to [11]: the standard deviation of the normal distribution is set to 2/ n h for the first hidden layer D 1 , 1/ n h for the second hidden layer D 2 and n h for the output layer D 3 ; the biases of all layers are initialised to zero. Single precision arithmetic is used in all numerical experiments. For all three systems discussed in Section 2.3 the networks are trained over 2500 epochs with 100 steps per epoch and a batch size of B = 128 using the Adam optimiser. The training schedule is a cosine decay, starting with a learning rate of 10 −3 which is reduced to 10 −5 at the end of the training cycle. We find that this reduces the MSE to 10 −5 − 10 −6 for the single-particle problem described in Section 2.3.1, to 10 −6 for the relativistic particle in Section 2.3.2 and to approximately 10 −4 for the two-particle system in Section 2.3.3. This is consistent with the chosen random noise on the training data, which is σ = 10 −3 in all cases and therefore limits the minimal achievable MSE to the order of σ 2 ∼ 10 −6 . The full loss histories can be found in B.  Figure 5: Trajectories for the motion of a particle in a gravitational potential described in Section 2.3.1 without (top) and with (bottom) constraints on the neural network to enforce rotational invariance of the Lagrangian. In each case, a trajectory that is obtained by perturbing the initial conditions by ∼ 10 −3 is also shown as a dashed curve. The ellipse that represents the true solution is marked with red dots.

Motion of a single particle in a gravitational potential
Both the mass of the particle and the strength of the gravitational potential in Eq. (13) are set to m = α = 1. To train the model, we generate pairs of inputs X (j) = (q exact (ϕ (j) ) + σξ 2 ) ∈ R 6 and ground trutĥ y (j) =q exact (ϕ (j) ) + σξ 1,2,3 ∼ N (0, 1) are drawn from a normal distribution with mean zero and unit variance. We assume that the motion takes place entirely in the x − y plane and the true trajectory is an ellipse with eccentricity ε ecc = 0.8. The initial conditions are chosen such that the vertical component of the angular momentum is L z = 1. Fig. 5 shows the trajectories predicted with the trained neural network Lagrangian. The equations of motion are integrated up to the final time T = 128 with a fourth order Runge Kutta (RK4) method and a timestep size of ∆t = 10 −2 . The figure also shows a trajectory for which both initial position and initial velocity are perturbed by normal random noise with a standard deviation of 10 −3 . As can be seen from the top figure, the trajectory obtained with the neural network Lagrangian deviates significantly from the true solution (the red ellipse) for the unconstrained Lagrangian and the two trajectories with perturbed initial conditions diverge. In fact, the neural network trajectories become unstable. Radically different behaviour is observed for the neural network with built-in rotational invariance (bottom plot in Fig. 5). Visually the trajectories obtained with the neural network Lagrangian can not be distinguished from the true solution. In particular, the trajectories appear to be confined to the x − y plane, which implies that the two corresponding components of the angular momentum indeed remain close to zero. To investigate the conservation of angular momentum further, we compute the time evolution of the deviation of the angular momentum from its original value. For this we consider the deviation of the angular momentum vectors defined by Eqs. (14), (15) and (16) from their initial values at time t = 0. Exact conservation of angular momentum would correspond to δL true (t) = 0 for all times t. As Fig. 6 (top) shows, for the unconstrained network δL true (t) is small initially, since the network has learned some degree of angular momentum conservation from the data, but then increases to around 1 by time t ≈ 50 and diverges shortly after that. As precision rounding errors. More importantly, for the rotationally invariant neural network the deviation δL true (t) of the true angular momentum never exceeds 10 −3 . Hence, even though mathematically only the conservation of L NN can be guaranteed, numerically this appears to also help with the conservation of L true . This behaviour is similar to the approximate conservation of the true energy observed for symplectic integrators of time-independent Hamiltonian systems, which is related to the exact conservation of the shadow Hamiltonian. Finally, in Fig. 7 we show the distance δq NN (t) = ||q NN (t) − q (perturbed) NN (t)|| 2 between the position vectors of the two neural network trajectories that only differ by a ∼ 10 −3 perturbation of the initial condition. While -as expected -the trajectories diverge over time, for the rotationally invariant neural network they stay much closer together and for the considered time interval their distance stays below 0.1. Intuitively, the reason for this is that rotational invariance limits the allowed trajectories to a smaller sub-manifold, so there is less "room" for nearby trajectories to diverge.

Motion of a massive relativistic particle in the Schwarzschild metric
For the motion of a massive relativistic particle in the Schwarzschild metric as defined in Section 2.3.2 we set the Schwarzschild radius to r s = 0.1. In this case, the "true" trajectory q exact (t) is obtained by integrating the exact equations of motion, which can be obtained from the true Lagrangian in Eq. (18), with a RK4 integrator with a timestep size of ∆t = 10 −2 . As above, normally distributed random noise with a standard deviation of σ = 10 −3 is added to obtain training samples (X (j) ,ŷ (j) ) with X (j) = (q exact (t (j) ) + σξ 2 ) ∈ R 8 and ground 3 ∼ N (0, 1) and t (j) are sample times along the true trajectory.   that for the unconstrained network the trajectories do not become unstable. However, they still diverge strongly from the true solution (shown as a dotted red line) and they oscillate about the x − y plane, which indicates that angular momentum is not conserved. Perturbing the initial conditions also leads to very different trajectories at later times. Fig. 8 (bottom) demonstrates that the picture is fundamentally different for the rotationally invariant Lagrangian neural network. Here the trajectory generated with the neural network stays very close to the true solution (which is indeed almost completely hidden beneath the solid blue curve) and the two trajectories with slightly differing initial conditions only diverge slowly at later times. Furthermore, the motion appears to be completely confined to the x − y plane, which indicates that the two corresponding components of the angular momentum are conserved. Again, we investigate the conservation of angular momentum quantitatively by plotting the time evolution of the two-norms δL NN (t) and δL true (t) in Eq. (35), now using the expressions for L NN and L true defined via Eqs. (20) and (21). As Fig. 9 shows, the unconstrained network is able to learn the conservation of angular momentum to some degree, but the true angular momentum L true (t) deviates from its initial value by around 10% at later times. This is consistent with the fact that the trajectories in Fig. 8 (top) are visibly not constrained to the x − y plane, but do not stray too far from it either. The neural network with built-in rotational invariance, on the other hand, is able to reduce the relative deviation δL true (t) from the initial angular momentum to less than 10 −3 . As expected, the quantity L NN , which would be zero in exact arithmetic and for an exact time integrator, is very small and never exceeds a value of around 10 −5 , which is consistent with (accumulated) rounding errors in single precision arithmetic. Fig. 10 shows the distance δq NN (t) = ||q NN (t) − q (perturbed) NN (t)|| 2 between the two neural network trajectories with slightly perturbed initial conditions. The plot confirms that the rotationally invariant Lagrangian is much more robust under perturbations of the initial conditions. Compared to the unconstrained Lagrangian neural network, the distance between the perturbed and the unperturbed trajectories is around one order of magnitude smaller and grows only moderately over time.

Two interacting particles in D = 4 dimensions
Finally, we consider the system of two interacting particles described in Section 2.  Figure 8: Trajectories for the motion of a relativistic massive particle in the Schwarzschild metric as defined in Section 2.3.2 without (top) and with (bottom) constraints on the neural network to enforce rotational invariance of the spatial part of the Lagrangian. In each case, a trajectory that is obtained by perturbing the initial conditions by ∼ 10 −3 is also shown as a dashed curve. The true solution is shown as a dotted red curve.

Rotationally and translationally invariant Lagrangian
Since (except in the first case) the input to the neural network is first passed through a symmetry-enforcing layer, the number of inputs to the first hidden layer D 1 depends on which symmetries we assume for the Lagrangian, as shown in Tab. 2. Note that as we do not enforce invariance under reflections (i.e. we only consider the SO(4) subgroup instead of the full O(4) group), in the case of a rotationally (but not necessarily translationally) invariant Lagrangian, the contraction ε αβρσ x σ with the Levi-Civita symbol is included in the set R {x (1) , x (2) ,ẋ (2) ,ẋ (2) } . As in the other two cases, we generate synthetic training samples (X (j) ,ŷ (j) ) ∈ R 16 × R 8 by integrating the true equations of motion obtained from the Lagrangian in Eq. (22) with a RK4 integrator (∆t = 10 −2 ) and adding normally distributed noise with a standard deviation of σ = 10 −3 . Fig. 11 shows a projection of the four-dimensional trajectories onto the first two dimensions for each of the four considered setups. The true solution (dashed) and the neural network solution (solid) are shown for both particles with the final positions at time T = 8 marked by circles (•/○). Visually, the completely unconstrained network gives the worst solution. While enforcing rotational or translation invariance improves this somewhat, the best qualitative agreement is achieved with the neural network Lagrangian that is both rotationally and translationally invariant. Although even in this case the final positions have a distance of order 1, this appears to be mainly attributed to the fact that the neural network solution lags behind the true solution. Ignoring this phase error, the trajectories generated with the neural network Lagrangian show good agreement with the true solution. To quantify the conservation of the linear-and angular momentum, we again plot the time evolution of δL NN (t) and δL true as defined in Eq.   where M NN and M true are the four-dimensional linear momentum vectors given in Eq. (34). Since the initial conditions are chosen such that the total linear momentum is zero, we cannot normalise by ||M NN (0)|| 2 and ||M true (0)|| 2 but instead divide by the two-norm of the linear momentum of the first particle defined as While the unconstrained network is able to learn the conservation of L true and M true to some degree, at the final time both quantities have deviated from their initial values by 10% − 100%. The picture is somewhat better once either translational or rotational invariance is enforced. In particular rotational invariance limits both the relative deviation of both the linear and angular momentum to below ≈ 10%. As expected, the linear and angular momenta M NN and L NN derived from the neural network Lagrangian (see Eq. (32)) are conserved up to rounding errors caused by single precision arithmetic. Enforcing both rotational and translation invariance improves the conservation of the true linear and angular momentum substantially, with both now not deviating by more than 1% from their initial values. In contrast to the other cases, there also does not appear to be any upwards trend for δM true (t) or δL NN (t) as time increases. Finally, we study the stability of the neural network trajectories under small perturbations of the initial conditions. For this, in each of the four cases the initial position and velocity were perturbed by the same normal noise with standard deviation 10 (t)|| 2 between the trajectories with these two different initial conditions is visualised in Fig. 13. As can be seen from this plot, enforcing rotational and translational invariance reduces the growth of the perturbation significantly. However, only enforcing either rotational or translational invariance seems to be almost as good, with the latter performing even slightly better at large times t.

Conclusion
In this paper we showed how the Lagrangian neural network approach in [14] can be combined with Noether's Theorem to enforce conservation laws while learning the dynamics of a physical system from noisy data. As the numerical results demonstrate, this generates significantly more realistic trajectories since the network does not have to learn the underlying conservation laws from data. As expected, for each symmetry of the Lagrangian the neural network integrators conserve a quantity up to rounding errors from single precision arithmetic. While the corresponding   (t)|| 2 between two trajectories obtained with slightly different initial conditions for the two-particle problem. The initial conditions that differ by δq NN (0) ∼ 10 −3 .
invariant quantity of the true Lagrangian is not conserved exactly, its deviations from the initial value are significantly reduced. Using a Lagrangian with built-in symmetries also improves the stability of the solution in the sense that the system is less sensitive to small perturbations of the initial conditions. There are several avenues for extending the work presented here. To demonstrate the principle ideas behind our approach we assumed that essentially unlimited (synthetic) data with relatively small errors is available for training. It would be interesting to instead train the Lagrangian neural networks on measured experimental data for real physical systems. While we already assumed that the training data is perturbed by random noise, in real-life situations measured data might be corrupted or only projections of the state space vectors might be available. For example, in celestial dynamics usually only motion perpendicular to the line of sight of the observer can be measured. More generally, it would be interesting to further explore the how the symmetry enforcing layers improve the solution for limited data with larger errors. So far, we considered Lagrangians that describe the motion of point particles. However, the Lagrangian formalism can also be extended to continuous systems such as (classical) field theories and it would be interesting to see how our approach works in this case. For example, in [14] a discretised one-dimensional wave equation is studied and the authors show that the Lagrangian approach leads to the approximate conservation of energy.
where sgn(σ) ∈ {−1, +1} denotes the sign of the permutation σ. There are n 2 + n = 1 2 n(n + 1) different contractions with the metric tensor, namely the dot products To simplify notation, we have adopted the Einstein sum convention of summing over pairs of identical upper and lower indices. The number of full contractions with the Levi-Civita symbol is n D . Each distinct full contraction is given by a choice of indices α 1 , α 2 , . . . , α D with 1 ≤ α 1 < α 2 < ⋅ ⋅ ⋅ < α D ≤ D and it can be written down explicitly as There are no full contractions with the Levi-Civita symbol if n < D and there is only one full contraction for n = D. At this point it is also worth pointing out that here we only assume invariance under the special orthogonal group SO(D) ⊂ O(D). In addition to rotations, the orthogonal group O(D) also contains reflections, which change the sign of the contractions in (40). Hence, if we were to enforce this additional constraint, only even powers of these terms can appear in the Lagrangian. Since the product ε

B Loss histories
The following figures show the loss histories for the MSE error defined in (12) for all three problems considered in this paper.