An algorithm for discovering Lagrangians automatically from data

An activity fundamental to science is building mathematical models. These models are used to both predict the results of future experiments and gain insight into the structure of the system under study. We present an algorithm that automates the model building process in a scientifically principled way. The algorithm can take observed trajectories from a wide variety of mechanical systems and, without any other prior knowledge or tuning of parameters, predict the future evolution of the system. It does this by applying the principle of least action and searching for the simplest Lagrangian that describes the system's behaviour. By generating this Lagrangian in a human interpretable form, it also provides insight into the working of the system.


I. INTRODUCTION
Modern science is, in many senses, highly automated. Experiments are frequently run under computer control, with data often recorded by the computer directly. Computerised data analysis and visualisation are widely used to process the resulting large volumes of data. Indeed, the ability to collect and analyse massive data sets is opening up an entirely new measure-first-ask-questionslater approach to science: the Square Kilometer Array radio telescope is expected to collect approximately one exabyte of data per day [1]; over 10 14 collisions from the ATLAS detector were analysed in the search for the Higgs boson [2]; and state-of-the-art whole-genome sequencers can currently sequence 600 gigabases per day [3]. In each of these examples the scientific questions are not fully formulated in advance of taking the data, and the question of how to best extract knowledge from the dataset is of great interest. This motivates the study of how to scale up the processes of scientific reasoning to take advantage of the wealth of available data.
Thus far, scientific reasoning has largely resisted automation. Hypothesising and refining models is still on the whole carried out by humans, with little direct support from computers. It has long been a desire of artificial intelligence researchers to automate this part of science, and with the growing volume of data available from experiments the motivation for this desire comes ever more sharply into focus. In this paper we present a step in this direction: an algorithm that automates finding a mathematical model for a system, in a scientifically principled way, by examining only its observed behaviour.
Early attempts to automatically model physical systems searched for simple mathematical regularities in observed quantities. Langley's BACON system was able to re-discover many simple laws -the ideal gas law, Ohm's law, Coulomb's law and others -from experimental data [4]. Dzeroski and Todorovski went beyond simple static laws with their LAGRANGE system which was able to search for differential equations that governed observed time series [5]. They extended this work to the LAGRAMGE system which additionally allowed an expert user to provide domain knowledge, improving the quality of the results [6]. The PRET system, developed by Bradley and collaborators, brings to bear a variety of advanced AI techniques on the problem of identifying system differential equations [7]. It has a sophisticated method for representing qualitative observations, and allows expert-user domain knowledge to be combined with automatic search very effectively. Schmidt and Lipson [8] used a genetic programming approach to automatically evolve invariant mathematical expressions from observed data [9]. In the context of engineering, there is a significant body of work on 'system identification', with techniques ranging from very general ad hoc fitting methods to fitting detailed physical models representing important classes of system [10,11].
In this work we take a different approach than those described above, the essence of which is that we embed a simple, general physical principle -the principle of least action -and very little else into our algorithm. While we are embedding the domain knowledge of a physicist in our algorithm, we are not embedding information about any particular physical system or class thereof. Rather we are capturing a deep understanding that has been distilled by physicists over the past 270 years, and packaging it into an algorithm that can be applied by non-experts. We find the algorithm to be surprisingly powerful, given its simplicity, but this power comes not from the ingenuity of its construction, rather from the broad applicability of the physical principle embedded in it.

II. THE PRINCIPLE OF LEAST ACTION
The principle of least action is one of the most fundamental and most celebrated principles in physics. First proposed by Maupertuis [12] and Euler [13] it states that the problem of predicting the behaviour of many physical systems can be cast as finding the behaviour that minimises the expenditure of some cost function. The total expenditure of the cost function by the system is known as the action. It is a remarkable fact that the behaviour of a very wide range of physical systems -including those studied in classical mechanics, special and general relativity, quantum field theory, and optics -can all have their behaviour explained in terms of minimising a cost function.
Each physical system has its own cost function, and once this function is known it is possible to predict exactly what the system will do in the future. The exercise of determining the cost function -often known as Lagrange's function, or just the Lagrangian -for a particular physical system is central to physics. Feynman described this process well [14] as "some kind of trial and error" advising students that "You just have to fiddle around with the equations that you know and see if you can get them into the form of the principle of least action." In this paper we present an algorithm that does this "fiddling" automatically, without requiring the user to have any expertise in physics.
The Lagrangian is ideal as the output of an automated modelling algorithm. It has the desirable property that it is a single, scalar expression that contains everything necessary to predict the system's future evolution. Consider, in contrast, finding the Hamiltonian where it would also be necessary to find the corresponding conjugate momenta. The Lagrangian has the additional quality that it is coordinate-independent and as a result can be written in any coordinate system. This is useful in the case of an automated algorithm where it is not obvious in which coordinate system the data might be presented.
The problem of taking a Lagrangian and automatically calculating the resulting motion of the system has been widely studied and applied. To the best of our knowledge, this work is the first that solves the inverse problem of taking the observed motion and calculating the Lagrangian for non-trivial systems.

III. THE ALGORITHM
To find a model for a system we search over a space of possible Lagrangians. To do this we need three elements: an objective to guide the search, which will take the form of a score function; a representation of the possible Lagrangians; and an algorithm to execute the search over the possible Lagrangians, working to improve the score. We will first describe the score function, which is the central idea of the algorithm.

A. Score function
The objective of our algorithm is to find a Lagrangian that, when integrated along the system's observed trajectory, yields a smaller total (action) than when integrated along any neighbouring trajectory. It would be possible to implement this definition of the least action principle directly in an algorithm, but instead we take an indirect approach that is more computationally efficient. For a Lagrangian L(θ, φ, . . . ,θ,φ, . . .) and a trajectory (θ(t), φ(t), . . . ,θ(t),φ(t), . . . ) it is possible to write down a condition, in the form of a set of differential equations, that must be satisfied if the action is to be stationary along the trajectory. These differential equations are known as Euler-Lagrange's equations, where q ∈ {θ, φ, . . .}. It is to be understood that the partial derivatives are taken symbolically with respect to the coordinates and velocities, which are then replaced with the time-dependent functions from the trajectory before the time-derivative is taken. We can define a score function based on these conditions, which is zero if the Euler-Lagrange equations are exactly satisfied. We note that Hillar and Sommer first proposed using a (different) score function derived from the Euler-Lagrange equations in [15], but did not apply it to finding Lagrangians from data. In practice our observations of the system are not functions (θ(t), φ(t), . . . ,θ(t),φ(t), . . . ) but discretely sampled time-series of the coordinates and generalized velocities. The algorithm operates with a dataset which is a time-series of samples where time runs from t = 1 . . . N . We divide this timeseries into two portions, a training set, comprising samples 1 . . . M , and a validation set of samples M + 1 . . . N . The algorithm will conduct its search using only the training set, reserving the validation set for out-of-sample measurement of the prediction error. In this way we can truly test the algorithm's ability to predict the future dynamics of the system. In all of the examples in this paper the sampling times will be evenly spaced, but this is not a requirement. We can discretize the Euler-Lagrange score function (1) to work with these sampled datasets, giving where the subscript on the score indicates that it is taken with respect to the dataset D. The square-bracketed quantities in this expression are time-series, and the subscript indicates taking the element in this time-series at the given time. So, for instance, the first term in (2) is to be calculated, in principle, by: first differentiating the candidate Lagrangian L symbolically with respect to the appropriate generalized velocity; evaluating this quantity at every time-step in the dataset to yield a new timeseries; taking the discrete derivative of this new timeseries with respect to time; and finally finding the element at time t in this time-derivative time-series. In practice, as we shall see below, a more computationally efficient implementation may be used.
The function EL D is the basis of the score function, capturing the principle of least action, but it is not sufficient on its own. While it is true that the Lagrangian we seek minimises EL D , the converse is not true as there are other functions which minimise EL D but are not physically meaningful Lagrangians. The first class of functions that we wish to avoid are those which are numerically tiny, for instance L = 10 −100 θ. We deal with these by introducing a normalisation score for each candidate Lagrangian, We will compose our final score from the scores EL D and N D in such a way, to be detailed below, that to score well a candidate Lagrangian must simultaneously have a low score for EL D and a score of around one for N D . The target value of one for N D is chosen arbitrarily. We can always arrange for the normalisation score to be approximately one, as the least-action trajectory is unchanged if the Lagrangian is scaled by a constant.
There is a second, more interesting class, of unwanted expressions that minimise the Euler-Lagrange score EL D . Consider, for instance, the candidate Lagrangian L = θ nθ . This Lagrangian satisfies the Euler-Lagrange equations trivially, in a way that does not depend on the trajectory. Such path-independent least-action Lagrangians are interesting from a physics point-of-view, being closely related to gauge invariance, but here they are a nuisance. To guide the search away from these expressions we introduce a second 'control' trajectory, C. This trajectory is unrelated to the behaviour of the system under study and serves solely to eliminate path-independent Lagrangians. We reason that the Lagrangian that we are seeking will score well with EL D but should score poorly on EL C , which is the Euler-Lagrange score evaluated along the control trajectory. The exact form of the control trajectory is unimportant so long as it not a valid trajectory of the system under study. In this work we use a control trajectory which is uniform motion in each coordinate, with velocity arbitrarily chosen to be 0.1, for all experiments.
We combine the three parts described above to give the search score function, where U (x) = ln(x + ) 2 + 1 is a function that is minimised, with value approximately one, when the argument is one. The small constant , typically set to be 10 −10 , ensures that the score function has the desired asymptotic behaviour for small values of the numerator and denominator, even when faced with errors from finite precision machine numbers. The factor U (EL C (L)) prevents the search algorithm from driving towards Largrangians that perform badly on the real dataset, but even worse on the control data. Overall, the score function drives the search to find Lagrangians that simultaneously minimise the action along the observed trajectory while having a non-zero action along the control trajectory, and a normalisation score close to one. Note that the score function, S does not in any way consider whether the prediction of the candidate Lagrangian agrees with the training data. It only considers whether the trajectory satisfies a least action principle for the candidate Lagrangian. The fact that this, on the face of it unrelated, objective leads to successful predictions is the insight from physics that we have embedded in our algorithm.

B. Representation and search
We have experimented with two representations of candidate Lagrangians. The first, a restricted polynomial representation, allows a fast search algorithm to be implemented. It is limited in the Lagrangians it can represent exactly, although through Taylor's theorem it can find approximations to any Lagrangian. This representation was used to generate the bulk of the results in this paper, and we describe it in detail in this section. The second representation lifts some of the constraints of the restricted polynomial model, at the expense of vastly increased computational cost. We describe it in section V.
The restricted polynomial representation assumes that the Lagrangian can be represented by a polynomial in the coordinates and velocities. The model is a sum of monomial terms, parametrised by coefficients multiplying every term. We restrict this polynomial in two ways: we limit the maximum power of any coordinate or velocity to be m; and we limit the maximum degree of any combination of coordinates and velocities to p. In addition, we remove any terms from the model that can have no physical significance, that is terms that are constant or of the form q nq . So, for instance, for one variable θ with m = 3 and p = 4 the resulting model would be For a given restricted polynomial model the score function is minimised by adjusting the parameters c i . We conduct this optimisation using the Nelder-Mead simplex algorithm [16] using the modified parameters of Gao and Han [17] which improve the efficiency in high dimensions. The coefficients are bounded between -1 and 1, enforced by a penalty function. We use a tight convergence tolerance, usually one part in 10 10 , to encourage the search to break out of local minima. We impose a maximum iteration limit, usually 5 × 10 6 , on the search to ensure that it is bounded in time.
We do not know in advance what values of m and p are needed to accurately represent the Lagrangian of the system under study. What's more, we wish to find the simplest Lagrangian such that the trajectory satisfies Sketches of the five test systems that we consider.
the principle of least action. We approach this using a simple heuristic algorithm. We start with the smallest non-trivial model (m = 2, p = 2) and optimise the parameters with the simplex search. We then make an insample prediction of how well the optimised Lagrangian predicts the dynamics of the system in the training sample. This is done by generating equations of motion from the optimised Lagrangian and numerically solving them, using initial conditions derived from the first sample in the dataset. If this in-sample prediction fits better than a specified tolerance then we stop and return the Lagrangian. If it does not fit then we generate a larger model (i.e. with larger values of m and/or p) and try again. The models are stepped through in increasing number of monomial terms. This proceeds until either a model is found that fits or a maximum bound on model complexity is reached. This heuristic algorithm only crudely captures the notion of mathematical complexity of the model, but it seems to work adequately well. Note that, for a given polynomial model, it is possible to partially pre-calculate the score function EL D for a given dataset, yielding a function quadratic in the coefficients c i . This is possible because the form of the model is fixed and it is possible to calculate its derivatives in advance. As a result, after the initial simplification of the score function, optimisation iterations are fast, and have a run-time independent of the number of data points.
Code for the score function, search algorithms and the datasets we use below can be downloaded from [18].

IV. RESULTS
We will consider five test systems, illustrated in figure 1. The first is the unforced Duffing oscillator, a textbook non-linear system. The second, a simple pendulum, is interesting because its Lagrangian cannot be represented exactly in the restricted polynomial representation. The third system, two masses on a frictionless surface joined by three springs to each other and two immovable walls, has two coupled degrees of freedom. The fourth system is the double pendulum, a coupled, two degree-of-freedom non-linear system capable of chaotic motion. As with the simple pendulum, the double pendulum cannot have its Lagrangian represented exactly by a finite degree polynomial. The fifth and final system is the Penning-type ion-trap, a three degree-of-freedom system with magnetic and electrostatic forces, that is of considerable experimental relevance. Figure 2 shows the result of applying the algorithm to simulated data sets for these systems. It can be seen that the algorithm is able to successfully predict the future dynamics of all of the test systems. Let us look in detail at the progress of the algorithm, and the resulting learned models, for two of the example systems.
In the case of the Duffing oscillator the algorithm tried seven, increasingly complex, polynomial models to arrive at the prediction shown, which was generated by the model with m = 4 and p = 4. The final model has 11 free parameters, and required 2160 Nelder-Mead iterations to optimise. The complete search working through all seven models, with a single-threaded implementation, executes in under five seconds on a 2012 2.0GHz Intel Core i7-3667U powered MacBook Air. The optimised Lagrangian, where we have removed terms with coefficients less than 10 −5 and displayed the remaining coefficients to two decimal places for clarity, was L = −0.30 x 2 + 0.14 x 4 + 0.20ẋ 2 . This is exactly the expression, apart perhaps from overall scaling, that would be written by a human physicist. The coefficients yielded by the search are found to match the correct coefficients to the 6th decimal place, limited by the convergence tolerance that we set. By generating a model in this form the algorithm gives insight into the system directly from the data.
The case of the simple pendulum is also interesting to consider. Here the search algorithm tried three models, where the third, with m = 2 and p = 4, converged in 480 Nelder-Mead iterations. The search in this case took around 0.6 seconds. The generated model, multiplied by 100 to make it more readable, was L = 0.049 x+8.6 x 2 −3.0ẋ 2 −0.0030 xẋ 2 −0.41 x 2ẋ2 . (4) It can be noted that this is not a straightforward Taylor expansion of the simple pendulum's Lagrangian, and it is We see that the algorithm is robust to noise, finding a model that accurately predicts the future evolution.
not obvious how to relate it to the standard form. Experimenting with removing terms and solving the resultant equations of motion indicates that the terms proportional to x and xẋ 2 are unimportant, but the relatively small term in x 2ẋ2 is essential. Despite being in an unexpected form, this Lagrangian does make successful predictions.
We shall see in section V that it is in fact a local approximation of a true Lagrangian around the region of configuration space that the training trajectory explored. Real world measurements are inevitably noisy and so to be practically useful it is important that the algorithm is able to converge even in the presence of imperfections in the data. We took the data for our third test systemthe coupled harmonic oscillators -and added normally distributed noise, with standard deviation 0.1 (about 5% of the oscillation amplitude) to the position, velocity, and acceleration. Figure 3 shows the result of running the algorithm on this noisy data set. We see that the algorithm is robust to this noise, finding a model that describes the future evolution well. The algorithm's convergence was slowed slightly taking 2130 iterations, compared to 1740 for the noise-free case, to converge on the final m = 2, p = 2 model.

V. GENERALISATION
We have shown that the algorithm can find models which successfully predict the future evolution of the system's behaviour. However, a good physical model does not just capture the behaviour of a particular time-series, corresponding to a particular set of initial conditions. Rather, it should be able to predict the behaviour of the system over a range of initial conditions. It is perhaps this ability to generalise that sets a true physical model apart from a mere fit or interpolation of the system's behaviour. It is interesting, therefore, to study whether the models found by our algorithm have this property.
We have seen in the case of the Duffing oscillator that the discovered model is indeed the correct model, and we would expect that this model will correctly predict the dynamics of the system for any initial conditions. We test this by simulating the behaviour of the system for a wide range of initial conditions, and comparing the results to the predictions of the model. We find, as expected, that the learned model for the Duffing oscillator does make correct predictions for all initial conditions.
Applying this procedure to the other test systems we find that the coupled harmonic oscillators and the Penning-type ion trap models also generalise well, making successful predictions for all initial conditions. This indicates that our algorithm is not merely a sophisticated curve fitting routine, but rather is finding the underlying physical truth behind the system dynamics to make its predictions.
The pendulum and double pendulum models do not generalise well, as we might have anticipated from the form of the Lagrangian in (4). Figure 4 compares the prediction of the learned simple pendulum model against the true behaviour, for a variety of swing amplitudes. We see that while the prediction is accurate for the amplitude at which the model was trained, it deviates at other amplitudes. These results are perhaps to be expected, and could well be the same as generated by a human physicist given the same data. The algorithm has found a mathematically simple approximation that works well for the data it has available to it, but does not have enough to go on to determine the true underlying model.
We consider two approaches to generating models that generalise better for these systems, inspired by the approaches a human physicist might take. The first method is simply to train the models with more data, corresponding to a wider range of initial conditions. The second is to introduce new mathematical constructs which allow a simpler model to be found, reasoning that this model is more likely to generalise well.
For the first approach we follow exactly the same procedure as before except we generate a number of trajectories, corresponding to a range of initial conditions, and use a score that is the sum of the scores for the individual trajectories. We applied this procedure to the simple pendulum system. The resulting search takes approximately 15 times longer to converge than the singletrajectory search. We find that the algorithm is unable to converge on an m = 2, p = 4 model, as it did before, and has to continue its search until it finds an m = 4, p = 4 model whose predictions fit all of the trajectories adequately. Figure 5 compares the predictive ability of this model with the 'single-trajectory' model of the previous section. We see that, as shown in figure 4, the singletrajectory model makes good predictions for the initial condition it was trained at, but makes poor predictions for other initial conditions. The 'multi-trajectory' model, though, is much improved. It makes good predictions at all of the initial conditions it was trained at, and further makes good predictions at other, unseen initial conditions as well. We have found similar results for the doublependulum system, although the computational difficulty of the problem constrained the experiment to a limited region of initial-condition-space.
Our second approach to generalisation is to expand the representation of the Lagrangians to encompass a wider range of mathematical expressions. We reason that, with a wider palette of mathematics at its disposal, the algorithm may be able to find a model of simpler form that works well. History has shown, although this may be tautological, that often systems of interest to physics can be described by remarkably simple mathematical models. We hope that by allowing the algorithm to generate structurally simpler models, it may be more likely to discover the underlying physical truth.
We have developed a proof-of-principle implementation of a richer representation, and a corresponding search algorithm, detailed in the appendix. Briefly, we compose mathematical expressions as trees with leafnodes corresponding to the system variables, simple functions (sine, cosine, square) of these variables, and numerical constants. Branch-nodes of the tree are arithmetic operators +, −, ×. This structure can represent a much wider range of mathematical forms than our polynomial representation. We search over this tree-structured representation using a genetic programming technique that simultaneously tries to optimise the score and minimise the size of the trees. Thus, this search algorithm tries explicitly to find simple expressions that score well on the data.
Repeating the search on large-amplitude (±0.95π) simple pendulum data using the tree-based representation highlights the relative strength of this approach. The generated model, which makes a successful prediction, is the same as would be written by a human physicist. Naturally, this model makes correct predictions over the full range of initial conditions. There are two reasons that the tree-based expression search is able to converge on this model. First it is only because the representation of possible models is richer that this model can be directly represented at all. Second, the notion of mathematical complexity in this representation more closely models that of a human physicist. This allows the search algorithm to do more work driving the result towards an expression that we recognize as canonical. It must be cautioned, though, that this is only a proof-of-principle demonstration. To reach this result we had to bias the search algorithm, as described in the appendix, and even then the run time is significantly longer, often taking many hours with a multi-threaded implementation on the hardware described above. We were not able to get results for the double pendulum system at all with the computing resources at our disposal. Nonetheless, we present this result as the technique shows potential for learning models that are both better able to generalise, and in a format more suitable for communication to human physicists.

VI. CONCLUSION
We have demonstrated an algorithm that can predict the future dynamics of a physical system directly from observed data. We have shown that the algorithm generates models that can be communicated to a human physicist, sometimes even finding models in textbook form. The model corresponding to the solid (red) curve was trained with multiple trajectories, indicated by the other (red) arrows. The original trajectory was also included in the training set for this model. We see that the 'multi-trajectory' model makes better predictions across a wide range of initial conditions, including conditions that it was not trained on. The multi-trajectory model is able to make successful predictions up to surprisingly large amplitudes, well beyond those it has seen in training.
We have further shown that the models generated generalise well to unseen data, and are not merely fits or interpolations, but are truly capturing the physical essence of the system under study.
One might ask what the use of such an algorithm is. As a first point, we find the question of whether a computer can do science to be fascinating in itself. Investigating the limits of a computer's ability in this regard educates us as to the strengths and weaknesses of our current scientific processes, and invites us to consider a different perspective on our scientific work.
But perhaps a more practical answer is that tools such as this could assist humans in their work. We see this assistance as coming in two forms. The first is simply automating the actions of a scientist so they can be applied to more data. Techniques that can be automatically applied to datasets, scanning for scientifically interesting features -in the case of the algorithm in this paper, for example, finding that there is a least action principle at work -may come to be a fruitful approach to generating unexpected scientific leads as we head into a data-dominated era. The second is opening up the techniques of science to non-specialists. By capturing the idea of searching for least action models in an algorithm we make it available to anyone, including those without the necessary skills to do it by hand. It is interesting in this regard to consider popular online natural language translation software. While no-one would consider these tools suitable for translating poetry, they nonetheless are exceedingly useful to many people in the common case where a 'good-enough' translation will do. While we do not imagine computers will replace expert human physicists in the near term, we envisage the availability of tools to automate scientific reasoning will empower nonspecialists to take better advantage of the discoveries and insights of physics.

ACKNOWLEDGMENTS
We acknowledge many useful discussion with Jeremy Mabbitt, Mike Tarbutt, Jack Devlin, Iain Barr, Ben Sauer, and Ed Hinds. DJAH and AMG contributed equally to this work.

Appendix: Tree-based representation and search
Here we describe in detail the tree-structured symbolic representation of mathematical expressions and the corresponding search routine. The expressions are built following a grammar designed to bias the search towards expressions that might be Lagrangians for simple physical systems. The terminals of these expression trees are one of: numerical constants, randomly generated; the coordinate variables and velocities; the squares of the coordinates and velocities; and for coordinates which represent angles, the sine and cosine of the coordinates and velocities. The non-terminal nodes of the trees are the operators +, −, ×. An example expression tree is shown in figure 6, representing the expression θ 2 − 3 cos(θ).
To search through these tree-structured expressions we take a genetic programming approach [19], explicitly optimising both the least-action score (3) and also a complexity score, using the SPEA2 multi-objective optimisation algorithm [20]. This biologically-inspired evolutionary algorithm maintains a population of candidate expressions and breeds, reproduces, and mutates them to try and simultaneously optimise the least-action and complexity scores.
In detail, we first construct a population of randomly generated expressions, usually numbering 100. We score these expressions using the least-action score and also assign a complexity score which is simply the number of nodes in the expression tree. The SPEA2 algorithm takes the current population, and an initially empty set of elite expressions, representing the best that have been seen so far. It has a rather complex selection mechanism [20] that produces a new set of elite expressions, plus a set of expressions, the breeding pool, which are candidates for reproduction. A new generation is created from the breeding pool by mutation (10%) and pair-wise crossover (90%). Mutation is effected by replacing a random subtree of the given expression with a randomly generated subtree. The crossover operation takes two expressions, selects a random point in each of the two trees, and swaps the sub-trees at these points to generate two new expressions. The evolutionary process is repeated starting from this new generation, and we iterate for a large number of generations, typically many thousand. To improve the convergence speed of the numeric constants in the expressions we also incorporate a small amount of hilldescent into each evolutionary iteration: a subset (20%) of the expressions have their numeric constants randomly adjusted by a small amount, and if this improves their least-action score, the modification is kept. We also impose a maximum size of expression (50 nodes) and trim expressions that exceed it each generation to ensure that the run-time is bounded. The final elite set is a set of expressions that represent the trade-off between leastaction score and complexity. We select from this set the simplest expression that has a least-action error below a specified threshold as the output of the run.
In this tree-based method, the structure of the candidate Lagrangians varies during the search, so it is not possible to partially pre-calculate the score function, as it was in the restricted polynomial technique. Rather it must be calculated in full for each expression in the population. Further, the search space of possible expressions is exceedingly large, and the score is not very smooth with respect to the genetic operations. As a result, the search must run for many generations and is extremely computationally expensive. A naïve implementation might calculate the partial derivative time-series in (3) by symbolically differentiating the candidate expression and then calculating the value of the derivative. This, however, can be exponentially expensive in the depth of the expression, in terms of both memory and runtime. A better approach, that we adopt in this work, is to simultaneously evaluate the expression's value and its derivatives using automatic differentiation [21]. This method avoids calculating an expression for the derivative, and has runtime proportional to the size of the expression.