Probabilistic Grammars for Equation Discovery

Equation discovery, also known as symbolic regression, is a type of automated modeling that discovers scientific laws, expressed in the form of equations, from observed data and expert knowledge. Deterministic grammars, such as context-free grammars, have been used to limit the search spaces in equation discovery by providing hard constraints that specify which equations to consider and which not. In this paper, we propose the use of probabilistic context-free grammars in the context of equation discovery. Such grammars encode soft constraints on the space of equations, specifying a prior probability distribution on the space of possible equations. We show that probabilistic grammars can be used to elegantly and flexibly formulate the parsimony principle, that favors simpler equations, through probabilities attached to the rules in the grammars. We demonstrate that the use of probabilistic, rather than deterministic grammars, in the context of a Monte-Carlo algorithm for grammar-based equation discovery, leads to more efficient equation discovery. Finally, by specifying prior probability distributions over equation spaces, the foundations are laid for Bayesian approaches to equation discovery.


Introduction
Scientific data are ever-more readily available in large amounts for use in building models in the form of equations by scientists from different disciplines. This has recently triggered a sharp increase in the interest in equation discovery, also known as symbolic regression. Equation discovery aims at automated discovery of quantitative laws, expressed in the form of equations, in collections of measured data [9,24]. History of science offers plenty of instances of the equation discovery task: given a collection of measured data, find quantitative laws, expressed in the form of equations. Take for example, Newton's second law of motion that relates the acceleration a of an observed moving object, its mass m and the force applied to it, F , in the well known equation F = m · a. As part of an equation discovery task, the collected measured data includes measurements of m, a and F .
The task of equation discovery is closely related to the task of supervised regression. Machine learning methods for supervised regression assume a fixed class of models, e.g., linear or piece-wise linear, and find the one that provides the best fit to the training data. Equation discovery methods typically consider broader classes of mathematical equations. These classes may be vast and many (often infinitely many) equations can be found that provide excellent fit to the training data. The challenge of symbolic regression is therefore twofold. On one hand, one can easily overfit the training data with an unnecessarily complex equation. On the other hand, the space of candidate equations is huge and grows exponentially as equation complexity increases, posing serious computational issues to equation discovery methods.
Based on the way they address these challenges, equation discovery methods can be clustered into two general paradigms. The first paradigm deploys a general and powerful search algorithm to the unconstrained space of candidate equations. Most commonly, symbolic regression methods employ genetic algorithms [24,12]. Recently, alternative approaches based on sparse regression [6] or neural networks [29] have been proposed. The second paradigm, commonly referred to as equation discovery, focuses on the use of knowledge for constraining the space of candidate equations. The different types of knowledge being considered include measurement units [30], grammars [27,23], cross-domain knowledge [2] and domain-specific knowledge [15,3].
A notable limitation of the existing methods following the equation-discovery paradigm is that they can only use hard constraints, i.e., they can only include/exclude a certain equation from the list of candidates. In this paper, we extend the equation discovery paradigm and propose probabilistic context free grammars (PCFGs) as a formalism for constraining the space of candidate equations. In addition to hard constraints, probabilistic grammars allow for specifying soft constraints that specify preferences over the space of candidates by assigning a prior probability to each of them.
The conjecture of this paper is that the use of probabilistic context free grammars holds promise for a significant impact on the computational efficiency of equation discovery methods. To test the validity of the conjecture, we consider one hundred equations from the Feynman data set [29]. For each of them, we observe the expected number of candidate equations that have to be considered for different approaches to reconstruct the correct equation. We perform a theoretical comparison of the expected number of candidates for a deterministic grammar and different variants of a probabilistic context-free grammar for arithmetic expressions. We also analyze the impact of the prior distribution over the space of candidate equations on the reconstruction performance of a simple Monte-Carlo sampling algorithm for equation discovery 1 .
The remainder of the paper is organized as follows. Section 2 introduces probabilistic context-free grammars, in general, and their specific instances for arithmetic expressions that we are going to use in our analysis, in particular. In Section 3, we perform a theoretical analysis of the expected number of candidate equations that have to be considered for reconstructing the Feynman equations. Section 4 reports the results of the empirical comparison of the reconstruction performance of different sampling procedures for equation discovery. Section 5 discusses the analysis results and puts them in the context of related work on symbolic regression and equation discovery. Section 6 provides a summary of our work and a number of directions for further research.

Grammars for specifying structural priors
This section introduces the notion of a grammar. Originating from computational linguistics, grammars are used as formal specifications of languages and use a set of production rules to derive valid strings in the language of interest. Our paper argues for the use of context-free grammars, a subclass of general grammars that is general enough to specify the language of arithmetic expressions. In equation discovery, we are interested in using grammars as generative models, as opposed to their common use for parsing, i.e., discriminating between legal and illegal strings in a language. In the rest of this section, we first introduce context-free grammars and their probabilistic counterparts through formal definitions and illustrative examples. We then define a PCFG for arithmetic expressions and define the task of (probabilistic) grammar-guided equation discovery.

Context-free grammars
Formally, a context-free grammar [21] is defined as a tuple G = (N , T , R, S). The set T contains terminal symbols, i.e., words for composing sentences in the language. Non-terminal symbols in the set N represent higher-order terms in the language, such as noun or verb phrases. The production rules in the set R are rewrite rules of the form A → α, where the left-hand side is a non-terminal, A ∈ N , while the right-hand side is a string of non-terminals and terminals, α ∈ (N ∪ T ) * . In natural language, a rule NP → A N specifies that a noun phrase NP is a sequence of an adjective A and a noun N . These two nonterminals represent the subsets of terminals corresponding to adjectives and nouns, respectively. When deriving a sentence, a grammar starts with a string containing a single non-terminal S ∈ N and recursively applies production rules to replace non-terminals in the current string with the strings on the right-hands sides of the rules. The final string, containing only terminal symbols, belongs to the language defined by the grammar G.
Consider a simple example of a context-free grammar G L = (N L , T L , R L , S L ) deriving linear expressions from two variables x and y: We usually write multiple production rules with the same non-terminal on the left-hand side using a compact, single-line representation, The process of deriving a string in the language of a context-free grammar is depicted with a parse tree. A rooted, labeled and ordered tree ψ is a parse tree, derived by grammar G, if its root is labeled S, the labels of the leaf nodes are terminal symbols from T , and the immediate children of an internal node labeled A ∈ N correspond to a left-to-right sequence of the symbols in α, where A → α ∈ R [21]. The string derived by ψ is the left-to-right sequence of its leafs. Figure 1 provides two example parse trees for the grammar G L above, ψ L 1 and ψ L 2 , for the expressions x + y and x + y + y, respectively. We measure the height of the parse tree as the number of edges on the longest path from the root node (S) to a leaf. The heights of parse trees ψ L 1 and ψ L 2 are three and four, respectively. The same mathematical expression will in general have a different parse tree (and height) when derived by using a different grammar. Nevertheless, the height of a parse tree is closely related to the complexity of the derived mathematical expression, with more complex expressions requiring higher parse trees.

Probabilistic context-free grammars
A grammar can be turned into a probabilistic grammar by assigning probabilities to each of its productions, such that for each A ∈ N : The probabilities of all productions with the same nonterminal symbol on the left hand side sum up to one. In other words, we impose a probability distribution on the productions with the same symbol on the left hand side. As each parse tree ψ, derived by a grammar G, is characterized by a sequence of productions, its probability is simply the product of the probabilities of all productions associated with it [11]: where f (A → α, ψ) denotes a function counting the number of instances of the production A → α in the parse tree ψ. Furthermore, it can be shown that for any proper PCFG the probabilities of all parse trees derived by the grammar sum up to one [7]: We can extend the example context-free grammar from Equation 1 to a probabilistic context-free grammar by assigning a probability to each of the four productions, given below in brackets after each production: where we have parametrized the probability distributions over productions for E and V the with parameters 0 < p < 1 and 0 < q < 1, respectively. We can now calculate the probabilities of the two example parse trees ψ L 1 = ψ L ("x+y") and ψ L 2 = ψ L ("x + y + y") from Figure 1: Longer and more complex mathematical expressions are composed by using more productions and their associated parse trees have thus lower probabilities. Throughout the paper, where the distinction between grammars and their probabilistic counterparts is important, we refer to the former as deterministic (context-free) grammars.
Algorithm 1 Sample a sentence from a probabilistic context-free grammar. Require: Probabilistic grammar G = (N , T , R, S), non-terminal A ∈ N Ensure: Sentence s corresponding to a randomly sampled parse tree ψ from G with root node A, probability p of ψ 1: procedure generate sample(G, A) 2: return (s, p)

Grammars as generators
While grammars were invented to formally describe the structure of naturallanguage sentences, in computer science, they are mainly used for specifying the syntactic structure of programming and markup languages [21]. In this context, grammars allow for the implementation of efficient parsers of source code and annotated documents: For a given string and a grammar, the parser determines whether the string can be derived using the grammar and, if yes, return the appropriate parse tree. The latter explicates the syntactic structure of the string.
In equation discovery, context-free grammars have been used as generators of arithmetic expressions [27]. The deterministic algorithm for generating expressions from a given grammar performs systematic enumeration of parse trees from simpler (lower) to more complex (higher) ones. To this end, the algorithm employs a refinement operator that for a given parse tree, generates its minimal refinements by replacing simpler production rules with more complex ones [27]. The deterministic generator requires a user-specified maximal height of the generated parse trees. Note that, in such a setting, each generated parse tree is assumed to be equally probable, i.e., the uniform distribution over the space of parse trees with a limited height is assumed.
Probabilistic grammars, on the other hand, provide a much more flexible generation mechanism, based on sampling the space of parse trees that can be derived by a grammar, described in Algorithm 1. The function generate sample takes two arguments: the probabilistic context-free grammar G and a non-terminal A as the root node of the parse tree of the generated expression. The procedure recursively expands its current list of elements s until it contains only terminal symbols. In each recursive step, it randomly samples production rules for the non-terminal A, according to their probability distribution as prescribed by the grammar (line 3). The sequence of symbols on the right-hand side of the chosen production rule is then used to expand the current sentence s: the extension is simple for terminals (line 6) and includes recursion for non-terminals (lines 8 to 10). When called with a grammar and its start symbol, the algorithm returns a randomly sampled sentence from the grammar along with its probability.
In contrast to the deterministic systematic generator of parse trees from a given grammar, Algorithm 1 does not need a specification of a maximal tree height. Also, each generated sentence (tree) is assigned a probability, based on the probabilities of the production rules used in its derivation. Therefore, the distribution over the generated parse trees is not necessarily uniform. As we will show in the next section, the more complex (higher) parse trees are less probable than the simpler ones.

Number and probabilities of parse trees with limited height
For a given deterministic grammar G = (N , T , R, S), the number of parse trees n G (A, h) with a height of exactly h and the root symbol A ∈ N ∪ T can be computed using a recursive formula: represents the number of parse (sub)trees of G with a root node A and height up to (and including) h: To compute the number of parse trees derived by a grammar, we use the above formula with A = S. Whenever the root symbol is not specified, we assume n G (h) = n G (A = S, h). We can now, due to the simplicity of the linear grammar G L , use the above formula to find an analytical expression for the number of parse trees at a given height: n G L (h) = n h−1 V , ∀h ≥ 2. Details are provided in the Appendix. Equation 6 introduces a recursive approach to counting the number of parse trees with a particular height derived by a grammar. The number of parse trees N G (A = S, h) does not change if we extend a context-free grammar to a probabilistic context-free grammar. However, we can introduce another related quantity that helps characterize the properties of a PCFG, called coverage. We define it as the sum of probabilities (Equation 2) of all the parse trees with height up to (and including) h: where Ψ A,hi ⊆ Ψ denotes the set of all parse trees with the root symbol A and height h i . For a proper grammar, we can now express Equation 3 in terms of coverage as lim We can compute the coverage of a PCFG at height h using a similar set of recursive equations as in Equation 6: where α = A 1 A 2 . . . A k is the string of symbols A i ∈ N ∪ T on the right-hand side of the production rule A → α.
To illustrate the benefit of introducing the probabilities of the production rules, let us reconsider the simple linear grammar G L from Equation 1. There are two probability distributions for the production rules we have to assign to G L : one for the production rules with E on the left-hand side and the other for the ones with V on the left-hand side. The latter is trivial, since one would assume that all the variables are equally probable, leading to a uniform distribution of probabilities of the production rules V → v, i.e., P (V → v) = 1/n V , where n V denotes the number of variables (n V = 2 in the example grammar from Equation 1).
The other probability distribution is over the two productions E → E + V and E → V . If we denote the probability of the first (recursive) rule with p, the probability of the other rule is 1 − p. Following the recursive formula for coverage from Equation 9, one can easily show that This result is important, since it reveals the impact of the probability p of the recursive production rule E → E + V on the probabilities of sampling a parse tree with a certain height, depicted in Figure 2. The probability of sampling the simplest/lowest parse trees at height 2 equals 1 − p. For small values of p, the probability of sampling one of the simplest trees is high. These simplest trees correspond to the expressions x and y, i.e., to expressions including a single linear term. Increasing p increases the likelihood of sampling higher parse trees corresponding to more complex expressions. Thus, p allows us a direct and intuitive control over the parsimony principle in equation discovery that introduces preference towards simpler expressions. Some equation discovery systems introduce regularization [6] with the same purpose. The regularization constant, however, usually has an unlimited range and does not provide an intuitive, probabilistic interpretation of the parameter controlling the preference for simpler expressions.
Furthermore, the probability distribution of the parse trees in G L with a certain height is uniform, following the uniform distribution over the production rules for V . Since the number of parse trees with height h increases exponentially with h, the probability to sample a single tree with a given height h decreases exponentially with h. This is in contrast with the behavior of deterministic grammars, where the sampling probability is uniformly distributed across all trees. Since the number of trees increases exponentially with increasing height, it is much more probable to sample a higher tree than a lower one. The use of deterministic grammars for equation discovery requires an external (regularization) mechanism to implement the parsimony principle [27].

PCFGs for arithmetic expressions
Context-free grammars are typically used to parse sentences. Probabilistic context-free grammars enhance this use, for instance, by providing an estimate of the probability of a sentence in addition to its parse tree. However, probabilistic context-free grammars allow for another type of application -stochastic generation of sentences or, in our case, mathematical expressions, as described in Algorithm 1. The probabilities, assigned to the productions, provide a great amount of control over the probability distribution of individual parse trees. In our example in Equation 4, the parameter p controls the probability of a higher number of terms, while the parameter q tunes the ratio between the number of occurrences of variables x and y.
An important concept to consider when working with grammars is ambiguity. A grammar is formally ambiguous if there exist sentences that can be described by more than one parse tree, generated by the grammar. Grammars for arithmetic expressions can express another type of ambiguity, called semantic ambiguity. All but the simplest arithmetic expressions can be written in many mathematically equivalent, but grammatically distinct ways. This is mainly due to the distributivity, associativity and commutativity properties of the basic operations. For example, consider the many ways the expression x 2 +xy can be written: x×x+x×y = x×y+x×x = x×(x+y) = x×(y+x) = (x+y)×x = (y+x)×x. Each results in a distinct parse tree. It is generally useful to adopt a canonical representation that each generated equation is converted into. This allows us to compare expressions to each other and check whether they are mathematically equivalent in addition to comparing their parse trees. In our work, we use the Python symbolic mathematics library SymPy [20] to simplify expressions and convert them into a canonical form, as well as to compare expressions symbolically.
For equation discovery, we are going to use a formally unambiguous contextfree grammar for arithmetic expressions: 4. S = E.
In this paper, we illustrate a number of useful properties of PCFG for equation discovery on the example of the universal arithmetic grammar. The grammar in Equation 11 can be extended to a PCFG by imposing a probability distribution to each set of production rules with the same nonterminal on the left hand side. In addition, we include four mathematical functions into the grammar: sin , cos , √ and exp . The resulting PCFG is defined in Equation 12: We treat the production probabilities in the PCFG as parameters of the grammar that define the prior distribution over expressions and shape the search space of equations for equation discovery. The specific probabilities above are our default values, a choice based on intuition and preliminary experiments. In the above definition, we have added another new concept to the basic version in Equation 11 -the constant parameter c. In this paper we discuss PCFG properties in the context of generating expressions for the task of equation discovery. Broadly speaking, there are three components to an arithmetic expression in the context of equation discovery: variables, constants and operators. Numerical values and constants are typically treated as free parameters to be optimized when evaluating an equation for its fit against the given data. We thus include the terminal symbol c in the PCFG to stand for a constant parameter of an expression with as yet-undetermined value.

Grammar-guided equation discovery
After introducing probabilistic context-free grammars for arithmetic expressions, we can now define the task of grammar-guided equation discovery as follows, as an extension of the task considered by Todorovski and Džeroski in 1995 [27]: Find an equation v = e, such that • e can be derived using G, • e minimizes the discrepancy between the observed values of v from D and the values of v calculated by using the equation.
The measure of discrepancy used in this paper is the relative root mean square error, i.e., where |D| denotes the number of observations in the data set D, v o and e o are the values of v and e for a specific observation o ∈ D, respectively, and σ 2 v denotes the variance of variable v in the data.

Theoretical analysis
After introducing context-free grammars and illustrating their use for generating arithmetic expression, we turn our attention towards evaluating the benefits of using PCFGs for equation discovery. We first formulate the task of equation discovery based on PCFGs. Then, we analyze the expected number of parse trees we have to sample from a given (deterministic or probabilistic) grammar in order to reconstruct a specific equation. Finally, we evaluate the impact of the probabilities of production rules on the expected number of sampled trees.
The analysis is based on the one hundred equations from the Feynman Symbolic Regression Database [29]. This database includes a diverse sample of equations from the three-volume set of physics textbooks by Richard P. Feynman [10] and has been previously used as a benchmark for equation discovery [29]. The equations are listed in the appendix and contain between one and nine variables, the four basic operations, the functions exp, √ , sin, cos, tanh, arcsin and log, as well as a variety of constants-mostly rational, but also e and π.

Expected number of equations
The metric used for analysis in this section is the expected number of parse trees that we need to sample from a given grammar, so that the sample includes a parse tree corresponding to a given (target) equation. The value of the metric depends on the target and the grammar. We are going to model the number of sampled parse trees as a random variable N , being interested in its expected value E[N ].
We model the sampling of parse trees from a given PCFG as a Bernoulli process of consecutive trials. In each trial, we sample a parse tree from the PCFG and check whether it matches the target equation. If we denote the probability of the parse tree corresponding to the target equation with p, then N follows the geometric distribution: In other words, the Bernoulli process includes a sequence of n − 1 unsuccessful trials, where the sampled parse tree does not correspond to the target (hence, the probability of each trial outcome is 1 − p), followed by a single successful trial with the probability of p. In the Appendix, we derive the formula for E[N ], i.e., the expected number of trials until (and including) the first success: which corresponds to the general intuition that the more probable the target equation is, the fewer samples are needed to reconstruct it. When dealing with a deterministic grammar, we derive the expected number of sampled trees as follows. First, let us denote the height of the parse tree corresponding to the target equation with h. Note however, that the generator does not have the exact value of h at input. Thus, it would have to systematically generate all the parse trees with heights at most h − 1 to find out that the tree corresponding to the target equation is not among them. Then, the sampling proceeds at height h, where the generator will end up identifying the target tree. On average, due to the uniform distribution of parse trees, half of the parse trees with height h will be considered in the last iteration, making the total number of trees considered by a deterministic grammar G To calculate the probability p and the height h for a specific target equation, we parse the expression on its right-hand side with the grammar of interest.

Probabilistic vs deterministic
Using the formulas derived in the previous subsection, we calculate the expected number of sampled parse trees necessary to reconstruct each of the one hundred equations from the Feynman database. Then, for different values of n, we are interested in how many of the target equations from the Feynman database we can expect to reconstruct by taking at most n sample parse trees from a given grammar, i.e., ratio(n) = 1 100 where n denotes the number of sampled parse trees, E[N i ] is the expected number of sampled trees necessary to reconstruct the i-th equation from the Feynman database, and I(b) is a function indicating the truthfulness of the Boolean expression b, having the value of 1, if b is true, and 0 otherwise. We compare these ratios for the deterministic and probabilistic grammars for generating arithmetic expressions from Equations 11 and 12, respectively. The comparison, depicted in Figure 3, shows that the expected number of samples for the deterministic grammar is many orders of magnitude above the expected number for the probabilistic grammar. In order to reconstruct all the equations in the Feynman database, we need to sample over 10 4000 parse trees from the deterministic grammar. In contrast, the probabilistic grammar would require, on average, less than 10 50 samples. Half of the Feynman equations can be successfully reconstructed when sampling around 10 10 parse trees from the probabilistic grammar, while roughly 10 60 samples are required for the same achievement using the deterministic version of the grammar.
The number of distinct parse trees for a context-free grammar increases super-exponentially with parse tree height. This is true for both probabilistic and deterministic grammars. Yet, the differences in the expected number of sampled expressions observed here span tens, even hundreds of orders of magnitude. This is due to the fact that the probability distribution over parse trees generated by a deterministic grammar is uniform. In contrast, probabilistic grammars introduce bias towards simpler parse trees, as we have shown in Section 2. The results in Figure 3 show that the bias introduced by probabilistic grammars corresponds well to the equations included in the Feynman database, indicating that the bias corresponds well to equations used in science. In the next section, we will show how a careful adjustment of the probabilities assigned to individual production rules in the grammar can further shift the bias towards equations in the Feynman database.

Biased vs. unbiased probabilistic grammar
Here, we focus our attention on probabilistic grammars, and explore the impact of tuning the prior distribution over the parse trees, via changing the probabilities of production rules, on the expected number of sampled trees. Recall that the analysis in Figure 2 shows that decreasing the probabilities of the recursive production rules introduces bias towards simpler equations. We now extend that analysis by exploring the impact of the probabilities of the other production rules on the expected number of sampled trees. To keep the comparative analysis simple and clear, we vary the following four parameters: 1. the ratio between the probabilities of summation and subtraction: 2. the ratio between the probabilities of multiplication and division: r mul = P (F →F * T ) P (F →F/T ) , 3. the ratio between the probabilities of a constant and a variable: r const = P (F →V ) P (F →c) , 4. the ratio of the total probability of the set of the four special functions and no special function: .
The first grammar in our analysis sets all of these ratios to one. We will refer to this grammar as the uniform grammar, since it does not introduce any preferences among the operators, functions or type of atomic terms (variables or constants) in the equation. In contrast, the biased grammar introduces intuitive preferences often used by scientists and engineers. Table 1 reports the exact parameter settings for the two grammars. The results of the comparison are depicted in Figure 4. The histogram on the left-hand side of the figure shows that introducing the bias into the probabilistic grammar leads to a nontrivial reduction in the number of expected samples, albeit smaller than the one observed when comparing the probabilistic and the deterministic grammar. The biased grammar reduces the expected number of samples for 86 out of the one hundred Feynman equations, whereas an increase is observed for merely ten equations (for four equations, the expected number of samples is the same for both grammars). Of the 86 cases with reduction, 58 lie in the range of a 25% reduction in the expected number of parse trees at worst, and a 90% reduction at best. For 28 target equations, the expected number of models is reduced by more than one order of magnitude, where the largest reduction is on the scale of seven orders of magnitude.
The biased grammar seems to exploit the deeper aspects of the basic parsimony principle. One might expect then, that the biased grammar would lead to a greater reduction in the number of expected samples for simpler target equations. Note however, that the scatter plot on the right-hand side of Figure 4 shows the opposite trend: the reduction tends to increase with the increasing complexity of the equation, with the Spearman correlation coefficient between the reduction rate and complexity being 0.5. In this particular figure, we measured the complexity of the target equation as the length (the number of alphanumerical characters) of its string/text representation. Table A   In sum, the comparison between a uniform and a biased universal grammar demonstrates that varying the production probabilities of a probabilistic grammar can reduce the expected number of samples by more than one order of magnitude. Furthermore, the effect is greater for more complex equations. Probabilistic context-free grammars thus provide a powerful and flexible way of expressing prior beliefs and domain knowledge for equation discovery. In the following section, we provide empirical support for this claim.

Empirical analysis
To accompany the preceding theoretical analysis, we have developed a simple algorithm for performing equation discovery with probabilistic context-free grammars in Python. We take a Monte-Carlo approach to sampling the probability distribution over the space of all possible equations. An equation discovery task includes simulated or measured data, as well as the names of the variables return eqns.sort(key=error, order=increasing) for which data are provided. In our study of performance, each task is accompanied by the correct equation, to which we can compare the results of the algorithm (the discovered equations). For each equation discovery task, a universal arithmetic PCFG is constructed with terminal symbols corresponding to the measured variables, as well as the constant symbol. A large number of candidate equation structures are randomly sampled from the grammar and evaluated against the data in the Feynman database. This involves the selection of values of the constants equation to obtain the best fit. The equation fitting the data best is returned as the result. Alternatively, equation structures could be sampled until a satisfactory candidate was found, where a satisfactory degree of fit would be specified in advance.

Monte-Carlo sampling algorithm
Algorithm 2 presents an outline of the Monte-Carlo approach to grammarguided equation discovery. Candidate expressions on the right-hand side of the equations are sampled independently by the procedure generate sample from Algorithm 1 (line 4). We implemented the algorithm in Python using the support for working with grammars provided by the Python Natural Language Toolkit, NLTK [1]. Each sampled expression e is then processed using the Python library for symbolic mathematics, SymPy [20] to obtain its canonical form e c and identify the list of constant parameters it includes (line 5). We then estimate the values of these constant parameters by minimizing the relative root mean squared error of the equation v = e c on D, as defined in Equation 13 (line 6). To this end, we employ differential evolution (DE) [25] as an efficient and proven method of global optimization, with DE parameters set to values similar to those reported in [17]. Following the formula from Equation 13, the algorithm computes the final score of the candidate equation (line 7). Finally, the algorithm reports the list of the sampled equations sorted by increasing error (lines 8 and 9).

Empirical setup
In order to confirm our theoretical analysis empirically, we employ the described Monte-Carlo algorithm for equation discovery on the Feynman database of one hundred equations from physics. We compare the uniform and the biased versions of the universal arithmetic grammar from Equation 12, with production probability distributions parametrized as in Table 1. For each Feynman equation, Algorithm 2 generates 10 5 candidate expressions with each of the studied grammars. Due to practical concerns and time constraints, parameter estimation is performed only for equations with at most five constant parameters. Equations with more than five constant parameters are simply assumed to be inadmissible.
We further optimize the computational complexity of the equation discovery process by checking the generated expressions for duplicates, i.e., expressions with identical canonical forms; the parameters for all the expressions sharing the same canonical form are estimated only once. Moreover, since the data in the Feynman database are noise free, correct equations tend to have low error. In the experiments, we use 10 −9 as a threshold on the value of ReRMSE to decide whether a candidate is a correct reconstruction of the target equation in the Feynman database: a target equation from the Feynman database is successfully reconstructed with a given grammar, if the corresponding sample contains at least one matching candidate equation. We perform three independent runs of the Monte-Carlo sampling for each grammar to account for the algorithm's stochastic nature. Table 3 summarizes the empirical results. Detailed results for each Feynman equation are reported in the Appendix. The first column in the table reports the number of unique expressions sampled per target equation, averaged over all the problems from the Feynman database. Because of the semantic ambiguity of the grammars for arithmetic expressions, the reported numbers are much lower than the total number of sampled expressions: only about 30% of the samples correspond to unique canonical expressions for both the uniform and the biased grammar.

Results
The second column in the table reports the achieved coverage of the space of all candidate equations in terms of the total probability of the sampled expressions (i.e., the sum of the probabilities of the corresponding parse trees). Note that sampling with the biased grammar covers around half of the space of candidate equations (in terms of total probability), while the uniform universal grammar reaches a coverage of only around 0.35. We attribute this difference to changes in the structure of the search space, brought by the different probability distributions of production rules. The probability distributions for the biased grammar are generally more varied -some rules have higher probabilities, others have lower probabilities, as opposed to equal probabilities for all. The effect is similar when considering the probability distributions over parse trees. For the biased grammar, a few parse trees contribute most of the covered total probability, while the majority of parse trees only make a minuscule contribution. Furthermore, parse trees with a higher contribution to coverage are also more likely to be sampled. In this way, it is possible to interpret coverage as a measure of inequality among the parse trees, which is related to the amount of information in the prior distribution. These observations correspond to the idea that uniform priors are the least informative ones, a concept often used in Bayesian statistics [16].
Finally, the third column reports the number of successfully reconstructed target equations from the Feynman database. The two grammars reconstruct roughly 36 of the 100 Feynman equations after 10 5 samples. When measured in terms of successful reconstructions, the performance of both grammars is seemingly identical. This is in contrast with our theoretical analysis in Section 3, which suggests that the biased grammar is superior.
We can compare the results in Table 3 to those reported by Udrescu [28]. AI Feynman was able to discover all one hundred equations from the Feynman database, while the symbolic regression method Eureqa [24] was able to reconstruct 71% of the equations. In comparison, probabilistic grammar-based equation discovery reconstructed 37% of the equations in our experiments. The lower performance is not unexpected, as the method presented in this paper serves mainly an illustrative purpose and relies on a very simple algorithm. We are going to return to this comparison in the analysis of the results.

Connection to the theoretical analysis
The results in Table 3 can not be immediately matched with the theoretical results, as they summarize performance at only a fixed number (10 5 ) of sampled expressions. To analyze the impact of the number of samples on the performance, we re-sample the expressions in each experiment, as detailed in the Appendix. Figure 5 depicts the results of the re-sampling: the y-axis shows the percentage of successfully reconstructed Feynman equations, which increases as the number of sampled equations, shown on the x-axis, increases. The shaded areas depict the range between the minimum and the maximum average success rate, obtained with the three runs of the Monte-Carlo algorithm. These results reveal the difference in performance between the uniform and the biased universal grammar. The performance of the biased grammar increases faster and remains well above the performance of the uniform grammar for sample sizes of up to 23,000 expressions. The two full lines on the same figure represent the success rates as predicted by the theoretical analysis in Section 3: note that they are much lower than the empirically obtained success rates. The reason for this discrepancy is the semantic ambiguity of the universal grammar. Although we know that the grammars generate many expressions that are mathematically identical, we approximate the sum of their probabilities with the probability of only a single parse tree. We infer the ratio between these two values from our experiments as the ratio between the number of unique expressions (second column in Table 3) in the sample and the sample size (10 5 ), averaged over the three independent samplings: We use these ratios to correct the estimated probability of generating a parse tree that simplifies to a canonical expression corresponding to the solution of the studied problem. This correction reduces the theoretical expectation on the number of necessary samples for each grammar, which leads to the dashed lines in Figure 5. The corrected theoretical predictions of the expected number of samples and the respective success rates for each grammar correspond well to the empirical curves, calculated by re-sampling the empirical results. Note that the corrected predictions provide solid confirmation of our theoretical analysis.

Analysis of the results
Overall, the performance of the Monte Carlo algorithm for grammar-based equation discovery is not great. In our experiments, the method was able to solve around 37% of the equation discovery tasks from the Feynman dataset. To understand the achieved performance level, we have to take a closer look at the results for some of the tasks.
In Figure 6, we study one of successfully solved equation discovery tasks and one of the unsuccessful ones. We visualize the distributions of the probability of an expression and its error as a scatter plot on a logarithmic scale. We see that, for both equation discovery tasks, the majority of sampled expressions are found in a cluster with moderate probability and high error. For the solved problem, a number of points are scattered across more than 15 orders of magnitude in error, with many of them falling below the error threshold for a correct expression. In the case of the unsolved problem, there are no points at all below the main cluster. It is clear that the solved problem represents an easier task than the unsolved one if we take a look at the corresponding information in table in Appendix A.7 (equations 20 and 39). Equation #39 is less complex than #20 in all measures except the number of parameters. With an estimated generation probability of 3 · 10 −4 , at least a few correct solutions for task #39 are expected in a sample of 10 5 expressions, on average. In contrast, the estimated probability of 9 · 10 −15 of equation #20 means that we would need to be extremely lucky to stumble upon the correct solution in our samplings. Another high-level view of the results is presented in Figure 7, where we compare the distributions of target expression complexity for the sets of solved and unsolved problems from the Feynman database. We see that the majority of problems the Monte-Carlo algorithm was unable to solve are more complex than the majority of solved problems, although there is some overlap in the tails of the distributions.
The Monte-Carlo algorithm presented in this work is intended to serve as a demonstration of the concepts and ideas behind using probabilistic grammars as a method of generating expressions for equation discovery. Furthermore, it provides a baseline for the performance that can be achieved with arguably the simplest sampling procedure possible. Our complexity analysis reveals where simple Monte-Carlo sampling fails, hinting towards a path for more efficient approaches. The nature of sampling a probabilistic grammar, as well as our chosen prior distributions, focuses the search toward simpler expressions. This bias allows the procedure to successfully discover the solutions to the majority of less complex problems from the Feynman database. During the sampling procedure, however, we ignore all information on the errors of the previously generated expressions. In order for a method, based on sampling probabilistic grammars, to be able to discover complex equations, intelligent and adaptive data-driven sampling algorithms will have to be developed.
Note finally, that the complexity of a parse tree deriving a particular arithmetic expression is not directly related to the complexity of that expression, as observed in Figure 7. The complexity of the parse tree depends on the grammar used for equation discovery: for one grammar, the expression can be derived using a parse tree of depth three, while for another grammar, a parse tree of depth ten might be needed. In this paper, we have used only the universal grammar for general arithmetic expressions from Equation 12. Using alternative probabilistic grammars (for example, derived from the deterministic grammars considered by Todorovski and Džeroski [27] that employ domain-specific or cross-domain knowledge about mathematical modeling to constrain the space of candidate equations), would allow for much more efficient reconstruction of the target equations from the Feynman database. These alternative grammars might need simpler parse trees to derive the target equations, which would decrease the number of samples required to reconstruct the target equation.

Discussion
The theoretical and empirical analysis show that the formalism of probabilistic grammars provides a flexible and powerful framework for specifying the inductive bias for equation discovery and symbolic regression. In particular, probabilistic grammars allow for concise specification of the prior distribution over the space of candidate equations. This is achieved by the probability distributions over the production rules for each non-terminal symbol in the grammar. For example, this provides the user of the equation discovery algorithm with an immediate and transparent control over the parsimony principle by specifying the probabilities of the recursive production rules that allow the grammar to generate complex arithmetic expressions. As shown in Section 3, lowering the probability of the recursion rules increases the probability of simpler equations.
This is an important result as the parsimony principle, following the Occam's razor that favors simpler explanations over complex ones, is of crucial importance in algorithms for equation discovery and symbolic regression [31]. It has been so far addressed by using various techniques, including the minimumdescription-length formalism [31,22], Akaike and Bayesian information criteria [18], regularization methods for sparse regression [6] and complexity-related penalty terms in the fitness for evolutionary approaches [24]. These techniques for following the parsimony principle often require an appropriate setting of a (regularization) parameter that sets the degree of trade-off between the error of the equation and its complexity. Finding the optimal parameter setting might require computationally expensive trial-and-error experiments. In contrast, the frame presented in Section 3 provides a basis for setting the probabilities of the recursive rules analytically, based on the probabilities of simpler expressions corresponding to shallower parse trees.
Grammars can support different types of inductive bias. In the experiment with the biased grammar, we have shown that the probabilities of production rules involving different arithmetic operators can be also finely tuned, which is equivalent to (if not more general than) setting the probabilities of the arithmetic operators [13]. Grammars can not only specify soft constraints that increase/decrease the probability of (classes of) expressions, but can also be used to specify hard constraints on the space of candidate equations. As shown by Todorovski and Džeroski [27], this kind of hard constraints allows for precise tailoring of the space of candidate equations to the knowledge and basic principles in the domain of interest. Other aspects of cross-domain knowledge [2], including dimensional analysis [30], can also be encoded as grammars [23].
Incorporating background domain-specific knowledge in equation discovery is closely related to the communicability of the obtained equations with scientists in the domain of interest. Mathematical models and equations, that are in line with the knowledge in the domain of use, allow humans to interpret them in terms of explanations of the observed phenomena [4]. However, deterministic grammars, as well as other types of constraints [3], as vehicles for integrating knowledge, cut out entire regions of the space of candidate equations. While these cuts improve the computational efficiency, they introduce the risk of cutting off potentially valid models. Probabilistic grammars allow for soft constraints that can reduce the risk by specifying the probability distribution over the space of candidate equations. At the same time, probabilistic grammars can still improve the computational complexity of the equation discovery process and lead to communicable models. The grammar parse trees and their inner nodes correspond to non-terminals and potentially explanatory higher-order expressions, much like processes in explanatory process-based models [15,4].
The empirical results from Section 4 show that the Monte-Carlo algorithm for grammar-based equation discovery is limited to the reconstruction of simple equations and fails to reconstruct a major fraction of more complex equations from the Feynman database. This result is expected and can be explained with the simplicity of the proposed algorithm, where more elaborate algorithms are expected to lead to better results [29]. Approaches of Bayesian optimization [14] can be applied to guide the simple sampling algorithm towards the candidate equations with small error and therefore towards the target equation in the case of reconstruction. We expect that these algorithms will improve the efficiency of the simple sampling procedure presented here and make the sampling-based approaches competitive with current probabilistic and Bayesian approaches to symbolic regression [13]. In addition, Bayesian optimization allows for explicit control over the trade-off between exploitation and exploration in the search space of candidate equations [14], which has a significant impact on the efficiency of search algorithms for equation discovery [26].
Probabilistic grammars can be also employed in the context of learning bias for equation discovery and process-based modeling [5]. In that setting, constraints on the space of candidate equations are inferred from the results of an equation discovery algorithm: we learn constraints that discriminate between accurate and inaccurate equations. Similarly, the Bayesian scientist [13] learns or estimates the probabilities of different arithmetic operators from a corpus of equations from Wikipedia. In both cases, the induced bias for equation discovery can be then transferred from one discovery task or domain to another. The transfer can significantly improve the efficiency of symbolic regression without harming the accuracy of the discovered equations. However, the formalism of process-based modeling requires complex learning mechanisms in logic [5].
When using probabilistic context free grammars as inductive bias, the latter can be learned using a number of standard algorithms for inferring grammars [8].

Conclusion
The results presented in this paper confirm the validity of our conjecture that the use of probabilistic grammars has a remarkable impact on the computational efficiency of methods for equation discovery and symbolic regression. First, our theoretical analysis shows that probabilistic grammars can significantly reduce the expected number of equations needed to reconstruct a known equation from data. Second, probabilistic grammars provide an intuitive, probabilistic parametrization of the parsimony principle, that favors simpler equations over more complex ones, through the probabilities of the recursive production rules in the grammar. Third, the computational experiments with a simple Monte-Carlo sampling algorithm for equation discovery provide an empirical confirmation of the theoretical results.
The significance of the contributions presented in this paper is related to the opportunities for further development of algorithms for equation discovery and symbolic regression. The integration of probabilistic grammars into approaches to equation discovery enables the development of truly Bayesian approaches to equation discovery. In particular, this development will focus on advanced and more efficient sampling procedures that will guide the search for proper equations towards more promising parts of the space of candidate equations specified by the grammar. The computational efficiency of equation discovery algorithms can be further improved by encoding background, domain-specific knowledge into probabilistic grammars. This encoding can be pursued with manual labor of transforming domain knowledge into grammars or by employing computational approaches to grammar inference from examples of equations in the particular domain of interest.

Appendix A. Appendix
Appendix A.1. Analytical expression for the number of parse trees with a given height for grammar G L The grammar in Equation 1 is simple enough to derive the number of parse trees with height h analytically, using Equation 6. First, observe that there are only two trees with V as the root and their height is one: Consequently, N G L (V, h) = n V for all h ≥ 1. We now make use of this while considering Equation 6 for h >= 2 and E as the root node: This relation is simply the recursive form of a geometric sequence. The general form can be written as The number of parse trees with height up to and including h is then the sum of the first h terms of the geometric series: In our computations we use a G L with n V = 2, for which the expression simplifies to N G L (E, h) = 2 h − 2.
Appendix A.2. Analytical expression for coverage for grammar G L The probability of generating a parse tree with height h with grammar G L is This is easy to see by considering the Bernoulli process. Recall that one of the productions with E on the left-hand side represents recursion, with probability p, and the other ends the process of generation, with probability 1 − p. Starting from 2, the generator must choose the recursive option h − 2 times before choosing the nonrecursive production in order to generate a parse tree with height h. Then, we can compute the coverage as To demonstrate the usage of the recursive Equation 9, we can use it to compute the above result. First, we observe the coverage for parse trees with V as their root node: From there, we can derive a recursive relation for the coverage of parse trees with E as the root node: Here it is not as easy to see the general expression for the h-th term in the series. Instead, we consider the first three terms: From this sequence it is easy to guess the general form For proof by induction, we now assume the relation holds for h and use the recursive relation to prove it for h + 1: Appendix A.3. Resampling procedure In the following Appendix section, we elaborate on the details of the post processing we employed in order to create Figure 5. In the experiment, described in Section 4, we employ the developed Monte-Carlo algorithm for grammarbased equation discovery to generate a large number of candidate expressions using the universal grammar with either a uniform or a biased prior distribution. Using each grammar we perform three independent samplings of 10 5 expressions with unique parse trees. Due to semantic ambiguity, some of these expressions simplify to the same canonical expression. We end up with approximately 30000 unique canonical expressions for each of the equation discovery tasks from the Feynman database.
Analyzing the sampled expressions and drawing meaningful conclusions is not a simple task. In our analysis, we characterize each sampled expression by two values: its generation probability and the relative root mean squared error (RRMSE) it produces when evaluated with the optimal parameter values. We set RRMSE=10 −9 as the threshold for a correct solution. We consider an equation discovery task successful if the Monte-Carlo sample of expressions contains at least one expression with error under the threshold. When agglomerating the results of three independent samplings, the number of successes for a particular problem can range from zero to three successes. In Table A.8 we report this value in columns #S U and #S B .
The results obtained in this way only serve to inform us of the performance of the algorithm at a particular number of samples, 10 5 in our case. A more general presentation of the results would show a performance curve, in other words, performance vs. sample size. The easiest way would be to simply take a sliding minimum across the error of the sampled expressions in the order they were sampled. However, the order is arbitrary -if we repeat the experiment, we would get the sampled expressions in an entirely different order (based on their probabilities) and a different performance curve. To alleviate the stochasticity of the algorithm, we employ resampling on the sample of expressions. The measure of success we are working with is the success ratio, which we define as the portion of successfully reconstructed equations from the Feynman dataset. We treat the sample of approx. 30000 unique canonical expressions with their corresponding probabilities as a probability distribution. From this distribution we then sample 30000 expressions, without replacement. In other words, we randomly reorder the set of sampled expressions, while taking into account the probability of each expressions. By repeating this many times (100 in our case), we can average the success ratio at each sample size. This value is the average success rate, depicted in Figure 5. It can be interpreted as the expected success rate at a given sample size. In other words, if the Monte-Carlo algorithm were repeatedly run many times, each time sampling N canonical expressions for each problem from the Feynman database, computing the average of the portion of solved problems would give a value close to the average success rate we report.
In our experiment, we perform three independent Monte-Carlo samplings for each of the two grammars, resulting in six sets of sampled expressions. The resampling procedure is performed separately for each of the six sample sets. Figure 5 depicts the minimum and maximum of the average success rate across the three samplings at each sample size for each of the two grammars.

Appendix A.4. Theoretical expectation of success rate
In the theoretical analysis in Section 3, we use an inside chart parser algorithm [19] to parse the target expression of each problem from the Feynman database, using either the universal or the biased universal grammar. We take the parsed tree probabilities as an approximation for the probability that a randomly sampled expression corresponds to the correct solution for a given problem. We can use the probabilities to calculate a theoretical expectation of the success rate (dependent on the sample size) for each grammar. The probability of finding the correct solution of a problem with index i in a sample of N expression, generated with grammar G, is where p G,i is the probability of randomly generating the correct solution to problem i using grammar G. We approximate this probability with the parsed probability of a single parse tree p G,i ≈ p G,i . Finally we consider the complete Feynman dataset of one hundred equations to arrive at the expected success rate E[success rate](N ) = 1 100 Once we perform our sampling experiment, we can directly compare the empirical success rate with its theoretical expectation, which is depicted as the pair of full lines in Figure 5.
Appendix A.5. Feynman database The Feynman database was constructed by Udrescu and Tegmark [29] to facilitate the development and testing of algorithms for symbolic regression. The database is composed of one hundred important equations from physics and acts as a good playground and benchmark dataset for equation discovery. The following table presents the one hundred equations from the Feynman database, along with their file name, corresponding to the presentation in [29].