1 Introduction

Symbolic regression (also known as equation discovery) aims at discovering closed-form equations in collections of measured data (Schmidt & Lipson, 2009; Todorovski, 2017). Methods for symbolic regression explore the vast space of candidate equations to find those that fit the given data well. They often employ modeling knowledge from the domain of use to constrain the search space of candidate equations. The knowledge is usually formalized into grammars (Brence et al., 2021) or libraries of model components, such as entities and processes (Bridewell et al., 2008). Knowledge-based equation discovery methods have successfully solved practical modeling problems in various domains (Radinja et al., 2021; Simidjievski et al., 2020).

Grammars and libraries of model components are used to generate candidate expressions that might appear in the discovered equations. However, they must be manually crafted, which is a severe obstacle to their broader use. The central aim of this article is to develop a novel generative model of mathematical expressions that can be used for efficient symbolic regression. The model can be trained from a corpus of mathematical expressions from the domain of interest, thus automatically tailoring the space of candidate equations to the application at hand. The developed generative model must have two essential properties to be applicable in such a scenario. First, it should be trainable from a small number of mathematical expressions, e.g., collected from a textbook or from scientific literature in the application domain. Second, the model should encode the expressions in a low-dimensional latent space. The latter space can then be efficiently explored by optimization methods to solve the task of symbolic regression. Lowering the dimensionality of the latent space will significantly increase the efficiency of symbolic regression.

Recently, several variational autoencoders (VAEs) have been shown to be efficient generative models. CVAE (Gómez-Bombarelli et al., 2018) employs a VAE based on recurrent neural networks to encode discrete expressions into a continuous latent space and then decode points from the latent space back into discrete mathematical expressions. This decoder can be used to generate expressions. However, CVAE still generates invalid sequences and requires extensive training data to reduce the likelihood of generating invalid expressions (Kusner & Hernández-Lobato, 2016). The grammar variational autoencoder, GVAE (Kusner et al., 2017), and its successor, SD-VAE (Dai et al., 2018), employ a context-free grammar to ensure the syntactic validity of the generated expressions. Instead of directly training models on sequences, they model the distribution of parse trees that are produced by the grammar while deriving syntactically (and, in the case of SD-VAE, semantically) valid expressions.

We claim that grammars are an unnecessarily powerful and too general formalism for generating mathematical expressions. Grammars add syntactic categories to the expression symbols rendering the parse trees, i.e., the structures modeled with the autoencoder, more complex than the original sequences. This overhead on training expressions inevitably translates to a requirement for more extensive training data and a latent space with larger dimensionality, reducing the efficiency of optimization methods for symbolic regression operating in that latent space.

We propose a novel variational autoencoder for hierarchical data objects, HVAE, to address these issues. It builds upon the ideas of variational autoencoders for hierarchical data (Jin et al., 2021) and gated recursive convolutional neural networks (Cho et al., 2014a). HVAE combines simple atomic units with shared weights to encode and decode the individual nodes in the hierarchy. The atomic units are extensions of the standard gated recurrent unit (GRU) cells. The encoding units are stacked into a tree that follows the hierarchy of the training object, and they encode the hierarchy bottom-up, compiling the codes of the descendants to encode the ancestor nodes. The decoding units proceed top-down and use the decoded symbols of the ancestor nodes to decide upon the need to extend the hierarchy with descendant nodes. We claim that HVAE can be efficiently trained to generate valid mathematical expressions from a training set of modest size, while using a low-dimensional latent space.

We exploit these expected properties of our HVAE to implement a novel approach for symbolic regression, EDHiE. It performs an evolutionary search through the latent space of a HVAE trained on mathematical expression trees as shown in Fig. 1. The genetic operations utilize the HVAE encoder to obtain the expressions’ latent codes, generate new individuals with crossover and mutation in the latent space, and decode the latter back to mathematical expressions. EDHiE can then evaluate the fit of the obtained expressions against the measurements. We conjecture that the performance of EDHiE on standard benchmarks (Udrescu & Tegmark, 2020; Uy et al., 2011) would compare favorably to that of a state-of-the-art symbolic regression methods (Mundhenk et al., 2021). The results of our empirical evaluation of HVAE and EDHiE confirm our conjectures. HVAE can achieve better reconstruction of the training expressions with order-of-magnitude fewer training examples while using latent spaces with fewer dimensions. EDHiE outperforms alternative methods for symbolic regression on the task of reconstructing the ten equations in the Ngyuen benchmark.

Fig. 1
figure 1

A schematic representation of the EDHiE approach. In the first step, we train a HVAE model. In the second step, we explore the latent space of the HVAE model with an evolutionary algorithm. The red dot represents the best expression in a given iteration (Color figure online)

We can summarize the contributions of this work as follows:

  • We propose HVAE, a variational autoencoder for hierarchical data, that can be efficiently trained to generate mathematical expressions from modest amounts of data, while using a low-dimensional latent space.

  • We introduce EDHiE, a symbolic regression approach that exploits HVAE to efficiently search through the space of candidate equations.

The remainder of the paper is organized as follows. Section 2 reviews related work on generative models and symbolic regression. We introduce the hierarchical variational autoencoder HVAE and the symbolic regression approach EDHiE in Sect 3. Section 4 presents the results of the empirical evaluation of HVAE and EDHiE. Finally, Section 5 summarizes and discusses the contributions of the presented work and outlines directions for further research.

2 Related work

Most of the early successful applications of generative models have been in the domains of text, speech, images, and video, i.e., they have been mainly used for generating unstructured data objects composed of continuous data elements. The discrete data structures that generative models have most often tackled are strings or sequences of characters, where the data elements are discrete symbols. The models that use strings as input (and output) usually do so by training a recurrent neural network (Sherstinsky, 2020), most commonly using seq2seq autoencoders (Sutskever et al., 2014).

A major problem of sequence-to-sequence autoencoders is that they do not guarantee the syntactic correctness of the generated expressions. One way to solve this problem is to learn an additional validation model for checking the correctness of the generated sequence (Janz et al., 2017). Grammar variational autoencoders (GVAE) (Kusner et al., 2017) use context-free grammars for specifying the space of valid structured data objects. Each data object can be then represented as a sequence of grammar productions (rewrite rules) that derives it. In turn, GVAE encode sequences of rewrite rules that derive objects instead of the objects themselves. The structure of the decoder is constrained to generate valid sequences of rewrite rules that are then used together with the grammar to generate valid expressions.

Dai et al. (2018) propose the use of attribute grammars, i.e., context-free grammars that attach attributes to the grammar’s syntactic categories. By prescribing properties and relationships between the attributes, such grammars can also encode semantic constraints on the derived data objects. The attribute grammars, together with SD-VAE, i.e., syntax-directed VAE, can generate expressions that are consistent with a set of both syntactic and semantic constraints. Alternative generalizations of grammars have been used for generative modeling of program source code in high-level languages (Bielik et al., 2016).

Most of the above approaches can also generate mathematical expressions. However, they need the complex formalism of grammars to generate more complex data structures, most often molecular structures (Dai et al., 2018; Gómez-Bombarelli et al., 2018). Since mathematical expressions can be represented as simpler structures, i.e., binary trees, our work concerns generative models for hierarchical (tree-structured) data.

Hierarchical data have been tackled by generative models in several ways. By making a node depend on its parent and previous sibling, DRNN (Alvarez-Melis & Jaakkola, 2017) combines representations obtained from the depth-wise and width-wise recurrent cells to generate new nodes, which proves useful for recovering the structure of a tree. On the other hand, Tree-LSTM (Tai et al., 2015) and JT-VAE (Jin et al., 2021) focus on adapting equations for recurrent cells to encode (and decode) hierarchical structures more efficiently. Tree-LSTM proposes a generalization of the LSTM cell for encoding trees into a representation that proves effective for classification tasks and semantic relatedness of sentence pairs. JT-VAE adapts recurrent cells for tree message passing. Trees are used as scaffolding for the graph that represents molecules. Encoding and decoding are thus split into four parts: encoding of the graph, encoding of the tree, decoding of the tree, and decoding of the graph. While these adaptations are similar to the ones presented in our work, their focus is on encoding more general structures that are unrelated to mathematical expressions.

Note that our model falls into the general framework of gated recursive convolutional neural networks (Cho et al., 2014a) that combine atomic units with shared parameters in a hierarchy. The output of the root node produces a fixed-length encoding of a data object with an arbitrarily complex structure. Another model, marginally related to ours, is the one of equivalence neural networks (Allamanis et al., 2017). The encoding produced by these networks follows the expressions’ semantic similarity and equivalence, in contrast to their syntactic similarity, which is followed by all the other approaches, including ours.

Finally, our work is also related to algorithms for equation discovery and symbolic regression. Most of them generate candidate expressions for equations first and then estimate the values of their constant parameters by matching the equations against data in the second phase. Classical symbolic regression approaches (Cranmer, 2023; Guimerà et al., 2020; Schmidt & Lipson, 2009), based on evolutionary algorithms, use stochastic generators of expression trees: At the beginning, the expression trees are randomly sampled, and later on, they are transformed using the evolutionary stochastic operations of mutation and cross-over. In contrast, process-based modeling approaches (Bridewell et al., 2008) generate equations by following domain-specific knowledge (provided by the user) that specifies a set of entities (variables) and processes (interactions among entities). Grammar-based approaches to equation discovery employ user-specified context-free grammars (which can also be based on domain knowledge), deterministic (Todorovski & Dzeroski, 1997) or probabilistic (Brence et al., 2021), as efficient generators of expressions.

Recently, many symbolic regression approaches based on neural networks have been proposed (Biggio et al., 2021; d’Ascoli et al., 2022; Kamienny et al., 2022; Martius & Lampert, 2016; Petersen et al., 2021; Udrescu & Tegmark, 2020). In particular, Deep Symbolic Optimization, DSO approaches symbolic regression [among other optimization tasks (Mundhenk et al., 2021)] by combining neural networks and reinforcement learning with evolutionary algorithms. The neural networks are used to sample the individuals in the initial population of the evolutionary algorithm and are retrained at each iteration to focus on expressions leading to better fit. It is closely related to our work, since it combines similar methods. Yet our focus here is on efficient neural networks for generating mathematical expressions that are trained before the beginning of the evolutionary process.

3 Methodology

We start this section by briefly introducing the task of symbolic regression and the search space of mathematical expressions (Sect. 3.1). After this, we introduce variational autoencoders and the structure of the hierarchical variational autoencoder, HVAE (Sect. 3.2). We finish the section by explaining how to use HVAE for generating mathematical expressions and how to combine it with an evolutionary algorithm for symbolic regression (Sect. 3.3).

3.1 Symbolic regression and expression trees

Symbolic regression (SR) is the machine learning task of discovering equations in collections of measured data. Symbolic regression methods take a data set S consisting of multiple measurements of a set of real-valued variables \(V = \{ x_1, x_2, \ldots , x_p, y \}\), where y is a designated target variable. The output of SR is an equation of the form \(y = f(x_1, x_2, \ldots , x_p)\), where the right-hand side of the equation is a closed-form mathematical expression. The equation should provide an optimal fit against the measurements from S, i.e., minimize the discrepancy between the observed values of the target variable y and values calculated by using the equation. Symbolic regression methods usually follow the parsimony principle, preferring simpler expressions over more complex ones.

Symbolic regression methods search through the space of candidate mathematical expressions for the right-hand side of the equation to find the one that optimally fits the measurements. Mathematical expressions can be represented in different ways. We commonly use the infix notation, where operators are placed between two sub-expressions they operate on, e.g., \(A + B\), where A and B are sub-expressions. Infix notation uses parentheses to indicate the order in which the operations need to be performed. Prefix (Polish) or postfix (reverse Polish) notations do not need parentheses since the operators are written before or after the two sub-expressions, e.g., \(+ A B\) or \(A B +\). The three notations correspond to different traversals of the nodes in an expression tree. The latter is a hierarchical data structure, where the inner nodes correspond to mathematical operators and functions, while the leaf nodes correspond to variables and constants.

In symbolic regression, the constants’ values are fitted against the measured data from S, while variables include elements from V without the target variable. We assume binary expression trees since standard arithmetic operators are binary. We take that the second descendant node is null in the inner nodes corresponding to single-argument functions. We define the height of an expression tree as the number of nodes on the longest path from the root node to one of the leaves. Figure 2 depicts an example expression tree with a height of four, along with the corresponding mathematical expression in different notations.

Fig. 2
figure 2

An expression tree with a height of four and three sequence-based representations of the corresponding mathematical expression

Our model generates expression trees, as they have several advantages over sequences (strings). Firstly, it is easy to achieve syntactic correctness, since operators and functions are in the inner nodes, while variables and constants are in the leafs. Secondly, information needs to travel at most \(\log n\) steps up the tree (up to the tree’s height) instead of n steps along the sequence (up to the length of the sequence). Lastly, sub-expressions can be encoded independently of each other during the encoding process.

3.2 Hierarchical variational autoencoder

In recent years, variational autoencoders (Kingma & Welling, 2014) have emerged as one of the most popular generative models. The reason for this is that, when trained correctly, variational autoencoders map the observed data with an unknown distribution into a latent representation with a known distribution. This results in a continuous latent space, from which one can sample and synthesize new data. In contrast to a (deterministic) autoencoder, where the encoder outputs a latent representation z that is directly fed into the decoder, the encoder in the variational autoencoder outputs the parameters for an approximate posterior distribution, e.g., \(\mu\) and \(\sigma\) in the case of a latent space parameterized by a multivariate Gaussian distribution.

Thus, a representation z that is fed into the decoder is sampled from the underlying distribution with the learned parameters (\(\mu\), \(\sigma\)). The loss of the variational autoencoder is the reconstruction error, i.e., the difference between the input to the encoder and the output of the decoder. Additionally, variational autoencoders typically use Kullback–Leibler (KL) divergence (Kullback & Leibler, 1951) as the regularization term for the loss. The loss can thus be calculated as:

$$\begin{aligned} J(x, \mu _z, \sigma _z) = J_{\text {rec}}(x) + \lambda \cdot \textsc {KL}(\mathcal {N}(\mu _z,\sigma _z) \, \Vert \, \mathcal {N}(0, I)), \end{aligned}$$
(1)

where \(J_{\text {rec}}(x)\) is the reconstruction loss of x and \(\lambda \ge 0\) the regularization cost parameter. In case the underlying distribution is Gaussian, KL divergence to an isotropic unit Gaussian can be estimated as

$$\begin{aligned} \textsc {KL}(\mathcal {N}(\mu _z,\sigma _z) \, \Vert \, \mathcal {N}(0, I))=\frac{1}{2}\left( 1 + \log \sigma ^2_z - \mu ^2_z - \sigma ^2_z\right) . \end{aligned}$$
(2)

We use cost annealing (Bahuleyan, 2018) to focus on the reconstruction error (i.e., use small values of \(\lambda\)) at the beginning and then gradually shift the focus towards the smoothness of the latent space by increasing the value of \(\lambda\).

3.2.1 Model overview

Our approach uses a variational autoencoder structure that consists of an encoder and a decoder. The encoder takes tree-structured data as input and outputs a distribution in the latent vector space, represented with the mean (\(\mu _z\)) and the logarithm of the variance (\(\log \sigma _z\)) vectors. The decoder works in the opposite direction, sampling a point from the latent vector space as input and transforming it into a binary expression tree. To make the backward propagation possible, we sample points with the reparametrization trick.

Fig. 3
figure 3

The processes of a encoding and b decoding the expression tree of \(x+\cos x\). The acronyms EC and DC stand for “encoding cell” (introduced in Sect. 3.2.2) and “decoding cell” (introduced in Sect. 3.2.3)

Trees are encoded recursively, starting from leafs and ending at root nodes. To encode a subtree with a root in n, we first encode its left and right subtrees. We then pass their codes, along with the symbol in the node n, as inputs to the encoding cell (further described in Sect. 3.2.2). This cell outputs the code of the subtree rooted in n. At the beginning of the recursion, in each leaf node, the codes corresponding to the (missing) children are assumed to be vectors of zeros. Once the root of the tree is encoded, its code is passed through two fully connected layers that give the mean and log-variance vectors that form the latent representation of the tree. Figure 3a illustrates the recursive encoding process on the expression \(x+\cos x\).

The first layer of the decoder transforms the sampled point from the latent space into the code of the hierarchy. After this, the tree is generated recursively by passing the code of the current node (subtree) through the decoding cell (further described in Sect. 3.2.3). This cell takes the code of the node (subtree) as input and generates a node symbol, along with the codes of the two child nodes. There are three possible symbol types. If we encounter an operator, both child nodes are generated recursively. On the other hand, if the symbol represents a function, we only generate the left child. Lastly, if the symbol is either a variable or a constant, no further child nodes are generated in this branch. This process is shown in Fig. 3b, where the expression \(x+\cos x\) is decoded.

During training, we follow the structure of the encoded tree and try to predict the correct node symbols. In turn, we jointly learn to predict the structure of the expression tree and the symbols inside the node, since the structure is determined by the symbols. We calculate the loss using cross-entropy on a sequence of symbols obtained with the in-order traversal of the expression tree. Some additional implementation details are explained in Appendix 4.

3.2.2 Encoder

The encoding proceeds in two phases. The first follows the hierarchy of the input and applies the encoding cell to each node of the hierarchy as described above. In the second phase, the code of the root node is transformed into the mean and log-variance vectors of the input’s latent representation.

Encoding comprises a GRU21 cell, which we have adapted from the GRU cell (Cho et al., 2014b). The (output) code h in GRU21 is computed from the input vector x, and codes \(h_l\) of the left and \(h_r\) of the right child with the following equations:

$$\begin{aligned} r&= \varphi _S (W_{ir}x + b_{ir} + W_{hr}(h_l \nparallel h_r) + b_{hr}) \end{aligned}$$
(3)
$$\begin{aligned} u&= \varphi _S (W_{iu}x + b_{iu} + W_{hu}(h_l \nparallel h_r) + b_{hu}) \end{aligned}$$
(4)
$$\begin{aligned} n&= \text {tanh} (W_{in}x + b_{in} + r*(W_{hn}(h_l \nparallel h_r) + b_{hn})) \end{aligned}$$
(5)
$$\begin{aligned} h&= (1-u)*n + \frac{u}{2}*h_l + \frac{u}{2}*h_r, \end{aligned}$$
(6)

where \(\varphi _S\) denotes the Sigmoid activation function. In these expressions, r, u, and n represent the standard reset gate, update gate and candidate activation vectors from a GRU cell. When compared to the original equations of the GRU cell, Eqs. (3), (4), (5) exhibit two differences. First, instead of the code of the previous symbol in the sequence, the concatenation of the codes \(h_l\) and \(h_r\) of the child nodes is used (denoted by \((h_l \nparallel h_r)\)). Second, the dimension of the weight matrices \(W_{hr}, W_{hu}, W_{hn}\) must be \(\dim (h_l)+\dim (h_r)\) instead of \(\dim (h)\). Thus, while Eq. (6) remains similar to the original one, we change the second term (from its usual form \(u*h_{t-1}\)) to \(\frac{u}{2}*h_l + \frac{u}{2}*h_r\), to retain information from the codes of the two child nodes. Recall that \(*\) denotes the element-wise multiplication of vectors.

In the second phase, the model transforms the code of the root node into the latent representation of the input expression through two fully-connected layers.

3.2.3 Decoder

The decoding also comprises two phases. In the first, a fully-connected layer transforms a point from the latent vector space into the code of the root node. In the second phase, the decoding cell is recursively deployed to decode each of the nodes in the expression tree.

Fig. 4
figure 4

The structure of the decoding cell

Figure 4 depicts the structure of the decoding cell. The cell is composed of a fully connected layer, a softmax layer, and the GRU12 cell, an adaptation of the original GRU cell. The input code is first passed through the sequence of a fully-connected and a softmax layer. The latter creates the vector of probabilities, from which the most probable output symbol is chosen. If the output symbol is either a constant or a variable, the decoding stops. Otherwise, the output vector is also used as an input to the GRU12 cell, together with the code that is given as input into the decoding cell. The GRU12 cell produces two codes, one for the left and one for the right child.

GRU12 computes the two codes \(h_l\) and \(h_r\) for the child nodes using the input vector x and the code h with the following equations:

$$\begin{aligned} r&= \varphi _S (W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \end{aligned}$$
(7)
$$\begin{aligned} u&= \varphi _S (W_{iu} x + b_{iu} + W_{hu} h + b_{hu}) \end{aligned}$$
(8)
$$\begin{aligned} n&= \text {tanh} (W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \end{aligned}$$
(9)
$$\begin{aligned} d&= (1-u) * n + u * (h \nparallel h) \end{aligned}$$
(10)
$$\begin{aligned} d&\equiv h_l \nparallel h_r \end{aligned}$$
(11)

There are two major differences between GRU12 and the original GRU cell. First, the vectors r,u, and n in Eqs. (7), (8), (9) are of dimension \(2\cdot \dim (h)\) instead of \(\dim (h)\). Consequently, all bias vectors are of dimension \(2\cdot \dim (h)\), and all weight matrices have an output dimension of \(2\cdot \dim (h)\). Second, in Equation (10), the code h is concatenated with itself to make the dimensions in the equation match. Vector d is then split in half in Eq. 11. The first part is used as a code for the left child, while the second is used as a code for the right child.

3.3 Generating expressions for symbolic regression

Recall that the goal of symbolic regression is to efficiently search through the space of mathematical expressions and find the one that, when used on the right-hand of an equation, fits given measurements well. In this section, we explain how to use HVAE for generating expressions.

3.3.1 HVAE as a generative model

We can generate expressions in two ways, corresponding to two different symbolic regression scenarios. The first way, which aims at discovering equations from data, samples random vectors from the standardized Gaussian distribution \(\mathcal {N}(0, I)\) in the latent space and passes them through the decoder.

On the other hand, we might want to generate expressions in a scenario that corresponds to the revision of existing equations to fit newly gathered data. Here, we want to generate mathematical expressions that are similar to the one given as input. Our approach achieves this by encoding an expression and sampling its immediate neighborhood in the latent space. We expect these points to be decoded into expressions similar to the one given as input. We will show that HVAE meets this expectation in Sect. 4.1.4.

3.3.2 Evolutionary algorithm operators

Finally, we can search the latent space spanned by our model with evolutionary algorithms (Koza, 1994), one of the most commonly used paradigms for symbolic regression. Evolutionary algorithms explore the search space by first randomly sampling individuals for the initial population. Then they repetitively generate new populations by combining pairs of individuals from the current population with the genetic operators of mutation and crossover.

An individual in a population is in our case a real-valued vector z, corresponding to the code of an expression tree in the latent vector space. Using the HVAE model, z can be decoded into an expression tree. To calculate the individual’s fit against the training data, we first fit the values of the constant parameters in the decoded expression tree and then measure the error of the equation with the resulting expression on the right-hand side (with respect to the training data). We generate the initial population by randomly sampling individuals from the Gaussian distribution \(\mathcal {N}(0, I)\).

Crossover combines two individuals, referred to as parents \(z_A\) and \(z_B\), into an offspring \(z_O\). We generate the latter as a convex combination of \(z_A\) and \(z_B\), i.e. \(z_{\text {O}}= (1-a) \cdot z_A + a \cdot z_B\), where a is sampled from the uniform distribution on the interval [0, 1]. For values of a close to 0 and 1, the offspring is close to one of the parents, while values of a close to 0.5 lead to an offspring equally dissimilar to both parents.

The mutation operator transforms an individual z into a mutated individual \(z_M\). To perform a mutation, we first decode z into an expression tree and immediately encode it back into its latent space representation to obtain the value of \(\sigma _z\). Now, we can mutate z into an individual with a syntactically similar expression by sampling from \(\mathcal {N}(\mu _z, \sigma _z)\) or into a random individual by sampling the offspring \(z_O\) from \(\mathcal {N}(0, I)\). Similarly to the case of crossover, we interpolate between these two extremes by sampling from \(\mathcal {N}(a \cdot \mu _z + (1 - a) \cdot 0, a \cdot \sigma _z + (1 - a) \cdot I) = \mathcal {N}(a \cdot \mu _z, a \cdot \sigma _z + (1 - a) \cdot I)\), where a is randomly sampled from the uniform distribution on the interval [0, 1]. When a is close to 0, the offspring \(z_O\) is chosen at random (see the first paragraph of Sect. 3.3.1). On the other hand, when a is close to 1, \(z_O\) is syntactically similar to z (second paragraph of Sect. 3.3.1).

We implement the EDHiE (Equation Discovery with Hierarchical variational autoEncoders) approach for symbolic regression by combining HVAE with evolutionary algorithms using these operators. Our implementation uses pymoo (Blank & Deb, 2020) for evolutionary algorithms and ProGED (Brence et al., 2021) functionality for evaluating the fit of a candidate equation.

4 Evaluation

In this section, we will investigate the validity of our hypothesis that the hierarchical variational autoencoder is a more efficient generator of mathematical expressions than the alternative VAEs for sequences by conducting two series of computational experiments. In the first series, we are going to evaluate the performance and efficiency of HVAE on the task of generating mathematical expressions. In the second series, we will evaluate the performance of EDHiE on the symbolic regression downstream task.

4.1 The performance of HVAE

We start this section by introducing the experimental setup (Sect. 4.1.1). We continue with reporting the experimental results of evaluating HVAE with respect to the reconstruction error (Sect. 4.1.2), efficiency in terms of the size of training data needed, the dimensionality of the latent space (Sect. 4.1.3), and finally the smoothness of the latent space (Sect. 4.1.4). In Appendix 2, we further justify our claim that points close in the latent space of HVAE are decoded into similar expressions.

4.1.1 Experimental setup

Data sets We estimate the reconstruction error of the variational autoencoders on a collection of six synthetic data sets, ranging from small ones, including simple expressions, to large ones, including complex expressions. The data sets are as follows:

  • AE4-2k, AE5-15k, and AE7-20k have 2, 15, and 20 thousand mathematical expressions with trees with a maximum height of four, five, and seven. These expressions can contain constants, variables, and the operators +, −, \(\cdot\), /, and \(\hat{\hspace{3.33328pt}}\).

  • Trig4-2k, Trig5-15k, and Trig7-20k are the same as above, but the expressions also contain the sine and cosine functions.

We create these data sets with the ProGED (Brence et al., 2021) system by randomly sampling mathematical expressions from a given probabilistic context-free grammar. The generated expressions are simplified using the Python library SymPy (Meurer et al., 2017). The context-free grammars that constrain the output of GVAE and the ones used to generate the data sets are documented in Appendix 1.

Parameter setting We train GVAE and CVAE for 150 epochs with the following values of their hyper-parameters: latent dimension \(=128\), hidden dimension \(=128\), batch size \(=64\), kernel sizes of the convolution layers \(= 2,3,4\), and the ADAM optimizer (Kingma et al., 2015). For reconstruction results created with our approach (HVAE), the hyper-parameters are: latent size \(=128\), batch size \(=32\), and the ADAM optimizer with the default learning rate. For the first 1,800 iterations i, we calculate the regularization cost parameter \(\lambda\) using \(\lambda _i = 0.5\cdot (\tanh \frac{i-4,500}{2}+1)\), after this, we set \(\lambda _i\) to \(\lambda _{1,800}\).

Estimating the reconstruction error The Levenshtein distance (Levenshtein, 1965) (often referred to as the edit distance) quantifies the dissimilarity of two strings in terms of the number of insertion, removal, and substitution operations that are needed to transform one string into the other. We use this distance to test how well our autoencoder recreates expressions.

We first pass the expression through the VAE to get the predicted expression. If needed, we validate the syntactical correctness of the latter and transform it into an expression tree. We then traverse the input and the output trees in post-order (left child, right child, node symbol) to obtain the input and the output expressions in the postfix notation (which does not require parentheses and is hence more suited for calculating the distance between expressions). Finally, we calculate the edit distance between those two strings.

To estimate the reconstruction error on unseen expressions, we use five-fold cross-validation with the same splits across all methods. GVAE and CVAE sometimes produce invalid expressions, which we discard from the evaluation. Because of this, the results in Sects. 4.1.2 and 4.1.3 might be biased in favor of CVAE due to many syntactically incorrect expressions being discarded. Note that GVAE has fixed-size input (and output) that might be too short for encoding all the grammar rules needed to derive an expression. In those cases, GVAE returns empty strings, which we consider invalid expressions. CVAE, on the other hand, produces syntactically incorrect expressions such as \(x c (/x)c)\sin \sin ( c ) )\), \(\cdot \cdot x-c\cdot /\sin (x))\), or \((/x(-x)c)\) (presented here in infix notation).

4.1.2 Out-of-sample reconstruction error

Table 1 compares the out-of-sample reconstruction error and the ratio of invalid expressions for the three variational autoencoders. Our hierarchical VAE significantly outperforms the other two methods on all data sets. An interesting observation is that GVAE works consistently better on expressions involving trigonometric functions, while HVAE and CVAE perform worse. The reason for the opposite effect is probably the following: for GVAE, functions only represent yet another production rule in the grammar, while for HVAE and CVAE they drastically change the structure of the expression (tree). This translates to better performance of GVAE, as expressions with trigonometric functions are usually shorter, given that the nodes corresponding to the trigonometric functions have only one descendant instead of the usual two.

Table 1 The out-of-sample reconstruction error and the percentages of syntactically incorrect expressions generated by the three variational autoencoders

The percentages of invalid expressions generated by the approaches show that our approach always produces syntactically correct expressions, while GVAE and CVAE sometimes fail to produce valid outputs. The fraction of such expressions is quite small when the GVAE approach is used (see the explanation above) but quite significant when CVAE is used. Lastly, we can notice that, as expected, longer expressions are harder to recreate and thus have higher edit distance and a higher percentage of invalid expressions than shorter ones, provided enough training data is used.

4.1.3 Training efficiency and the latent space dimensionality

We proceed to test our conjectures about the efficiency of training the generators of mathematical expressions. We expect that HVAE would require less training data and a lower dimensionality of the latent space to achieve the same levels of reconstruction error in comparison to other approaches. The latter is especially important because of the exploration of the latent space, which is more efficient in low-dimensional latent spaces.

Fig. 5
figure 5

The impact of the a training data set size and b dimensionality of the latent space on the reconstruction error of the three autoencoders

Figure 5a depicts the impact of the number of expressions in the training set on the reconstruction error for the three different generative models. Again, HVAE significantly outperforms the other two VAEs. Its reconstruction error is estimated to be consistently lower than 0.25, even when trained on 2 thousand examples only. This error is an order of magnitude lower than the lowest error of 1.5 achieved by the second best model, GVAE, when trained on the whole data set of 12 thousand examples.

Figure 5b shows the impact of the dimensionality of the latent space on the reconstruction error across different VAEs. In line with the previous results, HVAE significantly outperforms both CVAE and GVAE. HVAE with latent space of dimension 16 performs on par or better than GVAE and CVAE with latent spaces of 256 dimensions. We can see that the reconstruction error quickly raises when the latent space dimension is less than 32, but otherwise, the reconstruction error is consistently low. Even with a latent space size of 16, our approach is still comparable to the other two methods with a latent space of dimension 256. This allows us to reduce the dimensionality of the latent space by two orders of magnitude, which makes HVAE an excellent candidate for generating expressions for symbolic regression.

The reason for the superior efficiency of HVAE is the use of expression trees, as subexpressions are always encoded into the same code, regardless of their position in the expression. This significantly reduces the space of possible codes and allows for training the model in a way that better generalizes to the repetitive subexpressions (subpatterns) it encounters.

4.1.4 Latent space smoothness

Finally, we expect the latent space of HVAE to be smooth in the sense that samples close to the latent representation of the input expression are decoded into expressions similar to the one given as input. We investigate the validity of this conjecture by applying linear interpolation (performing a homotopic transformation) between two expressions in the latent space. Assume that we are given two expressions, A and B. Using the model, we encode them into their latent representations \(z_A\) and \(z_B\). We can then generate new latent representations \(z_\alpha\) by combining the two representations with the formula \(z_\alpha = (1-\alpha )\cdot z_A + \alpha \cdot z_B\), where \(\alpha \in \{i / n: i\in \mathbb {N} \wedge i \le n\}\). Decoding the latent representations \(z_\alpha\) in a smooth latent space should produce intermediate expressions that represent a smooth transition from A to B in n steps.

Table 2 shows the results of such a linear interpolation in the latent spaces of the different VAEs. HVAE and GVAE produce continuous latent spaces where the transition from expression A to expression B is indeed smooth. CVAE also produces a smooth transition, but some of the intermediate expressions might be syntactically incorrect. The second interpolation in the lower part of the table is an example of a smooth transition in the HVAE latent space. We can see that at each step only a few expression symbols change and that these changes are rarely redundant. Appendix 2 provides further examples of interpolations with visualizations of the expression trees.

Table 2 Examples of linear interpolation between two expressions in the latent spaces of the three VAEs. Expressions that are colored red are syntactically incorrect. Here, we set \(n=4\) and \(\alpha = i / 4, 0\le i\le 4\)

4.2 Evaluating EDHiE

In the second series of experiments, we evaluate the performance of EDHiE. We start the section by introducing the experimental setup (Sect. 4.2.1). We then report on the impact of the dimensionality of the latent space on the performance of symbolic regression (Sect. 4.2.2). Furthermore, we compare the performance of EDHiE with that of other methods for symbolic regression on the Nguyen benchmark (Sect. 4.2.3) and report the performance of EDHiE on the Feynman benchmark (Sect. 4.2.4).

4.2.1 Experimental setup

Data sets The Nguyen (Uy et al., 2011) benchmark contains eight equations with one non-target variable and two equations with two non-target variables. The right-hand sides of these equations are shown in the second column of Table 3. We generate two data sets (train and test) with five thousand simulated measurements for each equation. We use the train set to select the best expressions and the test set to evaluate their performance with the metrics described below. We sample points from the interval \([-20, 20]\) for equations 1–6, the interval [0, 40] for equation 7, [0, 80] for equation 8, and [0, 20] for equations 9 and 10.

We further evaluate our approach on the 16 equations involving up to two variables from the Feynman benchmark (Udrescu & Tegmark, 2020). The right-hand sides of these equations are shown on in the last column of Table 3. Because equations in the Feynman benchmark represent real-world equations, each of the equations FM-3, FM-4, FM-5, and FM-7 contains two entries. Each entry comes with its own variables and data sets of measurements.

Table 3 Expressions from the Nguyen (first two columns on the left-hand side) and Feynman (last three columns on the right-hand side) benchmarks

Evaluation process We compare the performance of EDHiE on the Nguyen benchmark equations to the performance of three other symbolic regression systems. ProGED (Brence et al., 2021) uses probabilistic grammars as generators of mathematical expressions. DSO (Petersen et al., 2021) combines deep neural networks with evolutionary algorithms. PySR (Cranmer, 2023) employs evolutionary optimization with operators directly applied to the expression trees. We run each system ten times on each equation and evaluate at most 100,000 sampled expressions. All approaches use the same library of tokens and/or grammars, further described in Appendix 1. When running PySR, we allow fitting the values of the constant parameters since it can not be turned off in the implementation. The dimensionality of the latent space of HVAE is set to 32; the ADAM optimizer uses the default learning rate. We elaborate on the setting of the dimensionality of the latent space in the next section. Appendix 3 gives the complete report on the experiments in latent spaces with varying dimensions.

Estimating the performance We use three metrics: the number of successful reconstructions, i.e., runs leading to an equation equivalent to the original one; the mean \(R^2\) of the best equation; and the number of expressions sampled to achieve reconstruction. We consider a run successful if we find an expression where the RMSE between the target and predicted values is lower than \(10^{-10}\). To guarantee accurate reporting, we manually check if the original and expression with the lowest RMSE are equivalent. In each run, we use the expression with the lowest RMSE to calculate the bounded \(R^2\) metric on the test set using the formula

$$\begin{aligned} R^2(\hat{y}, y) = \max \left( 0, 1 - \frac{\sum _i (y_i - \hat{y}_i)^2}{\sum _i (y_i - \overline{y})^2}\right) , \end{aligned}$$
(12)

where \(\hat{y}_i\) denotes the predicted value of the target variable (calculated by using the equation), \(y_i\) is the measured value of the target variable, and \(\overline{y}\) is the mean value of y in the training data set. Lastly, we show the average number (across the ten runs) of unique expressions considered before reconstructing the original equation. To this end, we count the unique expressions that the symbolic regression system has considered before the reconstructed equation is encountered in the generation process for all the methods that report this.

4.2.2 The impact of the dimensionality of the latent space

Let us start with a series of computational experiments exploring the latent space for encoding mathematical expressions with random sampling. Here, we perform symbolic regression by taking randomly sampled points in the latent space and decoding them into expressions that are then evaluated on the measurements/data. The expression that fits the data best is selected as the candidate for the discovered equation.

Table 4 The performance of symbolic regression (number of successful reconstructions) by randomly sampling with CVAE, GVAE, and HVAE on the Nguyen benchmark

Table 4 shows the number of successful runs of the random sampling approaches based on the three VAEs, CVAE, GVAE, and HVAE. In the further discussion of results, we use the name HVAR for HVAE with random sampling. We can see here a typical example of the curse of dimensionality at work. When the symbolic regression algorithm explores high-dimensional latent spaces, it can easily slip into parts of those spaces that do not lead to optimal equations. This shows that the ability of HVAE to encode mathematical expressions in low-dimensional latent spaces is crucial for the performance of symbolic regression with HVAR.

Based on the results of the experiments in Table 4 and Appendix 3, in the remainder of this section, we use 32-dimensional latent space for EDHiE.

4.2.3 Comparison on the Nguyen equations

Table 5 Comparison of the performance of symbolic regression (number of successful reconstructions, \(R^2\), and number of evaluated equations) with random sampling on the Nguyen benchmark. We compare sampling from a manually-crafted probabilistic grammar (ProGED) with sampling using a trained HVAE (HVAR)

In the next series of experiments, we compare the performance of HVAR, the random sampling method using HVAE, to the one of ProGED—the latter samples mathematical expressions using manually crafted probabilistic grammar. Table 5 reports the results of the comparison. The results show that the generator used within HVAE is not worse than the probabilistic grammar. To our surprise, HVAR outperforms ProGED significantly. First, it successfully reconstructs five (of the ten) equations from the Nguyen benchmark in ten runs, one more than ProGED. Second, for the three equations of NG-2, NG-6, and NG-9, the reconstruction is achieved faster, i.e., by evaluating fewer candidate expressions.

Furthermore, we check the contribution of the evolutionary approach in EDHiE over the random sampling method HVAR. To this end, we compare the last three columns of Table 5 with the last three columns of Table 6. EDHiE successfully reconstructs all ten equations from the Nguyen benchmarks in at least one of the ten runs. In three cases, the equations are reconstructed in every run. Note also that the successful reconstructions with EDHiE require fewer evaluations of candidate equations than the random sampling approaches.

Table 6 Comparison of the performance of the symbolic regression systems EDHiE, DSO, and PySR on the Nguyen benchmark

Table 6 compares EDHiE with PySR, which uses evolutionary operators on expression trees directly (i.e., without embedding them into a latent space), and DSO, that similarly to our approach, combines deep learning with evolutionary optimization. Overall, EDHiE performs better than the other two methods across all metricsFootnote 1: it achieves the highest total number of successful reconstructions. EDHiE has more successful reconstructions for five equations than PySR and less for a single equation, NG-9. The superior performance of EDHiE relative to PySR indicates that evolutionary optimization is more efficient in the latent space than in the space of expression trees. For four equations, EDHiE achieves successful reconstruction more often than DSO. In the two instances of reconstructing NG-3 and NG-4, DSO achieves success twice as often as our method.

Finally, note that the experiments on the Nguyen benchmark were performed on noise-free synthetic data. The results of the experiments on synthetic data with added noise, reported in Appendix 3, show that EDHiE is robust to noise: The increasing noise level has little effect on the reconstruction success rate while significantly increasing the rank of the successfully reconstructed equation in the list of evaluated equations, sorted with respect to increasing RMSE. Appendix 3 also includes additional results on the Nguyen benchmark by random sampling of CVAE, GVAE, and HVAE latent space with varying number of dimensions.

Table 7 Results of EDHiE on the 16 equations from the Feynman database that include at most two non-target variables

4.2.4 Results on the Feynman equations

In this section, we evaluate the ability of EDHiE to reconstruct real equations from the domain of physics included in the Feynman database. Table 7 presents the results of symbolic regression on a subset of 16 equations from the database with up to two non-target variables. EDHiE successfully reconstructs 13 equations in all the runs. Most of these equations are simple; thus, EDHiE explores small search spaces comprising less than two hundred evaluated expressions. A more complex equation FM-10 is reconstructed in five out of ten runs exploring more than 20 thousand expressions on average. The equation FM-6 could not be reconstructed in any of the runs since it includes the function \(\arcsin\) that has not been included in our token library. Finally, EDHiE fails to reconstruct the most complex equation FM-2.

5 Discussion and conclusion

We introduce a novel variational autoencoder for hierarchies, HVAE, that can be efficiently trained to generate valid mathematical expressions represented as expression trees. Compared to generators based on variational autoencoders for sequences, HVAE has three significant advantages. First, it consistently generates valid expressions. Second, its performance is robust even for small training sets: HVAE trained from only two thousand expressions achieves much lower reconstruction error than sequential VAEs trained from 12 thousand examples. Third, the HVAE operating in 32-dimensional latent space has a lower reconstruction error than sequential VAEs with comparable latent spaces.

The ability of HVAE to encode mathematical expressions in a low-dimensional latent space makes it an excellent proxy for exploring the search space of candidate expressions in symbolic regression. Indeed, when performing a random search through the latent space, we achieve comparable performance with a random search through the space of candidate expressions defined by a manually crafted probabilistic grammar. EDHiE, a symbolic regression system that performs evolutionary optimization in the latent space of the HVAE, significantly outperforms methods based on random search and achieves performance comparable to the state-of-the-art symbolic regression system DSO based on a similar combination of evolutionary algorithms and deep learning. The comparison of EDHiE with PySR, a genetic programming approach operating on expression trees directly, shows the benefit of performing evolutionary optimization in the latent space.

HVAE has been used here for symbolic regression, but its potential to efficiently generate and encode hierarchies makes it useful in many different contexts, e.g., generating molecular structures or more general symbolic expressions. Analysis of its performance in these application domains is a promising direction for further research. Moreover, the ability of HVAE to learn from small corpora of expressions might prove helpful in retraining the generator after each generation of the evolutionary search, much like the iterative learning in DSO. This will narrow its focus to generating better expressions, leading to more accurate equations. In general, training the generator on expressions involved in mathematical models that have proved useful in a domain of interest will enable seamless integration and transfer of background knowledge in symbolic regression.