Class Symbolic Regression: Gotta Fit ’Em All

We introduce “Class Symbolic Regression” (Class SR), the first framework for automatically finding a single analytical functional form that accurately fits multiple data sets—each realization being governed by its own (possibly) unique set of fitting parameters. This hierarchical framework leverages the common constraint that all the members of a single class of physical phenomena follow a common governing law. Our approach extends the capabilities of our earlier Physical Symbolic Optimization (Φ-SO) framework for symbolic regression, which integrates dimensional analysis constraints and deep reinforcement learning for unsupervised symbolic analytical function discovery from data. Additionally, we introduce the first Class SR benchmark, comprising a series of synthetic physical challenges specifically designed to evaluate such algorithms. We demonstrate the efficacy of our novel approach by applying it to these benchmark challenges and showcase its practical utility for astrophysics by successfully extracting an analytic galaxy potential from a set of simulated orbits approximating stellar streams.


INTRODUCTION
Since the beginning of the scientific revolution, researchers have tried to find repeatable regularities in experiments and observations.Mathematical structures were used in this exploration, and many new ones including functions and differential equations were developed to respond to this need to model nature.Perhaps because of shared symmetries between nature and mathematics, these abstract structures have often been found to work exceedingly well in reproducing and predicting properties of the world, to the point where some have even considered whether the universe is actually mathematical at heart (Tegmark 2008).
The Symbolic Regression (SR) that the present contribution is concerned with has a long pedigree.Perhaps its most famous application was by Kepler to planetary ephemerides, thereby finding the fitting law that bears his name (Kepler 1609).This empirical law gave the observational basis upon which Newton was able to build the physical theories developed in his Principia Mathematica (Newton 1687).
Modern SR (Schmidt & Lipson 2009, 2011;Kommenda et al. 2020;Kammerer et al. 2020;Bartlett et al. 2023b;Brence et al. 2021;Jin et al. 2019;Luo et al. 2022;Tohme et al. 2023;Udrescu & Tegmark 2020;Udrescu et al. 2020;Kamienny et al. 2022;Biggio et al. 2020Biggio et al. , 2021;;Vastl et al. 2022;Kamienny et al. 2023;Martius & Lampert 2017;Brunton et al. 2016;Zheng et al. 2022;Sahoo et al. 2018;Petersen et al. 2021a;Landajuela et al. 2022;Holt et al. 2023;Scholl et al. 2023;Sousa et al. 2024;Fiorini et al. 2024;Shojaee et al. 2024;Zhang & Lei 2024;Cheng & Alkhalifah 2024;He et al. 2024a;Makke & Chawla 2022;Angelis et al. 2023;Faris et al. 2024;He et al. 2024b;Tian et al. 2024;Michishita 2024;Melching et al. 2024;Meidani et al. 2024;Li et al. 2024a,b; Figure 1.Class Symbolic Regression framework sketch: searching for a unique functional form simultaneously fitting multiple datasets.The process starts at the left hand side, a batch of trial class analytical expressions are generated using our Φ-SO framework (Tenachi et al. 2023a).The free parameters appearing in those expressions are then optimized in a dataset-specific manner i.e. allowing each dataset to have its own unique associated values for each parameter.The neural network used to generate the trial expressions is then reinforced based on the fit quality of the trial symbolic functions.This process is repeated until convergence.Chen et al. 2024) aims to use the immense computational resources at our disposal to search through possible analytic descriptions in terms of a set of functions and operators (e.g.x, +, −, ×, /, sin, cos, exp log, ...) to best fit some numerical dataset (x, y) we wish to model.Concretely, one seeks some analytic function f : R n −→ R that fits y = f (x) given those data.It is worth pointing out here that the search space becomes exponentially larger the longer the analytic expression is that we seek to find.Hence the key to SR is to develop efficient schemes to search through the possibilities, and most importantly, to prune out poor choices.
Our modern computational abilities have allowed us to examine nature in unprecedented quantitative detail, with cameras, spectrographs and other detectors amassing vast quantities of numerical data.It is likely that the clues to next-generation physics and understanding lie therein, and so we are tasked to devise methodologies capable of handling this wealth of information and translating it into coherent, interpretable and intelligible physical models.The promise of SR is that it may allow us in part to answer this need to find accurate and intelligible empirical laws in giant datasets to best capitalize on the community's observational investments.
While SR has been extensively applied in scientific research, its focus has largely been on single dataset analysis, overlooking the rich potential in examining multiple datasets linked to a singular physical phenomenon.
The present article extends our Φ-SO framework (presented in Tenachi et al. 2023a,b) further by allowing the search for a functional form that can simultaneously fit several datasets at once, each realization having (possibly) different fitting parameters.This opens up the new possibility of implementing a functional search on the properties of a class of objects.This approach is relevant across various natural sciences, but it particularly shines in astrophysics, where multiple observations of a single phenomenon are often available, providing a rich multi-dataset setup enabling us to devise 'universal' laws that apply to a range of celestial objects of interest.
In particular, we apply this new framework to the recovery of a Milky Way-like analytic galactic potential from simulated orbits that can be inferred from stellar streams.Specifically, our approach recovers a single analytical form for the energy of stellar stream members, incorporating a 'universal' term that encapsulates the dark matter distribution alongside a nuisance term that accounts for the specifics of individual streams -containing parameters allowed to have objectspecific values.Unlike traditional black-box deep learning methods, such as auto-encoders, our method generates a physically meaningful, low-dimensional model in the form of an analytical model.
The layout of the paper is as follows: In Section 2, we present the methodology of our approach.Section 3 details a first benchmark for Class SR, consisting of a series of physics problems designed to assess the performance of Class SR systems; here, we also evaluate our method against these benchmarks.In Section 4, we illustrate the practical application of our method in the more complex scenario of a Milky Way-like potential recovery from orbits.Finally, Section 5, offers a discussion and a conclusion.

METHOD
We build our "Class Symbolic Regression" (Class SR) framework on the Physical Symbolic Optimization (Φ-SO) framework for SR.This framework combines deep reinforcement learning with in situ dimensional analysis constraints to construct solutions that avoid physically nonsensical combinations of units.This algorithm currently achieves state-of-the-art performance on physics datasets, and significantly outperforms competitors on the standard Feynman SR benchmark La Cava et al. (2021) in exact symbolic recovery in the presence of even slight levels of noise (exceeding 0.1%).
Figure 1 gives an overview of our Class SR framework.Using Φ-SO we generate a batch of analytical expressions via a recurrent neural network (RNN).In these expressions, class-parameters (c) -which are shared across the entire class and have consistent values across all datasets -can appear alongside realizationspecific parameters (k).Subsequently, we optimize the free parameters appearing in each expression (c, k), assigning unique values to realization-specific parameters {k i } i<Nr for each of the N r datasets.
This optimization is conducted using the L-BFGS nonlinear optimization routine Zhu et al. (1997).Encoding our mathematical symbols with PyTorch Paszke et al. ( 2019), enables us to use PyTorch's implementation of the L-BFGS routine, which benefits from PyTorch's auto-differentiation capabilities to efficiently and simultaneously optimize both class and realization-specific parameters employing a mean squared error (MSE) cost function: MSE = 1 Nr.
2 .Where x ij are the input variables, y ij are the target values and N (i) is the number of samples which depends on the dataset.
We then use reinforcement learning to update the RNN's parameters following a risk-seeking gradient policy (Petersen et al. 2021a), as detailed in Tenachi et al. (2023a).This update is based on a reward R = (1+NRMSE) −1 that is representative of the fit quality of the trial functional form f across all datasets -evaluated using a normalized root mean squared error (NRMSE): where σ y is the standard deviation of target values evaluated across all datasets.We repeat this process until the RNN converges to a unique high quality expression and its associated parameter values simultaneously fitting all datasets.
Furthermore, the sequential nature of expression generation in our Φ-SO framework enables the incorporation of various priors regarding the resulting expressions as demonstrated in (Tenachi et al. 2023a;Bartlett et al. 2023a;Petersen et al. 2021b;Kim et al. 2021).This allows for customized constraints on the generated expressions such as adherence to the rules of dimensional analysis (which was one of the focal points of our previous study Tenachi et al. 2023a) but also simpler priors such as constraints on the number of occurrences of given parameters, the length of the expression and more.

MULTI-DATASET SR CHALLENGES
Despite existing research efforts to establish benchmarks for SR (La Cava et al. 2021;Matsubara et al. 2022;Marinescu et al. 2023;Graham et al. 2013), a benchmark tailored specifically for Class SR has yet to be developed, reflecting the innovative nature of this approach.To address this, we introduce our own Class SR challenges, designed to evaluate a system's ability to analyze multiple datasets.These datasets represent varied observations of a similar phenomenon occurring at different scales but governed by a consistent functional form.Table 1 outlines these challenges, each focusing on accurately recovering the symbolic expression from synthetic datasets having varied scale parameter values.To heighten the challenge, we include multiple scenarios incorporating class parameters that are common to all realizations in addition to other realization-specific parameters.
We evaluate our algorithm by randomly sampling 10 datasets of 10 2 samples for each of the 8 challenges described in Table 1 and allowing a maximum of 200, 000 expressions to be explored during each run.In order to ensure robustness, for each challenge, the procedure was repeated 5 times, each time with a unique random seed, and the recovery rates were subsequently averaged.The whole benchmark tests were conducted across four noise levels: 0%, 0.1%, 1% and 10% with noise being added individually to each dataset as per the SRBench (La Cava et al. 2021) standardized SR benchmarking protocol : where γ is the level of noise.We conduct runs having access to a single dataset (SR) and having access to all # Challenge Formula Variables Realization-specific free parameters Class Symbolic Regression challenges.Each row details a distinct challenge, with the objective being the exact symbolic recovery of the designated target expression using multiple synthetic datasets.Each dataset being generated using unique realization-specific parameter sets sampled from the given parameter ranges by sampling from the target expression within the given variable ranges.
We run our algorithm using the hyper-parameters detailed in Tenachi et al. (2023a), but with dimensional analysis disabled to ensure a fair comparison with other algorithms (as a consequence the batch size is lowered to 2000).This adjustment allows future comparisons with our system to be focused solely on the machine learning technique used (here reinforcement learning), rather than the problem simplification achieved through dimensional analysis.We allow the use of the following operations: {+, −, ×, /, 1/□, √ □, □ 2 , −□, exp, log, cos, sin}, a constant equal to one {1}, two adjustable realization specific free constants k = {k 1 , k 2 } allowed to have datasetspecific values and one adjustable class free constant c = {c 1 }.The recovery rate is evaluated by examining each expression in the Pareto front, which contains optimum expressions found in conciseness / accuracy i.e. : best fitting expressions at each level of complexity generated by our algorithm.Successful recovery is noted if an expression on the Pareto front is symbolically equivalent to the target expression.Exact symbolic recovery is assessed by formally comparing these expressions with the target expression using the SymPy library for symbolic mathematics Meurer et al. (2017), following a methodology similar to the one in the SRBench (La Cava et al. 2021).Specifically, expressions are deemed equivalent if their fraction is symbolically equivalent to 1 or a constant or if their difference is symbolically equivalent to 0 or a constant.
Figure 2 presents a comparison of exact symbolic recovery rates between our Class SR framework and the traditional SR approach under both noiseless and noisy conditions using an SRBench-style benchmarking pipeline, with detailed challenge-by-challenge results published online (see Section 5).Our results demonstrate the superiority of Class SR over traditional SR in exact symbolic recovery, particularly in noisy scenarios where noise overfitting is generally an important concern (La Cava et al. 2021).
While one might consider employing traditional SR individually on each dataset and subsequently aggregating the results, this approach would not only be substantially more computationally demanding, but it would also fail to differentiate class constants from realization-specific scale parameters, thus yielding a less interpretable model.Furthermore, our analysis uncovers several instances where traditional SR did not successfully identify the correct expression in any of the 5 attempts but in which Class SR effectively discovered the correct expressions.This concerns Problem #3 and #6 at 10% noise level scenarios, as well as Problem #5 across all noise levels.These findings highlight the superior robustness and efficiency of Class SR over traditional methods.
Following the SRBench protocol, we also include, on Figure 2, the rate of accurate expressions (having R 2 > 0.999) with the R 2 metric defined as We evaluate fit quality by refitting all constants of candidate expressions on newly generated previously unseen test datasets.This approach ensures a fair comparison between Class SR expressions, whose numerical parameters must accommodate multiple observations, and expressions derived from traditional SR, which only fit a single observation.Our results demonstrate that Class SR is not only more efficient at recovering the exact expressions but also more effective at deriving accurate approximations than traditional SR, in scenarios with noise levels exceeding 0.1%.

RECOVERING AN ANALYTIC POTENTIAL FORM STELLAR STREAMS
We now turn to an astrophysical application of the algorithm: to try to find the underlying potential of a gravitational system from a set of orbit segments within it.This could be practically applicable for finding an analytic potential model of a galaxy from a set of stellar streams.These linear structures form from the tidal dissolution of globular clusters and dwarf satellite galaxies.When their progenitors are of low mass, the escaping stars have similar energy to the progenitor, and therefore follow a similar orbit.Hence stellar streams approximate orbits in the host galaxy.As has recently been shown by Ibata et al. (2023), for many real streams one can calculate a "correction function" to convert an orbit model into a stream track, and these functions are relatively insensitive to the adopted potential.This procedure can be inverted to give the orbit from the stream.
For this test we imagine having access to full 6dimensional phase-space measurements of a sample of streams.For each structure i, the kinetic energy per unit mass E i,kin (x) is simply: The total energy per unit mass E i t , which is constant, but different, for each stream, can be considered to be nuisance terms in our search for the underlying potential Φ.
We run our algorithm with the objective of recovering the analytic form for E i,kin (x).
We use the the hyper-parameters detailed in Tenachi et al. (2023a), allowing the use of the following operations: {+, −, ×, /, 1/□, √ □, □ 2 , −□, exp, log, }, a constant equal to one {1}, one adjustable realization specific free constant (having units of energy) and three adjustable class free constants (one having units of energy, one having length units and the other being dimensionless).
Again we conduct runs at four noise levels (0%, 0.1%, 1% and 10%), having access to a single orbit (SR), 25% of the orbits, 50% of the orbits and 100% of the orbits (Class SR), repeating experiments 16 times with different random seeds and allowing a maximum of 250, 000 expressions to be explored during each run, leading to the total evaluation of 64, 000, 000 expressions through 256 runs.
For the present analysis we generated a sample of artificial orbit data (shown in Figure 3) that approximates the sample of 29 thin and long streams studied by Ibata  2023).To this end we used the present day progenitor positions estimated by Ibata et al. (2023), and integrated orbits within a universal (NFW) dark matter halo model (Navarro et al. 1997) that very roughly approximates the large-scale mass distribution in the Milky Way.The adopted potential ( Lokas & Mamon 2001) is where M 200 is the virial mass of the halo, g ≡ (ln(1 + c) − c/(1 + c)) −1 is a function of the halo concentration c and R is the scale radius.We chose M 200 = 10 12 M ⊙ , c = 10 and R = 20.0 kpc.The orbits consist of 100 phase space points at locations between ±1 Gyr from the current progenitor location.
Figure 4 presents the results of our analysis in terms of exact symbolic recovery and fit quality, evaluated using the R 2 metric.This metric was determined by refitting candidate expressions on noiseless test data and computing the median across various random seeds.
As anticipated, our results underscore that utilizing more realizations during the SR process significantly enhances model accuracy and the likelihood of exact symbolic recovery.This trend is particularly evident as noise levels rise, reinforcing our findings of Section 3. Notably, at a 1% noise level, none of the 16 runs that analyzed stellar stream individually succeeded in recovering the correct functional form.In contrast, when all 29 stellar streams were utilized, the correct functional form was identified nearly half of the time, showcasing the advantages of Class SR under noisy conditions.
We observe that the inability of our algorithm to recover the exact symbolic expression in the presence of 10% noise can be attributed to the fact that, under such high noise conditions, the difference in fit quality between the expressions typically identified by our algorithm and the true solution yields only a minimal improvement in terms of reward, ∆R ∼ 10 −5 .This minute improvement, which corresponds to a difference in R 2 of approximately (10 −6 ), is the sole metric available to guide the algorithm, as it operates on a trial-and-error basis.Unfortunately, such a small difference often remains undetected due to it falling below the tolerance threshold of the free constants optimization procedure.This scenario highlights a known intrinsic limitation of purely empirical SR, where degeneracies in the space of functional forms can go undetected.
Excluding scenarios where noise levels render the numerically found expression indistinguishable from the true solution, our Class SR algorithm typically converges toward the correct functional form by exploring under 250, 000 expressions, despite the presence of multiple alternative functional forms that provide a nearperfect fit to individual streams.Φ-SO identifies an offset parameter specific to each stream (corresponding to E i t ) and a functional form parameterized by classparameters common to all streams corresponding to Φ N F W .These results show that our algorithm can effectively recover a concise intepretable model for a Milky-Way like potential in the form of an analytic expression based solely on stellar positions and velocities without any prior information about the system.

DISCUSSION AND CONCLUSIONS
We presented a first framework for discovering symbolic analytical functions that simultaneously fit multiple datasets by allowing for (possibly) unique datasetspecific parameter values.This new framework which we dub "Class Symbolic Regression" is built upon our earlier Φ-SO framework which already delivers state-ofthe-art performances in symbolic recovery in the presence of noise.We demonstrated the efficacy of Class SR through simple textbook physics examples which we compiled into a first Class SR benchmark, finding better performance in exact symbolic recovery over traditional SR, especially in noisy situations.Additionally, we applied our method to a more complex astrophysical scenario, successfully rediscovering an NFW galaxy potential model from orbits approximating stellar streams.
Regular SR, when applied to a single dataset, often risks overfitting to specific characteristics of an observation, such as observational biases or transient events, and noise.In contrast, our Class SR framework should facilitate the finding of universal analytical laws that apply to a range of observations within a single class of physical phenomena.This enables our framework to model the underlying physics rather than the specifics of individual observations, with dataset-specific free parameters modeling the unique aspects of each observation.For instance, an application within galactic dynamics that we intend to explore in a future contribution is the analysis of galactic rotation curves.Here, a universal law derived through Class SR could provide insights into the general behavior of dark matter, whereas traditional SR, if applied to a single galaxy, might merely find the specific attributes of that galaxy.
It should be noted that while Class SR might superficially resemble regular SR applied to unbalanced datasets with dataset-specific parameters being akin to additional input variables, this comparison is not entirely accurate.In Class SR, these additional degrees of freedom represent unknown values that must be determined, differentiating it as a distinct problem with its own unique challenges.
A persistent issue in SR is model selection as the correct expression can often be overlooked in favor of those that fit better or are less complex (these concerns led to e.g., the development of single objective criterion Bartlett et al. 2023b).Our framework, by searching for expressions that fit multiple datasets, effectively utilizes information about the physical phenomena's class structure.This approach significantly mitigates model selection challenges, helping avoid incorrect model choices influenced by dataset-specific peculiarities.In addition, exploiting multiple datasets with regular SR techniques would require fitting the individual datasets independently, and then identifying the solutions in common between the objects, which may not be possible if the measurements are uncertain, would be computationally inefficient and would result in lower performances in exact symbolic recovery and fit quality alike in the presence of noise.
Finally, we note that after the first submission of our paper, another Class SR approach built on Operon (Kommenda et al. 2020) -a genetic algorithm approach to SR -was applied to supernovae photometry in Russeil et al. (2024).
In future work, we intend to improve on the machine learning aspects of our method to more effectively leverage multiple datasets.As each dataset might distinctly highlight certain symbolic terms or subexpressions more prominently than others, a promising strategy could be to periodically shift the neural network's training emphasis between datasets.This technique could potentially refine the performance of Class SR by sequentially learning different segments of the expression, rather than attempting to learn the entire expression simultaneously, thereby facilitating the learning process.

CODE AVAILABILITY
The documented code for the Φ-SO algorithm, along with demonstration notebooks, benchmark and results analysis pipelines is accessible on GitHub at github.com/WassimTenachi/PhySO , complete with comprehensive documentation.A frozen version related to this work on Class Symbolic Regression is released under tag v1.1.0and deposited on zenodo: 10.5281/zenodo.11663147.
We offer to the community a convenient interface for using our Class SR benchmark, running: pb = ClassProblem(i) will instantiate challenge i ∈ {0, 1, ..., 7} of the Class benchmark presented in Table 1.This interface offers simple ways to generate data points (via pb.generate data points) and compare a candidate expression to the target (via pb.get sympy).
In addition, we include challenge-by-challenge and run-by-run performances results tables: see PhySO/ benchmarking/ClassBenchmark/results for results pertaining to the Class SR benchmark and PhySO/ demos/demos_class_sr/demo_milky_way_streams/ results for results pertaining to the stellar stream problem.
Finally, for the sake of result reproducibility, we offer a straightforward method to replicate the outcomes presented in Figure 2 by simply executing the following command: python classbench run.py--equation i --noise n --n reals Nr.This command will run PhySO on challenge number i ∈ {0, 1, ..., 7} of the Class benchmark presented in Table 1, employing a noise level of n ∈ [0, 1] and exploiting Nr ∈ N realizations.We also include the script we used to estimate performances post-run : classbench results analysis.py Similarly, we offer a straightforward method to replicate the outcomes presented in Figure 3 by simply executing the following command: python MW streams run.py--noise n --frac real fr.This command will run PhySO on the stellar stream problem described in Section 4, employing a noise level of n ∈ [0, 1] and exploiting a fraction of fr ∈ [0, 1] realizations.Again, we include our results analysis script: MW streams results analysis.py . ', ), Φ e.g.+ = -./ .cos()5 + Φ)

Figure 3 .
Figure3.Synthetic stellar stream data utilized by our algorithm to recover the galactic potential.The left and middle panels display the spatial positions of stream members relative to the Milky Way, while the right panel illustrates the kinetic energy of these members as a function of their radial distance from the galactic center.etal. (2023).To this end we used the present day progenitor positions estimated byIbata et al. (2023), and integrated orbits within a universal (NFW) dark matter halo model(Navarro et al. 1997) that very roughly approximates the large-scale mass distribution in the Milky Way.The adopted potential( Lokas & Mamon 2001) is

Figure 4 .
Figure 4.This figure presents the exact symbolic recovery rate and median R 2 achieved by our Class SR algorithm in the task of recovering an NFW dark matter halo model (Navarro et al. 1997) from synthetic datasets of stellar stream positions and velocities.The performance metrics are displayed as functions of noise levels and the number of realizations exploited.The edge case, in which a single realization is used, corresponds to the conditions of traditional SR.The results distinctly demonstrate that Class SR substantially outperforms traditional SR, particularly in noisy environments.