Ensemble-SINDy: Robust sparse model discovery in the low-data, high-noise limit, with active learning and control

Sparse model identification enables the discovery of nonlinear dynamical systems purely from data; however, this approach is sensitive to noise, especially in the low-data limit. In this work, we leverage the statistical approach of bootstrap aggregating (bagging) to robustify the sparse identification of the nonlinear dynamics (SINDy) algorithm. First, an ensemble of SINDy models is identified from subsets of limited and noisy data. The aggregate model statistics are then used to produce inclusion probabilities of the candidate functions, which enables uncertainty quantification and probabilistic forecasts. We apply this ensemble-SINDy (E-SINDy) algorithm to several synthetic and real-world datasets and demonstrate substantial improvements to the accuracy and robustness of model discovery from extremely noisy and limited data. For example, E-SINDy uncovers partial differential equations models from data with more than twice as much measurement noise as has been previously reported. Similarly, E-SINDy learns the Lotka Volterra dynamics from remarkably limited data of yearly lynx and hare pelts collected from 1900 to 1920. E-SINDy is computationally efficient, with similar scaling as standard SINDy. Finally, we show that ensemble statistics from E-SINDy can be exploited for active learning and improved model predictive control.


Introduction
Data-driven model discovery enables the characterization of complex systems where first principles derivations remain elusive, such as in neuroscience, power grids, epidemiology, finance and ecology. A wide range of data-driven model discovery methods exist, including equationfree modelling [1], normal form identification [2][3][4], nonlinear Laplacian spectral analysis [5], Koopman analysis [6,7] and dynamic mode decomposition (DMD) [8][9][10], symbolic regression [11][12][13][14][15], sparse regression [16,17], Gaussian processes [18], combining machine learning and data assimilation [19,20], and deep learning [21][22][23][24][25][26][27]. Limited data and noisy measurements are fundamental challenges for all of these model discovery methods, often limiting the effectiveness of such techniques across diverse application areas. The sparse identification of nonlinear dynamics (SINDy) [16] algorithm is promising, because it enables the discovery of interpretable and generalizable models that balance accuracy and efficiency. Moreover, SINDy is based on simple sparse linear regression that is highly extensible and requires significantly less data in comparison with, for instance, neural networks. In this work, we unify and extend innovations of the SINDy algorithm by leveraging classical statistical bagging methods [28] to produce a computationally efficient and robust probabilistic model discovery method that overcomes the two canonical failure points of model discovery: noise and limited data.
Model discovery algorithms are sensitive to noise because they rely on the accurate computation of derivatives, which is especially challenging for PDEs where noise can be strongly amplified when computing higher-order spatial derivatives. There have been two key innovations to improve the noise robustness of SINDy: control volume formulations and ensemble methods. The integral formulation of SINDy [50] has proven powerful, enabling the identification of PDEs in a weak form that averages over control volumes, which significantly improves its noise tolerance. This approach has been used to discover a hierarchy of PDE models for fluids and plasmas [37,[51][52][53][54]64,65]. Several works have begun to explore ensemble methods to robustify data-driven modelling, including the use of bagging for DMD [66], ensemble-Lasso [67], subsample aggregating for improved discovery [61,68], statistical learning of PDEs to select model coefficients with high importance measures [69] and improved discovery using ensembles based on subsampling of the data [51,52,61,65]. Also, symbolic regression methods [11][12][13] and spectral proper orthogonal decomposition (SPOD) [70]  building blocks (library terms) and then recombining building blocks (equations or library terms) that best model the experimental data. In SPOD, a modal decomposition method closely related to DMD, optimally averaged DMD modes are obtained from an ensemble DMD problem. Thus, both these methods naturally include ensembling ideas.
When dealing with noise-compromised data, it is also critical to provide uncertainty estimates of the discovered models. In this direction, recent innovations of SINDy use sparse Bayesian inference for probabilistic model discovery [57][58][59][60]. Such methods employ Markov Chain Monte Carlo, which is extremely computationally intensive. These extensions have all improved the robustness of SINDy for high-noise data, although they have been developed largely in isolation and they have not been fully characterized, exploited and/or integrated.
In this work, we unify and extend recent innovations in ensembling and the weak formulation of SINDy to develop and characterize a more robust ensemble-SINDy (E-SINDy) algorithm. Furthermore, we show how this method can be useful for active learning and control. In particular, we apply b(r)agging 1 to SINDy to identify models of nonlinear ODEs of the form with state u ∈ R n and dynamics f(u), and for nonlinear PDEs of the form 2) with N(·) a system of nonlinear functions of the state u(x, t), its derivatives and parameters μ; partial derivatives are denoted with subscripts, such that u t := ∂u/∂t. We show that b(r)agging improves the accuracy and robustness of SINDy. The method also promotes interpretability through the inclusion probabilities of candidate functions, enabling uncertainty quantification. Importantly, the ensemble statistics are useful for producing probabilistic forecasts and can be used for active learning and nonlinear control. We also demonstrate library E-SINDy, which subsamples terms in the SINDy library. E-SINDy is computationally efficient compared with probabilistic model identification methods based on Markov Chain Monte Carlo sampling [60], which take several hours of CPU time to identify a model. By contrast, our method identifies models and summary statistics in seconds by leveraging the sparse regression of SINDy with statistical bagging techniques. Indeed, E-SINDy has similar computational scaling to standard SINDy. This method applies under the same conditions as the standard SINDy algorithm, where it is assumed that all relevant variables are measured at a sufficient temporal and spatial resolution so as to approximate derivatives. We investigate different ensemble methods, apply them to several synthetic and real-world datasets, and demonstrate that E-SINDy outperforms existing sparse regression methods, especially in the low-data and high-noise limit. A schematic of E-SINDy is shown in figure 1. We first describe SINDy for ODEs and PDEs in §2, before introducing the E-SINDy extension in §3, and discussing applications to challenging model discovery, active learning and control problems in §4.

Background
Here, we describe SINDy [16], a data-driven model discovery method to identify sparse nonlinear models from measurement data. First, we introduce SINDy to identify ODEs, followed by its generalization to identify PDEs [17,29].

(a) Sparse identification of nonlinear dynamics
The SINDy algorithm [16] identifies nonlinear dynamical systems from data, based on the assumption that many systems have relatively few active terms in the dynamics f in equation (   (a-c) Schematic of the E-SINDy framework. E-SINDy exploits the statistical method of bootstrap aggregating (bagging) to identify ordinary and partial differential equations that govern the dynamics of observed noisy data. First, sparse regression is performed on bootstraps of the measured data, or on the library terms in case of library bagging, to identify an ensemble of SINDy models. The mean or median of the coefficients are then computed, coefficients with low inclusion probabilities are thresholded, and the E-SINDy model is aggregated and used for forecasting. (Online version in colour.) parsimonious models that automatically balance model complexity with accuracy [16]. We first measure m snapshots of the state u in time and arrange these into a data matrix Next, we compute the library of D candidate nonlinear functions Θ(U) ∈ R m×D This library is constructed to include any functions that might describe the data, and this choice is crucial. The underlying dynamical system is unknown and we cannot guarantee that the dynamics are well described by the span of the library. Therefore, the recommended strategy is to start with a basic choice, such as low-order polynomials, and then increase the complexity and order of the library until sparse and accurate models are obtained. We must also compute the time derivatives of the state U t = [u 1u2 · · ·u m ] T , typically by numerical differentiation. We therefore need a suitable data sampling time that allows for the computation of the time derivatives, which may limit the applicability of the SINDy algorithm for certain datasets with coarse or uneven sampling in time. The system in equation (1.1) may then be written in terms of these data matrices Each entry in Ξ ∈ R D×n is a coefficient corresponding to a term in the dynamical system. Many dynamical systems have relatively few active terms in the governing equations. Thus, we may employ sparse regression to identify a sparse matrix of coefficients Ξ signifying the fewest active terms from the library that result in a good model fit The regularizer R(Ξ ) is chosen to promote sparsity in Ξ . For example, sequentially thresholded least-squares (STLS) [16] uses R(Ξ ) = λ||Ξ || 0 with a single hyperparameter λ, whereas sequentially thresholded ridge regression (STRidge) [17] uses R(Ξ ) = λ 1 ||Ξ || 0 + λ 2 ||Ξ || 2 with two hyperparameters λ 1 and λ 2 . STLS was first introduced to discover ODEs and STRidge was introduced to discover PDEs where data can be highly correlated and STLS tends to perform poorly. There are several other recently proposed regularizers and optimization schemes [49,71,72]. We illustrate STRidge in pseudo code algorithm 1, noting that STRidge reduces to STLS for λ 2 = 0.

(b) Discovering PDEs
SINDy was recently generalized to identify PDEs [17,29] in the partial differential equation functional identification of nonlinear dynamics (PDE-FIND) algorithm. PDE-FIND is similar to SINDy, but with the library including partial derivatives. Spatial time-series data are arranged into a column vector U ∈ R mn , with data collected over m time points and n spatial locations. Thus, for PDE-FIND, the library of candidate terms is Θ(U) ∈ R mn×D . The PDE-FIND implementation of Rudy et al. [17] takes derivatives using finite difference for clean data or polynomial interpolation for noisy data. The library of candidate terms can then be evaluated: The time derivative U t is reshaped into a column vector and the system in equation (1.2) is written as For most PDEs, Ξ is sparse and can be identified with a similar sparsity-promoting regression STRidge improves model identification with highly correlated data that is common in PDE regression problems. PDE-FIND is extremely prone to noise, because noise is amplified when computing high-order partial derivatives for Θ. To make PDE-FIND more noise robust, integral [50] and weak formulations [51,54] were introduced. Instead of discovering a model based on equation (1.2), the PDE can be multiplied by a weight w j (u, t) and integrated over a domain Ω k . This can be repeated for a number of combinations of w j (u, t) and Ω k . Stacking the results of the integration over different domains using different weights leads to a linear system with q 0 and Q = [q 1 , . . . , q D ] the integrated left-hand side and integrated library of candidate terms, which replace U t and the library of nonlinear functions Θ(U). As with PDE-FIND, sparse regression can be employed to identify a sparse matrix of coefficients Ξ , using STLS, STRidge or other regularizers. For all of our results, we use this weak formulation as a baseline and for the basis of ensemble models.

Ensemble SINDy
In this work, we introduce E-SINDy, which incorporates ensembling techniques into datadriven model discovery. Ensembling is a well-established machine learning technique that combines multiple models to improve prediction. A range of ensembling methods exist, such as bagging (bootstrap aggregation) [28], bragging (robust bagging) [73,74] and boosting [75,76]. Structure learning techniques such as cross-validation [77] or stability selection [78] can also be considered ensembling methods, because they combine and use the information of a collection of learners or models. For model discovery, ensembling improves robustness and naturally provides inclusion probabilities and uncertainty estimates for the identified model coefficients, which enable probabilistic forecasting and active learning.
Here, we propose two new ensemble model discovery methods: the first method is called b(r)agging E-SINDy, and the second method is called library E-SINDy. A general schematic of E-SINDy is shown in figure 1, and a schematic of the sparse regression problems for b(r)agging and library E-SINDy is shown in figure 2. Our first method, b(r)agging E-SINDy, uses data bootstraps to discover an ensemble of models that are aggregated by taking the mean of the identified model coefficients in case of bagging, and taking the median in the case of bragging. Bootstraps are data samples with replacement. Applied to SINDy to identify ODEs, we first build a library of candidate terms Θ(U) and derivatives U t . From the m rows of the data matrices U t and Θ(U), corresponding to m samples in time, we select q bootstrapped data samples and generate q SINDy models in the ensemble. For each of these q data bootstraps, m new rows are sampled with replacement from the original m rows of the data matrices. On average, each data bootstrap will have around 63% of the entries of the original data matrices, with some of these entries being represented multiple times in the bootstrap; for large m this quantity converges to 1 − e −1 ≈ 0.632, which is the limit of 1 − (1 − (1/m)) m for m → ∞. In this way, randomness and subsampling is inherent to the bootstrapping procedure. From the q identified SINDy models in the ensemble, we can either directly aggregate the identified models, or first threshold coefficients with low inclusion probability. The procedure is illustrated in algorithm 2 for bagging E-SINDy using STRidge. The same procedure applies for bragging, taking the median instead of the mean, and using other regularizers than STRidge. Note that there are other random data subsampling approaches that may be used, such as generating q models based on q random subsamples of p < m rows of the data without replacement, of which there are m p . However, boostrapping based on selection with replacement is the most standard procedure.
Schematic of SINDy and E-SINDy with b(r)agging and library bagging. Shown is a single model of the ensemble. In the case of b(r)agging, data bootstraps (data samples with replacement) are used to generate an ensemble of SINDy models.
The E-SINDy model is aggregated by taking the mean of the identified coefficients for bagging, and the median for bragging. In case of library bagging, instead of data bootstraps, library term bootstraps are sampled without replacement. Library terms with low inclusion probability are discarded and the E-SINDy model can be identified on the smaller library using standard SINDy or b(r)agging E-SINDy. (Online version in colour.) The second method proposed, library bagging E-SINDy, samples library terms instead of data pairs. We sample l out of D library terms without replacement. In case of sampling library terms, replacement does not affect the sparse regression problem. However, using smaller libraries can drastically speed up model identification, as the complexity of the least-squares algorithm is O(ml 2 ). Library bagging with small l can therefore help counteract the increased computational cost of solving multiple regression problems in the ensemble. As with bagging E-SINDy, we obtain an ensemble of models and model coefficient inclusion probabilities. We can directly aggregate the models and threshold coefficients with low inclusion probabilities to get a library E-SINDy model. We can also use the inclusion probabilities to threshold the library, only keeping relevant terms, and run bagging E-SINDy using the smaller library. This can be particularly useful if we start with a large library: we first identify and remove all library terms that are clearly not relevant and then run bagging E-SINDy on the smaller library. However, the library bagging inclusion probability threshold needs to be selected carefully to not remove relevant terms from the library. We show a pseudo code of library bagging E-SINDy in algorithm 3. E-SINDy provides inclusion probabilities and uncertainty estimates for the discovered model coefficients, thus connecting to Bayesian model identification techniques. The identified ensemble of model coefficients can be used to compute coefficient probability density functions, which form a posterior distribution p(Ξ |X). In terms of forecasting, we can either use the aggregated mean or median of the identified coefficients to forecast, or we can draw from multiple identified SINDy models to generate ensemble forecasts that represent posterior predictive distributions p(x(t)|X) that provide prediction confidence intervals.

Results
We now apply E-SINDy to challenging synthetic and real-world datasets to identify ODEs and PDEs. We apply library bagging E-SINDy to a real-world ecological dataset, showing its performance in the very low data limit. For PDEs, we use the recent weak-SINDy (WSINDy) [54] as a baseline and show the improved noise robustness when using E-SINDy for identifying a range of PDEs. Trends for the noise and data length sensitivity of bagging, bragging and library bagging to identify the chaotic Lorenz system dynamics are presented in appendix A.

(a) ODEs
We apply E-SINDy to a challenging real-world dataset from the Hudson Bay Company, which consists of the yearly number of lynx and hare pelts collected from 1900 to 1920. These pelt counts are thought to be roughly proportional to the population of the two species [79]. Lynx are predators whose diet depends on hares. The population dynamics of the two species should, therefore, be well approximated by a Lotka-Volterra model. There are several challenges in identifying a SINDy model from this dataset: there are only 21 data points available, and there is large uncertainty in the measurements arising from weather variability, consistency in trapping and other changing factors over the years measured. In figure 3, we show that E-SINDy correctly identifies the Lotka-Volterra dynamics, providing model coefficient and inclusion probabilities and confidence intervals for the reconstructed dynamics. We use library bagging, followed by   x 2 x 1 inclusion probability model reconstruction bagging using a library of polynomials up to third order, to identify a sparse model in this very low data limit with only 21 data points per species. Similar results for the lynx-hare dataset were recently published using a probabilistic model discovery method [60] based on sparse Bayesian inference. This approach employed Markov Chain Monte Carlo, for which the computational effort to generate a probabilistic model is comparably high, taking several hours of CPU time. By contrast, E-SINDy takes only seconds to identify a model and its coefficient and inclusion probabilities.

(b) PDEs
In this section, we present results applying E-SINDy to discover PDEs from noisy data. We use the recent WSINDy implementation [54] as the baseline model for ensembling. WSINDy was successfully applied to identify models in the high-noise regime using large libraries. We perform library bagging on the system of equation (2.8) instead of equation (2.6), and refer to the resulting method as ensemble weak SINDy (E-WSINDy).
We apply E-WSINDy to identify PDEs from synthetic data for the inviscid Burgers, Korteweg de Vries, nonlinear Schroedinger, Kuramoto-Sivashinsky and reaction-diffusion equations. Details on the numerical methods for creating the data are discussed in appendix B and in our E-SINDy data and code repository. We quantify the accuracy and robustness of the model identification by assessing the success rate and model coefficient errors for a number of noise realizations. The success rate is defined as the rate of identifying the correct non-zero and zero terms in the library, averaged over all realizations. The model coefficient error E c quantifies how much the identified coefficientsΞ deviate from the true parameters Ξ that we use to generate the data: The results are summarized in figure 4.

(c) Exploiting ensemble statistics for active learning
We now present results exploiting the ensemble statistics for active learning [80,81]. Active learning is a machine learning method that can reduce training data size while maintaining accuracy by actively exploring regions of the feature space that maximally inform the learning process [82,83]. This can be particularly effective for systems with large feature spaces that are expensive to explore, such as in biological systems or high-dimensional systems with control. In biological systems, collecting samples can be time consuming and expensive, but it may be possible to initialize specific new initial conditions of the system. Active learning can inform the selection of these initial conditions for improved data efficiency of the learning process and model discovery. For control problems, similarly, exploration of large feature spaces may be expensive, such as repeatedly performing robotics experiments. Active learning can enable data-efficient exploration by the guided collection of relevant and descriptive data that optimally supports the model discovery process of the controlled robotic system. Here, we leverage the ensemble statistics of E-SINDy to identify and sample high-uncertainty regions of phase space that maximally inform the sparse regression. In E-SINDy, we collect data from a single initial condition or from multiple randomly selected initial conditions and identify a model in one shot. Instead, we can successively identify E-SINDy models and exploit their ensemble statistics to identify new initial conditions with high information content, which can improve the data efficiency of the model discovery process. The basic idea is to compute ensemble forecasts from a large set of initial conditions using successively improved E-SINDy models and only explore regions with high ensemble forecast variance. Our simple but effective active E-SINDy approach iteratively identifies models in three steps: (1) collecting a small amount of randomly selected data to identify an initial E-SINDy model; (2) selecting a number of random initial conditions and computing the ensemble forecast variance for each initial condition using the current E-SINDy model; and (3) sampling the true system with the initial condition with highest variance. Finally, we concatenate the newly explored data to the existing dataset to identify a new E-SINDy model, and continue the model identification until the model accuracy and/or variance of the identified model coefficients converge. Here, we test active E-SINDy on the Lorenz system dynamics introduced in figure 1 and appendix A and show the results in figure 5. In figure 5a, we show five illustrative ensemble forecasts from different initial conditions after initializing the algorithm. In total, at each iteration, we compute ensemble forecasts at 200 different initial conditions. We found that at each initial condition, integrating a single time step forward in time is informative enough to compute ensemble forecasting variance. In figure 5b, we show the probability density functions of the identified model coefficients at initialization have wide distributions, and after 80 active learning steps the variance of the distributions is significantly reduced. Figure 5c also shows the improved data efficiency of the model discovery using active learning E-SINDy compared with E-SINDy. Through active E-SINDy, we reduce the variance of the identified model coefficients, increase the success rate of identifying the correct model structure and reduce the model coefficient error compared with standard E-SINDy.

(d) Ensemble SINDy model predictive control
It is also possible to use E-SINDy to improve model predictive control (MPC) [84][85][86]. MPC is a particularly compelling approach that enables control of strongly nonlinear systems with constraints, multiple operating conditions and time delays. The major challenge of MPC lies in the development of a suitable model. Deep neural network models have been increasingly used for deep MPC [87,88]; however, they often rely on access to massive datasets, have limited ability to generalize, do not readily incorporate known physical constraints and are computationally expensive. Kaiser et al. [46] showed that sparse models obtained via SINDy perform nearly as well with MPC, and may be trained with relatively limited data compared to a neural network. Here, we show that E-SINDy can further reduce the training data requirements compared to SINDy, enabling the control of nonlinear systems in the very low data limit.
We use E-SINDy to identify a model of the forced Lorenz system dynamics and use MPC to stabilize one of the two unstable fixed points (± √ 72, ± √ 72, 27). The Lorenz system is introduced in figure 1 and we add a control input u to the first state of the dynamics:ẋ = σ (y − x) + u. The  control problem is based on Kaiser et al. [46]. We describe the MPC problem in more detail in appendix C. In figure 6, we show the performance of MPC based on E-SINDy models for different training data length and noise = 0.01. In figure 6a, we show the sensitivity of the mean MPC cost to training data length. We run 1000 noise realizations and average the mean MPC cost of all runs. Figure 6b shows trajectories of the controlled Lorenz system for models trained with E-SINDy and SINDy, using 50 and 150 time-step data points. E-SINDy significantly improves the MPC performance in the low data limit compared with SINDy.

Discussion
This work has developed and demonstrated a robust variant of the SINDy algorithm based on ensembling. The proposed E-SINDy algorithm significantly improves the robustness and accuracy of SINDy for model discovery, reducing the data requirements and increasing noise tolerance. E-SINDy exploits foundational statistical methods, such as bootstrap aggregating, to identify ensembles of ODEs and PDEs that govern the dynamics from noisy data. From this ensemble of models, aggregate model statistics are used to generate inclusion probabilities of candidate functions, which promotes interpretability in model selection and provides probabilistic forecasts. We show that ensembling may be used to improve several standard SINDy variants, including the integral formulation for PDEs. Combining ensembling with the integral formulation of SINDy enables the identification of PDE models from data with more than twice as much measurement noise as has been previously reported. These results are promising for the discovery of governing equations for complex systems in neuroscience, power grids, epidemiology, finance or ecology, where governing equations have remained elusive. Importantly, the computational effort to generate probabilistic models using E-SINDy is low. E-SINDy produces accurate probabilistic models in seconds, compared with existing Bayesian inference methods that take several hours. Library bagging has the additional advantage of making the least-squares computation more efficient by sampling only small subsets of the library. E-SINDy has also been incorporated into the open-source PySINDy package [62,63] to promote reproducible research.
We also present results exploiting the ensemble statistics for active learning and control. Recent active exploration methods [89] and active learning of nonlinear system identification [90] suggest exploration techniques using trajectory planning to efficiently explore high uncertainty regions of the feature space. We use the uncertainty estimates of E-SINDy to explore high uncertainty regions that maximally inform the learning process. Active E-SINDy reduces the variance of the identified model coefficients, increases the success rate of identifying the correct model structure and reduces the model coefficient error compared with standard E-SINDy in the low data limit. Finally, we apply E-SINDy to improve nonlinear MPC. SINDy was recently used to generate models for real-time MPC of nonlinear systems. We show that E-SINDy can significantly reduce the training data required to identify a model, thus enabling control of nonlinear systems with constraints in the very low data limit. An exciting future extension of the computationally efficient probabilistic model discovery is to combine the active learning and MPC strategies based on E-SINDy. An important avenue of future work may also explore active sampling that is constrained by the physical limitations of a given simulation or experiment. Highly efficient exploration and identification of nonlinear models may also enable learning task-agnostic models that are fundamental components of model-based reinforcement learning.
Data accessibility. This paper contains no experimental data. The E-SINDy method is implemented in the opensource software PySINDy, which is available at https://github.com/dynamicslab/pysindy. Additionally, the data and code to reproduce the results are available at https://github.com/urban-fasel/EnsembleSINDy.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration. We declare we have no competing interests.

Appendix A. E-SINDy trends compared with standard SINDy
Here, we evaluate noise and data length sensitivity of E-SINDy compared with the standard SINDy implementation with no innovations for noise robustness. This case does not demonstrate the maximum noise robustness, but rather provides a simple and intuitive introduction to how E-SINDy compares with a baseline SINDy algorithm. E-SINDy can be used with most SINDy variants, and similar performance increases are found. For example, using integral SINDy or computing derivatives using the total variation derivative will dramatically improve the baseline robustness and therefore the E-SINDy robustness, as shown in §4b.
We define three commonly used metrics of success for identifying sparse nonlinear models: (1) the structure of the identified model, i.e. identifying the correct zero and non-zero terms in library; (2) the error of the identified model coefficients; and (3) the accuracy of predicted trajectories. The accuracy of a predicted trajectory is highly sensitive to small differences in the model coefficients, especially for chaotic systems. Therefore, we assess the structure of the identified model and the error of the identified coefficients quantitatively, and assess the accuracy of the forecast qualitatively. In figure 7, we compare the noise level and data length sensitivity of SINDy with bagging, bragging and library bagging E-SINDy, for data from the Lorenz system. We generate trajectories from the Lorenz system with system parameters shown in figure 1 and initial condition u 0 = [− 8,7,27], for a range of data length. We then add Gaussian white noise with zero mean and variance σ/||u|| rms , where ||u|| rms is the root mean squared value of the trajectory data. We test each noise level and data length case with 1000 random realizations of noise, compute the success rate of identifying the correct model structure in Ξ and calculate the mean error of the identified model coefficients. The sparsity promoting hyperparameter is set to λ = 0.2 for all methods. The number of models in the ensemble is set to q = 100. The inclusion  probability used to threshold model coefficients is set to tol = 0.6 for b(r)agging and tol = 0.4 for library bagging. We choose a smaller threshold for library bagging because we perform bagging after library bagging. Thus, the goal is not to aggressively threshold the identified coefficients, but only to reduce the initial library size to simplify the complexity of the subsequent bagging E-SINDy. We found that this strategy can improve the success rate and reduce the model error.
Additionally, in the case of library bagging, we define the number of library terms in each sample to be l = 0.6D. The results for the different ensemble methods over noise level and data length are shown in figure 7. We see that ensembling clearly improves the accuracy and robustness of the model discovery, enabling model identification with less data and more noise than standard SINDy. As expected, bragging further robustifies bagging, and library bagging also improves accuracy and robustness. Additionally, library bagging has the advantage that we can consider larger libraries, discarding irrelevant terms in the library before applying b(r)agging E-SINDy. In the low data limit with little or no noise, we see that bagging does not significantly improve accuracy, and can even perform worse than standard SINDy. This is caused by outliers in the ensemble of identified models that decrease the accuracy of the aggregated model. We therefore suggest bragging, which robustifies bagging and improves the accuracy of the identified model by taking the median instead of the mean of the ensemble.
In terms of the accuracy of the E-SINDy forecast, we show library bagging for 2.5% noise and data length T = 10s in the bottom of figure 1. Library bagging E-SINDy (coloured solid line) is able to forecast several chaotic lobe switches, compared with SINDy (grey dotted line), which deviates much earlier from the true dynamics (dashed line). We also show the 95% confidence intervals for different ensemble forecasting strategies, where multiple models are drawn from the bootstrap and used for prediction. Additionally, figure 1 shows the model coefficient distribution and inclusion probabilities of the identified library bagging E-SINDy model. The variance of the identified coefficients is comparably low, and the inclusion probabilities for the non-zero terms are clearly higher than for the zero terms.
Alternative resampling strategies such as jackknife [91] or using out-of-bag error estimates [92] may also improve the robustness of model discovery. In jackknife, we sequentially remove one observation from the data and identify an ensemble of models. This method may efficiently be used with library bagging, however, we found that bootstrapping outperforms jackknife, which is also reported in the literature [91]. We may also use out-of-bag error estimates to improve bagging. In bagging, each bootstrap sample leaves out approximately 37% of the data. This data can be used to give error estimates of the bagged predictor. We tested out-of-bag bagging E-SINDy, where we only aggregate the 10% most accurate models in terms of out-of-bag error. We found improved performance compared with bagging, however, we found lower accuracy and success rate compared with bragging. The more recent stability selection method [78] is another alternative to bagging and bragging. We perform the same sensitivity analysis to noise level and data length using the stability selection method that was recently used for PDE model discovery in [69]. Stability selection can be used to automatically determine the level of regularization in SINDy. The method randomly samples data subsets of size m/2 without replacement, identifies SINDy models for each subset, computes an importance measure and thresholds library terms, and solves a final linear least-squares regression on the thresholded library. The method in its original form performs slightly worse than b(r)agging E-SINDy. However, the success rate of stability selection SINDy can be improved, achieving comparable performance with bragging E-SINDy, by sampling subsets of size m with replacement instead of m/2 without replacement. In figure 8, we show the noise level sensitivity of the different ensemble SINDy methods including jackknife, out-of-bag bagging and stability selection of the Lorenz system dynamics for 4s data length.