Neural ordinary differential equations with irregular and noisy data

Measurement noise is an integral part of collecting data of a physical process. Thus, noise removal is necessary to draw conclusions from these data, and it often becomes essential to construct dynamical models using these data. We discuss a methodology to learn differential equation(s) using noisy and irregularly sampled measurements. In our methodology, the main innovation can be seen in the integration of deep neural networks with the neural ordinary differential equations (ODEs) approach. Precisely, we aim at learning a neural network that provides (approximately) an implicit representation of the data and an additional neural network that models the vector fields of the dependent variables. We combine these two networks by constraints using neural ODEs. The proposed framework to learn a model describing the vector field is highly effective under noisy measurements. The approach can handle scenarios where dependent variables are unavailable at the same temporal grid. Moreover, a particular structure, e.g. second order with respect to time, can easily be incorporated. We demonstrate the effectiveness of the proposed method for learning models using data obtained from various differential equations and present a comparison with the neural ODE method that does not make any special treatment to noise. Additionally, we discuss an ensemble approach to improve the performance of the proposed approach further.


Introduction
Uncovering dynamical models explaining physical phenomena and dynamic behaviours has been active research for centuries.When a model describing the underlying dynamics is available, it can be used for several engineering studies such as process design, optimization, predictions and control.Conventional approaches based on physical laws and empirical knowledge are often used to derive dynamical models.However, this is impenetrable for many complex systems, e.g.understanding the Arctic ice pack dynamics, sea ice, power grids, neuroscience or finance, to only name a few applications.Data-driven methods to discover models have enormous potential to better understand transient behaviours in the latter cases.Furthermore, data acquired using imaging devices or sensors are contaminated with measurement noise.Therefore, systematic approaches to learning dynamical models with proper noise treatment are required.
In this work, we consider learning autonomous nonlinear differential equation of the form _ xðtÞ ¼ gðxðtÞÞ and xð0Þ ¼ x 0 , ð1:1Þ where xðtÞ [ R n denotes the solution at time t, _ xðtÞ is the time-derivative of x at time t, and the continuous function gðÁÞ : R n !R n defines the vector field.We aim to learn to the vector field g(•) using the noisy measurements.Towards this aim, the initial work [1] proposes a framework that explicitly incorporates the noise into a numerical time-stepping method, namely a Runge-Kutta method.Though the approach has shown promising directions, its scalability remains ambiguous as the approach explicitly needs noise estimates and aims to decompose the signal explicitly into noise and ground truth.Moreover, it requires that the Runge-Kutta method can give a reasonable estimate at the next step.Additionally, irregular sampling (e.g. when dependent variables are not collected or not available at the same time grid) cannot be applied, which can be highly relevant when information is gathered from various sources, e.g. in medical applications.This work discusses a deep learning-based approach to learning dynamical models by enhancing neural networks with adaptive numerical integrations.This allows learning models to represent the vector field accurately without estimating noise explicitly and when dependent variables are arbitrarily irregularly sampled.

Our contributions
Our work introduces a framework to learn dynamical models by innovatively blending neural networks and numerical integration methods from noisy and irregular measurements.Precisely, we aim at learning two networks: one that approximately represents the given measurement data implicitly, and the second one that approximates the vector field.We connect these two networks by enforcing an integral form of the ordinary differential equation (ODE) as depicted in figure 1.The appeal of the approach is that we do not require an explicit noise estimate to learn a model.Furthermore, the proposed approach is applicable even if each dependent variable is collected on a different time grid, which can be irregular.x 1 (t) x 2 (t) x 2 (t) x 1 (t)   The remaining structure of the paper is as follows.In the next section, we present a summary of relevant work.In §3, we present our deep learning-based framework for learning dynamics from noisy measurements by combining two networks.In §4, we also demonstrate the effectiveness of the proposed methodology using various synthetic data with increasing noise levels.Section 5 discusses the application of learning second-order dynamical models.Moreover, in §6, we discuss how to handle irregular sampling of measurements.We conclude the paper with a summary and future research directions.We also discuss an ensemble approach [2][3][4][5][6] to improve our approach further by taking a mean of the ensemble models.

Relevant work
Data-driven methods to learn dynamical models have been studied for several decades (e.g.[7][8][9]).Learning linear models from input-output data goes back to Ho & Kálmán [10].There have been several algorithmic developments for linear systems, for example, the eigensystem realization algorithm [11,12], and Kalman filter-based approaches [13][14][15].Dynamic mode decomposition has also emerged as a promising approach to construct models from input-output data and has been widely applied in fluid dynamics applications (e.g.[16][17][18]).Furthermore, there has been a series of developments to learn nonlinear dynamical models.This includes, for example, equation-free modelling [19], nonlinear regression [20], dynamical modelling [21] and automated inference of dynamics [22][23][24].Using symbolic regression and an evolutionary algorithm [25,26], learning compact nonlinear models becomes possible.Moreover, leveraging sparsity (also known as sparse regression), several approaches have been proposed [27][28][29][30][31][32].We also mention the work [33] that learns models using Gaussian process regression.All these methods have particular approaches to handling noise in the data.For example, sparse regression methods (e.g.[27,28,32]) often use smoothing methods before identifying models, and the work [33] handles measurement noise as data represented like a Gaussian process.
Even though the aforementioned nonlinear modelling methods are appealing and powerful in providing analytic expressions for models, they are often built upon model hypotheses.For example, the success of sparse regression techniques relies on the fact that the nonlinear basis functions, describing the dynamics, lie in a candidate feature library.For many complex dynamics, the utilization of these methods is not trivial.Thus, machine learning techniques, particularly deep learning-based ones, have emerged as powerful methods capable of expressing any complex function in a black-box manner given enough training data.Neural network-based approaches in the context of dynamical systems have been discussed in [34][35][36][37] decades ago.A particular type of neural network, namely recurrent neural networks, intrinsically models sequences and is often used for forecasting [38][39][40][41][42] but does not explicitly learn the corresponding vector field.Deep learning is also used to identify a coordinate transformation so that the dynamics in the transformed coordinates are almost linear or sparse in a high-dimensional feature basis (e.g.[43][44][45][46]).Furthermore, we mention that classical numerical schemes are incorporated with feed-forward neural networks to have discretetime steppers for predictions (see [36,[47][48][49]).The approaches in [36,47] can be interpreted as nonlinear autoregressive models [9].A crucial feature of deep learning-based approaches that integrate numerical integration schemes is that vector fields are estimated using neural networks.Also, time-stepping is done using a numerical integration scheme.Furthermore, in recent times, neural ordinary differential equations (neural ODEs) in which neural networks define the vector fields, have been proposed in [50], where it is shown how to compute gradients with respect to network parameters efficiently using adjoint sensitivities.As a result, one can use efficient black-box numerical solvers to solve ODEs in a given time span using any adaptive time-stepping method.However, measurement data are often corrupted with noise, and these approaches do not perform any specific noise treatment.The work in [1] proposes a framework that explicitly incorporates the noise into a numerical time-stepping method.Though the approach has shown a promising direction, its scalability remains ambiguous.The approach explicitly needs noise estimates by learning the decomposition of the signal into noise and ground truth.Also, it relies on a Runge-Kutta scheme that can accurately estimate the variable at the next step.In the context of sparse regression, several attempts have been made to reduce the effect of the noise on the discovered sparse models, which are, for example, WSINDy [51] and Ensemble-SINDy [52].However, these techniques rely on sparse regression assumptions and assume all dependent variables are collected at the same time.Furthermore, in scenarios where the data are collected on an irregular time grid, the work [53] discussed a methodology by combining gated recurrent unit (GRU) and neural ODEs.In the royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 221475 approach, an estimate for the initial condition of (latent) ODEs is learned, and an ODE for the vector field is then integrated using the estimated initial condition.However, long sequences are quite challenging to estimate the initial condition given measurements future in time.Although in [53] the measurements can be collected at an irregular time grid, it still requires that all dependent variables are measured at the same time grid.When each dependent variable is collected at a different time grid, the approach [53] is not even applicable.Gaussian processes have recently been combined with neural ODEs to deal with noisy measurements and irregular measurement sampling [54].In this, each dependent variable is represented as a Gaussian process, and a probabilistic model is learned, describing the underlying dynamics.The approach, however, depends on the modelling assumption for each dependent variable and yields a probabilistic model rather than a deterministic model.Furthermore, it does not focus on recovering the clean data from the noisy measurements.

Proposed methodology for learning dynamics: implicit networks combined with neural ODEs
This section discusses our framework for learning dynamical models using noisy measurements without explicit noise estimation.To achieve the goal, we use the powerful approximation capabilities of deep neural networks and their automatic differentiation feature with the neural ODEs approach [50].
Neural ODEs allow one to integrate a function, defining the vector field, with any desired method and accuracy, and computing derivatives with respect to the parameters efficiently.For details, refer to Chen et al. [50].Consider the nonlinear dynamical system of the form (1.1).Note that the solution x(t j ) can be given as gðxðtÞÞ dt: ð3:1Þ Next, we discuss our framework to learn dynamical models from noisy measurements.The approach involves two networks.The first network implicitly represents the variable as shown in figure 1b, and the second network approximates the vector field, or the function g(•).These two networks are related by connecting the dependent variables at time t i and t j , as given in (3.1).That is, the output of the implicit network is not only in the vicinity of the given noisy measurement data, but also its time-evolution can be defined by g(x) or as in (3.1).
To be mathematically precise, let us denote noisy measurement data at time t i by yðt i Þ.Furthermore, we consider a feed-forward neural network, denoted by N Imp u and parameterized by θ, that approximately yields an implicit representation of measurement data, i.e.
where i ∈ {1, …, m} with m being the total number of measurements.Additionally, let us denote another neural network by N Dyn f parameterized by ϕ that approximates the vector field g(•).We connect these two networks by enforcing that the time-evolution of the output of the network N Imp u can be described by where x(t) is defined in (3.2).As a result, our goal becomes to determine the network parameters {θ, ϕ} such that the following loss is minimized: where yðt i Þ is measurement data.The loss enforces measurement data to be in the vicinity of the output of the implicit network, and l MSE is its weighting parameter.
royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 221475 -The term L Integral links the two networks by comparing the prediction, i.e.
and the parameter l Integral defines its weight in the total loss.-The vector field at the output of the implicit network can also be computed directly using automatic differentiation, but it can also be computed using the network N Dyn f .The term L Grad penalizes its mismatch as follows: and l Grad is its corresponding weighting parameter.
The total loss L can be minimized using a gradient-based optimizer such as Adam [55].Once the networks are trained and have found their parameters that minimize the loss, we can generate the denoised variables using the implicit network N Imp u , and the vector field by the network N Dyn f .In the rest of the paper, we denote the proposed methodology by implicit-neural ODEs (in short Imp-NODEs).

Numerical experiments
We now investigate the performance of the approach discussed in §3 to denoise measurement data and to learn a model for estimating the vector field by means of an example.To that aim, we consider data obtained by solving a differential equation that is then corrupted using additive Gaussian white noise by varying the noise level.For a given percentage, we determine the noise as follows: n N (0, s 2 ), with s ¼ Noise% 100 :

Training set-up
We have implemented our framework using the deep learning library PyTorch [56] and have optimized both networks simultaneously using the Adam optimizer [55].We have used torchdiffeq [50], a Python package, to integrate ODEs and to do back-propagation to determine gradients with the default settings.Since at the start of training the parameters of the neural networks are far from the optimized values as they are initialized randomly, it is not required to solve the integral term in (3.6) very accurately.Therefore, we can approximate it using the fourth-order Runge-Kutta (RK4) method at the beginning of the training.Consequently, we can expect to gain computational advantages because the RK4 method requires only four calls of the function defining the vector field.Therefore, we first train using this approximation of the integral for 5000 epochs, followed by training using an adaptive ODE integration scheme for 10 000 epochs.We also make use of a learning scheduler, for which we reduced the learning rate by one-tenth after every 4000 epochs.Furthermore, for the implicit networks, we map the input data to [−1, 1].Note that, in our experiments, we report the results obtained from one attempt by setting the random seed to 42, except in §4.2.5, where we discuss an ensemble approach.The neural network architecture design and hyper-parameters are discussed in appendix A, and we have run all our experiments on a NVIDIA P100 GPU.

Cubic damped model
For illustration purposes, we consider a simple damped cubic system, which is described by It has been one of the benchmark examples in discovering models using data (e.g.[32,57]), but it is assumed that the dynamics can be given sparsely in a high-dimensional feature dictionary.Here, we do not make such an assumption but instead learn the vector field using a neural network.For this royalsocietypublishing.org/journal/rsosR. Soc.Open Sci.10: 221475 example, we take 2500 data points in the time interval [0, 25] by simulating the model using the initial condition [2,0].We add various noise levels to the clean data to obtain noisy measurements synthetically.We, therefore, corrupt the data by adding mean-zero Gaussian white noise with f5%, . . ., 30%g noise.

Training and results
Next, we aim to obtain a denoised signal and a model, defining its vector field using the proposed methodology.Thus, we construct neural networks for the implicit representation and the vector field with the parameters given in table 2.
To train the implicit network and the neural network for ODEs, we set l Integral ¼ 1:0 and l Grad ¼ 10 À2 in the loss function (3.4); we choose l MSE ¼ 1:0 for 5% noise, and l MSE ¼ 0:5 for f10%, 20%g noise, and l MSE ¼ 0:2 for 30% noise to avoid over-fitting of noisy data for the implicit network.Moreover, to integrate the ODEs, we consider the time span of 10 Á dt with dt ¼ 10 À2 .We compare our methodology with the neural ODE framework [50], which also focuses on learning a neural network that defines the underlying vector field.We note that the neural ODE framework does not have any special treatment to handle noise.For this methodology, we train the model for 1000 epochs only, since we shall later illustrate that it is prone to over-fitting when trained longer.Furthermore, it is trained with the same configuration for neural network architectures and using training data as for our approach.Having the trained models, we compare the vector field in the domain [−2, 2] × [−2, 2] by taking 25 points in each direction.We plot the results in figure 2, where the learned models of the vector fields obtained from the proposed method (Imp-NODEs) are compared with neural ODE [50] (Std-NODEs).
It is clear from the figures that Imp-NODEs is able to learn the underlying vector field faithfully, whereas Std-NODEs fails to identify the vector fields correctly which becomes particularly evident for higher noise levels.Our approach consists of an implicit network, aiming to generate denoised data in the vicinity of noisy data whose dynamics is defined by a neural network.Thus, we plot the denoised data obtained from the implicit network in figure 3. We note that despite not employing any datafiltering scheme, we can obtain denoised data, close to the ground truth clean data even for a high noise level, which is, otherwise, not possible by employing solely Std-NODEs.We note that the proposed method takes 0.11 s for one epoch, and a similar order of computational time is taken for Std-NODEs.3.17 err in log-scale noise 20% Next, we study the effect of hyper-parameters on the performance of the proposed methodology.To that end, we first note that for Imp-NODEs, the loss function is given by a weighted sum of three terms; see (3.4).Here, we aim to study the influence of one of the hyper-parameters, namely l MSE , on the performance of the learned model using Imp-NODEs, by keeping the other two hyper-parameters fixed.They are set to l Integral ¼ 1:0 and l Grad ¼ 10 À2 , similar to the previous subsection.Recall that l MSE determines how well the given data are approximated using an implicit network.We take the cases of f5%, 20%g noise levels.We train different models by varying l MSE .We then plot its effect on the performance of the method in figure 4. The figure shows that for low noise levels, the method is robust with respect to the change of the parameter l MSE , but it is rather sensitive for high noise levels.This can be explained by the fact that when the data are highly noisy, the implicit neural networks learn the noise by over-fitting, as more weight is given to fitting noisy data.Moreover, when l MSE is very low, then the implicit network does not learn enough information from the data; hence, the underlying vector field cannot be expected to be identified accurately.Therefore, finding a good value of the hyper-parameters is important to obtain a good fit for the model, defining the vector field.
To determine a good region for l MSE , we make an attempt and borrow an idea from solving illconditioned least-squares problems (e.g.[58]).In light of this, we solve the underlying optimization problem for different values of l MSE and observe the data-fidelity term L MSE , given in (3.5).We then plot these quantities, namely l MSE and data-fidelity term, as shown in figure 4b.Such a plot often exhibits an L-type curve, and a promising region for the value l MSE lies at the corner.These kinds of studies are often carried out to determine hyper-parameters for solving Tikhonov-regularized leastsquares problems [58].It is also what we observe in our case, and such a hyper-parameter search can provide us with a hint about a suitable parameter region.At least in the context of neural networks, it is widely known that longer training with many iterations (or epochs) can over-fit the model, particularly for noisy measurements.Here, we study the effect of longer training on the performance of Imp-NODEs and Std-NODEs.Keeping the same setting as in §4.2.1 and taking data for 20% noise, we learn vector fields using Imp-NODEs and Std-NODEs by varying the number of epochs.We plot the results in figure 5, which shows that Imp-NODEs is quite robust with respect to the number of epochs, and it does not start over-fitting when trained longer.Potential reasoning could be that our approach Imp-NODEs is augmented with an implicit neural network, which can act as an adequate regularizer to avoid over-fitting.By contrast, we observe that Std-NODEs starts learning noise as the training progresses after certain epochs.Hence, early stopping is quite an important factor in having good performance for Std-NODEs.

Employing low-pass filters as a pre-processing step
We have already observed in §4.2.1 that Imp-NODEs is capable of yielding denoised data, which are close to the clean data, without any pre-processing step.However, one might argue that employing classical methods such as low-pass filters can be beneficial, and they provide a computationally cheap yet powerful tool to remove a major part of the noise.This is what we study next-that is, how the performances of Imp-NODEs and Std-NODEs to learn models are affected when a pre-processing step is employed.We have the same setting as in §4.2.1, except that we now employ a low-pass filter to smooth the data.This is achieved by third order digital Butterworth filter with critical frequency 0.1 and is implemented using scipy.We present the quality of the learned vector fields using Imp-NODEs and Std-NODEs in figure 6.We note that the performance of Std-NODEs is only slightly affected, as compared to the case when no preprocessing step was employed (compare figures 2 and 6).On the other hand, for Std-NODEs, we observe a substantial improvement, especially for high noise levels.A reason is that Std-NODEs does not have an inherent capability of handling noise; hence, any filtering approach would be highly beneficial, in contrast to our approach Imp-NODEs, which implicitly aims to yield denoised data as well.We explicitly report this behaviour by means of a bar plot; see figure 7. Despite improving the performance of Std-NODEs by means of pre-precessing, our approach Imp-NODEs still outperforms Std-NODEs in terms of the quality of the learned models for the vector field.Furthermore, we note that such filtering approaches are not straightforward to employ, especially in the case of irregular data.In those cases, employing our approach, namely Imp-NODEs, could be beneficial since it does not require any pre-processing step yet yields good models and denoised data.

An ensemble approach to improve performance
Ensemble approaches are widely employed machine learning techniques to improve model predictions.The main principle is to combine predictions of many possible independent models, for example, by  royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 221475 taking an average.Several methods exist in this direction, such as bragging, bagging and boosting (e.g.[2][3][4]6,59]).In this work, we take inspiration from bagging and propose the following to obtain an ensemble of models for predicting the vector field.
In bagging, data bootstraps are often used with replacements, followed by learning a model for an ensemble.In our framework, the data are very limited, thus focusing on using all of them in some form.However, we require to build an ensemble of independent models.For this purpose, we propose modifying the terms (3.5)-(3.7)using a weighting vector ω as follows: where ω i is the ith entry of ω and is sampled randomly from a uniform distribution between [0, 1].Using the modified loss terms as above, we can define a new weighted total loss as in (3.4).We can expect a different solution for every random vector ω, as the underlying optimization problem is highly nonlinear and non-convex.Moreover, a physical interpretation of ω or ω i , in the context of the classical bragging philosophy, can be given as follows: it defines a probability of drawing the sample y i (or x i ).Consequently, we can obtain an ensemble of models by randomly selecting ω.For Imp-NODEs, we build 20 models to predict the vector field for 5% and 20% noise levels and plot the mean of the ensemble models in figure 8.We also show the standard deviation among these 20 models.These figures indicate that we have a good approximation of the vector field in the region of the collected data and can estimate the confidence by means of the obtained standard deviation.To further quantify the performance of the ensemble approach, in table 1 we note the mean and median errors of the vector fields of the mean-ensemble model, the best and worst models among the 20 trained models.Interestingly, we find that the mean-ensemble model can even outperform the best-obtained model with a single attempt.This illustrates the powerful capability of an ensemble approach.However, we note that ensemble approaches come with computational disadvantages, as we are not only required to train several models, but during the inference, we also need to make use of those many models to take an average of them.

Second-order neural ODEs for noisy data
Several dynamics observed in engineering processes, particularly in electrical and mechanical systems, are of second order, which can be given as follows: We constructed an ensemble of 20 models and have taken the mean of these to predict the vector field.
royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: where _ xðtÞ and € xðtÞ denote the first and second derivatives of x(t) with respect to t, respectively.As discussed in [60], it is advantageous to consider the companion first-order system of (5.1) which is as follows: where z 1 ðtÞ ¼ _ xðtÞ and z 2 ðtÞ ¼ xðtÞ, and it inherently preserves the second-order behaviour.The above system can be seen as a first-order system with a constraint.The method proposed in the previous section can be readily applied to learn second-order neural ODEs for noisy measurements by incorporating implicit networks.

Numerical example: pendulum dynamics
To illustrate learning second-order dynamics, we consider the nonlinear pendulum model € xðtÞ ¼ À sinðxðtÞÞ À 0:05 Á _ xðtÞ: ð5:3Þ We collect data using the initial condition ½ _ xðtÞ, xðtÞ ¼ ½À0:5, 2:0 with time steps of dt ¼ 0:05, which is then corrupted by adding Gaussian white noise of f5%, 10%, 20%, 30%g noise levels.Here, as well, we do not apply any pre-processing step to observe the performance of the proposed methodology without any preprocessing.By imposing the second-order structure, we employ the proposed scheme by combining an implicit network and neural ODEs.We train the networks with parameters l Integral ¼ 1:0, l Grad ¼ 10 À2 , and l MSE ¼ 1:0 in (3.4) for 5% noise, and l MSE ¼ 0:5 for 10% and 20% noise, and l MSE ¼ 0:1 for 30% noise to avoid over-fitting.We also use an early stopping for standard neural ODEs to avoid over-fitting, as discussed in §4.2.1.We train it for 200 epochs.For numerical integration, we take a time span of 5 Á dt.
We compare our results with neural ODEs for second-order systems, the approach proposed in [60]; we denote it by SO-Std-NODEs.We plot the learned vector field from both methods in figure 9, where we see a better performance for the proposed method than SO-Std-NODEs.It is particularly apparent for more significant noise, where SO-Std-NODEs fails to capture the vector field (see figure 9 third and fourth rows).Moreover, in figure 10, we plot the denoised data, which is the output of the trained implicit network, indicating the faithful recovery of the data without performing any prior pre-processing step.

Data at irregular sampling
Lastly, we illustrate the ready applicability of the proposed method (Imp-NODEs) when the data are collected at an irregular time grid, especially when dependent variables are not even measured in the same time frame.This is of particular interest in medical applications, where data often come at quite irregular time intervals or when the sources of information are different.
We here present the framework for two-dimensional problems; however, it readily extends to arbitrary dimensional dynamics.Let us consider a dynamical model as follows: the vector field representing the dynamics for x using measurements at an irregular time grid, we construct an implicit representation for x so that both variables can be estimated on the same time grid (let us denote it by T = {t 1 , …, t p }) but with a constraint using measurements.Assume the implicit network and neural ODE defining the vector field are denoted by N Imp u and N Dyn f .To train the network, we define the following loss function: where [ • ] k denotes the kth element.

Numerical example: linear 2D
We illustrate the considered scenario using a linear 2D example given by   We collect data using an initial condition [x 1 , x 2 ] = [2, 0] with a time step dt ¼ 0:05 in the time interval [0, 20].We randomly collect 60% independent samples for the first and second dependent variables, followed by corrupting them using Gaussian white noise for f5%, 10%, 20%, 30%g noise levels.Consequently, we obtain the data, which are not only noisy but irregular as well, as shown in figure 11.Furthermore, we take the time grid for prediction of the output of the implicit network as the uniform grid with dt ¼ 0:05 for the time interval [0, 20] so that it can be fed to evaluate the integral terms.For learning models for the vector field, we set l Integral ¼ 1:0, l Grad ¼ 10 À2 , and l MSE ¼ 1:0 for 5% noise level, l MSE ¼ 0:5 for 10% noise level, and l MSE ¼ 0:2 for f20%, 30%g noise levels.For time integration, we consider the time span 5 Á dt.
We show the estimates of the learned vector fields using the proposed methodology and compare them with the ground truth in figure 12, illustrating faithful capturing of the dynamics.Moreover, we can recover the clean signal without any prior information about the noise and any pre-processing of the data, even for irregular data, as shown in figure 13.

Discussion and conclusion
This work has presented a new approach for learning dynamical models from noisy time-series data and for obtaining denoised data.Our framework blends the universal approximation capabilities of deep neural networks with neural ODEs.The proposed scheme involves two networks to learn (approximately) an implicit representation of the measurement data and of the vector field.These networks are combined by enforcing that an ODE can explain the dynamics of the output of the implicit network.We also discussed its extension to second-order neural ODEs to learn second-order dynamical models using corrupted data.Furthermore, we have presented that the proposed approach can readily handle arbitrarily sampled points in time.The dependent variables need not be collected at the same time grid.This is possible because of the construction of an implicit representation of the data in our framework that does not require data to be at a particular grid.We also discussed an ensemble approach, inspired by bragging, to improve the quality of the models by taking an average of an ensemble of models.We have also discussed a scheme based on an L-curve analysis to determine a good regime for hyper-parameters.
In the future, we will focus on using the encoder-decoder framework combined with an implicit network to learn latent ODEs and explain even richer dynamics.Moreover, when the data are highdimensional (e.g.coming from partial differential equations), applying neural ODEs becomes computationally intractable.However, it is known that the dynamics often lie in a low-dimensional A comparison of the learned vector fields for second-order dynamical models using the proposed methodology and Std-NODEs for various noise levels.It illustrates the capability of the proposed method to learn dynamic models from highly irregular data and its robustness with respect to various noise levels.
royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 221475 manifold.Therefore, in our future work, we aim to use the concept of low-dimensional embedding to make learning computationally more efficient for high-dimensional data.Furthermore, it would be interesting to use expert knowledge and physical laws to have physics-constrained neural ODEs so that the generalizability and extrapolation capabilities of models can be further improved.

Figure 1 .
Figure 1.The figure illustrates the framework for denoising the data and learning a model describing underlying dynamics.For this, we determine an implicit representation of the noisy data (approximately) by a network N Imp u and another network for the vector field N Dyn f .These two networks are connected by enforcing that the dynamics of the output of the implicit representation can be given by N Dyn f .Once the objective function (shown in c) is minimized, we obtain an implicit network for denoised data and a model for the vector field N Dyn f ðxÞ.
:4Þ where -L MSE denotes the mean square error of the output of the network N Imp u and the noisy measurements, i.e.

Figure 2 .
Figure 2. Cubic2D example.A comparison of vector fields of the ground truth and learned models for various noise levels.

Figure 3 .
Figure 3. Cubic2D example.The plots show the noisy data, and denoised (recovered) data from the implicit network.The clean (reference) data are indicated using the black dashed lines.The third and fourth columns indicate absolute and relative errors between ground truth and denoised data, respectively.

Figure 4 .
Figure 4. Cubic2D example.(a) The effect of the hyper-parameter l MSE on the performance of Imp-NODEs.(b) We demonstrate how an L-curve study can help to determine a suitable regime for the hyper-parameter l MSE .Note that the corner of the L-curve is often a good region for l MSE , which is also observed in this case.

Figure 5 . 1 The
Figure 5. Cubic2D example.The figure illustrates the effect of longer training on the performance of Imp-NODEs and Std-NODEs.

Figure 6 .
Figure 6.Cubic2D example.Having employed a low-pass filter as a pre-processing step, we here present a comparison of vector fields of the ground truth and learned models for various noise levels.

Figure 7 .
Figure 7. Cubic2D example.The plot shows the effect on employing a pre-processing step using a low-pass filter on both approaches, Imp-NODEs and Std-NODEs.It illustrates that the pre-processing step does not have a major impact on the performance of Imp-NODEs, in contrast to Std-NODEs.

Figure 8 .
Figure 8. Cubic2D example.The figure demonstrates the performance of the ensemble approach combined with Imp-NODEs.We constructed an ensemble of 20 models and have taken the mean of these to predict the vector field.

Figure 9 .
Figure 9. Pendulum example.A comparison of the learned vector fields for second-order dynamical models using the proposed methodology and SO-Std-NODEs for various noise levels.It illustrates the robustness of the proposed approach with respect to various noise levels.

Figure 10 .
Figure 10.Pendulum example.The figure shows the noisy measurements, and denoised data obtained from the implicit network.The black dashed lines show the ground truth reference.The third and fourth columns indicate absolute and relative errors between ground truth and denoised data, respectively.

11 .
Linear 2D example.An illustration of collected noisy and irregular data.In the plot, it is clearly visible that the variables x 1 and x 2 are not collected at the same time frame.royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 221475

Figure 12 .
Figure 12.Linear 2D example.A comparison of the learned vector fields for second-order dynamical models using the proposed methodology and Std-NODEs for various noise levels.It illustrates the capability of the proposed method to learn dynamic models from highly irregular data and its robustness with respect to various noise levels.

Figure 13 .u 1 k u 1 k u 1 k u 1 k u 1 k u 1 n u 1 (Figure 14 .
Figure 13.Linear 2D example.The figure shows the ability to recover the clean data by means of the implicit network, even for irregular data.

Table 1 .
whereas the variable x 2 is collected on the time gridT 2 ¼ ft ð2Þ 1 , ..., tð2Þm g with T 1 ≠ T 2 .To learn a model for Cubic2D example.A comparison of the mean-ensemble, the best and the worst models is presented by comparing the mean and median of the error of the vector field.It shows that the mean-ensemble model can outperform the best model obtained using a single attempt.The best performing model using mean and median measures is highlighed in bold.

Table 2 .
The table shows the information about network architectures and learning rates.royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.10: 221475