Learning dynamical systems from data: A simple cross-validation perspective, part III: Irregularly-Sampled Time Series

A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a kernel. In particular, this strategy is highly efficient (both in terms of accuracy and complexity) when the kernel is data-adapted using Kernel Flows (KF)~\cite{Owhadi19} (which uses gradient-based optimization to learn a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used for interpolation). Despite its previous successes, this strategy (based on interpolating the vector field driving the dynamical system) breaks down when the observed time series is not regularly sampled in time. In this work, we propose to address this problem by directly approximating the vector field of the dynamical system by incorporating time differences between observations in the (KF) data-adapted kernels. We compare our approach with the classical one over different benchmark dynamical systems and show that it significantly improves the forecasting accuracy while remaining simple, fast, and robust.


Introduction
The ubiquity of time series in many domains of science has led to the development of diverse statistical and machine learning forecasting methods. Examples include ARIMA [10], GARCH [5] or LSTM [39]. Most of these methods require the time series to be regularly sampled in time. Yet, this requirement is not met in many applications. Indeed, irregularly sampled time series commonly arise in healthcare [29], finance [16] and physics [40] among other fields.
While adaptations have been proposed, these workarounds tend to consider the irregular sampling issue as a missing values problem, leading to poor performance when the resulting missing rate is very high. Such approaches include (1) the imputation of the missing values (e.g. with exponential smoothing [23,42] or with a Kalman filter [17]), and (2) fast Fourier transforms or Lomb-Scargle periodograms [16,2]. This issue has motivated the development of several recent deep learningbased algorithms such as VS-GRU [27], GRU-ODE-Bayes [28,15] or ODE-RNN [18].
Amongst various learning-based approaches, kernel-based methods hold potential for considerable advantages in terms of theoretical analysis, numerical implementation, regularization, guaranteed convergence, automatization, and interpretability [11,32]. Indeed, reproducing kernel Hilbert spaces (RKHS) [14] have provided strong mathematical foundations for analyzing dynamical systems [6,21,19,20,4,24,25,1,26,7,8,9] and surrogate modeling (we refer the reader to [38] for a survey). Yet, the accuracy of these emulators depends on the kernel, and the problem of selecting a good kernel has received less attention. Recently, the experiments by Hamzi and Owhadi [22] show that when the time series is regularly sampled, Kernel Flows (KF) [34] (an RKHS technique) can successfully reconstruct the dynamics of some prototypical chaotic dynamical systems. KFs have subsequently been applied to complex large-scale systems, including climate data [30,41]. The nonparametric version of KFs has been extended to dynamical systems in [35]. A KFs version for SDEs can be found in [36].
Despite its recent successes, we show in this paper that this strategy (based on approximating the vector field of the dynamical system) cannot directly be applied to irregularly sampled time series. Instead, we propose a simple adaptation to the original method that allows to significantly improve forecasting performance when the sampling is irregular. The adaptation is to approximate the vector field and can be reduced to adding time delays in between observations to the delay embedding used to feed the method. We demonstrate the benefits of our approach on three prototypical chaotic dynamical systems: the Hénon map, the Van der Pol oscillator, and the Lorenz map. For all, our approach shows significantly improved forecasting accuracy (compared to the original approach).
Specifically, our contributions are as follows: • We show that learning the kernel in kernel ridge regression using our modified approach significantly improves the prediction performance for irregular time series of dynamical systems • Using a delay embedding, we adapt the KF-adapted kernel method algorithm to make multistep predictions The outline of this paper is as follows. In Section 2, we review kernel methods for regularly sampled time series and propose an extension of Kernel Flows to irregularly sampled time series. Section 3 contains a description of our experiments with the Hénon, Van der Pol, and Lorenz systems and a discussion. The appendix provides a summary of the theory of reproducing kernel Hilbert spaces (RKHS).

Statement of the problem and proposed solution
2.1. The problem. Let x 1 , x 2 , ..., x n be observations from a deterministic dynamical system in R d , along with a vector t = (t 1 , . . . , t n ) containing the time of observations. That is, the observation x k is observed at time t k . Importantly, time differences in between observation t k+1 − t k are not necessarily regular. Our goal is to predict x n+1 , x n+2 , . . . given the future sampling times t n+1 , t n+2 , ... and the history of the irregularly observed time series (x 1 , ...x n and t 1 , ..., t n ).

2.2.
A reminder on kernel methods for regularly sampled time series. The simplest approach to forecasting the time series (employed in [22]) is to assume that x 1 , x 2 , . . . is the solution of a discrete dynamical system of the form with an unknown vector field f † and time delay τ ∈ N * (which we will call delay or delay embedding) and approximate f † with a kernel interpolant f of the past data (a kernel ridge regression model [13]) and use the resulting surrogate model x k+1 = f (x k , . . . , x k−τ † +1 ) to predict future state. Given τ ∈ N * (see [22] for how τ can be learned in practice), the approximation of the dynamical system can then be recast as that of interpolating f † from pointwise measurements with X k := (x k , . . . , x k+τ −1 ), Y k := x k+1 and N = n − τ . Given a reproducing kernel Hilbert space 1 of candidates H for f † , and using the relative error in the RKHS norm · H as a loss, the regression of the data (X k , Y k ) with the kernel K associated with H provides a minimax optimal approximation [33] of f † in H. This regressor (in the presence of measurement noise of variance where X = (X 1 , . . . , X N ), Y = (Y 1 , . . . , Y N ), k(X, X) is the N × N matrix with entries k(X i , X j ), k(x, X) is the N vector with entries k(x, X i ) and I is the identity matrix. This regressor has also a natural interpretation in the setting of Gaussian process (GP) regression: (i.) (3) is the conditional mean of the centered GP ξ ∼ N (0, K) with covariance function K conditioned on ξ(X k ) + √ λZ k = Y k where the Z k are centered i.i.d. normal random variables of variance λ.

2.3.
A reminder on the Kernel Flows (KF) algorithm. The accuracy of any kernel-based method depends on the kernel K, and [22] proposed (in the setting of Subsec. 2.2) to also learn that kernel from the data (X k , Y k ) with the Kernel Flows (KF) algorithm [34,44,12] which we will now recall.
To describe this algorithm, let K θ (x, x ) be a family of kernels parameterized by θ. Using the notations from Subsection 2.2, the interpolant of the data (X, Y ) (X = (X 1 , . . . , X N ) and Y = (Y 1 , . . . , Y N )) obtained with the kernel K θ (and a nugget λ > 0) admits the representer formula (4) A fundamental question is then: which θ should be chosen in (4)? KF answers that question by learning θ from data based on the simple premise that a kernel (K θ ) is good if the interpolant (4) does not change much under subsampling of the data. This simple cross-validation concept is then turned into an iterative algorithm as follows.
2.4. The problem with irregularly sampled time series. The model (1) fails to be accurate for irregularly sampled series because it discards the information contained in the t k . When the x k are obtained by sampling a continuous dynamical system, one could consider the following alternative model: While this approach may succeed if the time intervals t k+1 − t k are small enough, it will also break down as these time intervals get larger. In our experiments section, we refer to this approach as the Euler approach, as it consists in learning the Euler discretization of the vector field.

The proposed solution.
To address this issue, we consider the model which incorporates the time differences ∆ k = t k+1 − t k between observations. That is, we employ a time-aware time series representations by interleaving observations and time differences. The proposed strategy is then to construct a surrogate model of (7) by regressing f † from past data and a kernel K θ learned with Kernel Flows as described in Subsec. 2.3. Note that the past data takes the form (2) with X k := (x k , ∆ k , . . . , x k+τ −1 , ∆ k+τ −1 ), Y k := x k+1 and N = n − τ .

Experiments
We conduct numerical experiments on three well-known dynamical systems: the Hénon map, the van der Pol oscillator, and the Lorenz map. We generate irregularly sampled time series from these dynamical systems using numerical integration and subsequently split the time series into training and test subsets. The time series are subsequently irregularly sampled according to the following scheme. The time interval between each observation ∆ k is taken to be a multiple of the smallest integration setup used to generate the data δ t . That is, ∆ k = α k δ t where α k is a random integer between 1 and α. We train the kernel on the training part of the time series and evaluate the forecasting performance of the model. We report both the mean squared error (MSE) and the coefficient of determination (R 2 ).
Given test samples x n+1 , x n+2 , ..., x N and the predictionsx n+1 ,x n+2 , ...,x N , the MSE and the coefficient of determination are computed as follows: The MSE should then be as low as possible and the R 2 as high as possible. We note that it is possible to have a negative R 2 , if the predictor performs worse than the average of the samples.
To showcase the importance of learning the kernel parameters and to include the time difference between subsequent observations, we proceed in three stages. We first report the results of our method when the parameters of the kernel are not learned but rather sampled at random from a uniform (U(0, 1)) distribution and when the time delays are not encoded in the input data. In this setup, we distinguish the original KF case and the Euler version, as discussed in Subsection 2.4. Second, to assess the importance of learning the Kernel parameters, we report the model performance when the parameters are learned but the time delays are not encoded in the input data. Lastly, we report the performance of our approach when we both learned the kernel parameters and included the time delays.
For all models variants and dynamical systems, we use the training procedure as described in [22] and used a mini-batch size of 100 temporal observations and minimize ρ(θ) as in Equation 5 using stochastic gradient descent. To allow for a notion of uncertainty in the reported metrics, all our experiments use a five repetition approach where five different kernel initialization are randomly chosen.
In all of our examples, we used a kernel that is a linear combination of the triangular, Gaussian, Laplace, locally periodic kernels, and the quadratic kernel.  Table 1. Test performance of the different datasets. We report the means along with standard deviations of the mean squared error (MSE) and coefficient of determination (R 2 ) on the forecasting task. As Hénon is not a time-continuous map, the Euler version of KF is not applicable in this case. For readability, we abstain from reporting the exact numbers when MSE is larger than one and R 2 larger lower than 0.

Method
is, for a horizon h and for a delay embedding with delay d, we split the test time series in chunks of lengths h + d. For each of these chunks, we use the d first samples as input to our model and predict over the h remaining samples in the chunk. We eventually aggregate the predictions of all samples overall chunks together to compute the reporting metrics.
Overview. Recapping, we will compare 5 approaches: (A) Regressing model (7) with a kernel learnt using KF (which we call irregular KF).
(B) Regressing model (1) with a kernel learned using KF (which we call regular KF).
(C) Regressing model (6) with a kernel learned using KF (which we call the Euler version).
(E) Regressing model (1) without learning the kernel. Table 1 summarizes results obtained in the following sections.

Hénon map.
Consider the Hénon map with a = 1.4, b = 0.3 x n+1 = 1 − ax 2 n + y n , y n+1 = bx n We have repeated our experiments five times with a delay embedding of 1, a learning rate η of 0.1, a prediction horizon h of 5, maximum time difference α of 3, and have trained the model on 600 points to predict the next 400 points. Fig. 1i shows that approach (E) cannot reconstruct the attractor because it makes no attempt at learning the kernel and ignores time differences in the sampling. Fig. 1ii shows that embedding the time delay in the kernel (approach (A)) significantly improves the reconstruction of the attractor of the Hénon map. Table 1 displays the forecasting performance of the different methods. We observe that if the kernel is not learned (if the kernel is not data adapted), then the underlying method is unable to learn an accurate representation of the dynamical system. However, if the parameters of the kernel are learned, then our proposed approach (A) clearly outperforms the regular KF (approach (B)). As for the Euler version, it is not applicable in this case as Hénon is not a continuous map.
Here, we have used a prediction horizon h of 10, a learning rate η of 0.01, a maximum time difference α of 5, and a delay embedding of 1. As evident from Table 1

Lorenz.
Our third example is the Lorenz system described by the following system of differential equations:ẋ with standard parameter values σ = 10, ρ = 28 , β = 8 3 Our parameters include a delay embedding of 2, a learning rate η = 0.01, a prediction horizon h = 20, a maximum time difference α = 5, 5000 points used for training and the 5000 for testing. Fig. 7, 8 and 9 show that (1) not learning the kernel or not including time differences lead to poor reconstructions of the attractor of the Lorenz system even if the time horizon is 1. However, as observed in Table 1, the Euler version of KF leads to satisfying results, close to (but not as good as) the ones obtained with our proposed approach (A).     Remark (Real-time learning and Newton basis): It is possible to include new measurements when approximating the dynamics from data without repeating the learning process. This can be done by working in Newton basis as in [37] (see also section 4 of [38]). The Newton basis is just another basis for the space spanned by the kernel on the points, i.e., span{k(., , the basis is orthonormal in the RKHS inner product).
If we add a new point x N +1 , ..., x N +m , we'll have corresponding elements v N +1 , ..., v N +m of the Newton basis, still orthonormal to the previous ones. So we will have a new interpolant f new (x) = N +m i=1 b i v i (x) that can be rewritten in terms of the old interpolant as where f can still be written in terms of the basis K, but with different coefficients c . If A is the kernel matrix on the first N points, on can compute a Cholesky factorization A = LL T with L lower triangular. Let B := L −T , then v j (x) = N i=1 (B) ij K(x, x i ). When we add new points, we have an updated kernel matrix A , and the Cholesky factor of A can be easily updated to the one of A .

Conclusion
Our numerical experiments demonstrate that embedding the time differences between the observations in the kernel considerably improves the forecasting accuracy with irregular time series. Though we have focused on a few examples, the success of our proposed approach (A) has raised the question of whether it can be extended to other systems, including those described by partial and stochastic differential equations, as well as complex real-world data.

Reproducing Kernel Hilbert Spaces (RKHS).
We give a brief overview of reproducing kernel Hilbert spaces as used in statistical learning theory [14]. Early work developing the theory of RKHS was undertaken by N. Aronszajn [3].
Definition 5.1. Let H be a Hilbert space of functions on a set X . Denote by f, g the inner product on H and let f = f, f 1/2 be the norm in H, for f and g ∈ H. We say that H is a reproducing kernel Hilbert space (RKHS) if there exists a function K : X × X → R such that i. K x := K(x, ·) ∈ H for all x ∈ X . ii. K spans H: H = span{K x | x ∈ X }. iii. K has the reproducing property: ∀f ∈ H, f (x) = f, K x . K will be called a reproducing kernel of H. H K will denote the RKHS H with reproducing kernel K where it is convenient to explicitly note this dependence.
The important properties of reproducing kernels are summarized in the following proposition.
Proposition 5.1. If K is a reproducing kernel of a Hilbert space H, then i. K(x, y) is unique. ii. ∀x, y ∈ X , K(x, y) = K(y, x) (symmetry). iii.
Theorem 5.1. Let K : X ×X → R be a symmetric and positive definite function. Then there exists a Hilbert space of functions H defined on X admitting K as a reproducing Kernel. Conversely, let H be a Hilbert space of functions f : X → R satisfying ∀x ∈ X , ∃κ x > 0, such that |f (x)| ≤ κ x f H , ∀f ∈ H. Then H has a reproducing kernel K.
Theorem 5.2. Let K(x, y) be a positive definite kernel on a compact domain or a manifold X. Then there exists a Hilbert space F and a function Φ : X → F such that K(x, y) = Φ(x), Φ(y) F for x, y ∈ X.
Φ is called a feature map, and F a feature space 2 .

Function Approximation in RKHSs:
An Optimal Recovery Viewpoint. In this section, we review function approximation in RKHSs from the point of view of optimal recovery as discussed in [33]. Problem P:. Given input/output data (x 1 , y 1 ), · · · , (x N , y N ) ∈ X × R, recover an unknown function u * mapping X to R such that u * (x i ) = y i for i ∈ {1, ..., N }.
In the setting of optimal recovery, [33] Problem P can be turned into a well-posed problem by restricting candidates for u to belong to a Banach space of functions B endowed with a norm || · || and identifying the optimal recovery as the minimizer of the relative error where the max is taken over u ∈ B and the min is taken over candidates in v ∈ B such that v(x i ) = u(x i ) = y i . For the validity of the constraints u(x i ) = y i , B * , the dual space of B, must contain delta Dirac functions φ i (·) = δ(· − x i ). This problem can be stated as a game between Players I and II and can then be represented as If || · || is quadratic, i.e. ||u|| 2 = [Q −1 u, u] where [φ, u] stands for the duality product between φ ∈ B * and u ∈ B and Q : B * → B is a positive symmetric linear bijection (i.e. such that [φ, Qφ] ≥ 0 and [ψ, Qφ] = [φ, Qψ] for φ, ψ ∈ B * ). In that case the optimal solution of (12) has the explicit form where A = Θ −1 and Θ ∈ R N ×N is a Gram matrix with entries Θ i,j = [φ i , Qφ j ].
To recover the classical representer theorem, one defines the reproducing kernel K as K(x, y) = [δ(· − x), Qδ(· − y)] In this case, (B, || · ||) can be seen as an RKHS endowed with the norm and (14) corresponds to the classical representer theorem v * (·) = y T AK(x, ·), using the vectorial notation y T AK(x, ·) = N i,j=1 y i A i,j K(x j , ·) with y i = u(x i ), A = Θ −1 and Θ i,j = K(x i , x j ). Now, let us consider the problem of learning the kernel from data. As introduced in [34], the method of KFs is based on the premise that a kernel is good if there is no significant loss in accuracy in the prediction error if the number of data points is halved. This led to the introduction of ρ = ||v * − v s || 2 ||v * || 2 (16) which is the relative error between v * , the optimal recovery (15) of u * based on the full dataset X = {(x 1 , y 1 ), . . . , (x N , y N )}, and v s the optimal recovery of both u * and v * based on half of the dataset X s = {(x i , y i ) | i ∈ S} (Card(S) = N/2) which admits the representation v s = (y s ) T A s K(x s , ·) with y s = {y i | i ∈ S}, x s = {x i | i ∈ S}, A s = (Θ s ) −1 , Θ s i,j = K(x s i , x s j ). This quantity ρ is directly related to the game in (13) where one is minimizing the relative error of v * versus v s . Instead of using the entire the dataset X one may use random subsets X s 1 (of X) for v * and random subsets X s 2 (of X s 1 ) for v s . Writing σ 2 (x) = K(x, x) − K(x, X f )K(X f , X f ) −1 K(X f , x) we have the pointwise error bound Local error estimates such as (18) are classical in Kriging [43] (see also [31][Thm. 5.1] for applications to PDEs). u H is bounded from below (and, in with sufficient data, can be approximated by) by Y f,T K(X f , X f ) −1 Y f , i.e., the RKHS norm of the interpolant of v * .

Code.
All the relevant code for the experiments can be found at: https://github.com/jlee1998/Kernel-Flows-for-Irregular-Time-Series