Prediction of Dynamical time Series Using Kernel Based Regression and Smooth Splines

Prediction of dynamical time series with additive noise using support vector machines or kernel based regression has been proved to be consistent for certain classes of discrete dynamical systems. Consistency implies that these methods are effective at computing the expected value of a point at a future time given the present coordinates. However, the present coordinates themselves are noisy, and therefore, these methods are not necessarily effective at removing noise. In this article, we consider denoising and prediction as separate problems for flows, as opposed to discrete time dynamical systems, and show that the use of smooth splines is more effective at removing noise. Combination of smooth splines and kernel based regression yields predictors that are more accurate on benchmarks typically by a factor of 2 or more. We prove that kernel based regression in combination with smooth splines converges to the exact predictor for time series extracted from any compact invariant set of any sufficiently smooth flow. As a consequence of convergence, one can find examples where the combination of kernel based regression with smooth splines is superior by even a factor of $100$. The predictors that we compute operate on delay coordinate data and not the full state vector, which is typically not observable.


Introduction
The problem of time series prediction is to use knowledge of a signal x(t) for 0 ≤ t ≤ T and infer its value at a future time t = T + t f , where t f is positive and fixed.A time series is not predictable if it is entirely white noise.Any prediction scheme has to make some assumption about how the time series is generated.A common assumption is that the observation x(t) is a projection of the state of a dynamical system with noise superposed [8].Since the state of the dynamical system can be of dimension much higher than 1, delay coordinates are used to reconstruct the state.Thus, the state at time t may be captured as (x(t), x(t − τ ), . . ., x(t − (D − 1)τ )) (1.1) where τ is the delay parameter and D is the embedding dimension.Delay coordinates are (generically) effective in capturing the state correctly provided D ≥ 2d + 1, where d is the dimension of the underlying dynamics [21].
Farmer and Sidorowich [8] used a linear framework to compute predictors applicable to delay coordinates.It was soon realized that the nonlinear and more general framework of support vector machines would yield better predictors [15,18,19].Detailed computations demonstrating the advantages of kernel based predictors were given by Müller et al [19] and are also discussed in the textbook of Schölkopf and Smola [22].Kernel methods still appear to be the best, or among the best, for the prediction of stationary time series [16,20].
A central question in the study of noisy dynamical time series is how well that noise can be removed to recover the underlying dynamics.Lalley, and later Nobel, [12,14,13] have examined hyperbolic maps of the form x n+1 = F (x n ), with F : R d → R d .It is assumed that observations are of the form y n = x n + n , where n is iid noise.They proved that it is impossible to recover x n from y n , even if the available data y n is for n = 0, −1, −2, . . .and infinitely long, if the noise is normally distributed.However, if the noise satisfies | n | < ∆ for a suitably small ∆, the underlying signal x n can be recovered.The recovery algorithm does not assume any knowledge of F .The phenomenon of unrecoverability is related to homoclinic points.If the noise does not have compact support, with some nonzero probability, it is impossible to distinguish between homoclinic points.
Lalley [14] suggested that the case of flows could be different from the case of maps.In discrete dynamical systems, there is no notion of smoothness across iteration.In the case of flows, the underlying signal will depend smoothly on time but the noise, which is assumed to be iid at different points in time, will not.Lalley's algorithm for denoising relies on dynamics and, in particular, on recurrences.In the case of flows, we rely solely on smoothness of the underlying signal for denoising.As predicted by Lalley, the case of flows is different.Denoising based on smoothness of the underlying signal alone can handle normally distributed noise or other noise models.Thus, our algorithms are split into two parts: first the use of smooth splines to denoise, and second the use of kernel based regression to compute the predictor.Only the second part relies on recurrences.
Prediction of discrete dynamics, within the framework of Lalley [12], has been considered by Steinwart and Anghel [24] (also see [3]).Suppose x n = F n (x 0 ) and xn = x n + n is the noisy state vector.The risk of a function f is defined as where ν is the distribution of the noise and µ is a probability measure invariant under F and with compact support.Thus, the risk is a measure of how well the noisy future state vector can be predicted given the noisy current state vector.It is proved that kernel based regression is consistent with respect to this notion of risk for a class of rapidly mixing dynamical systems.Although the notion of risk does not require denoising, consistency of empirical risk minimization is proved for additive noise n of compact support as in [12].In the case of empirical risk minimization, compactness of added noise is not a requirement imposed by the underlying dynamics but is assumed to make it easier to apply universality theorems.Our results differ in the following ways.We consider flows and not discrete time maps.In addition, we work with delay coordinate embedding [21] and do not require the entire state vector to be observable.Finally, we prove convergence to the exact predictor, which goes beyond consistency.The convergence theorem we prove is not uniform over any class of dynamical systems.However, we do not assume any type of decay in correlations or rapid mixing.Non-uniformity in convergence is an inevitable consequence of proving a theorem that is applicable to any compact invariant set of a generic finite dimensional dynamical system [1,9,25].This point is further discussed in section 2, which presents the main algorithm as well as a statement of the convergence theorem.Section 3 presents a proof of the convergence theorem.
In section 4, we present numerical evidence of the effectiveness of combining spline smoothing and kernel based regression.The algorithm of section 2 is compared to computations reported in [19] and the spline smoothing step is found to improve accuracy of the predictor considerably.The numerical examples bring up two points that go beyond either consistency or convergence.First, we explain heuristically why it is not a good idea to iterate 1-step predictor k-times to predict the state k steps ahead.Rather, it is a much better idea to learn the k-step predictor directly.Second, we point out that no currently known predictor splits the distance vector between stable and unstable directions, a step which was argued to be essential for an optimal predictor by Viswanath et al [28].The heuristic explanation for why iterating a 1-step predictor k times is not a good idea relies on the same principle.
The concluding discussion in section 5 points out connections to related lines of current research in parameter inference [16,17] and optimal consistency estimates for stationary data [11].

Prediction algorithm and statement of convergence theorem
, define a flow that may be limited to an open subset of R d with compact closure.Let F t (U 0 ) be the time-t map with initial data U 0 .It is assumed that U (t; U 0 ), t ∈ R, is a trajectory of the flow whose initial point Let μ be a compactly supported invariant probability measure of the flow-map F t for t > 0 and let X be its support.It is assumed that the initial point ω is drawn from the measure μ.For ω ∈ X, the trajectory U (t; ω) exists for all t ∈ R and is unique.In addition, the flow is assumed to be ergodic with respect to the measure μ.
As a consequence of the C r embedding, there is a measure µ compactly supported in R D that corresponds to μ.The measure µ is ergodic and invariant under the flow lifted via the embedding.Denote the compact support of µ by X.For every point ω in X, there corresponds a unique point ω in X and vice versa.Because the prediction algorithm is based on delay coordinates and not the state vector, it is more convenient to work in the embedding space R D and in terms of ω and µ.Therefore, we will rely on the bijective correspondence between X and X and use the notation u(t; τ ; ω) instead of u(t; τ ; ω) and u(t; ω) instead of u(t; ω).With these conventions, u(t; τ ; ω) can be thought of as the path in R D with u(0; τ ; ω) = ω.Similarly, u(t; ω) can be thought of as a real-valued signal with u(0; ω) = ω 1 , where ω 1 is the first component of ω ∈ R D .In later arguments, the assumption that ω is µ-distributed will be significant, and so will be the ergodicity of the flow with respect to µ.
Given the signal u(t; ω), it is assumed that the recorded observations are u η (jh; ω) = u(jh; ω) + j , where j is iid noise.Following Eggermont and LaRiccia [5,6], we assume that E j = 0 and E | j | κ < ∞ for some κ > 3. To avoid inessential technicalities it is assumed that τ /h ∈ Z + so that the delay is an integral multiple of the time step h.In particular, we set τ = nh.Similarly, we assume t f = n f τ , n f ∈ Z + , where t f is the look-ahead into the future.The noisy delay coordinates u η (jh; τ ; ω) are assumed to be available for j = 0, . . ., (N + n f ) n, which implies that the observation interval of The exact predictor F : R D → R is a C r function such that F (u(t; τ ; ω)) = u(t + t f ; ω) for ω ∈ X. Lemma 3 proves uniqueness and existence of the exact predictor F .The exact predictor F corresponds to a fixed t f > 0, but that dependence is not shown in the notation.The problem as considered by Müller et al [19] is to recover the exact predictor F from the noisy observations u η (jh; ω).Let |•| denote Vapnik's -loss function.The algorithm of Müller et al computes f m such that the functional is minimized for f = f m in the reproducing kernel Hilbert space H Kγ corresponding to the kernel K γ .The kernel K γ is assumed to be given by K . The kernel bandwidth parameter γ and the Lagrange multiplier Λ are both determined using crossvalidation.This method approximates the exact predictor the approximation is iterated n f times.We will compare our predictor against that of Müller et al using some of the same examples and the same framework as they do in section 4.
In our algorithm, the first step is to apply spline smoothing.In particular, we apply cubic spline smoothing [4] to compute a function u s (t; ω), t ∈ [−(D − 1)τ, N τ + t f ] such that the functional The parameter λ is determined using five-fold cross-validation.The minimizer u s (t; ω) depends upon the noise-free signal u(t; ω) as well as the instantiation of the iid noise in u η (jh; ω) for −(D − 1)n ≤ j ≤ (N + n f )n.However, the dependence on the iid noise is not shown in the notation.
The second step of our algorithm is similar to the method of Müller et al.The predictor f 1 is computed as Both the parameters γ and Λ are determined using five-fold cross-validation.Here n f and therefore t f are fixed because we seek to approximate the exact predictor with lookahead fixed at t f .As explained in section 4, it is significant that the predictor directly optimizes with a lookahead of t f .Iterating a τ -step predictor n f times gives worse predictions.
The second step (2.3) differs from the algorithm of Müller et al in using the spline smoothed signal u s (t; ω) in place of the noisy signal u η (t; ω).Our algorithm relies mainly on spline smoothing to eliminate noise.Yet another difference is that we use the least squares loss function in place of the -loss function.This difference is a consequence of relying on spline smoothing to eliminate noise.As explained by Christmann and Steinwart [3], the -loss function, Huber's loss, and the L 1 loss function are used to handle outliers.However, spline smoothing eliminates outliers, and we choose the L 2 loss function because of its algorithmic advantages.
We now turn to a discussion of the convergence of the predictor f 1 to the exact predictor F .The first step is to assess the accuracy of spline smoothing.We quote the following lemma, which is a convenient restatement of a result of Eggermont and LaRiccia [5,6] (see pages 132 and 133 of [6]).In the lemma, , where h = τ /n and where j are iid random variables.It is further assumed that be the spline that minimizes the functional .
Let p = P (n, N, ∆, ω) be the probability that where the ∞-norm is over the interval Some remarks about the connection of this lemma to the algorithm given by (2.2) and (2.3) follow.First, the lemma assumes a fixed choice of λ (the relevant theorem in [5,6] in fact allows λ to lie in an interval).In our algorithm, λ is determined using cross-validation because of its practical effectiveness [29].
Second, the probability P(n, N, ∆, ω) (which may be interpreted as the probability that spline smoothing fails to denoise effectively) depends on ω and therefore on the particular trajectory.If P(n, N, ∆, ω) depends on ω only though a bound on the m-th derivative of u(t; ω), t ∈ [−(D − 1)n, nN ], the bound would be uniform for all trajectories on the compact invariant set X.The achievability part of Stone's optimality result [26] gives such a bound but the algorithm in that proof does not appear practical.Proving a similar result for smooth splines based on the existing literature does not appear entirely straightforward.In the L 2 norm, some uniform bounds have been proved for smooth splines by Györfi et al [10].A bound on the L 2 norm can be combined with a bound on the the m-th derivative using a Sobolev inequality to obtain an ∞-norm bound.Although the rate of convergence would be slightly sub-optimal, it would suffice for our purposes.However, the result of Györfi et al is for expectations and not for convergence in probability, and an argument using Chebyshev's inequality does not give strong bounds.
The convergence analysis of the second half of the algorithm also alters the algorithm slightly.In particular, the use of cross-validation to choose parameters is not a part of the analysis.To state the convergence theorem, we first fix > 0. By the universality theorem of Steinwart [23], we may choose F ∈ H Kγ such that ||F − F || ∞ < in a compact domain that has a non-empty interior and contains the invariant set X.The convergence theorem also makes the technical assumption 2 / ||F || 2  Kγ < 1, which may always be satisfied by taking small enough.
The choice of the kernel-width parameter γ is important in practice.In the convergence proof, the choice of γ is not explicitly considered.However, γ still plays a role because ||F || Kγ depends upon γ.
The parameter Λ in (2.3) is fixed as Λ = 2 / ||F || 2 Kγ for the proof.Next we pick δ = 1/2 and ∈ Z + such that the covering of the invariant set X using boxes of dimension 2 − ensures that the variation of F (as well as that of the exact predictor F and f 3 , which is defined later) within each box is bounded by δ/4.
Suppose A 1 , . . ., A L are boxes of dimension 2 − that cover X in the manner hinted above.We next choose T * such that the measure of the trajectories (with respect to the ergodic measure µ) that sample each one of the boxes A j adequately (in a sense that will be explained) is greater than 1 − if the time interval of the trajectory exceeds T * .
The parameter ∆ is a bound on the infinite norm accuracy of the smooth spline as in Lemma 1. Choose ∆ > 0 small enough that where B 1 is a constant specified later.The main purpose of increasing n is to make spline smoothing accurate.However, the following condition requiring n to be large enough is assumed in the proof: Within this set-up, we have the following convergence theorem.
Nonuniform bounds implying a form of weak consistency are considered by Steinwart, Hush, and Scovel [25].However, the algorithm of (2.2) and (2.3) does not fit into the framework of [25].The application of spline smoothing to produce u s (t; ω) means that u s (t; ω) may not be stationary, and our method of analysis does not rely on verifying a weak law of large numbers as in [25].The analysis summarized above and given in detail in the following section relies on ∞-norm bounds.

Proof of convergence
We begin the proof with a more complete account of how the embedding theorem is applied.is a C r diffeomorphism between Ṽ and its image in R D .This assumption is generically true [21].This map is called the delay embedding.Denote the image of Ṽ under the delay embedding by V .
The invariant measures μ and µ as well as X, X, ω, ω, u(t; ω), and u(t; τ ; ω) are as defined earlier.It is assumed that X ⊂ Ṽ , which implies X ⊂ V .
by U 0,τ so that U 0,τ ∈ V .There exists a unique and well-defined C r function F : V → R, called the exact predictor, such that for all U 0,τ ∈ V .In particular, F (u(t; τ ; ω)) = u(t + t f ; ω) for all t ∈ R and all ω ∈ X.
Proof.To map U 0,τ ∈ V to πF t f (U 0 ), first invert the delay map to obtain the point U 0 in Ṽ , advance that point by t f by applying F t f , and finally project using π.Each of the three maps in this composition is C r or better.The predictor must be unique because F t f is uniquely determined by the flow.
Remark.The embedding theory of Sauer et al [21] may be applied to the compact invariant set X without enclosing it in the open set Ṽ .Indeed, if the box counting dimension of X is d , the embedding dimension need only satisfy D ∈ Z + and D > 2d .That can be advantageous because we may have d much smaller than d.However, there are two difficulties if X is a fractal set.First, tangent spaces cannot be defined and we cannot assert the delay map to be a diffeomorphism although it will be one-one generically.Second, we will need to extend F to the closure of an open neighborhood of X in R D to apply the universality theorem, and such an extension cannot be made from X if X is a fractal set.Both these difficulties go away if we take Ṽ to be a submanifold that contains X.If d is the dimension of Ṽ , we would only require D > 2d .For simplicity, we have assumed Ṽ to be an open set.
The following convexity lemma is an elementary result of convex analysis [7].It is stated and proved for completeness.Lemma 4. Let L 1 (f ) and L 2 (f ) be convex and continuous in f , where f ∈ H and H is a Hilbert space.If w ∈ ∇L i (f ), the subgradient at f , assume that for ||f || ≤ r, and assume that ||f 1 || < r and ||f 2 || < r.Then, Proof.Because f 1 minimizes L 1 (f ), we have 0 ∈ ∇L 1 (f 1 ).Thus, By adding the two inequalities, we have proving the lemma.This last step relies on is the noise-free signal, our arguments are phrased under the assumption that |u(t; ω) − u s (t; ω)| ≤ ∆.This assumption is realized with probability 1 − P(n, N, ∆, ω), which tends to 1 as n increases (by Lemma 1).For convenience, we denote P(n, N, ∆, ω) by p.The probability that u η (t; ω) is successfully denoised by smooth splines so that |u(t; ω) − u s (t; ω)| ≤ ∆ is then 1 − p.
In general, a C r function defined on an embedded submanifold can be extended to an open neighborhood of the submanifold using a partition of unity.Because V ⊂ R D is an embedded submanifold, X ⊂ V , and the exact predictor F is defined on V , it follows that there exists M > 0 such that F can be extended to Y , where We will always assume ∆ < M so that the spline-smoothed signal maps to Y under delay embedding with probability greater than 1 − p.Without loss of generality, we assume M ≤ 1.The convergence proof will assess the approximation to F with respect to the measure µ.Therefore, the manner in which the extension is carried out is not highly relevant.The sole purpose of the extension is to facilitate an application of the universality theorem for Gaussian kernels. Let Thus, B is a bound on the size of the embedded invariant set with ample allowance for error in spline smoothing.Let u s (t; ω) denote the spline-smoothed signal and u(t; ω) the noise-free signal with ω ∈ X. Define where t f = n f τ , n f ∈ Z + , and K is any smooth and positive kernel defined over Y × Y .The kernel K will be specialized to the Gaussian kernel K γ when applying the universality theorem.Define using the noise-free signal u(t; ω).Let T = N τ and define For Λ > 0, all three functionals are strictly convex and have a unique minimizer.The unique minimizers of W 1 , W 2 , and W 3 are denoted by f 1 , f 2 , and f 3 , respectively.The functional W 1 is the same as in (2.3), the second step of the algorithm.Thus, f 1 is the computed approximation to the exact predictor F .
The following lemma bounds the minimizers of Lemma 5.The minimizer and the stated bound for ||f 1 || K follows.The bounds for f 2 and f 3 are proved similarly.
Here B 1 depends only on B and the kernel K.The kernel K is assumed to be C 2 .
Proof.First, we note that ||f || ∞ ≤ c 0 ||f || K and ||∂f || ∞ ≤ c 1 ||f || K , where ∂ is the directional derivative of f in any direction.By a result of Zhou (part (c) of Theorem 1 of [30]), we may take , where D is the embedding dimension and the ∞-norm is over x, y ∈ Y .If we define B 1 using it follows that both ||f || ∞ and ||∂f || ∞ (where ∂ is a directional derivative in any direction) are bounded above by B 1 /Λ 1/2 .We may write The proof is completed by utilizing these bounds in (3.3) and defining B 1 as Proof.Follows from Lemmas 5, 6, and 4. Lemma 4 is applied with Λ , and λ = 2Λ.The choice of r is justified by Lemma 5 and the choice of δ is justified by Lemma 6.To justify the choice of λ, note that W 1 (f ) and W 2 (f ) can both be written as K .Thus, if w ∈ ∇W i (f ) (the subgradient of W i is unique and may be obtained explicitly), we must have Λ .Proof.We will argue as in Lemma 6 and assume that ||f || ∞ , and ||∂f || ∞ are bounded by we may apply the mean value theorem to the integral and argue as in Lemma 6 to upper bound the difference by (B 1 ) 2 h/Λ.The proof is completed by summing the differences from k = 0 to k = N n − 1 and dividing by N n.
Proof.Follows from Lemmas 5, 8, and 4. Lemma 4 is applied with Λ , and λ = 2Λ.The choices of r, δ, and Λ are justified using Lemmas 5 and 8 and an additional argument as in the proof of Lemma 7.
Choose > 0. At this point, we specialize K to a kernel for which the universality theorem of Steinwart applies.For example, K = K γ .We may then find F ∈ H K such that ||F − F || ∞ ≤ , where the ∞-norm is over Y .In fact, we will need the difference |F (x) − F (x)| to be bounded by only for x ∈ X.The larger compact space Y is needed to apply the universality theorem and for other RKHS arguments.
In addition, Proof.We have is the minimizer, and This last inequality uses ´(F (u(t; τ The proof of the first part of the lemma is completed by combining the inequalities.To prove the second part, we argue similarly after noting Consider half-open boxes in R D of the form with ∈ Z + and j i ∈ Z.The whole of R D is a disjoint union of such boxes.Because X is compact, we can assume that X ⊂ ∪ L j=1 A j , where the union is disjoint, each A j is a half-open box of the form above, and A j ∩ X = φ for 1 ≤ j ≤ L.
We will pick to be so large, that each box has a diameter that is bounded as follows: Here δ > 0 is determined later, and Lemma 10 tells us that ||f 3 || K ≤ √ 2 ||F || K , and therefore (by part (c) of Theorem 1 of [30]) As a consequence of our choice of , x, y ∈ A j implies that |f 3 (x) − f 3 (y)| < δ/4, (3.4) bounding the variation of f 3 within a single cell A j .Because the exact predictor F is C r , r ≥ 2, and X is compact, we may also assert that for x, y ∈ A j by taking larger if necessary.The next lemma is about taking a trajectory that is long enough that each of the sets A j is sampled accurately.By assumption X is the support of µ.However, we may still have µ(A j ) = 0 for some j.In the following lemma and later, it is assumed that all A j with µ(A j ) = 0 are eliminated from the list of boxes covering X. Lemma 11.Let χ A j denote the characteristic function of the set A j .There exist T * > 0 and a Borel measurable set S ,T * ⊂ X such that ω ∈ S ,T * implies that for all T ≥ T * and j = 1, . . ., L and with µ (S ,T * ) > 1 − .Proof.To begin with, consider the set A 1 .By the ergodic theorem, lim for ω ∈ S ⊂ X with µ(S) = 1.Let ) for some T ≥ s .
The sets A s, shrink with increasing s.Then the measure of ∩ ∞ s=1 A s, under µ is zero.Therefore, there exists We can find s 2 , . . ., s L similarly by considering the sets A 2 , . . ., A L .The lemma then holds with T * = max(s 1 , . . ., s L ).
Lemma 12. Suppose that ω ∈ S ,T * , T ≥ T * , and Λ = 2 / ||F || 2 K ≤ 1. Suppose that f 3 minimizes W 3 (f ), which is defined using u(t; ω), T , and Λ .Then For ω ∈ S ,T * , we have where the first inequality holds because A j are disjoint, the second inequality holds because |f 3 (y) − F (y)| > δ/2 follows from (3.6) for y = u(t; τ ; ω) ∈ A j with j ∈ J, and the final inequality is a consequence of Lemma 11 and ω ∈ S ,T * .embedding dimension used for delay coordinates were τ = 6 and D = 6, as in [19].The size of the training set was N = 1000.For cross-validation, the γ/2D parameter was varied over {0.1, 1.5, 10.0, 50.0, 100.0}, and the Λ parameter was varied over 10 −8.5 , 10 −8 , . . ., 10 −0.5 for least squares with or without spline smoothing but over 10 −10 , 10 −6 , 10 −2 , 10 2 for the more expensive support vector regression.For support vector regression, the was varied over {0.01, 0.05, 0.25}.The phenomenon we will demonstrate is far more pronounced than the slight gains obtained using more extensive cross-validation.For support vector regression, we were able to reproduce the relevant results reported in [19].4.2, and 4.3, each point is an average over 480 independent datasets in the case of least squares with or without spline smoothing and over 48 data sets in the case of support vector regression.In all cases, using half as many datasets does not change the picture.
A t f = n f τ predictor can be obtained by iterating a τ -step predictor n f times, and this strategy is sometimes used to save cost [19].This is not a good idea as explained in [28] and as shown in Figure 4.2.An optimal predictor would need to roughly split the distance to the nearest training sample such that the component of the distance along unstable directions is small and with the component along stable directions allowed to be much larger.The balance between the two components depends upon t f , and therefore, iterating a one-step predictor is not a good strategy.
In Figure 4.1, we see that spline smoothing becomes more and more advantageous as noise   increases.The situation in Figure 4.3 is a little different.When t f is small, spline smoothing does help more for the noisier SNR of 0.4 compared to 0.2.However, for larger t f , even though spline smoothing helps, it does not help more when the noise is higher.This could be because as t f increases capturing the correct geometry of the predictor becomes more and more difficult, and this difficulty may be constraining the accuracy of the predictor.The MacKey-Glass example is a delay-differential equation and does not come under the purview of our convergence theorem.The Lorenz example, ẋ = 10(y −x), ẏ = 28x−y −xz, ż = −8z/3 + xy, is a dynamical system with a compact invariant set and comes under the purview of the convergence theorem.The Lorenz signal has a standard deviation of 7.9.For the Lorenz plots of

Discussion
For the prediction of dynamical time series, we have shown that flows are quite different from maps.In the case of flows, the time series can be denoised by relying solely on the smoothness of the underlying flow.The predictor can be derived by applying kernel-based regression to the denoised signal.The resulting predictor converges to the exact predictor under conditions described by Theorem 2.
As far as dynamical time series are concerned, the parameter estimation problem [16,17] is complementary to prediction.Much of the existing theory is for maps and with the assumption of rapid mixing.For flows, smooth splines or a similar technique may prove an effective method to denoise in the context of parameter estimation as well.
The convergence theorem given here does not give rates and is not uniform.Obtaining rates with uniformity over a class of flows will probably require rapid mixing assumptions as in the case of maps [11,24].Rapid mixing results for flows may be found in [2] for example.
With respect to rates and uniformity, there are two more issues that would need to be considered.First, convergence of smooth splines in the ∞-norm must be proved with explicit bounds that depend only on the norm of the m-th derivative.A more significant point is that rates of convergence for a given lookahead t f may not be the best direction.As pointed out in [28], the question of how large t f can be given a signal of length T appears to have implications for the prediction algorithm and not just to its analysis.There is no evidence that existing algorithms including the one in this paper are capable of predicting as far into the future as an optimal algorithm should.
The smooth spline idea is primarily local and so are the optimality results of Stone [26].Stone's algorithm for achievability is to find a local scale and to fit a polynomial using linear least squares within that local region.It is perhaps worth noting that the same idea has a dynamical analog.In its dynamical version [27], the noisy dynamical time series is embedded within Euclidean space using delay coordinates.The embedding will be necessarily noisy.However, the embedded manifold can be smoothed locally using linear techniques.

Figure 4 . 1 :
Figure 4.1: Root mean square errors in the prediction of the Mackey-Glass signal with t f = 1 as a function of the signal to noise ratio.The superiority of the method using smooth splines is evident.

Figure 4 .
1 demonstrates that (2.1) produces predictors that are corrupted by errors in the inputs or delay coordinates.The method with spline smoothing is more accurate and deteriorates less with increasing SNR.For the Mackey-Glass plots in Figures 4.1 ,

Figure 4 . 2 :
Figure 4.2: Comparison of the 1-step least squares predictor (without spline smoothing) iterated t f times with the t f -step predictor (without spline smoothing).The latter is seen to be superior.

Figure 4 . 3 :
Figure 4.3: The plot on the left uses SNR of 0.2 and the plot on the right uses 0.4.The method using smooth splines does better in all instances.

Figure 4 . 4 :
Figure 4.4: The advantage of spline smoothing for Lorenz is much less on the right with h = 0.1 than on the left with h = 0.01.

Figure 4 . 4 ,
each point is an average over 160 datasets each with N = 1000.The picture did not change even with many fewer datasets.

Figure 4 .
4 compares h = .01and h = .1 for Lorenz.In both cases, the embedding dimension is d = 10, the delay parameter is τ = 1, and the lookahead is t f = h.It may be seen that spline smoothing is less effective when h = 0.1 as compared to h = 0.01.A typical Lorenz oscillation has a period of about 0.75, and when h = 0.1 the resolution is too low causing too much discretization error.Smooth splines are less effective in reconstructing the noise-free signal if the grid on the time axis does not have sufficient resolution.The left half of Figure4.4 shows an example where prediction using spline smoothing improves accuracy by a factor of 100 with h = 0.01.