ERROR BOUNDS FOR KERNEL-BASED APPROXIMATIONS OF THE KOOPMAN OPERATOR

A BSTRACT . We consider the data-driven approximation of the Koopman operator for stochastic differential equations on reproducing kernel Hilbert spaces (RKHS). Our focus is on the estimation error if the data are collected from long-term ergodic simulations. We derive both an exact expression for the variance of the kernel cross-covariance operator, measured in the Hilbert-Schmidt norm, and probabilistic bounds for the finite-data estimation error. Moreover, we derive a bound on the prediction error of observables in the RKHS using a finite Mercer series expansion. Further, assuming Koopman-invariance of the RKHS, we provide bounds on the full approximation error. Numerical experiments using the Ornstein-Uhlenbeck process illustrate our results.


INTRODUCTION
The Koopman operator [28] has become an essential tool in the modeling process of complex dynamical systems based on simulation or measurement data.The philosophy of the Koopman approach is that for a (usually non-linear) dynamical system on a finite-dimensional space, the time-evolution of expectation values of observable functions satisfies a linear differential equation.Hence, after "lifting" the dynamical system into an infinite-dimensional function space of observables, linear methods become available for its analysis.The second step is then to notice that traditional Galerkin approximations of the Koopman operator can be consistently estimated from simulation or measurement data, establishing the fundamental connection between the Koopman approach and modern data science.Koopman methods have found widespread application in system identification [5], control [29,49,30,22,56], sensor placement [38], molecular dynamics [57,51,42,43,23,64], and many other fields.We refer to [24,40,6] for comprehensive reviews of the state of the art.
The fundamental numerical method for the Koopman approach is Extended Dynamic Mode Decomposition (EDMD) [62], which allows to learn a Galerkin approximation of the Koopman operator from finite (simulation or measurement) data on a subspace spanned by a finite set of observables, often called dictionary.An appropriate choice of said dictionary is a challenging problem.In light of this issue, representations of the Koopman operator on large approximation spaces have been considered in recent years, including deep neural networks [36,39], tensor product spaces [26,44], and reproducing kernel Hilbert spaces (RKHS) [63,12,25].See also [3,8,15,16,17,31,32] for recent studies on the use of reproducing kernels in the context of dynamical systems.In Reference [25] it was shown that by means of the integral operator associated to an RKHS, it is possible to construct a type of Galerkin approximation of the Koopman operator.The central object are (cross-)covariance operators, which can be estimated from data, using only evaluations of the feature map.Due to the relative simplicity of the resulting numerical algorithms on the one hand, and the rich approximation properties of reproducing kernels on the other hand, kernel methods have emerged as a promising candidate to overcome the fundamental problem of dictionary selection.
A key question is the quantification of the estimation error for (compressed 1 ) Koopman operators.For finite dictionaries and independent, identically distributed (i.i.d.) samples, error estimates were provided in [33,45], see also [66] for the ODE case and [56] for an extension to control-affine systems.The estimation error for cross-covariance operators on kernel spaces was considered in [41], where general concentration inequalities were employed.The data were also allowed to be correlated, and mixing coefficients were used to account for the lack of independence.In this article, we take a different route and follow the approach of our previous paper [45], where we, in addition, also derived error estimates for the Koopman generator and operator for finite dictionaries and data collected from long-term, ergodic trajectories.This setting is relevant in many areas of science, where sampling i.i.d.from an unknown stationary distribution is practically infeasible, e.g., in fluid or molecular dynamics.The centerpiece of our results was an exact expression for the variance of the finite-data estimator, which can be bounded by an asymptotic variance.The asymptotic variance by itself is a highly interesting dynamical quantity, which can also be described in terms of Poisson equations for the generator [34, Section 3].
We consider the Koopman semigroup (K t ) t≥0 generated by a stochastic differential equation on the space L2 µ , where µ is a probability measure which is invariant w.r.t. the associated Markov process.We study the action of K t on observables in an RKHS H which is densely and compactly embedded in L 2 µ .If this action is considered through the "lens" of the kernel integral operator E : L 2 µ → H (see Section 2.2), we arrive at a family of operators C t H = EK t E * (cf. Figure 1).The action of C t H : H → H is that of a cross-covariance operator: where k(•, •) is the kernel generating the RKHS H.These operators possess canonical empirical estimators based on finite simulation data, which only require evaluations of the feature map.

Diagram illustrating the different operators involved
Our contribution, illustrated in Figure 2, is two-fold.In our first main result, Theorem 3.1, we provide an exact formula for the Hilbert-Schmidt variance of the canonical empirical estimator C m,t H of the crosscovariance operator C t H , for m data points sampled from a long ergodic simulation.This result holds under the very mild assumption that λ = 1 is a simple 2 isolated eigenvalue of K t , which does not exclude deterministic systems, extends the findings in [45] to the kernel setting and no longer depends on the dictionary size (which would be infinite, at any rate).Furthermore, the result allows for probabilistic estimates for the error ∥ C m,t H − C t H ∥ HS , see Proposition 3.5.As a second main result, we propose an empirical estimator for the restriction of the Koopman operator K t to H, truncated to finitely many terms of its estimated Mercer series expansion, and prove a probabilistic bound for the resulting estimation error in Theorem 4.1, measured in the operator norm for bounded linear maps from H to L 2 µ .This result can be seen as a bound on the prediction error for the RKHS-based Koopman operator due to the use of finite data.In the situation where the RKHS is invariant under the Koopman operator we are able to complement the preceding error analysis with a bound on the full approximation error in Theorem 4.5.
Finally, we illustrate our results for a one-dimensional Ornstein-Uhlenbeck (OU) process.For this simple test case, all quantities appearing in our error estimates are known analytically and can be well approximated numerically.Therefore, we are able to provide a detailed comparison between the error bound obtained from our results and the actual errors observed for finite data.Our experiments show that our bounds for the estimation error of the cross-covariance operator are accurate, and that the corrections we introduced to account for the inter-dependence of the data are indeed required.Concerning the prediction error, we find our theoretical bounds still far too conservative, which reflects the problem of accounting for the effect of inverting the mass matrix in traditional EDMD.This finding indicates that additional research is required on this end.

FIGURE 2. Illustration of main results
The paper is structured as follows: the setting is introduced in Section 2. The result concerning the variance of the empirical cross-covariance operator, Theorem 3.1, is presented and proved in Section 3, while our bound for the prediction error is part of Theorem 4.1 in Section 4. Numerical experiments are shown in Section 5, conclusions are drawn in Section 6.

PRELIMINARIES
In this section, we review the required background on stochastic differential equations (Section 2.1), reproducing kernel Hilbert spaces (Section 2.2), Koopman operators (Section 2.3), their representations on an RKHS (Section 2.4),and the associated empirical estimators (Section 2.5).The results in this section can all be found in the literature, but we list them here at any rate to achieve a self-contained presentation.Selected proofs are also shown in the appendix for the reader's convenience.Below, we list the most frequently used notation: with drift vector field b : X → R d and diffusion matrix field σ : X → R d×d be given, i.e., where W t is d-dimensional Brownian motion.We assume that both b and σ are Lipschitz-continuous and that (1 Then [46,Theorem 5.2.1] guarantees the existence of a unique solution (X t ) t≥0 to (2.1).The solution (X t ) t≥0 constitutes a continuous-time Markov process whose transition kernel will be denoted by ρ t : X ×B X → R, where B X denotes the Borel σ-algebra on X .Then ρ t (x, •) is a probability measure for all x ∈ X , and for each A ∈ B X we have that ρ t (•, A) is a representative of the conditional probability for A containing X t given X 0 = • , i.e., where P X 0 denotes the marginal distribution of X 0 .
Throughout, we will assume the existence of an invariant (Borel) probability measure µ for the Markov process (X t ) t≥0 , i.e., we have for all t ≥ 0.
In addition to invariance, we assume that µ is ergodic, meaning that for any t > 0 every ρ t -invariant set A (that is, ρ t (x, A) = 1 for all x ∈ A) satisfies µ(A) ∈ {0, 1}.In this case, the Birkhoff ergodic theorem [20,Theorem 9.6] (see also (D.1)) and its generalizations apply, and allow us to calculate expectations w.r.t.µ using long-time averages over simulation data.

Reproducing kernel Hilbert spaces.
In what follows, let k : X × X → R be a continuous and symmetric positive definite kernel, that is, we have k(x, y) = k(y, x) for all x, y ∈ X and m i,j=1 k(x i , x j )c i c j ≥ 0 for all choices of x 1 , . . ., x m ∈ X and c 1 , . . ., c m ∈ R. It is well known that k generates a so-called reproducing kernel Hilbert space (RKHS) [1,7,47] (H, ⟨• , •⟩) of continuous functions, such that for ψ ∈ H the reproducing property holds, where Φ : X → H denotes the so-called feature map corresponding to the kernel k, i.e., In the sequel, we shall denote the norm on H by ∥ • ∥ and the kernel diagonal by φ: Then for x ∈ X we have We shall frequently make use of the following estimate: In particular, it shows that k is bounded if and only if its diagonal φ is bounded.By L p µ (X ), p ∈ [1, ∞), we denote the space of all functions (not equivalence classes) on X with a finite p-norm ∥ • ∥ p .Henceforth, we shall impose the following Compatibility Assumptions: µ (X ) such that k(x, y)ψ(x)ψ(y) dµ(x) dµ(y) = 0, then ψ = 0. (A3) If ψ ∈ H such that ψ(x) = 0 for µ-a.e.x ∈ X , then ψ(x) = 0 for all x ∈ X .
Many of the statements in this subsection can also be found in [59,Chapter 4].However, as we aim to present the contents in a self-contained way, we provide the proofs in A.
Lemma 2.1.Under the assumption that φ ∈ L 1 µ (X ) (in particular, under assumption (A1)), we have that ) and assumption (A2) is equivalent to the density of H in L 2 µ (X ).We have meticulously distinguished between functions and equivalence classes as there might be distinct functions ϕ and ψ in H, which are equal µ-almost everywhere3 , i.e., ϕ = ψ in L 2 µ (X ).The compatibility assumption (A3) prohibits this situation so that H can in fact be seen as a subspace of L 2 µ (X ), which is then densely and continuously embedded.Remark 2.2.(a) Condition (A1) implies k ∈ L4 µ⊗µ (X × X ), where µ ⊗ µ is the product measure on X × X .
It immediately follows from for ψ ∈ L 2 µ (X ) that the linear operator E : L 2 µ (X ) → H, defined by is well defined (as a Bochner integral in H) and bounded with operator norm not larger than ∥φ∥ 1/2 1 .Remark 2.3.The so-called kernel mean embedding E k , mapping probability measures ν on X to the RKHS H, is defined by E k ν = Φ(x) dν(x), see, e.g., [58].Hence, we have Eψ = E k ν with dν = ψ dµ.
Note that the operator E is not an embedding in strict mathematical terms.The terminology embedding rather applies to its adjoint E * .Indeed, the operator E enjoys the simple but important property: for ψ ∈ L 2 µ (X ) and η ∈ H.This implies that the adjoint operator We shall further define the covariance operator 4 Recall that a linear operator T ∈ L(H) on a Hilbert space H is trace class if for some (and hence for each) orthonormal basis (e j ) j∈N of H we have that ∞ j=1 ⟨(T * T ) 1/2 e i , e i ⟩ < ∞.A linear operator S ∈ L(H, K) between Hilbert spaces H and K is said to be Hilbert-Schmidt [13, Chapter III.9] if S * S is trace class, i.e., ∥S∥ 2 HS := ∞ j=1 ∥Se i ∥ 2 < ∞ for some (and hence for each) orthonormal basis (e j ) j∈N .Lemma 2.4 ([59, Theorem 4.27]).Let the Compatibility Assumptions (A1)-(A3) be satisfied.Then the following hold.
(a) The operator E is an injective Hilbert-Schmidt operator with The space H is densely and compactly embedded in L 2 µ (X ).(c) The operator C H is an injective non-negative self-adjoint trace class operator.
The next theorem is due to Mercer and can be found in, e.g., [52].It shows the existence of a particular orthonormal basis (e j ) ∞ j=1 of L 2 µ (X ) composed of eigenfunctions of E * E, which we shall henceforth call the Mercer basis corresponding to the kernel k.Again for the sake of self-containedness, we give a short proof in A.

Theorem 2.5 (Mercer's Theorem).
There exists an orthonormal basis (e j ) ∞ j=1 of L 2 µ (X ) consisting of eigenfunctions of E * E with corresponding eigenvalues λ j > 0 such that ∞ j=1 λ j = ∥φ∥ 1 < ∞.Furthermore, (f j ) ∞ j=1 with f j = λ j e j constitutes an orthonormal basis of H consisting of eigenfunctions of C H with corresponding eigenvalues λ j .Moreover, for all x, y ∈ X , k(x, y) = j f j (x)f j (y) = j λ j e j (x)e j (y), the series converges absolutely.
2.3.The Koopman semigroup.The Koopman semigroup (K t ) t≥0 associated with the SDE (2.1) is defined by for ψ ∈ B(X ), the set of all bounded Borel-measurable functions on X , and ρ t (x, dy) = dρ t (x, •)(y).It is easy to see that the invariance of µ is equivalent to the identity for all t ≥ 0 and ψ ∈ B(X ) (which easily extends to functions ψ ∈ L 1 µ (X ), see Proposition 2.7).Remark 2.6.Note that in the case σ = 0 the SDE (2.1) reduces to the deterministic ODE ẋ = b(x).Then (2.8) implies |ψ(ϕ(t, x))| 2 dµ(x) = |ψ(x)| 2 dµ(x) for all t ≥ 0 and all ψ ∈ B(X ), where ϕ(•, x) is the solution of the initial value problem ẏ = b(y), y(0) = x.Hence, the composition operator The proofs of the following two propositions can be found in A.
Proposition 2.7.For each p ∈ [1, ∞] and t ≥ 0, K t extends uniquely to a bounded operator from L p µ (X ) to itself with operator norm By C b (X ) we denote the set of all bounded continuous functions on X .As the measure µ is finite, we have The infinitesimal generator of the C 0 -semigroup (K t ) t≥0 is the (in general unbounded) operator in L 2 µ (X ), defined by whose domain dom L is the set of all ψ ∈ L 2 µ (X ) for which the above limit exists.By Proposition 2.8 and the Lumer-Phillips theorem (see [35]), the operator L is densely defined, closed 5 , dissipative (i.e., Re⟨Lψ, ψ⟩ µ ≤ 0 for all ψ ∈ dom L), and its spectrum is contained in the closed left half-plane.
Lemma 2.9.The constant function 1 is contained in dom L and L1 = 0.Moreover, both M := span{1} and M ⊥ are invariant under L and all K t , t ≥ 0.
Representation of Koopman Operators on the RKHS.Using the integral operator E, it is possible to represent the Koopman operator with the aid of a linear operator on H, which is based on kernel evaluations.This construction mimics the well-known kernel trick used frequently in machine learning.To begin with, for any x, y ∈ X define the rank-one operator C xy : H → H by For t ≥ 0 and ψ ∈ H we further define the cross-covariance operator C t H : H → H by Thus, we have In other words, the cross-covariance operator C t H represents the action of the Koopman semigroup through the lens of the RKHS integral operator E (see [25] for details).Being the product of the two Hilbert-Schmidt operators EK t and E * , the operator C t H is trace class for all t ≥ 0 (cf.[21, p. 521]).Note that due to ρ 0 (x, • ) = δ x , for t = 0 this reduces to the already introduced covariance operator The identity (2.10) shows that for all η, ψ ∈ H we have which shows that the role of C t H is analogous to that of the stiffness matrix in a traditional finitedimensional approximation of the Koopman operator.In this analogy, the covariance operator C H plays the role of the mass matrix.
2.5.Empirical estimators.Recall that the resolvent set ρ(T ) of a bounded operator T , mapping from a Hilbert space H into itself, is the set consisting of all λ ∈ C such that T − λI is boundedly invertible.It is the complement of the spectrum σ(T ) of T .
Next, we introduce empirical estimators for C t H based on finite data (x k , y k ), k = 1, . . ., m.We consider two sampling scenarios for fixed t > 0.
Assumptions on the sampling scheme and the Koopman operator: (1) The x k are drawn i.i.d.from µ, and each y k ∼ µ is obtained from the conditional distribution x ∈ X .For example, y k can be obtained by simulating the SDE (2.1) starting from x k until time t.
(2) µ is ergodic and both x k and y k are obtained from a single (usually long-term) simulation of the dynamics X t at discrete integration time step t > 0, using a sliding-window estimator, i.e., In this case, we assume that where K t 0 is the restriction of the Koopman operator K t to the orthogonal complement Remark 2.10.(a) The condition (2.12) means that λ = 1 is an isolated simple eigenvalue of K t .It is satisfied if K t is compact.Then σ(K t ) consists of zero and a sequence of eigenvalues converging to zero, and ergodicity ensures that the eigenvalue λ = 1 is simple, cf.Proposition D.1.Another case where (2.12) holds is when the semigroup (K t 0 ) t≥0 is exponentially stable, i.e., there exist M ≥ 1 and ω > 0 such that ∥K t 0 ∥ ≤ M e −ωt for all t ≥ 0. Then ∥K nt 0 ∥ 1/n ≤ M 1/n e −ωt , so that the spectral radius r = lim n→∞ ∥K nt 0 ∥ 1/n of K t 0 is at most e −ωt < 1.(b) We would like to point out that the condition (2.12) does not exclude deterministic systems, i.e., autonomous ODEs ẋ = b(x), in which case the operator K t is unitary on L 2 µ (R d ).
Recall that the joint distribution of two random variables X and Y is given by Set X = x k and Y = y k .Then, in both cases (1) and (2), we have P X = µ and In other words, for the joint distribution µ 0,t of x k and y k we have More explicitly,

Now, since
for the empirical estimator for C t H we choose the expression (2.13)

VARIANCE OF THE EMPIRICAL ESTIMATOR
In case (1), the law of large numbers [4, Theorem 2.4] and, in case (2), ergodicity [2] ensures the expected behavior However, this is a purely qualitative result, and nothing is known a priori on the rate of this convergence.
The main result of this section, Theorem 3.1, yields probabilistic estimates for the expression Here, our focus is on the estimation from a single ergodic trajectory, i.e., case (2) above.While the broader line of reasoning partially resembles that of our previous paper [45], we require additional steps due to the infinite-dimensional setting introduced by the RKHS.
Theorem 3.1.The Hilbert-Schmidt variance of the empirical estimator can be written as where where Q denotes the orthogonal projection onto Proof.Let us prove (3.1).First of all, we set Therefore, , and thus (3.1) follows.
Case (1).Since z k and z ℓ are independent for k ̸ = ℓ, we have Hence, the statement of the theorem for case (1) follows.
Case (2).First of all, note that As also For the cross terms, we compute The last term can be expressed as i,j For a justification of ( * ) see Lemma C.1.
Let P be the orthogonal projection in L 2 µ (R d ) onto span{1}, i.e., P = I − Q.Then, by Lemma 2.9, At this point, we would like to remark for later use that in a similar way we get i,j We have thus shown that which concludes the proof.□ Remark 3.2.Let us compute the variance in the case, where the generator L is self-adjoint with discrete spectrum.Then L = ∞ ℓ=0 µ ℓ ⟨•, ψ ℓ ⟩ µ ψ ℓ with eigenvalues µ ℓ ≤ 0 and eigenfunctions ψ ℓ .We let µ 0 = 0 and ψ 0 = 1.Then, setting q ℓ = e µ ℓ t , we get K t 0 = ∞ ℓ=1 q ℓ ⟨•, ψ ℓ ⟩ µ ψ ℓ and thus It is now easy to see that (note that g * ji = g ji in this case) where In the following, we let We can therefore interpret σ 2 ∞ as asymptotic variance of the estimator C m,t H , similar to our previous results in [45, Lemma 6].
An upper bound on the variance can be obtained as follows.

BOUND ON THE KOOPMAN PREDICTION ERROR
The kernel cross-covariance operator C t H can also be used to approximate the predictive capabilities of the Koopman operator, for observables in H. Approximating the full Koopman operator involves the inverse of the co-variance operator, which becomes an unbounded operator on a dense domain of definition in the infinite-dimensional RKHS case.Moreover, its empirical estimator C m H is finite-rank and thus not even injective.While Fukumizu et al. tackle this problem in [11] by means of a regularization procedure, we choose to use pseudo-inverses instead (cf.Remark 4.2).We truncate the action of the Koopman operator using N terms of the Mercer series expansion and derive a bound for the prediction error for fixed truncation parameter N .While we use similar ideas as presented in [12], we heavily rely on our new results on the cross-covariance operator, cf.Section 3. Afterwards, we deal with the case of Koopman-invariance of the RKHS [27].Here, we establish an estimate for the truncation error, which then yields a bound on the deviation from the full Koopman operator.We emphasize that this error bound is extremely useful in comparison to its prior counterparts based on the assumption that the space spanned by a finite number of so-called observables (dictionary) is invariant under the Koopman operator.The latter essentially requires to employ only Koopman eigenfunctions as observables, see, e.g., [30,19].
Let (e j ) be the Mercer orthonormal basis of L 2 µ (X ) corresponding to the kernel k and let λ j = ∥Ee j ∥ µ as well as f j := λ j e j (cf.Theorem 2.5).We arrange the Mercer eigenvalues in a non-increasing way, i.e., ⟨C t H ψ, e j ⟩e j .(4.1) 4.1.Prediction error.In the next theorem, we estimate the probabilistic error between the first summand ⟨C t H ψ, e j ⟩e j , ψ ∈ H, and its empirical estimator, which is of the form N j=1 ⟨ C m,t H ψ, e j ⟩ e j with approximations e j of the e j .
Theorem 4.1.Assume that the eigenvalues λ j of C H are simple, i.e., λ j+1 < λ j for all j.Fix an arbitrary N ∈ N and let Further, let ε ∈ (0, δ N ) and δ ∈ (0, 1) be arbitrary and fix some H in descending order and let e 1 , . . ., e m be corresponding eigenfunctions, respectively, such that ∥ e j ∥ = λ −1/2 j for j = 1, . . ., m.If we define then, with probability at least 1 − δ, we have that All of the above statements equally apply to case (1) upon replacing σ m by E 0 (t).• e j , then and thus where Q N = N j=1 ⟨ • , f j ⟩ f j is the orthogonal projector onto the span of the first N eigenfunctions of C m H in H. Therefore, In particular, for N = m we have K m,t N = ( C m H ) † C m,t H , which surely is one of the first canonical choices for an empirical estimator of K t .
(b) The functions e j have unit length in the empirical L 2 µ -norm: Therefore, projecting onto the first N empirical Mercer features is the whitening transformation commonly used in traditional EDMD [24].
Proof of Theorem 4.1.By Proposition 3.5, both events ∥C t H − C m,t H ∥ HS ≤ ε and ∥C H − C m H ∥ HS ≤ ε occur with probability at least 1 − δ/2, respectively.Hence, they occur simultaneously with probability at least 1 − δ.
In the remainder of this proof we assume that both events occur.Then all the statements deduced in the following hold with probability at least 1 − δ.

Let us define the intermediate approximation
Let ψ ∈ H be arbitrary.Setting C := C t H − C m,t H , we have Cψ, e j e j and thus, Next, we aim at estimating the remaining error By (2.4), it suffices to estimate the above error in the ∥ • ∥-norm.By Theorem B.3, the first summand can be estimated as For the second summand we have Now, note that ϵ < δ N by assumption and therefore ≤ λ N 2 .For j = 1, . . ., N , according to Theorem B.1 this implies and thus, All together, we obtain (recall (2.4)) which implies (4.4).□ 4.2.Projection error in case of Koopman-invariance of the RKHS.In the preceding section, we have seen that the empirical operator K m,t N can be written as ( C m H ) † C m,t H if m = N .In the limit m → ∞, we would arrive at the operator C −1 H C t H , which is not even well-defined for all ψ ∈ H, in general.However, if the RKHS is invariant under K t , the above operator limit is well-defined as a bounded operator on H.In this situation we are able to extend Theorem 4.1 to an estimate on the full error made by our empirical estimator.
We start by defining the operator µ (X ) is injective, it possesses an unbounded inverse on its range H, and therefore: Proposition 4.4.For t > 0, the following statements are equivalent: Proof.With regard to the two representations (4.5) and (4.6) of the domain, it is immediate that both (i) and (iii) are equivalent to dom K t H = H.The equivalence of the latter to (ii) follows from the closed graph theorem. □ Note that if one of (i)-(iii) holds, then K t H = K t | H . Theorem 4.5.In addition to the assumptions in Theorem 4.1, assume that H is invariant under the Koopman operator K t .For fixed N ∈ N, let δ N be as in (4.2), choose ε, δ, and m as in Theorem 4.1 and define the empirical estimator K m,t N as in (4.3).Then, with probability at least 1 − δ we have that Proof.First of all, Theorem 4.1 implies that which proves the theorem.□ Remark 4.6.(a) The proof of Theorem 4.5 shows that the projection error ∥K t ψ − K t N ψ∥ µ decays at least as fast as the square roots of the eigenvalues of

4(c). (b)
In E, we prove that the RKHS generated by Gaussian RBF kernels on R is invariant under the Koopman semigroup associated with the 1D Ornstein-Uhlenbeck process.In fact, it can be proved that this invariance also holds in higher dimensions.This shows that the assumption in Theorem 4.5 is not too exotic and can be satisifed.

ILLUSTRATION WITH THE ORNSTEIN-UHLENBECK PROCESS
For the numerical illustration of our results, we consider the Ornstein-Uhlenbeck (OU) process on X = R, which is given by the SDE where α > 0 is a positive parameter.

Analytical Results
. Since all relevant properties of the OU process are available in analytical form, we can exactly calculate all of the terms appearing in our theoretical error bounds.Moreover, we can also compute the exact estimation and prediction errors for finite data in closed form.Let us begin by recapping the analytical results required for our analysis, which can be found in [48].
The invariant measure µ, and the density of the stochastic transition kernel ρ t , are given by . The Koopman operators K t are self-adjoint in L 2 µ (R), their eigenvalues and corresponding eigenfunctions are given by q j = e −αjt and ψ j (x) = 1 where H j are the physicist's Hermite polynomials.We consider the Gaussian radial basis function (RBF) kernel with bandwidth σ > 0, i.e., Let us quickly verify that this choice of the kernel satisfies the compatibility assumptions (A1)-(A3).Indeed, (A1) is trivial as k(x, x) = 1 and (A3) follows easily from the continuity of the functions in H.
The Mercer eigenvalues and features with respect to the invariant measure µ of the Ornstein-Uhlenbeck process, i.e., the eigenvalues and eigenfunctions of the integral operator E * E in L 2 µ (R), are also available in analytical form [10].They are given by using the following constants: With these results, we can compute the variance of the empirical estimator for C t H as described in Theorem 3.1.The eigenvalues q j were already given above.The coefficients d j,t (cf.Remark 3.2) are given by The series needs to be truncated at a finite number of terms and the integrals can be calculated by numerical integration.As furthermore (see the proof of Theorem 3.1) the Hilbert-Schmidt norm of the cross-covariance operator C t H can be computed similarly.Since, for the Gaussian RBF kernel, we have φ(x) = k(x, x) = 1 for all x, we therefore find completing the list of terms required by Theorem 3.1 and Remark 3.4.In addition, we notice that upon replacing either one or two of the integrals in (5.1) by finite-data averages, we can also calculate ∥ C m,t H ∥ 2

HS
and ⟨C t H , C m,t H ⟩ HS .Therefore, the estimation error for finite data {(x k , y k )} m k=1 can be obtained by simply expanding the inner product H , C t H ⟩ HS , allowing us to precisely compare the estimation error to the error bounds obtained in Theorem 3.1.
Besides the estimation error for C t H , we are also interested in the prediction error, which is bounded according to Theorem 4.1.We will compare these bounds to the actual error ∥ , for a specific observable ϕ ∈ H and a fixed number of N Mercer features.For the OU process, it is again beneficial to consider Gaussian observables ϕ: Application of the Koopman operator leads to yet another, unnormalized Gaussian observable, which is given by The inner products of K t ϕ with the Mercer eigenfunctions φ i can be evaluated by numerical integration, providing full access to the truncated observable K t N ϕ.On the other hand, the empirical approximation K m,t N ϕ can be computed directly based on the data.We note that The functions e j can be obtained from the eigenvalue decomposition of the standard kernel Gramian matrix as the latter is the matrix representation of the empirical covariance operator C m H on the subspace span{Φ(x k )} m k=1 .If 1 m K X = V ΛV ⊤ is the spectral decomposition of the Gramian, then are the correctly normalized eigenfunctions according to Theorem 4.1.Plugging this into the above, we find where 5.2.Numerical Results.For the actual numerical experiments, we set α = 1, choose the Koopman lag time as t = 0.05, and downsample all simulation data such that successive time steps are separated by time t.We compute the exact variance E[∥C t H − C m,t H ∥ 2 HS ] by the expression given in Theorem 3.1, and also the coarser estimate for the variance given in Remark 3.4.In addition, we also compute the i.i.d.variance 1 m E 0 (t).We test three different kernel bandwidths, σ ∈ {0.05, 0.1, 0.5}.All Mercer series are truncated after the first 10 terms for σ ∈ {0.1, 0.5}, and 20 terms for σ = 0.05, while Koopman eigenfunction expansions are truncated after 15 terms.
In the first set of experiments, we use Chebyshev's inequality as in Proposition 3.5 combined with the variance estimates described above to compute the maximal estimation error ∥C t H − C m,t H ∥ HS that can be guaranteed with confidence 1 − δ = 0.9, for a range of data sizes m between m = 20 and m = 50.000.As a comparison, we generate 200 independent simulations with simulation horizon m • t, for each data size m.We then compute the resulting estimation error using the expressions given in the previous section.The comparison of these results for all data sizes m and the different kernel bandwidths is shown in Figure 3.We observe that the bound based on the exact variance from Theorem 3.1 is quite accurate, over-estimating the actual error by about a factor three, and captures the detailed qualitative dependence of the estimation error on m and σ.The coarser bound from Remark 3.4, however, appears to discard too much information, it over-estimates the error by at least an order of magnitude, and also does not change significantly with σ.Finally, we note that for larger kernel bandwidth, the i.i.d.variance is indeed too small, leading to an under-estimation of the error.This observation confirms that it is indeed necessary to take the effect of the correlation between data points into account.
In a second set of experiments, we test the performance of our theoretical bounds concerning the prediction of expectations for individual observables, obtained in Theorem 4.1.For the same three Gaussian RBF kernels as in the first set of experiments, we consider the observable ϕ = φ 0 , i.e., the first Mercer feature.As above, we choose N = 10 or N = 20, depending on the bandwidth, to truncate the Mercer series expansion K t N ϕ and its empirical approximation K m,t N ϕ.Note that ϕ is a different observable depending on the bandwidth.Again, we set 1 − δ = 0.9, and use Theorem 4.1 to bound the L 2 µ -error between K t N ϕ and K m,t N ϕ.As a comparison, we compute the actual L 2 µ -error by numerical integration, using the fact that we can evaluate K t N ϕ(x) and K m,t N ϕ(x) at any x based on the discussion above.We repeat this procedure 20 times and provide average errors and standard deviations.The results for all three kernels are shown in Figure 4, and we find that our theoretical bounds are much too pessimistic in all cases.This finding highlights our previous observation that bounding the prediction error outside the RKHS still requires more in-depth research.

CONCLUSIONS
We have analyzed the finite-data estimation error for data-driven approximations of the Koopman operator on reproducing kernel Hilbert spaces.More specifically, we have provided an exact expression H associated to the OU process, at lag time t = 0.05, and the Gaussian RBF kernel with different bandwidths σ ∈ {0.05, 0.1, 0.05} (indicated by circles, x-es, and squares).We show the estimated error obtained from Proposition 3.5, with confidence 1 − δ = 0.9, using either the exact variance given in Theorem 3.1 (blue), the coarser estimate in Remark 3.4 (green), or the i.i.d.-variance 1 m E 0 (t) (purple).The red curves in both panels shows the 0.9-percentile of the estimation error based on 200 independent simulations.for the variance of empirical estimators for the cross-covariance operator, if a sliding-window estimator is applied to a long ergodic trajectory of the dynamical system (Theorem 3.1).This setting is relevant for many complex systems, such as molecular dynamics simulations.Our results present a significant improvement over the state of the art, since they concern a setting where the notorious problem of dictionary selection can be circumvented, and therefore no longer depend on the dictionary size.We have also extended the concept of asymptotic variance to an infinite-dimensional approximation space for the Koopman operator.Our numerical study on the Ornstein Uhlenbeck process has shown that, even using a simple mass concentration inequality, accurate bounds on the estimation error can be obtained (Figure 3).
In our second main contribution, Theorem 4.1, we have extended our estimates to a uniform bound on the prediction error for observables in the RKHS.Thereby, we have circumvented dealing with an unbounded inverse of the covariance operator by applying a finite-dimensional truncation of the associated Mercer series.In case of Koopman-invariance of the RKHS, we were even able to find a bound on the truncation error which then yields estimates for the full approximation error (Theorem 4.5).The resulting error bounds have, however, proven very conservative in the numerical examples (Figure 4).Therefore, obtaining sharper bounds on the prediction error constitutes a primary goal for future research.
Hence, the injectivity of E follows from (A2).If (e i ) is an orthonormal basis of H, then The claim is now a consequence of ∥Φ(x)∥ 2 = φ(x).
(c) This follows from (a) and ker C H = ker EE * = ker E * = {0} by (A3).□ Proof of Theorem 2.5.By Lemma 2.4, the operator ) is a positive self-adjoint traceclass operator.Hence, by the well known spectral theory of compact operators (see, e.g., [13]) there exists an orthonormal basis (e j ) ∞ j=1 of L 2 µ (X ) consisting of eigenfunctions of E * E corresponding to a summable sequence (λ j ) ∞ j=1 of strictly positive eigenvalues.Since E * ψ = ψ for ψ ∈ H, we have Ee j = λ j e j and thus e j ∈ H for all j and C H e j = EE * e j = Ee j = λ j e j .Moreover, ⟨f i , f j ⟩ = λ j /λ i ⟨Ee i , e j ⟩ = λ j /λ i ⟨e i , e j ⟩ µ = δ ij by (2.6) so that the f j indeed form an orthonormal system in H.The completeness of (f j ) in H follows from the injectivity of E. Finally, ∞ j=1 λ j = Tr C H = ∥φ∥ 1 and k(x, y) = ⟨Φ(x), Φ(y)⟩ = j ⟨Φ(x), f j ⟩⟨f j , Φ(y)⟩ = j f j (x)f j (y), which completes the proof.

APPENDIX B. SOME FACTS FROM SPECTRAL THEORY
In this section, let H be a Hilbert space.If P is an orthogonal projection in H, we set P ⊥ = I − P .For v ∈ H, ∥v∥ = 1, denote by P v the rank-one orthogonal projection onto span{v}.
We say that a linear operator on H is non-negative if it is self-adjoint and its spectrum is contained in [0, ∞).For a non-negative compact operator T on H we denote by λ 1 (T ) ≥ λ 2 (T ) ≥ . . . the eigenvalues of T in descending order (counting multiplicities).We set λ j (T ) = 0 if j > rank(T ).
Moreover, if T has only simple eigenvalues 7 , we let P j (T ) denote the orthogonal projection onto the eigenspace ker(T − λ j (T )) and Q n (T ) = n j=1 P j (T ) the spectral projection corresponding to the n largest eigenvalues of T .
Proof.First of all, the second equation in (B.1) is clear, since ).Second, if P v,w denotes the orthogonal projection onto H v,w := span{v, w}, we have which is a two-dimensional problem in H v,w .Now, if x ∈ H v,w , ∥x∥ = 1, we write x = av + bw and obtain a 2 + 2abγ + b 2 = 1, where γ = ⟨v, w⟩.Moreover, ⟨x, v⟩ = a + bγ, ⟨x, w⟩ = aγ + b and so Hence, the objective function is constant on {x ∈ H v,w : ∥x∥ = 1} and (B.1) is proved.□ The next theorem is a variant of the Davis-Kahan sin(Θ) theorem (cf.[65]).
Theorem B.3.Let T and T be non-negative Hilbert-Schmidt operators on H, let n ∈ N, assume that the largest n + 1 eigenvalues of T are simple, and set If ∥T − T ∥ HS < δ, then for j = 1, . . ., n we have Proof.For j ∈ N put λ j = λ j (T ), P j = P j (T ), λ j = λ j ( T ), and P j = P j ( T ).By Theorem B.1, we have |λ j − λ j | ≤ ∥T − T ∥ HS < δ for all j, hence λ j is contained in the interval I j = (λ j − δ, λ j + δ) for j = 1, . . ., n + 1.By assumption, sup I j+1 ≤ inf I j for j = 1, . . ., n.In particular, the intervals I 1 , . . ., I n+1 are pairwise disjoint. 7i.e., dim ker(T − λ) = 1 for each eigenvalue λ of T Now, let j ∈ {1, . . ., n}.Then for k ∈ N \ {j} we have | λ k − λ j | > δ.Therefore, we have dist(λ j , σ( T )\{ λ j }) ≥ δ and thus, for f ∈ P As T P j = λ j P j and P ⊥ j T = T P ⊥ j , we obtain The Since the right-hand side also makes sense for f ∈ L 1 µ , the operator (K t ) * extends to a bounded operator P t on L 1 µ .From the above defining identity, it is readily seen that P t is a Markov operator, i.e., P t 1 = 1 Let f be a simple function, i.e., f = n i=1 a i 1 A i , where the A i are mutually disjoint and and therefore Hence, there exists a subsequence (f n k ) such that (K t ) * f n k → (K t ) * f µ-a.e. as k → ∞.WLOG, we may therefore assume that (K t ) * f n → (K t ) * f µ-a.e. as n → ∞.By monotone convergence, which is a finite number.Hence, indeed, f • K t 1 (gh) ∈ L 1 µ and by majorized convergence, as claimed.□

APPENDIX D. ERGODICITY AND THE KOOPMAN SEMIGROUP
In this section, we prove the following proposition on the spectral properties of the generator L under the ergodicity assumption.
Proof.Concerning the "in particular"-part, we only mention that Lψ = 0 implies K t ψ = ψ for all t ≥ 0 and that Lψ = iωψ, ω ∈ R \ {0}, implies K 2π/ω ψ = ψ.So, let us show that K t ψ = ψ for some t > 0 and ψ ∈ L 2 µ (X ) is only possible for constant ψ.For this, we consider the Markov process (X nt ) ∞ n=0 .For convenience, we assume w.l.o.g. that t = 1.By invariance of µ, the process (X n ) ∞ n=0 is stationary, i.e., (X n ) ∞ n=0 and (X n+1 ) ∞ n=0 are equally distributed as X N 0 -valued random variables.According to [20, Lemma 9.2] there exist X -valued random variables X −k , k ∈ N, such that X := (X n ) n∈Z is also stationary.By P µ denote the law of the X Z -valued random variable X.
On S := X Z define the left shift T : S → S by T (x n ) n∈Z := (x n+1 ) n∈Z .Stationarity of X means that also T X ∼ P µ .
A set A ∈ B Z X := k∈Z B X is called shift-invariant if T −1 A = A. It is easy to see that the set of shift-invariant sets forms a sub-σ-algebra I of B Z X .Now, by [18,Corollary 5.11] and the ergodicity of µ we have P µ (A) ∈ {0, 1} for any A ∈ I. Now, Birkhoff's Ergodic Theorem [20,Theorem 9.6] states that In many works in the present literature on the Koopman operator for deterministic dynamical systems in connection with kernels, it is assumed that the Koopman operator maps the RKHS boundedly (or even compactly) into itself.However, as it was proved in [14], the Koopman operator of a discrete-time system on R n is invariant under the radial basis function RKHS if and only if the dynamics are affine.
In the following, we show that the situation is essentially different for stochastic systems in that we prove that the RBF RKHS is invariant under the Koopman operator associated with the OU process.
The Ornstein-Uhlenbeck (OU) process on X = R is the solution of the SDE dX t = −αX t dt + dW t .The invariant measure µ and the Markov transition kernel ρ t , t > 0, are known and given by .
By H σ we denote the RKHS generated by the kernel k σ .Hilbert space norm and scalar product on H σ will be denoted by ∥ • ∥ σ and ⟨• , •⟩ σ , respectively.For y ∈ R and a kernel k on R set k y (x) := k(x, y).
For two positive definite kernels on R we write for any choice of n ∈ N and α j , x j ∈ R, j = 1, . . ., n.We also write V c − → W for two normed vector spaces V and W if V ⊂ W is continuously embedded in W .

Lemma
Theorem 2.5), and F m : C\{1} → C is given by

. 7 )
Remarkably, invariance of H under the Koopman operator implies that the left-hand side not only reproduces the Koopman operator on H, but actually defines a bounded operation.Parts of the next proposition can be found in[27, Theorem 5.3] and [9, Theorem 1].

50 FIGURE 3 .
FIGURE 3. Probabilistic error estimates for C tH associated to the OU process, at lag time t = 0.05, and the Gaussian RBF kernel with different bandwidths σ ∈ {0.05, 0.1, 0.05} (indicated by circles, x-es, and squares).We show the estimated error obtained from Proposition 3.5, with confidence 1 − δ = 0.9, using either the exact variance given in Theorem 3.1 (blue), the coarser estimate in Remark 3.4 (green), or the i.i.d.-variance

FIGURE 4 .
FIGURE 4. Comparison of the theoretical bound on the prediction error ∥K t N ϕ − K m,t N ϕ∥ µ , if ϕ is chosen as the first Mercer feature φ 0 , using N = 20 (for σ = 0.05) or N = 10 (otherwise) in the Mercer series representation.The predicted error is shown in blue, different bandwidths are indicates by circles, x-es and squares.Error bars for the actual error obtained from 20 independent data sets are shown in red.
Note that C t H ψ ∈ ran C H if and only if EK t ψ = C H ϕ for some ϕ ∈ H. Since C H ϕ = Eϕ and ker E = {0}, the latter is equivalent to K t ψ = ϕ ∈ H, which proves the representation of the domain.As to the closedness of K t If the Koopman operator leaves the RKHS H invariant (i.e., K t H ⊂ H), K t H is defined on all of H.Moreover, since the canonical inclusion map E H , let (ψ n ) ⊂ dom K t H and ϕ ∈ H such that ψ n → ψ in H and K t H ψ n → ϕ in H as n → ∞.The latter implies C t H ψ n → C H ϕ, while the first implies C t H ψ n → C t H ψ in H as n → ∞, from which we conclude that C t H ψ = C H ϕ, i.e., ψ ∈ dom K t H and K t H ψ = ϕ.□ * : H → L 2 by Jensen's inequality, for every convex ϕ : R → R we have ϕ• K t ψ ≤ K t (ϕ • ψ)and thus |K t ψ| p ≤ K t |ψ| p , which, by invariance of µ, leads to ∥K t ψ∥ p p = |K t ψ| p dµ ≤ K t |ψ| p dµ = |ψ| p dµ = ∥ψ∥ p p .