Consistently recovering the signal from noisy functional data

In practice most functional data cannot be recorded on a continuum, but rather at discrete time points. It is also quite common that these measurements come with an additive error, which one would like eliminate for the statistical analysis. When the measurements for each functional datum are taken on the same grid, the underlying signal-plus-noise model can be viewed as a factor model. The signals refer to the common components of the factor model, the noise is related to the idiosyncratic components. We formulate a framework which allows to consistently recover the signal by a PCA based factor model estimation scheme. Our theoretical results hold under rather mild conditions, in particular we don't require specific smoothness assumptions for the underlying curves and allow for a certain degree of autocorrelation in the noise.


Introduction
As a consequence of the explosive increase of available data in practically all areas of life, data science (as a generic term for the diverse data processing disciplines) has been steadily gaining popularity over the past decades. Handling massive data not only requests larger human and computational resources, but also demands and entails the development of new scientific approaches. Within the field of statistics, functional data analysis (FDA) is one such area which has experienced a massive surge in interest over the last couple of years. Its basic objective is to handle data, where each observation X t can be viewed as a process X t (s) defined on some continuous domain U . To date a number of comprehensive text books (e.g., Ramsay and Silverman (2005); Ferraty and Vieu (2006); Horváth and Kokoszka (2012); Hsing and Eubank (2015)) can be consulted for a general overview of the numerous tools and methods which have been developed. The most active domains of research arguably are related to functional regression and dimension reduction techniques (see e.g., Cuevas (2014) for a survey). The latest developments are compactly summarized in Goia and Vieu (2016) and Aneiros et al. (2019).
Our contribution concerns the preprocessing step which is needed in most functional data analyses. Typically X t (s) cannot be sampled over the entire continuum, but rather we record a discretized version. Moreover, it is not uncommon that measuring devices become more error-prone with an increase in sampling frequency, so that often we actually observe a discretized and noisy version of X t (s). The question how to deal with the discrepancy between the theoretical model and the actual measurements is often the starting point in theoretical and practical FDA problems. While existing literature has been focusing on different variants of smoothing techniques, we want to explore this problem now from a different perspective.
Throughout this paper we are dealing with a stationary sequence of functional data (X t : t ≥ 1), where each data point is of the form (X t (s) : s ∈ [0, 1]). In the common case where U is an interval, the restriction to [0, 1] comes with no loss of generality. We assume that X t (s) is recorded at discrete time points 0 ≤ s 1 < s 2 < · · · < s p ≤ 1. We focus on data where these sampling points are identical for each t and where the number of sampling points p is large. Such a setting is very common when we have machine recorded data. Irrespective of whether we have a sparse or a dense observation grid, most papers in FDA literature assume that measurements might come with an additional error, yielding actual observations of the form y t = (X t (s 1 ), . . . , X t (s p )) ⊤ + (u t1 , . . . , u tp ) ⊤ =: X t (s) + u t . (1) Here and in the sequel s = (s 1 , . . . , s p ) and g(s) = (g(s 1 ), . . . , g(s p )) ⊤ . When we want to process such data, we need to eliminate the noise u t in order to guarantee a valid inference for X t . Most available approaches towards this end assume X t (s) to be a smooth curve and then propose to estimate the latent signal with diverse smoothing techniques. The smoothing is mostly done curve-by-curve, for example via spline smoothing (e.g. Ramsay and Silverman (2005)), kernel smoothing (e.g. Wand and Jones (1995)) or local linear regression (e.g. Fan (1993)). Alternatively, Staniswalis and Lee (1998) suggest to smooth the empirical covariance of the raw data in order to get an estimate of Γ X (s, s ′ ) = Cov(X t (s), X t (s ′ )). The eigenfunctions of the smoothed covariance are then used to get a proxy for the Karhunen-Loève expansion of X t . In this paper we investigate an alternative method, which doesn't make use of the common smoothing techniques and which in turn doesn't rely on potentially unverifiable smoothness of the curves. Rather we use that data y t given in (1) can be shown to follow some (approximate) factor model, where X t (s) und u t correspond to the common components and idiosyncratic components of y t , respectively. This means that the components X t (s) are driven by some lower dimensional process (the common factors) and hence are strongly correlated, whereas the components of u t are uncorrelated or only mildly correlated and represent unsystematic fluctuations (see e.g., Mardia et al. (1979) for a basic introduction).
Recovering X t (s) can thus be seen as estimating the common components in a factor model. In a companion paper Hörmann and Jammoul (2021) we have pointed and worked out the connection to factor models and thoroughly explored the practical performance of existing MLE or PCA based estimation schemes on real and simulated data. We found that these approaches lead to very convincing and nicely interpretable results. In a variety of considered settings the factor model estimates outperform competing smoothing techniques in terms of the resulting mean square approximation error. This advantage is particularly remarkable for moderately sized p and large T or when the underlying signal X(s) is not smooth.
The primary purpose of this article is to provide the theoretical validation of the factor approach. Following Fan et al. (2013) we will explore a principal components based estimator for the common components. In Hörmann and Jammoul (2021) we have experienced that it leads to slightly better practical performance when compared to a likelihood based approach (e.g., Bai and Li (2012)). Moreover, it is very simple to implement. When EX t = 0, then the estimator is of the form (X 1 (s), . . . ,X T (s)) = YÊÊ ⊤ , where Y = (y 1 , . . . , y T ) and whereÊ = (ê 1 , . . . ,ê L ) are the eigenvectors of 1 T Y ⊤ Y belonging to the L largest eigenvaluesγ 1 ≥ · · · ≥γ L . In our theorems below we will prove that Here P → denotes convergence in probability. Our results are based on T → ∞ and p = p(T ) → ∞. While our reasoning is similar to Fan et al. (2013), we are working with assumptions specific for functional data. In particular, we are going to allow the number of factors L to diverge, i.e. L = L(T ) → ∞. Still, we will be able to work under much less restrictive moment and dependence assumptions compared to current factor model literature. They are also much less restrictive than assumptions in comparable FDA smoothing results. We will expand on this in more detail in Section 2.3.
In the next section we formulate our assumptions and the main theorems. In Section 3 we give the core steps of the proofs. We conclude in Section 4. Technical lemmas and detailed proofs are given in the Appendix.

Results
In this section we formulate and discuss our assumptions and present our large sample results. It will be assumed throughout that (X t : t ≥ 1) is stationary and that the curves (X t (s) : s ∈ [0, 1]) are second-order stochastic processes. We then can define µ(s) = EX t (s) and Γ X (s, s ′ ) = Cov(X t (s), X t (s ′ )), respectively. We will be considering the setting where the sample size T and the number of sampling points p = p(T ) tend to infinity. The data are sampled as in (1).

Assumptions
Let us now list the detailed assumptions on the processes (u t ) and (X t ).
Assumption 1. The noise process (u t ) is i.i.d. zero mean and independent of the signals (X t ). The processes (u ti : 1 ≤ i ≤ p) are Gaussian with covariance function γ u (h) = Cov(u t(i+h) , u ti ), such that Assuming Gaussianity of the noise processes simplifies our proofs and assumptions and seems reasonably justified in the context of modelling measurement errors. It is important to note that we don't impose that u ti and u tj are uncorrelated, as it is often required for factor models. In many real data situations it is likely that errors of consecutive measurements are at least mildly correlated. Our main theorems below can be extended under suitable moment and dependence assumptions for the (u ti : Summability of the autocovariances implies that the correlation between the errors decays quickly with increasing lag. This is in line with the paradigm of approximate factor models. Such models have been first discussed in Chamberlain and Rothschild (1983) in context of macroeconomic applications and are now quite commonly used in high dimensional factor model settings. Several works have further investigated approximate factor models since then, see e.g., Bai (2003); Bai and Liao (2016); Choi (2012); Bai and Li (2016); Bai and Ng (2019).
For the signal process (X t ) we will allow for temporal dependence. A concept that proved to be rather useful in this context, is the so-called L p -m-approximability, which has been proposed in Hörmann and Kokoszka (2010). A sequence is said to be L p -m-approximable, if it has an ergodic representation of the form X t = g(δ t , δ t−1 , . . .) with some measurable functional g and i.i.d. elements (δ t ) in some measurable space and satisfies Here, · denotes the L 2 -norm for square integrable functions on [0, 1] and (δ ′ t ) is an independent copy of the innovations (δ t ). This means that we can approximate X t by coupled m-dependent sequences sufficiently well. Several popular functional time series models (e.g. functional ARMA or functional GARCH) fall into this framework. Unlike diverse mixing conditions, L p -m-approximability is often much easier to establish. For the purpose of illustration consider a functional AR(1) model X t = ̺(X t−1 ) + δ t . Under suitable assumptions on the linear operator ̺ (see e.g. Bosq (2000)) we get by iterative application of the recursion that X t = k≥0 ̺ k (δ t−k ), and hence the required representation holds. A sufficient assumption is ̺ < 1. Here the coupled version is X This shows that with increasing m, the L p error between the original data and the coupled mdependent sequence goes to zero fast enough to guarantee summability in (3). More details can be found in Hörmann and Kokoszka (2010).
The zero mean assumption is not necessary, but it simplifies the presentation. In practice µ(s) will be estimated viaμ(s) = T −1 (y 1 + · · · + y T ) and data will be centered. It is implicit in the definition of L 4 -m-approximability that the process (X t ) is strictly stationary and ergodic, which is why the conditions (ii) and (iii) can be restricted to X 1 . The main theoretical restriction is Assumption 2 (iv). It is motivated by Mercer's theorem which yields under the assumption of a continuous covariance kernel that for a functional variable X t it holds that sup where ϕ ℓ denotes the eigenfunction of the covariance operator Γ X belonging to the ℓ-th eigenvalue λ ℓ and the x tℓ = 1 0 (X t (s) − µ(s))ϕ ℓ (s)ds =: X t − µ, ϕ ℓ are functional principal component scores. In most real data problems the convergence in (4) is very fast, so that even for relatively small L the variables X L t (s) = µ(s) + L ℓ=1 x tℓ ϕ ℓ (s) define only a slight perturbations of X t (s), lying in the L-dimensional space spanned by the functional principal components ϕ 1 , . . . , ϕ L . Hence this Assumption 2 (iv) holds asymptotically for a very broad class of functional data. Now the key observation that we are going to exploit in the context of this paper is that under Assumptions 1 and 2 the y t given in (1) follow some L-factor model, where X t (s) is the common component and u t the idiosyncratic component. See Hörmann and Jammoul (2021).
At this stage we stress once more that p = p(T ) and L = L(T ), which explains the asymptotic Hence, this assumption is very mild.

Theorems
We are now ready to formulate our asymptotic results. We distinguish the scenarios where L is fixed (Theorem 1) and where L = L(T ) → ∞ (Theorem 2). Our main results are formulated with L known. At the end of this section we comment on the estimation of L. All following asymptotic formulations are understood as T → ∞.
Theorem 1. Let Assumptions 1-3 hold. Assume that p = p(T ) diverges at a subexponential rate and let L be arbitrary but fixed. Furthermore, suppose that

Hence the estimator is consistent if
Theorem 1 needs further explanation. First, we comment on p/γ L = O P (1). The following lemma provides a simple condition when this assumption holds.
Then, under the conditions of Theorem 1, we have max i≥1 The lemma shows, among others, that p/γ L is a consistent estimator for λ −1 L and hence is O P (1). Condition (5) holds if the covariance kernel Γ X is uniformly continuous. We would like to point out that the conditions in Lemma 1 may be adapted and weakened. We decided to disentangle this problem from the theorem and to present conditions which are neat and not very technical.
Concerning the uniform consistency in Theorem 1 we need in addition a bound for max 1≤t≤T f t . Such bounds can easily be obtained under higher order moment conditions. We formulate the following lemma.
Lemma 2. Suppose that Assumption 2 holds and that L is fixed.
The lemma implies that if we request moments of order > 4 for the signal process X t , then our Theorem 1 provides uniform consistency over time and cross-section.
Next we consider the setting where the dimension of the function space is allowed to diverge. Not surprisingly, this requires extra conditions, e.g. on the eigenvalues of Γ X and on the growth rate of L. Instead of listing several different sets of conditions and results, we have decided to present here one specific setup which allows for relatively neat conditions. However, we transparently outline the main steps and bounds needed for the proofs in Section 3. From there it is not very complicated to see where the respective conditions enter and how changing the assumptions will change the results.
Here we refer to the decomposition of the approximation error in (8) and (9) in particular. This leaves some flexibility and allows the analyst to adapt the theorems to the setup under investigation.
Theorem 2. Let Assumptions 1-3 hold. Assume that p = p(T ) diverges at a subexponetial rate and that L = L(T ) diverges at a subpolynomial rate. Furthermore, assume that for some ν > 0 and some ρ > 0 we have λ j ≥ ρj −ν and that there is some β > 0 such that p/γ L = O P (L β ). Then, for any As one referee pointed out, it may appear counter-intuitive that the reconstruction error vanishes more slowly when the eigenvalue sequence decreases faster. The problem here is that the smaller the eigenvalue, the weaker the signal associated with the corresponding factor and the more difficult it is to estimate. A similar problem occurs in functional regression. While a fast decay of the eigenvalues associated to the functional covariate indicates that we are "almost" in a finite dimensional setting, convergence rates in fact become slower. See, for example, the results in Hall and Horowitz (2007).
The technical conditions in Theorem 2 are mild. But as we noted before, they can easily be generalized. For example, it is possible to assume a polynomial rate for L(T ), but then we will get weaker and more complicated error terms. One part of our assumption which is less transparent is p/γ L = O P (L β ). The following lemma provides a sufficient condition.
In Theorems 1 and 2 the required number of factors L is assumed to be known, even though it may grow with T within the setting of Theorem 2. Naturally, in practice L is not known and must be estimated before applying the above mentioned methods. Determining the number of factors is a separate problem, which goes beyond the scope of our article. We refer e.g., to Bai and Ng (2002) or Onatski (2010). In our practical implementations we found the bi-cross validation method suggested by Owen and Wang (2016) to be most effective. However, we show that the use of an appropriate estimator for L does not interfere with our results. Let us assume here that an estimatorL exists, such that Pr(|L − L| = 0) −→ 1.
Corollary 1. Suppose that the number of factors is estimated byL, and that this estimator satisfies (6). Then the statements of Theorems 1 and 2 remain valid.
We conclude this section by commenting on the dependence structure of the estimated signals (X t (s)). As one of the reviewers pointed out, our procedure to some extent modifies the autocorrelation structure of the functional series. For example, if the X t were independent, then theX t may nevertheless be dependent, since eachX t is computed from the entire sample. Intuitively, this effect will be asymptotically negligible by our main results. Let us illustrate that this intuition is correct on the basis of the empirical autocovariance. To this end we note that The first term on the right hand side can be bounded by The second term can be treated similiarly. The factor T −1 T j=1 max 1≤j≤p |X t (s j )| is O P (1) by Assumption 2 (iii) and the ergodic theorem, while the first factor tends to zero at the rate specified by our theorems. On the other hand, it follows from the ergodic theorem that the empirical acf from the latent signals X t (s i ) is consistent for Cov(X t+|h| (s i ), X t (s j )). The imposed weak dependence assumption can be used to obtain rates.

Comparison to existing results
Let us begin with noting that the common curve-by-curve fitting techniques, such as spline smoothing or non-linear regression techniques, don't profit from T → ∞. The estimation error will only decrease with growing p and not with a growing sample size. Hence our results do not directly compare to these methods.
A method taking into account the entire sample is Staniswalis and Lee (1998). Their approach, in contrast to ours, returns a full curve and doesn't require a fixed sampling design. Of course, we can also obtain a full curve by using a sensible interpolation scheme, e.g.X t (s) := X t (s i ) for all s ∈ [s i , s i+1 ) is a simple solution. We think, however, that such an extension ofX t (s) to a curve X t (s) is of little practical relevance, since for the processing of real data we will most often use the discretised curves anyway. For example, in a functional regression model Y t = β(s)X t (s)ds + ǫ t , the integral typically can only be computed numerically, e.g. by applying a composite trapezoidal rule. It is then natural to use s as the nodes.
Extending our results to obtain a theoretical bound for sup s |X t (s) − X t (s)| is conceptually not very complicated. Set h = max 1≤i≤p |s i+1 − s i | and define the modulus of continuity of X t as For the first term on the right we have provided bounds. By imposing suitable assumptions on the path properties of the signal X t , the errors δ X (h) can be controlled, provided the sampling grid s is dense. Let us remark at this point that for our theorems we don't require a dense grid s, i.e., we can also deal with a situation where h → 0 as p → ∞.
Corresponding asymptotic results for the method of Staniswalis and Lee (1998) have been established in Müller et al. (2006). These authors also assume a fixed and dense sampling design but indicate that such assumptions can be relaxed to a certain extent. Their framework requires smoothness of the underlying signal in terms of higher order derivatives. This is an assumption which we particularly seek to avoid. In Hörmann and Jammoul (2021) we have seen that smoothness of the signal is not just a technical assumption needed for their proof, but also that the practical performance suffers severely for signals which are not very smooth. An important technical restriction in Müller et al. (2006) is that signals are assumed to be bounded, which then also excludes the important case of Gaussian processes. In our paper we only request > 4 order moments.
The bounds in Müller et al. (2006) are quite involved and of the form sup s |X t (s) − X t (s)|. To the best of our knowledge, we are the first paper to provide uniform bounds over s and 1 ≤ t ≤ T .
Basically all papers considering the setting (1) assume that the noise components are i.i.d. or at least white noise. We think that this is a severe restriction. In real data it is rather plausible that errors of subsequent measurements are likely to be correlated. From the factor model perspective, we have improved on the conditions used in Fan et al. (2013), who require, among others, exponential moments for the involved processes. Moreover, we allow L → ∞, which is also novel.
The j-th column of Y ⊤ and U ⊤ are denoted Y j and U j , respectively. The j-th row of F ⊤ is denoted by F j . Then we have by (1), (4) and Assumption 2 (iv) that In this notation, the objective is to estimate BF ⊤ through some estimatorBF ⊤ and to show that BF ⊤ −BF ⊤ tends to zero, uniformly over each component. In particular we letF = √ TÊ and B = T −1 YF , whereÊ = (ê 1 , . . . ,ê L ) denotes the eigenvectors associated to the L largest eigenvalues of T −1 Y ⊤ Y . ThenX =BF ⊤ as in (2). We introduce the matricesΛ = diag(γ 1 , . . . ,γ L ) and H := T −1Λ−1F ⊤ F B ⊤ B. We will see in Lemma 8 that H is asymptotically orthogonal, which means BF ⊤ −BF ⊤ ≈ BH ⊤ HF ⊤ −BF ⊤ . The basic idea is then to show thatB − BH ⊤ andF ⊤ − HF ⊤ become small. The matrix H guarantees that the estimated and the empirical scores share the same orientation.
More specifically we haveX t (s where R (1) := max 1≤t≤T R Here-with a slight abuse of notation-· denotes next to the L 2 -norm the Euclidean norm and the spectral norm of a matrix. The concrete meaning will be clear from the context. Bounds for R (i) will be obtained below in a sequence of lemmas. The proofs of these technical lemmas are given in A.
We begin with the term R (1) . By simple algebraic manipulations it follows from (7) that Using the spectral theorem we get the representation Notice thatF ⊤ − HF ⊤ is an L × T random matrix. The t-th column can be written aŝ The following lemma provides the order of magnitude of max 1≤t≤T A mt , m ∈ {1, 2, 3}.
Next we study the order of magnitude of R (2) . By the definition ofB and H we get Rearranging the terms we thus get The following lemma provides the order of magnitude of max p j=1 B mj , m ∈ {1, 2, 3}. Lemma 5. Consider Assumptions 1, 2 and 3. Suppose that L/(T λ 2 L ) → 0 and that p diverges at a subexponential rate. Then, as T → ∞, Thus, if m T denotes the maximum of the rates in (14)- (16), then R (2) = O P (m T (p/γ L )).
Proof of Corollary 1: We letX t (s i |K) be the estimated components in (2) withÊ = (ê 1 , . . . ,ê K ). When L is unknown our estimator is of the formX t (s i |L). Moreover,X t (s i |L) =X t (s i ) as it was used in Theorems 1 and 2. From those theorems we have statements of the form Pr max 1≤t≤T max 1≤j≤p |X t (s i |L) − X t (s i )| > a T → 0. This remains true if we replace L byL, since

Conclusion
In this paper we develop the theoretical foundation to a factor model based approach for preprocessing functional data. We show that the recovery of the underlying signal from noisy functional data observations can be done consistently using a simple and straight-forward application of a principal component factor model. We relax many assumptions on approximate factor models as well as comparable FDA smoothing techniques and provide a general and neat setting which appears realistic in real life applications. We extend common factor analysis approaches by considering the case where the number of factors may grow with the dimensions of the dataset. Furthermore, we expand upon comparable FDA smoothing techniques by including the whole dataset in the signal recovery instead of applying common curve-by-curve techniques. Practical applications as well as extensive simulation studies are presented in the companion paper Hörmann and Jammoul (2021).

A Proofs of technical lemmas
In this appendix we prove the lemmas stated in Section 3 and some further technical lemmas. We start by reviewing some further background and notation. All random variables below are defined on a common probability space (Ω, A, P ). We note that by our assumption the processes X t are elements in H = L 2 ([0, 1])-the space of square integrable functions on the interval [0, 1]. This space is equipped with scalar product x, y = 1 0 x(t)y(t)dt and norm x = x, x . For the sake of a light notation we will use · also for the Euclidean norm and the spectral norm of a matrix. It will be clear from the context what type of norm it indicates. We use · F for the Frobenius norm.
The signals X t can also be viewed as elements in L 2 H = L 2 H (Ω, A, P ), i.e. the space of H-valued random elements X with E X 2 < ∞. The space L 2 H is also a Hilbert space with inner product E X, Y .
The outer product ⊗ between two functions x, y ∈ H defines a Hilbert-Schmidt (HS) operator through x ⊗ y(z) = x z, y . The set of Hilbert-Schmidt (HS) operators on H will be denoted by S. Let (e i ) be an orthonormal basis (ONB) of H. Then the space S equipped with the scalar product W, V S := i≥1 W (e i ), V (e i ) (which can be shown to be independent of the ONB) is another separable Hilbert space. It can be easily seen that E X 4 < ∞, implies E X ⊗ X 2 S < ∞ and thus X ⊗ X ∈ L 2 S (Ω, A, P ). Then we may define Ω X := Var(X ⊗ X). For the definition of covariance operators on Hilbert spaces we refer to Bosq (2000).
We use d → to indicate convergence in distribution. A normal random element in some separable Hilbert space H with mean µ and covariance operator Ω is denoted by N H (µ, Ω).

Preliminary lemmas
Lemma 9. Define W T := T t=1 X t ⊗ X t . Under Assumption 2 (i) Proof: Assumption 2 (i) implies that the process (X t ⊗ X t : t ≥ 1) is L 2 -m-approximable in S. This follows by a simple modification of the proof of Lemma 2.1 in Hörmann and Kokoszka (2010). Thus Theorem 8 in Hörmann and Cerovecki (2017) applies and this in turn yields the central limit theorem (see Theorem 5 in Hörmann and Cerovecki (2017)). Part (ii) follows directly from Theorem 3.1 in Hörmann and Kokoszka (2010).

Lemma 10. Under Assumption 1 we have a constant c 2 which is independent of p such that for
Proof: We show (i). The proof of (ii) is along the same lines. The main ingredient is Isserlis' theorem, which states that for a zero-mean multivariate normal vector (Z 1 , . . . , Z 2k ) ⊤ the product moments are given by where P[I] denotes the set of all pairings of the components of some index vector I. Denote u 1 = (ε 1 , . . . , ε p ) ⊤ . Since u 1 and u 2 are i.i.d. we get E u ⊤ 2 u 1 2k = p i 1 ,...,i 2k =1 (Eε i 1 . . . ε i 2k ) 2 , and thus The last inequality was deduced from Assumption 1.
Lemma 11. Under Assumption 2 (i) Proof: Denoting by δ ij the Kronecker delta, we notice that the i, j-th entry of the (L × L)-matrix T −1 F ⊤ F − I L can be written as This shows that By Lemma 9 (i) we infer that T −1/2 W T − T Γ X S = O P (1) and hence the result follows.
For the next result we note that the k-th row off t − Hf t is given by whereF k is the k-th row ofF ⊤ .

Lemma 12. Under Assumptions 1 and 2
Proof: We show (20) and (21). For (20) we use that In the last step we used the Cauchy-Schwarz inequality and the fact thatf kℓ (ℓ-th column and k-th row ofF ⊤ ) satisfies T ℓ=1f 2 kℓ = T . Using the Markov inequality with Assumption 1 and Lemma 10 we get for any κ > 0 The first statement of the lemma is shown. By similar arguments as above we obtain, using Bf t = X t (s), Then, using the independence between X t (s) and u ℓ (Assumption 1), we get that By Assumption 2 (ii) we have sup s∈[0,1] EX 2 (s) < ∞ and so (21) follows with similar arguments as before using the Markov inequality. The bound for (22) can be derived in complete analogy to (21).

Lemma 13. Under Assumptions 1 and 2 and if
Then, we note that for any κ > 0 Pr max For p 2 we get with Lemma 9 (ii) and the Markov inequality To derive the probability p 1 (κ) we first condition on X 1 , . . . , X T . Then Conditional on X 1 , . . . , X T the variable By an elementary bound for the normal distribution function (see e.g., Petrov (1995, p. 177)) we get Thus for any i and j By taking the expectation of the conditional probability we obtain a bound for the unconditional probability. Then, summing over i and j yields Now the proposition of the lemma follows immediately.

Thus, for any
showing thatγ X i /p =λ * i . Finally, By a basic inequality for matrix norms, we have Γ u ≤ Γ u c Γ u r , where A c and A r denote the maximal absolute column sum and the maximal absolute row sum of some matrix A, respectively. In particular, this implies that Γ u ≤ h∈Z |γ u (h)| ≤ C u (Assumption 1). By results on the distribution of the largest eigenvalue of random covariance matrices in Johnstone (2001) and Karoui (2005), one can deduce that Next we observe that

By our previous arguments we see that this last term is
and hence T −1 XX ⊤ = O P (p). Thus, we can conclude The proof follows.
Proof of Lemma 2: It holds by the Bonferroni and the Markov inequality that for any ǫ > 0 Pr max By the stationarity of (X t ) and X 1 , ϕ i 2 ≤ X 1 2 we have The statement (i) follows. For statement (ii) we use Pr(exp(α f t ) > T αǫ ). Now using again the Markov inequality and choosing ǫ > 1/α, this last term tends to zero.

Proof of Lemma 3:
We are going to show that Then, by the lower bound on the eigenvalues we conclude that the right-hand term in (25) is bounded by 2 Pr (|λ L −γ L /p| > ρL −ν /2) . Since we assume that Lemma 1 applies, and since we assume L 2ν / max{p, T } → 0, this term goes to zero. As for (25) we remark that for a random variable Y > 0 and some x > 0, it holds that

Proof of Lemma 4: Recall that
and that T ℓ=1 f ℓ 2 = T L. Thus, by the triangular inequality and the Cauchy-Schwarz inequality it follows that The Markov inequality yields From Lemma 10 (ii) we infer that Yet another application of the Markov inequality and Lemma 10 gives Combining these estimates yields (11). Next we verify (12). We recall that Bf t = X t (s) and hence Consequently, by the Markov inequality Moreover, Eu 1,i 1 u 1,i 2 u 1,j 1 u 1,j 2 EX 1 (s i 1 )X 1 (s i 2 )X 1 (s j 1 )X 1 (s j 2 ).
The Cauchy-Schwarz inequality applied twice gives By Assumption 2 (ii) this term is bounded by C X . Combing these results with Assumption 1 it is easy to derive that This yields (12). Relation (13) can be derived analogously.
Using Lemma 7 and Lemma 13, the claim follows. For B 3j , we note that with Assumption 2 it follows that The claim then follows directly from Lemma 7 and Lemma 11. This finishes the proof.
Proof of Lemma 6: We just have to note that b j 2 = Var(X 1 (s j )).

Proof of Lemma 7: It holds that
Thus, the following easily verifiable assertions imply the proof: In (26) we used Lemma 9, while (27) follows from k≥1 λ k < ∞ and Assumption 3.
Proof of Lemma 8: Let us first derive a bound for HH ⊤ − I L . We have For the first term in (28) by Lemmas 7 and 11 and the second term is equal to We have H = O P pL/γ L √ λ L (Lemma 7) and F F = O P T L/λ L 1/2 (see proof of Lemma 7) and F F = √ LT . From Lemma 12 we deduce that Combined, this shows that the second term in (28) is Define D L := HH ⊤ − I L and let us denote by S = S T = {ω ∈ Ω : D L < 1}. If the bound in (29) tends to zero, then Pr(S) → 1. On S we have that W L := I L − D L + D 2 L − D 3 L ± . . . converges and it is easily seen that W L is the inverse of I L + D L = HH ⊤ . But then H is invertible and H −1 = H ⊤ W L .
Hence we get

B Weakening the assumption on Gaussian noise
As mentioned in Section 2.1, one may weaken the assumed Gaussianity of the error process in Assumption 1. We discuss a rough outline of the consequences of avoiding this specific restriction. In the following we instead assume 8 moments for u t and the validity of some moment inequalities which are well known in the iid case and which have been verified under Assumption 1. Additionally, we assume p = O(T 2 ) in order to guarantee the same convergence rates in our theorems. Previously, subexponential growth of p = p(T ) was possible. Below we discuss the necessary modifications.
(I) In Lemma 10 we prove (i) E(u ⊤ 2 u 1 ) 2k ≤ c 2 p k and (ii) E u 1 4k ≤ c 2 p 2k , k ∈ {1, 2}. In the case of iid noise this follows immediately from Rosenthal's inequality (see Petrov (1975)). To apply this inequality we need for (i) 2k moments, and for (ii) 4k moments, which ultimately means that we need 8 moments. Extensions to stationary processes exist (see e.g. Liu et al. (2013)). In order to provide a more general setting, we now state (i) and (ii) as assumptions.
(II) In Lemma 13 we need a bound for j≤p i≤L , the result is immediate.
The lemma is now applied to (31) with κ = c(Lp) 1/4 T −1/2 . Then for any ǫ > 0 there is a c independent of T , p and L such that p 1 (κ) < ǫ. Using this in Lemma 13 implies that max j≤p 1 T F ⊤ U j = O P (Lp) 1/4 T 1/2 .
If again we assume 8 moments for u t , the analogue approach yields max j≤p 1 T F ⊤ U j = O P (Lp) 1/8 T 1/2 .
(III) The bound in (II) is used in the proof of Lemma 5. In (15) we obtain the modification This implies that the bound for R (2) needs to be modified. In the proof of Theorem 1 (where L is fixed) this needs to be replaced by O P 1/T 1/4 + T 1/4 / √ p + p 1/8 / √ T .
Since we require p = O(T 2 ) the rate remains unchanged. Moreover, the factor L in (32) is included with a smaller power than in (15), and hence this modification also does not affect the rate in Theorem 2, where L is allowed to grow. In the iid case this can again be deduced by the Rosenthal's inequality. Likewise we may use some extension thereof in the dependent case. by the Rosenthal inequality. Note that with respect to t the u t are assumed to be independent. For our proof this term is required to be O(1), which is the case due to the assumption of p = O(T 2 ).