Ridge regression for the functional concurrent model

: The aim of this paper is to propose estimators of the unknown functional coeﬃcients in the Functional Concurrent Model (FCM). We extend the Ridge Regression method developed in the classical linear case to the functional data framework. Two distinct penalized estimators are obtained: one with a constant regularization parameter and the other with a functional one. We prove the probability convergence of these estimators with rate. Then we study the practical choice of both regularization param- eters. Additionally, we present some simulations that show the accuracy of these estimators despite a very low signal-to-noise ratio.


Introduction
Functional Data Analysis (FDA) proposes very good tools to handle data that are functions of some covariate (e.g. time, when dealing with longitudinal data), see Hsing and Eubank [11] or Horváth and Kokoszka [10]. These tools allow for better modelling of complex relationships than classical multivariate data analysis do, as noticed by Ramsay and Silverman [15,Ch. 1], Yao et al. [20,19], among others.
There are several models in FDA for studying the relationship between two variables. In particular in this paper we are interested in the Functional Concurrent Model (FCM) which is defined as follows where t ∈ R, β 0 and β 1 are the unknown functions to be estimated, X , Y are random functions and is a noise random function. All the functions considered here are complex valued. From a practical perspective all functional linear models can be reduced to a functional concurrent model with several covariates (Ramsay and Silverman [15, p. 220]). This model is also related to the functional varying coefficient model (VCM) and has been studied for example by Wu et al. [18] or more recently by Şentürk and Müller [16].
Another practical advantage of model (1.1) is that it allows to simplify the study of the following convolution model where u, s ∈ R, through the Fourier transform F with Y = F(W ), β 0 ≡ 0, β 1 = F(θ), X = F(Z) and = F(η).
Despite the abundant literature related to FCM or functional VCM, there is hardly any paper providing estimators of the unknown functions in model (1.1) along with their asymptotic properties which use the norm of the functional space where they belong to.
As noticed by Ramsay and Silverman [15, p. 259], most of the current methods of estimation come from a multivariate data analysis approach rather than from a functional one. For some applications, for example when the observations are highly auto-correlated, taking this functional nature into account may be decisive. If not, multivariate approaches may cause a loss of information because, as noticed by Şentürk and Müller [16, p. 1257], they "do not take full advantage of the functional nature of the underlying data". In practice this loss of information may reduce the accuracy of estimation and prediction. To circumvent this problem, Şentürk and Müller [16] propose a three-step functional approach based on smoothing and least square estimation. However, the convergence results obtained on compact sets do not allow to study specific models like (1.2), for which convergence on the whole real line is required.
Besides, Ramsay et al. [14,Ch 10] propose a practical estimation method by projecting all the random functions to an adequate finite dimensional subspace and then use a penalization to chose the estimator. They do not provide a theoretical study of its asymptotic properties.
The objective of the present paper is to propose estimators of the functions β 0 and β 1 in the FCM (1.1) for which the asymptotic properties are obtained. Our estimation approach is based on the Ridge Regression method developed in the classical linear case, see Hoerl [8]. We extend this to the functional data framework of model (1.1).
To ease the notations and the presentation of the results, we introduce in section 2 a simplified centered model. The functional ridge regression estimator of the functional coefficient is then defined with a constant regularization parameter. In section 3 we establish the consistency of this estimator and get a rate of convergence. Section 4 addresses the practical choice of the regularization parameter through cross-validation criteria. We also introduce a more flexible estimator with a functional regularization parameter. Some simulation trials are presented in section 5, and show the comparison of the two penalized estimators together with that of Ramsay et al. [14,Ch 10], in a very low signal-to-noise ratio (SNR) setting. Finally an application on a real data set is presented in section 6. All the proofs are postponed to Section 8.

Estimator and hypotheses
Let (X i , Y i ) i=1,··· ,n be an i.i.d sample of FCM (1.1). To remove the functional intercept β 0 , we center the model (1.1) and get The estimator of β 0 depends on the estimator of β 1 obtained from the centered model. Given that the natural estimators of E[X ] and E[Y] are the empirical means (X n := 1/n n i=1 X i andȲ n := 1/n n i=1 Y i ), the estimator of β 0 is defined asβ 0 :=Ȳ n −β 1Xn . (2.1) The convergence results onβ 1 immediately transpose toβ 0 . Now, to focus on the estimation of β 1 , we define the elements of the centered model as follows, and β := β 1 and the centered FCM writes Y (t) = β(t) X(t) + (t). (

2.2)
In what follows we discuss the estimation of β.

Functional ridge regression estimator (FRRE)
The definition of the estimator of β in the centered model (2.2) is inspired by the estimator introduced by Hoerl [8] in the Ridge Regularization method that deal with ill-posed problems in the classical linear regression. Let λ n > 0 be a regularization parameter, we define the Functional Ridge Regression Estimator (FRRE) of β as followsβ where the exponent * stands for the complex conjugate. In the classical linear regression case, Hoerl and Kennard [9, p. 62] proved that there is always a regularization parameter for which the ridge estimator is better than the Ordinary Linear Squares (OLS) estimator. Huh and Olkin [12] made a study of some asymptotic properties of the ridge estimator in this case. In the context of the functional linear regression with scalar output, Hall et al. [6, p. 73] have also used a ridge regularization method to invert the whole covariance operator of X. Their approach has two main differences with the one used to define the FRRE: we use i) functional outputs (Y i ) and ii) inversion of the diagonal terms of the covariance matrix of X.
In our case, the use of λ n in the denominator prevents from dividing by zero because E[X] = 0 (centered model) and, therefore, it helps to control the instability of the estimator. The simulation studies in Section 5 show that in practice a better estimator is obtained with the regularization parameter.

Notations and general hypotheses of the FCM
Before studying the FCM, let us define some useful notations. We define L 2 (R, C) = L 2 the set of square integrable complex valued functions, with the L 2 -norm f L 2 := R |f (x)| 2 dx 1/2 , with its associated inner product ·, · . Besides, given a subset the complex modulus.
The theoretical results given in the next sections are proved on the whole real line. For this reason, we need to restrict the study to the set of functions that vanish at infinity. Let C 0 (R, C) = C 0 be the space of complex valued continuous functions, which satisfies: for all ζ > 0 there exists a R > 0 such that for all |t| > R, |f (t)| < ζ. We use the supremum norm f C0 := sup x∈R |f (x)|. In particular for a subset K ⊂ R, f C0(K) := sup x∈K |f (x)|.
Finally, throughout this paper, the support of a continuous function f : R → C is the set supp(f ) := {t ∈ R : |f (t)| = 0}. This set is open because f is continuous. Besides we define the boundary of a set S, as ∂(S) := S \ int(S), where S is the closure of S and int(S) is its interior.
The space C 0 is too large. For instance, its geometry does not allow for the application of the Central Limit Theorem (CLT) under the general hypothesis of the existence of the covariance operator, that is E( X 2 C0 ) < ∞ (see Ledoux and Talagrand [13,Ch 10]). To circumvent this difficulty, we consider functions that belong to the space C 0 ∩ L 2 . Here are general hypotheses that will be used all along the paper: ) and E( X 2 L 2 ) are all finite. We do not assume that E[X] = 0 in model (2.2). Therefore, we deal with a more general case than the one derived after centering model (1.1), and our results will be valid also for the centered case.

Asymptotic properties of the FRRE
From the definition (2.3), it is easy to show that the FRREβ n has the following bias-variance decomposition: (3.1) In this equation, we can see that the penalization introduces a bias but helps to control the variance (last term in (3.1)). Thus, the penalization should not be too big nor too small. Note also that, when E[X] ≈ 0, the part of the denominator 1 n n i=1 |X i (t)| 2 might be close to zero at some values of t. Therefore, the penalization (λ n > 0) is necessary to prevent the denominator to be too small.
Then from the equation (3.1) we deduce that the ill-posed degree of this estimation problem depends on the intervals where these two conditions are satisfied: i) E[|X| 2 ] is close to zero and ii) β E[|X| 2 ] is not close to zero. The latter condition implies a big bias because β will be significantly bigger than the denominator.
The main results of this paper are the probability convergence of the FRRE with rate and the mean square error rate under large conditions. where Hypothesis (nA1) is stronger than the negation of (A1). This hypothesis provides that there exists some t 0 in supp(|β|), such that X is zero almost surely in a neighborhood of t 0 . Then there exists a constant C > 0 such that almost surely

Consistency of the estimator
In what follows we obtain some rates of convergence over the whole real line and over compact subsets.

Rate of convergence
To obtain a rate of convergence, we need to control the shapes of the functions λ n := n 1− 1 4α+2 , where α > 0 comes from the hypothesis (A5).

4)
where γ := min   can naturally be L 2 -bounded under this condition.
Next (A5a) requires that around the points p ∈ C β,∂X , the function E[|X| 2 ] goes to zero slower than a polynomial of degree α, which implies that the term  The degree of ill-posedness of the problem depends on how close to zero E[|X| 2 ] is. The hypothesis (A5a) measures this through the polynomial degree α. In this way the rate of convergence, which directly depends on α, is related to the illposed nature of the problem.
Parts (b) and (c) of (A5) help us controlling the tails of β and |X| around infinity. They are useful only when card(C β,∂X ) = +∞. Note that the set C β,∂X is always countable (see the proof of Theorem 3.5).
Finally hypothesis (A6) replaces (A2) in Theorem 3.1, as the rate of convergence strongly depends on the behaviour of β E[|X| 2 ] around the points of C β,∂X , which depends on α. We can see that (A6) always implies (A2).

Remark 3.8.
It is natural to ask whether the convergence rate obtained in Theorem 3.5 is optimal or not. Stone [17] obtained an optimal convergence rate in a multivariate nonparametric regression setting. Transposition for statistical models with functional variables is still an open problem. In our case, the convergence rate in Theorem 3.5 can be written under the form n −α/(2α+1) with α < 1/2, which leads to a rate slower than n −1/4 . The condition α < 1/2 prevents from getting convergence rates of the same form than those given in Stone [17]. This constraint enables to bound the quantity 1 E(|X| 2 ) , which is crucial to control the bias term. Indeed, the convergence rate is stated in a large setting (i) for the L 2 norm over the whole real line, (ii) without any assumption on the regularity of the curve X, and (iii) without any assumption on the distribution of X.
Under stronger but more intuitive hypotheses, we can also obtain similar convergence results to that of Theorem 3.5. Corollary 3.9 is an example. Corollary 3.9. If additionally to hypotheses (A1), (A2) and (A3), we assume Hypothesis (A4bis) is a reformulation of (A4) and part (c) of (A5). It is required to control the second term of (3.1) and the decreasing rate of β with respect to E[|X| 2 ] around infinity (tails control). Besides, note that (A4bis) implies that C β,∂X = ∅. Theorem 3.10 presents a simpler convergence result on compact subsets of the support of E[|X|]. This theorem assumes general hypotheses and ensures convergence in a wide variety of cases.

Further results
In the previous subsection, we presented some convergence theorems that use convergence in probability (consistency). By adapting the arguments in the proof, we can also obtain convergence of the mean square error. We proved the following theorem.
Moreover, the confidence bands of β are computed in Proposition 3.12 under suitable noise conditions. We first compute the expectation and the variance of β n conditionally to the sample X 1 , · · · , X n . Then we define an unbiased estimator of the variance of the noise for each value t ∈ R, with which we compute the confidence interval of β for this value t.
Proposition 3.12. The expectation and variance ofβ n conditional to a sample Additionally, if for a given value t ∈ R, we suppose that with critical value t n−1 (1 − α/2).

Predictive and generalized cross-validation
This section is devoted to developing a selection procedure of the regularization parameter λ n for a given sample ( Next we introduce the following Generalized Cross-Validation (GCV), which is computationally faster than the PCV: which yields that the GCV criterion is bounded as follows:

Functional regularization parameter
Given that we are working with functional data, another possibility for the estimator defined in (2.3) is to use a time-dependent function Λ n (t) instead of a constant number λ n . We shall optimize, for each time t, the choice of Λ n (t).
To that aim, we have to compute the PCV for each time t ∈ R, is computed with the sample (X j (t), Y j (t)) j∈{1,··· ,n}\{i} . As above, we obtain a simpler formula for P CV (Λ n (t)) (see next proposition bellow), which yields a faster computation.

Proposition 4.3. We have
. This criterion is discussed in the next section dedicated to simulation studies. Its performance is evaluated and compared to that of GCV (λ n ).
Theoretical results can be obtained on the asymptotic properties of the estimator associated to the functional regularization parameter. For instance we proved the following theorem. Then whereβ n is obtained with Λ n (t) minimizing (4.2).

Simulation study
We divide the simulation study into two parts. Firstly, we present in settings 1 and 2, a comparative numerical analysis of different estimators used for estimation in model (1.1). Then, in the second part, a third setting simulation is introduced to numerical study the dependence of the convergence rate (n −γ ) on α, where α is a bound for the decreasing rate of E[|X| 2 ] towards 0, as described in Theorem 3.5. In this case we use the model without intercept (2.2).

Comparison of estimation methods
For settings 1 and 2, we evaluate our estimation procedures when the Signal-to-Noise-Ratio (SNR) is low, that is, under noisy conditions. Both approaches for computing the FRRE (using λ n and Λ n (t)) are compared along with the non penalized case (λ n = 0). Furthermore, we also compare them to the estimator defined by Ramsay et al. ([14,Ch 10]). In this approach, the random functions are projected onto an adequate finite-dimensional subspace generated by the Fourier basis. The estimator is obtained as a solution of a penalized least square criterion and is implemented in the R package fda.
We use the estimator (2.1) of β 0 and the FRRE estimator of β 1 after centering, that isβ For each setting we computed 500 Monte Carlo runs to evaluate the mean absolute deviation error (MADE) and the weighted average squared error (WASE), defined in the same way as in Şentürk and Müller [16, p. 1261], where [0, T ] is the domain of β 0 and β 1 and range(β r ) is the range of the function β r for r = 0, 1.
In the first setting, we analyze how the estimators behave when E[X ] > 0. Then, in the second one, we study a case where the penalization (λ > 0) is clearly needed, that is, when E[X ] = 0 and β 0 = 0.
The general hypotheses (HA1 F CM ) -(HA3 F CM ) are satisfied for both settings. The regularization parameter λ n and the function Λ n were optimized over the interval [0, 100].

Setting 1
We simulated samples with size n = 70. The input curves X i , for i = 1, . . . , n, were generated with mean function μ X (t) = t + sin(t) and covariance function constructed from the 10 first eigenfunctions of the Wiener Process with its correspondent eigenvalues. That is, for The function β 0 is defined as β 0 (t) = (t − 0.25) 2 1 [0.25,1] and β 1 as The noise i is defined as follows,

Results:
The simulation results are presented in Figures 1, 2, and Table 1. The performance of the four estimators are illustrated. We can see that, even under rather noisy conditions (SN R = 2), the estimators perform well. This shows their robustness. Furthermore, β 1 is better estimated than β 0 (see Figure 1) because of two reasons: (i) it is estimated before β 0 in (5.1) and (ii) sinceX n ≈ μ X has some periodicity, it introduces cycles on the estimators of β 0 , which is monotone.
Lastly, let us remark that the FRRE computed with a functional regularization Λ n gives in average better estimations. To understand better this fact, in Figure 3 we compare the mean of the 500 calibrated functional regularization parameters (Λ n ) with the mean of the correspondent calibrated regularization parameters (λ n ), which is equal to 0.5289 (sd = 0.1096).
The FRRE computed with a functional regularization Λ n can reduce, if necessary, either the bias or the variance of the estimator in (3.1). This adaptability  property makes it more efficient. An illustration is given in Figure 3. On the one hand,Λ n penalizes much more in the intervals where β 1 is equal to zero to reduce the variance in (3.1). On the other hand, Λ n is close to zero where β 1 > 0 to reduce the bias.

Setting 2
We simulated samples with size n = 100. The input curves X i , for i = 1, . . . , n, were generated with two white Gaussian noises. The first one over the subinterval [0, 0.5] with a variance σ 2 X,I1 = 0.5, and the second one over [0.5, 1] with a variance σ 2 X,I2 = 0.5 * 1/10. Accordingly, we have E[X ] = 0 and the function E[|X | 2 ] is constant over each of these subintervals.
Function β 0 is null and β 1 is defined as follows:    penalization values are needed over each interval. A functional penalization like Λ is more flexible and consequently, it performs better than a constant penalization one. Thirdly, given that the noise is 20 times bigger over [0.5, 1] than over [0, 0.5], a bigger penalization is needed over [0.5, 1] to bound the variance in the biasvariance decomposition (3.1). This is better handled by the flexible functional penalization. Similarly, the bias is also better handled by a flexible penalization.
Finally, the FRRE estimators are more suitable than the estimator introduced by Ramsay et al (Ramsay et al. [14]). The main reason is that the FRRE estimators are pointwise defined, which avoids projecting the random functions onto a finite dimensional subspace that may be composed by too regular functions (Fourier basis). Thus, the approach we propose can better handle complex datasets of random functions such as realizations of the white Gaussian noise.

Dependence of the convergence rate and α
As stated in Theorem 3.5, the convergence rate of the estimator ( β n − β L 2 ) is bounded by O P (n −γ ), where γ := min 1 2(2α+1) , 1 2 − 1 2(2α+1) . Therefore, this rate depends on α. In this way, the rate is directly related to the behavior of E[|X| 2 (t)] around border points (p ∈ C β,∂X ). This behavior is explained through the polynomial lower bound function |t − p| α , according to hypothesis A5 (part a).
We present in setting 3 a case that explicitly shows the dependence of the convergence rate and α. In particular, we are interested in the behavior of β n − β L 2 and of its upper bound, i.e., where D 0 = 10 has been empirically chosen in order that (5.2) can be a bound of β n − β L 2 . From the proof of Theorem 3.5 (see section 8), we can see that C n has a rate equal to O P (n −γ ). To illustrate Theorem 3.5, we chose p = 0 ∈ C β,∂X (see Assumption A5). The random functions X i and Y i are defined in a neighbourhood of p.
The functional coefficient is defined as β(t) = 1.5 − t 2 . Lastly, the output functions Y i are generated according to model (2.2).
From these definitions, E[|X(t)| 2 ] = |t| α and p := 0 ∈ C β,∂X . Tables 3 and 4 we show the mean values of β n − β L 2 and of its upper bound C n , respectively. Clearly, as the value of α increases, the convergence rate deteriorates due to the increasing bias. Specifically, when α >> 0, E[|X|] ≈ 0 and then, the bias behaves like β in Equation (3.1) slowing down its rate.

Results: In
The upper bound C n behaves as expected for n large enough. That is, its convergence rate is very low when α ≈ 0, improves to reach its maximum value for α = 1/2.
We can also see that β n − β L 2 tends to 0 faster when α tends to 0.  is small. Using an equispaced grid around zero, we can assume that for all these observation times t k , |X(t k )| 2 > 0.5. Therefore, in Equation (3.1), we can bound the variance with 2 , and get an optimal rate for the variance. Similarly, we can show that the convergence rate of the bias ((O(λ n /n)) is high because when α ≈ 0, λ n /n ≈ n −1/2 which is the parametric convergence rate.
In this way, we can see that when α tends to 0, both the variance and the bias have better convergence rates than C n = O P (n −γ ). Thus the convergence rate of β n − β L 2 reveals to be better than that of C n , which is the upper bound obtained in Theorem 3.5. This bound is not optimal. The additional Proposition 8.5 in section 8 is stated to show how to improve the upper bound on compact sets.

Application
We illustrate the use of the estimators in (5.1) with the "gait data". These data have been processed by Ramsay et al. [14, p. 158] as an example of estimation in the FCM and can be found in the R package fda. The data "are measurements of angle at the hip and knee of 39 children as they walk through a single gait cycle. The cycle begins at the point where the child's heel under the leg being observed strikes the ground. For plotting simplicity we run time here over the interval [0,20], since there are 20 times at which the two angles are observed." The main question the authors wanted to study was: "How much control does the hip angle have over the knee angle?". Accordingly, the hip angle curves are the covariate X i and the knee angle curves the response Y i . They model this interaction through the FCM with intercept (1.1).
The estimators of β 0 and β 1 (5.1) with optimized constant and functional parameters are presented in Figure 6. These estimators gave similar results as those obtained with fda, with a better computation time. Additionally, the empirical meanȲ n is also compared to β 0 to see what happens if β 1 = 0, that is when the hip angle (X ) does not influence the knee angle (Y). From Figure 6 (left panel) we see that a functional coefficient β 1 is required.

Conclusions
In this paper we generalized the Ridge Regression method to define the FRRE estimator of the functional coefficient β 1 in the FCM (1.1). We proved its consistency for the L 2 -norm, and obtained its rate of convergence over the whole real line, not only on compact sets.
From a practical point of view, we introduced two penalized estimators, one with a constant regularization parameter and the other with a functional one. The functional regularization is more flexible in case where the noise variance is changing over the estimation interval, or when the functional parameter β is close to 0. For both estimators, we provided a selection procedure through PCV.
In addition we compared this estimation method with that of Ramsay et al. [14,Ch. 10] in a simulation study and in an application. Both perform well under noisy conditions and in some cases the former is more robust, may better handle complex datasets of random functions and is faster to compute.
All these results open new perspectives for studying the FCM with several covariates and related models such as the convolution model (1.2), for which the properties of the Fourier transform allow to transpose the convergence results to an estimator based on the FRRE.

Proof of Theorem 3.1
Let us first introduce a useful technical lemma. Here we will denote ϕ := E[|X| 2 ] ∈ C 0 .

Lemma 8.1. Under hypotheses (A1) and (A2) of Theorem 3.1, if there exists a sequence of functions
where m is the Lebesgue measure, 2. a strictly increasing sequence of natural numbers (N j ) j≥1 ⊂ N and a se- such that for every j ≥ 1 and n ∈ {N j , · · · , N j+1 }, We define the sequence α r := λr r which is decreasing to 0, and the sets K ϕ r := ϕ −1 ([α r , +∞[) and K β q := |β| −1 ([1/q, +∞[) for r, q ∈ N + . All these sets are compacts and cover the supports of ϕ and β respectively, that is ∪ ∞ r=1 ↑ K ϕ r = supp(ϕ) and ∪ ∞ q=1 ↑ K β q = supp(β). Without loss of generality, we can suppose that there exists some Q 1 ∈ N such that K β Q1 = ∅ (otherwise β ≡ 0). Then we redefine for all q ∈ N, K β q := K β Q1+q . Let us take a sequence δ s decreasing to 0 and define for all s ∈ N, since the supports of continuous functions are open.
Thus, from the definition of K ϕ r and the fact that α r goes to zero, there exists r 1 ∈ N such that for all r ≥ r 1 , K β 1 \ C 1 ⊂ K ϕ r . Moreover, from (A2) there existsr 1 > r 1 such that, for all r ≥r 1 , Considering K β 1 \ C 1 , from the definition of K ϕ r1 and the uniform convergence of (f n ) n≥1 towards ϕ, we deduce that there exists N 1 >r 1 such that for all n ≥ N 1 and t ∈ K ϕ r1 , Thus for all n such that n ≥ N 1 , In particular we can deduce, for all n ≥ N 1 > r 1 , because of the definition of α r1 . Similarly K β 2 \ C 2 ⊂ int(supp(ϕ)), and there exists r 2 > r 1 such that for all r ≥ r 2 , K β 2 \ C δ2 ⊂ K ϕ r . From (A2) there existsr 2 > r 2 such that max r≥r2 λr r ≤ λr 2 r2 . Again, given the definition of K ϕ r2 and the uniform convergence of (f n ) n≥1 towards ϕ, we deduce that there exists N 2 >r 2 such that for all n ≥ N 2 and t ∈ K ϕ r2 , This yields that, for all n such that n ≥ N 2 > r 2 , We continue this way to build three strictly increasing sequences r j ↑ ∞, r j ↑ ∞ and N j ↑ ∞ such that for all j ∈ N, Let n be an integer greater than N 1 . Then there exists an integer j such that n belongs to the set {N j , N j + 1, · · · , N j+1 − 1}. The following sequence (d n ) is then defined as follows: It is easy to see that this sequence goes to zero and from (8.2) we conclude that for all n ∈ {N j , N j + 1, · · · , N j+1 − 1}, because of the definition of K β j (outside K β j , β is bounded by 1/j) and the fact that Let us start by showing that because of (HA1 F CM ) and (HA3 F CM ). Now due to the moment monotonicity E[ X L 2 ] < +∞, X is strongly integrable with the L 2 -norm, so the function E[ X] exists and belongs to L 2 . From (HA1 F CM ), E[ X] is the zero function. We conclude that Finally (8.4) is obtained from the fact that Ridge regression for the FCM

1007
As √ n λn → 0 by (A3), we obtain the probability convergence of this part.
To conclude the proof, it is enough to show that To that purpose, we use the fact that Let us take an arbitrary and fixed value ω ∈ S. Then for n ≥ 1 we define the sequence of functions f n := 1 n n i=1 |X i (ω)| 2 . Clearly this sequence belongs to C 0 and f n − ϕ C0 → 0. Thus we can use Lemma 8.1 which implies that there exists a sequence of subsets of R, (C j ) j≥1 , a strictly increasing sequence of natural numbers (N j ) j≥1 ⊂ N and a sequence of real numbers (d n ) n≥1 ⊂ R converging to zero, such that inequality (8.1) holds.
At this point we define for n ≥ N 1 , R n := 1 dn → ∞ and the intervalsĪ n := [−R n , +R n ]. For n ∈ {N j , N j + 1, · · · , N j+1 − 1}, by the triangular inequality and inequality (8.1), In this way we obtain for every n ∈ {N j , N j + 1, · · · , N j+1 − 1}, Thus Finally the sequence of functions |β · 1 Cj | is bounded by β and is pointwise convergent to zero almost everywhere because {t ∈ R : which is countable then with measure zero.
By the dominated convergence theorem, lim j→∞ β · 1 Cj L 2 = 0. Thus L = 0 and so (8.5) is proved because ω is an arbitrary element of S and P (S) = 1.

Proof of Theorem 3.5
We use (3.1) and the triangle inequality to obtain is the same as in Theorem 3.1.

Now we can bound
Let us take a fixed v ≥ 1. Given the fact that ξ l is strictly decreasing to zero, by hypothesis (A6), there exists a unique number N v ≥ 1 such that Then for every n ≥ N v , Using the inequality Because of (A6), there exists M 3 > 0 such that for l ≥ 1, .
n . Using these inequalities we can prove that for every n < N v , To finish the proof of this lemma, we bound the sequence A n 2 L 2 (J) = v≥1 A n 2 L 2 (Jv) . In order to do this we define for each n ≥ 1, the set C n := {v ≥ 1 : n < N v }. We obtain We obtain for n ≥ 1 and thus for n ≥ 1, . Proof of Corollary 3.6. Direct computation using α < 1/2 in Theorem 3.5.
Proof of Corollary 3.9. As in the proof of Theorem 3.5, we only need to prove that β To achieve this we use a similar method to that of Lemma 8.2. First note that hypothesis (A4bis) implies that, for all t ∈ supp(β), |β(t)|/ϕ(t) is finite. Consequently, supp(β) ⊂ supp(ϕ).
Proof of Theorem 3.10. We start with the decomposition is the same as in Theorem 3.1. We finish the proof of the theorem by showing that Given that K ⊂ supp(ϕ), there exists a positive number s 1 > 0 such that We define s := s 1 /2. We have for every n ∈ N, .
Clearly, the first part above is bounded by The second part is bounded as follows Moreover, thanks to hypothesis (A3), we have S n − ϕ L 2 (K) = O P ( 1 √ n ). This inequality, together with the fact that |S n − ϕ| > s wheneverS n ∈ [0, s], allows to obtain In this way, m(K ∩S n ∈ [0, s]) = O P ( 1 √ n ) and as a consequence β S n +λ n 1S n ∈[0,s] which finishes the proof of (8.8).

Further results on the convergence rates
Proof of Corollary 8.4. Take K = supp(β) in Theorem 3.10 to upper bound β n − β L 2 (K) . Finally, we have β n − β L 2 (K c ) ≤ O P ( √ n λn ) because, first β ≡ 0 over K c , which implies that the bias is null outside K (see (3.1)), and secondly, O P ( √ n λn ) is the natural upper bound of the variance over K c . Thus, using the bias-variance decomposition we upper bound β n − β L 2 (R) as we wanted.
Under more restricted hypotheses we can obtain the optimal rate of convergence. This is shown in Proposition 8.5 for the model (2.2).
Furthermore, under the hypotheses (A1bis), (A3), (A4ter) and by replacing (A2) with (A2bis) (λ n ) n≥1 ⊂ R + is the constant sequence equal to λ > 0, Proof of Proposition 8.5. We start with the decomposition where K := supp(β). First, we obtain the convergence rates over K considering the bias variance decomposition We deduce from Hypothesis (A4ter) that almost surely β n − β L 2 (K) ≤ λn n β m X L 2 (K) which, under hypothesis (A2), implies Likewise, given that the bias is null over K c , we obtain from a similar biasvariance decomposition over K c that Thus, we get the convergence rate of β n − β L 2 by adding the rates over K and K c , that is, O P ( λn n ), under hypothesis (A2). Finally, when hypothesis (A2bis) holds, the upper bound of β n − β L 2 (K) is O P ( λ n ) + O P ( 1 √ n ), with λ > 0 constant. This bound is equal to O P ( 1 √ n ). The bound of β n − β L 2 (K c ) is O P ( 1 √ n ). Then, by adding both we get the optimal convergence rate.

Proof of Theorem 3.11
From the decomposition (3.1) we obtain whereλ n := λn n andS n := n i=1 |Xi| 2 n . Thus to finish this proof we need to prove two things: and Let us prove the first equality. We know that hypothesis (A4bis) implies that the set C β,∂X := supp(|β|)\∂(supp(ϕ)) is empty (see proof of the Corollary 3.9). For this reason, by taking J := ∅ the hypotheses (A4) and (A5) in Theorem 3.5 will hold. Now we can extend the inequality (8.7) of Lemma 8.2 to the whole real line because C = supp(β) in this inequality. Then we have where M 3 := 16( The second term in the right side of this inequality is non random. Then we need to prove that the expectation of the first term in this side goes to zero, that is From hypothesis (A3) we can prove that |ϕ −S n | 2 is a random function belonging to L 1 (R, R) and that E R |ϕ −S n | 2 < ∞. Thus by Fubini and Tonelli Theorems (see Brezis [2, p. 91]) and thanks to the independence of the X i we have Thus by putting d := 2 ϕ 2 L 2 + 2E |X| 2 2 L 2 < ∞ we obtain the first equality. Next, to finish this proof we will prove the second equality, namely From the hypotheses (HA1 F CM ) and (HA3 F CM ) it can be proved that the random function | 1 n n i=1 i X * i | 2 belongs to L 1 (R, R) and that its expectation is upper bounded. Thus thanks to the independence of the i and X i we obtain what we wanted . Proof of Proposition 3.12. The conditional expectation comes directly from the decomposition (3.1). To compute the variance let us define for i = 1, · · · , n, the random functions g i := X * i 1 n n i=1 |Xi| 2 + λn n . Since the g i are independent of the i we obtain where D X is a function defined as follows D X := Here we need D X (t) > 0, otherwise X 1 (t) = · · · = X n (t) = 0 and nothing can be inferred about β(t).