Lower bound in regression for functional data by representation of small ball probabilities

: The minimax rate for estimating the regression function r ( · ) = E ( y | X = · ) when y ∈ R and X takes values in a function space is the initial motivation of this work. Recent articles underline the major role of the shifted small ball probability P ( k X − x 0 k < · ) in the variance of classical estimates. The main results are twofold. First, starting from a theorem by [41], we study the small ball probability P ( S < ε ) when ε ↓ 0 with S = P + ∞ i =1 λ i Z i where the Z i ’s are i.i.d. positive and ( λ i ) i ∈ N a positive nonincreasing sequence such that P λ i < + ∞ . It is shown that P ( S < · ) belongs to a class of functions introduced by de Haan, well-known in extreme value theory, the class of Gamma-varying functions, for which an exponential-integral representation is available. Second this approach allows to derive minimax lower bounds for the risk at a ﬁxed point x 0 when X ∈ H some Hilbert space of functions. Denoting this minimax risk: where T n is any estimate of r ( x 0 ) and E is some class of smooth functions from H to R it turns out that, in a general framework, n τ R ∗ n → + ∞ for any τ > 0 . This negative result may pave the way towards new approaches for modeling regression with functional data.


Preliminaries
The three following subsections are independent. The first introduces the nonparametric regression model for functional data and simply raises the problems attached to obtaining sharp bounds for the quadratic risk at a fixed point. The second gives some basic material about small ball probability. The third collects classical results from extreme value theory as well as the definition of the class Γ 0 which is then briefly described. The notions encountered in this long introduction though intially distinct from each other merge in the sequel of this work and give birth to the main results. Proofs are given in the last section.

The nonparametric regression model for functional data
Statistics for functional data is a recent domain which has been receiving increasing interests and was boosted by computational advances. We briefly recall that the main purpose of functional data analysis (FDA) is to model and study datasets where observations are of functional nature (usually observed on a grid then smoothed, approximated and reconstructed by projection on accurate basis ). FDA extends the classical statistical models designed for vectors to the situation when the data are functions or curves. We refer to the monographs by [47] and [22] for an overview of this topic. Consider the general regression problem with functional data as inputs: where y and ε are real random variables with Eε = 0 with variance denoted σ 2 ε , X belongs to the Hilbert space H and r is a function from H to R. The space H may be chosen to be L 2 (T ) where T is a compact set in the Euclidean space or some Sobolev space H 2,m (T ). It is endowed with an inner product ·, · inducing a norm · . Estimating the regression function at a fixed point x 0 namely r (x 0 ) = E (y|X = x 0 ) is possible from an i.i.d. sample (y i , X i ) 1≤i≤n and by a classical Nadaraya-Watson approach (see [54] for a general presentation in the finite dimensional setting and [22] for implementation on functional data). This model was studied for instance in [21] and asymptotic results were derived in [19] like a first upper bound for the quadratic risk at a fixed point and in [18] for a uniform bound under entropy conditions. Apparently no projectionbased estimate in this model has been introduced yet, certainly due to a lack of theoretical results on approximation theory for functions defined on a Hilbert space. When X ∈ R d model (1.1) was obviously exhaustively studied. It is known since [49,50] that the optimal rate of convergence in minimax sense depends on smoothness conditions of r namely the order of the highest achievable smooth derivative of r say m (and subject to the choice of an accurate kernel K) as well as on the dimension d. This rate is more precisely of order n −m/(2m+d) and depletes when the dimension d increases. This phenomenon is known in nonparametric statistics as 'curse of dimensionality'. Letting d go to infinity in the rate above should leave us rather pessimistic about what we could expect from model (1.1) with functional input X. Indeed the issue of the minimax risk in this model -that is when X belongs to a function space-is open and will be addressed here. Slightly anticipating we can say that the concerns raised by the curse of dimensionality are grounded and that the rate for any efficient estimate is slower than any power of n.
We consider here an adapted Nadaraya-Watson estimate. This estimate is asymptotically efficient when X ∈ R d : where x 0 is a fixed point of the space, K is a kernel, that is a mesurable, unilateral (defined on R + ) positive function with K = 1 and h = (h n ) n∈N is a nonnegative sequence tending to 0 (the bandwidth). We consider the following additional conditions on the kernel K: K has compact support (say [0, 1]), is absolutely continuous and bounded above and below with K (1) > 0 These conditions hold for the naive kernel, K (u) = 1 if and only if u ∈ [0, 1]. We do not seek minimal conditions on the kernel here and the assumption above could certainly be alleviated but is sufficient to carry out computations. The estimate r (x 0 ) has already been considered in the papers mentioned above (see also [22,24]).
Considering the L 2 -risk at a fixed point x 0 leads to a bias-variance decomposition: A. Mas where ε i = y i − r (X i ). It will be shown later in Proposition 5 that: where the b i 's are positive and non random constants given in Proposition 5.
The sequence (b i ) i∈N is not unveiled here because it depends on several parameters which will be introduced later, namely the derivatives of r and other functions depending on the distribution of X. We keep in mind that the bias-variance decomposition of the risk is essentially based on the computation of two sorts of moments: EK ( X − x 0 /h) and E[ X, e i 2 K ( X /h)].

Calculation of EK
which appears in the numerator of (1.5) (note that here x 0 does not appear anymore) is more tricky and will not be done at this stage. This expectation is bounded above by E[ X 2 K ( X /h)] which will be treated like EK ( X − x 0 /h) at (1.7) below. We ask the reader to accept for the moment that the computation of R n (x 0 ) critically depends on moments of the form E [ X p K ( X /h)] for some integer p.
In a multivariate setting, when X is an R d valued random variable, and the density of X f X is smooth enough at x 0 simple calculus leads in many situations to: EK where c d denotes the volume of the unit ball in the space R d . The r.h.s. of the formula above may vary, depending on the support of the distribution of X. However neither Lebesgue's measure or a counterpart to f X may be defined when X is valued in a Hilbert space for instance. The classical notion of volume of a ball cannot be generalized to such spaces. As a consequence when X is a process, the density of X at x 0 does not make sense anymore or should be revisited. A major issue is then to compute the preceding expectation without assuming that f X (x 0 ) exists. Applying Fubini's Theorem is sufficient to get rid of the density. Denoting P ( X − x 0 < h) = F x0 (h) we obtain: and the evaluation of the expectation above essentially depends again on the small ball probability F x0 (·). Assume that F x0 is regularly varying at zero with index d (which is usually true when X is finite dimensional) then where c d depends only on d and K. Unfortunately when X lies in a function space, the most classical examples of F x0 (h) are not reguarly varying as will be seen below. However we notice for further purpose that the theory of regular variation is of some help in this important special case.
Remark 1. It is straightforward to see that the same method may yield the value of such integrals as: When X is a random function the behaviour of F x0 at 0 is crucial and determines the rate of convergence to zero of the expectations in (1.4) and (1.5) -what statisticians are truly interested in. The behaviour of this small ball probability has been an issue hard to circumvent. Usually the authors express their final results in terms of F x0 (·) and cannot derive more explicit formulae in a general framework (see again [19] or [18]). Along the past decade some authors turned their attention to the question of modelizing probability distribution for curve-data with applications in statistics: [11,31,12] in a general setting then. But a unified and general result for representing F x0 (·) efficiently is missing.
Remark 2. Some variants of the Nadaraya-Watson estimate above were studied when the norm of the function space · is replaced by a semi-norm ν (·). This usually allows to fix the problem of the small ball probability. Indeed taking projection semi-norms like ν (x) = P d x where P d is the projection on some d-dimensional space it is easily seen that P (ν (X − x 0 ) < h) ∼ c d h d . However the semi-norm estimate is usually not consistent. It suffices to take x 0 in the null space of ν to see that ν (X − x 0 ) = ν (X) and the estimate cannot converge. This route may be attractive for numerical and practical reasons but will not be considered in this formal setting. The case of increasing d = d n has not been investigated, at least up to the author's knowledge.
A special case of (1.1) is the linear regression model y = X (s) β (s) + ε. It has been extensively investigated in the last years. We refer for instance to [31,10,9] (see also references therein these works). The optimal rate for the risk is now fully determined in this linear model conversely to the non-linear one investigated here. A comparison is given later at Remark 8.

About non-shifted and shifted small ball problems
Small ball problems could generally be stated in the following way: consider a random variable X with values in a general normed space (E, · ) (which may be infinite-dimensional) and estimate P ( X < ε) for small values of ε. This issue may be viewed as a counterpart of the large deviations or concentration problems (where P ( X > M ) is studied for large M ) and the terms small deviations or lower tail behaviour are sometimes encountered to name small ball problems.

A. Mas
The core of the literature on small ball problems focuses on Gaussian random variables. The survey by [39] is a complete state of the art, introducing the main concepts and providing numerous references. Another reference is Chapter 18 of [40] entirely devoted to Gaussian random functions. Much attention has been given to Brownian motion (when (E, · ) = (C (0.1) , |·| ∞ )) or its relatives (fractional Bronwian motion, Bronwian sheet, etc). The case of stable random elements was also investigated (see for instance [38,3]). Another issue is related to the norm. Indeed in infinite dimensional spaces, norms or metrics are not equivalent and this may influence the local behaviour of P ( X < ε).
A more general question -certainly more attractive in statistics-could be the shifted small ball probability P ( X − x 0 < ε) for a fixed x 0 . A concern arises from the shift x 0 . It turns out that, in general, computations cannot be carried out for any x 0 . Several works focus on expliciting the set of those x 0 for which the shifted small ball probability may be computed from the non-shifted one (when x 0 = 0). We refer to [7] or [36] for instance. A classical example stems from the situation where P X−x0 ≪ P X where P X denotes the probability distribution induced by the random element X. The classical Cameron-Martin's theorem for Brownian motion illustrates this case for instance. Absolute regularity yields: where f x0 = dP X−x0 /dP X and B (0, ε) stands for the ball centered at 0 with radius ε. When f x0 is regular enough in a neighborhood of zero: About this fact see Proposition 2.1 in [1]. In general the sharpness of existing results may vary, depending on the triplet ((E, · ) , P X , x 0 ) under consideration. In fact there are only few spaces for which the local behaviour of P ( X − x 0 < ε) is explicitely described. Quite often lower and upper bounds are computed so that: where ϕ x0 is known and f ≍ g means here that for some positive constants c − and c + the positive functions f and g satisfy: Sometimes only one of these bounds is accessible or needed. It is worth noting or recalling a few crucial features of small deviations techniques. The Laplace transform, as well as in large deviations problems, is a major tool when coupled with the saddlepoint method. Small deviations are intimately connected with the entropy of the unit ball of the reproducing kernel Hilbert space associated with X, with the l-approximation numbers of X (i.e. the rate of approximation of X by a finite dimensional random variable, see [37] or to the degree of compactness of linear operators generating X (see [38]). All these notions are clearly connected to the regularity of the process X, when X is a process.
Applications of small ball probabilities are numerous: they appear when studying rates of convergence in the Law of the Iterated Logarithm (see [53,34] or the rate of escape of the Brownian motion (see [15]). They even surprisingly provide a sufficient condition for the CLT (see [35], Theorem 10.13 p.289). However small ball problems remained until nowadays a matter essentially reserved to probability theory. For example [55,56] found applications of small ball techniques to Bayesian statistics. Recently [27] introduced them in conditional extremal quantile estimation. It turns out that this topic may be also of interest in functional data analysis. As shown in the previous section, since Lebesgue's density of an infinite-dimensional random X does not exist, all the inferential techniques based on the density cannot hold anymore. In this framework, the small ball probabilities appear as natural counterparts and should be investigated with much care.
First we introduce the l 2 framework. Consider X a random variable defined the following way: where (λ i ) 1≤i≤n is a real positive sequence arranged in a non-increasing order such that +∞ i=1 λ i < +∞ and (x i ) 1≤i≤n is a sequence of real independent and identically distributed random variables with null expectation. From Kolmogorov's 0 − 1 law it is straightforward to see that X exists as a l 2 -valued random element. The square norm of i . The small ball problem consists here in estimating for different choices of the sequence (λ i ) i∈N and (x i ) i∈N the probability P (S < r) when r tends to zero. The latter probability is expected to depend on the λ i 's. About this fact we refer to [14].
The inspection of the case E = l 2 is motivated by the application to functional statistics mentioned in the paragraph above. Indeed random functions are often reconstructed by interpolation techniques, like splines or wavelets, in Hilbert spaces such as L 2 ([0, T ]) or the Sobolev space W m,2 ([0, T ]) , m ∈ N. Then the random element X is valued in a separable Hilbert space H and all these Hilbert spaces of functions are isometrically isomorphic to l 2 . In this framework a useful tool is the so-called Karhunen-Loève decomposition (sometimes referred to as Principal Orthogonal Decomposition in other areas of mathematics such as PDEs). Any centered random function X will be represented by its coordinates in a basis of eigenvectors of the covariance operator C X = E [X ⊗ X]. When e i 's are the eigenvectors of C X and λ i the associated eigenvalues where the x i 's are uncorrelated real random variables. The x i 's are actually 1752 A. Mas always independent when X is Gaussian and are assumed to be in most settings. The l 2 random element defined in (1.9) is formally identifiable with this Karhunen-Loève decomposition familiar in Functional Data Analysis.
Historically the description of the exact behaviour of Gaussian small ball probability in Hilbert space is due to [52]. However we borrow the notations from [41] who extended Sytaya's results to the non-Gaussian framework. First in order to alleviate notations set once and for all: where λ i > 0 are arranged in decreasing order with +∞ i=1 λ i < +∞ and Z i are positive random variables (they stand for the x 2 i 's above). For the sake of completeness and since the main theorems of this work heavily rely on his results we recall them. In the previously mentioned article Lifshits proved that: where γ and σ are functions of r defined below and Λ (γ) = E exp (−γS) is the Laplace transform of S evaluated at γ (r). The definitions of γ and σ are implicit. Let S γ be the Esscher transform of S that is the random variable with distribution exp (−γx) P S (dx) /Λ (γ). Then set: (1.14) where V denotes variance. We note for further purpose that the implicit function theorem ensures the existence and smoothness of functions γ (r) and γ σ 2 derived from (1.13) and (1.14). Without further assumption on the λ i 's P (S < r) cannot be made more explicit. This is done for instance in [14] where these authors considered the case of λ i with polynomial and exponential decay.

The class Γ 0
As a last part of this long introduction we shift from functional data analysis and small ball problems to extreme value theory. The theory of extremes is another well-known topic connecting probability theory, mathematical statistics and real analysis through regular variation and Karamata's theory. The foundations of extreme value theory may be illustrated by the famous Fisher-Tippett theorem (see [25] and [28]). This classical result assesses that whenever U 1 , . . . , U n is an i.id. sample of real random variables, M n = max {U 1 , . . . , U n } belongs to the domain of attraction of G, where G has same type as one of the three distributions Gumbel, Frechet and Weibull. The Gumbel law, also named double exponential distribution, with cumulative distribution function R (x) = exp (− exp (−x)) defines the domain of attraction of the third type. Laurens de Haan in [29] characterized the (cumulative) distribution functions of U such that M n belongs to the domain of attraction of R. We give this result below.
Theorem ( [29]). If F is the cumulative distribution function of a real random variable X which belongs to the domain of attraction of the third type (Gumbel) there exists a measurable function ρ : R → R + , called the auxiliary function of F , such that: This property was initially introduced by de Haan as a Form of Regular Variation (see the title of his article). This class of distribution functions is referred to as de Haan's Gamma class in the book by [5] and within this article. Surprisingly, in their book as well as in de Haan's article no examples of functions belonging to Γ is given. The cumulative distribution function of the Gaussian distribution belongs to this class with x + = +∞ and ρ (s) = 1/s.
Since we focus on the local behaviour at zero and not at infinity of the cumulative distribution function function of a real valued random variable we have to modifiy again slightly the definitions above. We introduce the class Γ 0 and feature some of its properties below. We share most of our notations with [5] which differ from those of de Haan. Definition 1. The class Γ 0 consists of those functions F : R → R + null over (−∞, 0], non decreasing with F (0) = 0 and right-continuous for which there exists a continuous non decreasing function ρ : V + → R + , defined on some a right-neighborhood of zero V + such that ρ (0) = 0 and for all x ∈ R, The function ρ is called the auxiliary function of F .
The properties of the auxiliary function are crucial.
Proposition 1. From Definition 1 above we deduce that: ρ (s) /s → 0 as s → 0 and ρ is self-neglecting which means that: Remark 3. When the property in the Proposition above does not hold locally uniformly but only pointwise the function is called Beurling slowly varying.
Assuming that ρ is continuous in Definition 1 yields local uniformity and enables to consider a self-neglecting ρ.

A. Mas
The class Γ 0 is subject to an exponential-integral representation. In fact the following Theorem asserts that the local behaviour at 0 of any F in Γ 0 depends only on the auxiliary mapping ρ. Theorem 1. Let F belong to Γ 0 with self-neglecting auxiliary function ρ then when s → 0: with η (s) → c ∈ R and the auxiliary function ρ is unique up to asymptotic equivalence and may be taken as The proofs of Proposition 1 and of Theorem 1 are closely inspired from the proofs of Lemma 3.10.1, Proposition 3.10.3 and Theorem 3.10.8 in [5] and will be omitted.
Let us also mention that Gaïffas in [26] proposed to model locally the density of sparse data by Gamma-varying functions. This is another statistical application for Γ 0 .
It is simple to construct explicit examples of functions in Γ 0 by tuning the auxiliary function ρ and taking η (·) = 0 in (1.16). For instance taking 2 ). Obviously constants may be added in front of or within the exponential. The next Proposition seems to show a specific feature of functions in the class Γ 0 which will be used later: they are very flat at zero. Proposition 2. Let F belong to Γ 0 . Then for all integer p F (p) (0) = 0 where F (p) denotes the derivative of order p of F .

Main results
We are ready to give the main results. This section is split in three parts. In the first it is shown that the function F x0 (·) which is crucial for evaluating the risk in model (1.1) belongs to the class Γ 0 of Gamma-varying functions in a quite general framework. In the second we focus on the case of a Gaussian design. In the third we use the properties of the class Γ 0 to derive upper and lower bounds on the risk for (1.1) and at a fixed point. The degeneracy of the lower bound (with respect to the case when X is finite dimensional) announced earlier may be seen as an ultimate symptom of the curse of dimensionality. If f and g are two positive functions the notation f x g means that lim u→x f (u) /g (u) ≤ c for some positive constant c.

Small ball probabilities of random functions are Gamma-varying
This sub-section connects the two apparently distinct notions of probability seen before: the class of small ball probabilities in l 2 and de Haan's Gamma class of functions. Both families of functions are defined by their local behaviour around 0. In what follows, the exponent −1 is strictly reserved to denoting the generalized inverse of a function f denoted f −1 . Consequently in general f −1 = 1/f. Let us introduce the function λ (·) which interpolates the λ j 's in a smooth way (which means that λ (j) = λ j for all j and λ is C 1 ). Since our results rely on those of [41] we recall now the assumptions needed in this article. Let G denote the (cumulative) distribution function of Z then we assume that there exists b ∈ (0, 1) , c 1 > 1, c 2 ∈ (0, 1) and c 3 > 0 such that for r < c 3 : As mentioned in [41] assumption A 0 states that the local behaviour at 0 of G is polynomial and A 0 holds whenever the density g of Z is regularly varying at 0 with index α > −1. We also note that the assumption above holds for a large class of classical positive distributions of Z itself (Gamma, Beta...) or when Z = X 2 with X Gaussian, X Laplace, Uniform or Student distributions for instance. These considerations are of interest for the statistician in order not to limit the approach to Gaussian models. Note that the assumption on the convergence of the third order moment of Z was alleviated in some recent papers. We keep it here since it is general enough for our purpose.
When (Z i ) i∈N is a sequence of random variables whose cumulative distribution function G is regularly varying at 0 with strictly positive index, the explicit form of the small ball probability was derived for explicit sequences of log convex λ (·) by [14]. In particular they show that when where ψ 0 is a bounded function. Formula (2.2) is proved as well at page 269 in [40]. Simple algebra proves that both functions on the right hand side of (2.2) and (2.3) have all their derivatives vanishing at 0. We notice that the r.h.s. of (2.2) is always flatter than the r.h.s. of (2.3) which in turn will always be flatter at 0 than any polynomial function (like c d s d ). However we notice that the degree of flatness is directly connected with the rate of decrease of the λ i 's which quantifies, exactly like the l-numbers, the accuracy of a finite-dimensional approximation of X. We emphasize the following Proposition, which will not be proved.
The auxiliary functions ρ 1 and ρ 2 could be more precisely computed but we only need equivalencies at this stage.
We are ready to extend this fact to general sequences (λ i ) i∈N . Remind that the function γ (·) was defined implicitely at line (1.13). In words it is, up to sign, the inverse of the first order derivative of the log-Laplace transform of S.
the small ball probability of S then F ∈ Γ 0 with auxiliary function: and the representation (1.12) may be rephrased only in terms of γ (·): where r 0 = EZ · +∞ j=1 λ j . Obviously the r.h.s. of (2.5) is mathematically the same object as the r.h.s. of (1.12). The Gamma-varying version of the right hand side is ρ ′ (r) exp[− r0 r ds/ ρ(s)]. We believe however that this new version is slightly more explicit and maybe more suited for statistical purposes. We will take advantage of the properties of the class Γ 0 listed earlier.
The Theorem may be intuitively explained in view of Proposition 2. Indeed when X lies in R d and in a general context F (s) ∼ 0 p d (s) = c d s d . The function p d has the following property: p (k) d (0) = 0 whenever k = d. Consequently in an infinite dimensional space we can expect that all the derivatives at 0 should be null and this property is recovered through Proposition 2. A more geometric way to understand this consists in considering the problem of the concentration of a probability measure. Let µ be the measure associated with the random variable X. Once again starting from R d and letting d increase -even if this approach is not really fair-we see that µ must allocate a constant mass of 1 to a space whose dimension increases. Then µ gets more and more diffuse, allowing fewer mass to balls and visiting rarely fixed points such as x 0 (and their neighborhoods), resulting in a very flat small ball probability function.
The following corollary provides some information about the rate of decrease to zero of F (·) when an additional assumption is made on ρ.
This property of the small ball probability has to be connected with property (1.17). It is referred to as rapid variation at 0 in the literature on regular variations and may be compared or opposed with the regularly varying situation discussed below (1.7). The assumptions RV + and RV 1 will be encountered again when addressing the case of nonparametric regression. At last, note that for the auxiliary functions ρ 1 and ρ 2 appearing at Proposition 3 and arising from [14] we get ρ 1 ∈ RV + and ρ 2 ∈ RV 1 .

Remark 5.
For the sake of completeness we point out the following fact which may be misleading: indeed we started from P( X 2 < r) and the properties of this function may differ from the true small ball probability P( X 2 < r 2 ). It is simple to show that if F ∈ Γ 0 with auxiliary function ρ F then G (r) = F (r 2 ) belongs to Γ 0 as well with auxiliary function ρ G defined by ρ G (r) = ρ F (r 2 )/ (2r) .

Gaussian framework
Assuming that X is Gaussian, hence that x i in (1.10) are N (0, 1) distributed provides a critical amount of extra information. Indeed it is then possible to compute in a more explicit form: which is the initial equation linking r and γ. We derive below an explicit link between the λ j 's and γ (·) or equivalently ρ (·). Under rather general assumptions on the rate of decrease of the λ j 's we obtain as well an upper bound for the small ball probability which will be exploited in the next subsection when investigating a lower bound for the regression.

Upper and lower bound in regression for functional data
We fix once and for all the assumptions considered in what follows. These assumptions appear in addition to those considered in the previous sections. Remind that if g is some function defined on H and with values in R the first order Fréchet-derivative of g at x 0 (its infinite-dimensional gradient) may be identified with an element of H. The second order derivative g ′′ (x 0 ) (Hessian operator) is identified with a symmetric operator from H to H.
Assumptions on the distribution of X. The random element X is centered and in the development (1.10) the x i 's are independent. We have P X−x0 ≪ P X with f x0 = dP X−x0 /dP X such that f x0 (0) > 0, f ′ (x 0 ) ∈ H exists and the second order derivative of f x0 denoted f ′′ (x) is for all x in a neighborhood of x 0 a bounded linear operator from H to H. Denote ∂ i f x0 = f ′ (x 0 ) , e i where e i is one of the eigenvectors appearing in (1.10). Besides we assume that for all i the density of the margins X, e i is symmetric.
Discussion and examples. Let for instance X be Gaussian. Chapter 9 and 10 in [40] are clear about these issues (see more specifically p.102-107.) It is possible to shift the assumptions on the regularity of X to conditions on the regularity of x 0 . First in order to define f ′ (x 0 ) we need to assume that x 0 = (m i ) 1≤i belongs to the kernel of X that is i∈N m 2 i /λ i < +∞. Then for any u = . We are interested in the smoothness of these functions at 0.
the major point is the finiteness of the latter series. This is subject to a condition of decay on the coefficients m i 's with respect to the λ i 's hence of smoothness of the fixed function x 0 .
It turns out that the Gaussian framework may be generalized to some other stable distributions though somewhat restricted. Skorohod ([48]) gives conditions for non Gaussian distributions to check the condition P X−x0 ≪ P X . We refer to the characterization of stable distributions on Hilbert spaces by Jurek ( [33]) for further reading. When these formal conditions hold and X = (λ 1 Z 1 , λ 2 Z 2 . . .) with Z i centered with unit variance and independent, conditions on m i and λ i and similar to the Gaussian case may be formulated from simple calculations, all linking the smoothness of f (x 0 ) with the relative decay rate of the Fourier coefficients of x 0 with respect to the eigenvalues.
Assumptions on the regression function. Assume that r ′′ (x) exists for x in a neighborhood of x 0 . We denote ∂ i r x0 = r ′ (x 0 ) , e i and ∂ 2 ii r x0 = r ′ (x 0 ) (e i ) , e i and assume as well that: Discussion and examples: If we except the technical assumption given at the line above, the conditions on r given here are classical. They hold for instance when r (x) = r 0 (ω (x)) where r 0 : R → R has second order derivative around ω (x 0 ) and ω : H → R is a simple functional such as ω (x) = D i=1 x, g i where D ∈ N and g i ∈ H or even ω (x) = x for x 0 = 0. But the functional setting forces us to consider other kinds of possible regression functions. We briefly comment two of them that may produce irregularity: evaluation and derivation-based functionals. In [17] the regression function r involves sums and products of evaluation functionals. For instance r F HV (x) = x (t 1 ) x (t 2 ) where t 1 and t 2 are two fixed points in [0, 1]. In [20] r F KV (x) = t cos (t) [X ′ (t)] 2 dt. We recall that the linear mapping d t0 (x) = x (t 0 ) is not continuous when H = L 2 ([0, 1]) . If we require that H is a reproducing kernel Hilbert space -typically a Sobolev space with boundary conditions such as Cameron-Martin space H 2 0 = f ∈ L 2 ([0, 1]) , f ′2 < +∞, f (0 = 0) -then d t0 (·) = k t0 , · where k t0 ∈ H is bounded hence continuous (by Riesz representation theorem). Then we derive d ′ t0 (x) = k t0 for all x and d ′′ t0 = 0 and the condition (2.8) on r F HV may be expressed in terms of coefficients of the kernel k t1 and k t2 in the basis e i . For the second example, considering smooth spaces of functions will here again ensure the smoothness of the operator. Some calculations show that ∇r F KV (x) (t) = 2 t 0 s cos (s) x ′ (s) dt. Here too the assumption will hold if the coordinates of ∇r F KV (x 0 ) = r ′ (x 0 ) and of the diagonal of r ′′ (x 0 ) in the basis e i tend rapidly to zero.

Upper bound
In view of the results of the preceding section we are in a position to simplify some computations. Turning to the local moments defined at (1.7), from properties of functions in Γ 0 and specifically (1.17) we get: We see again that the representation theorem of the preceding section is of some help to simplify our calculations. We mention for immediate purpose that the derivation of both formula above leads as well to: Let the local moments of order 1 and 2 of X at x 0 be respectively defined by: Note that M K,1 belongs to H since (X − x 0 ) K ( X − x 0 /h) does. Formula (2.12) may be explicited. First let u and v be two points in the vector space then u ⊗ v is a linear operator defined by [u ⊗ v] (x) = v, x y and so is ( . Erasing x 0 and K ( X − x 0 /h) gives the usual covariance operator of X for a centered X which is trace-class by definition whenever E X 2 < +∞. The special covariance operator M K,2 (x 0 ) is

A. Mas
obtained by shifting and smoothing X around x 0 and M K,2 is also a linear traceclass operator acting from and onto H whenever E[ X − x 0 2 K ( X − x 0 /h)] is finite. We refer to [44] for some statistical results on local moments for finitedimensional random variables and to [42] for some related results dealing with (2.12) and where random functions and small ball problems appear.
The next Proposition bounds above and below the bias part of the risk of our kernel estimate.

Proposition 5. For the variance part of the risk the equivalence holds
For the bias part we have (1.5). From the line above we derive an approximate rate: where c + and c − depend only on η 2 . Hence when ρ (h) h m for some m B n (x 0 ) decreases to 0 at most and at least at a polynomial rate.
The problem here is to ensure a rough control of B n (x 0 ). As will be seen soon ρ 6 (h) turns out to be regularly varying in most cases and decays to zero at a polynomial rate. The unusual framework (namely with distributions in the class Γ 0 ) motivates to prove the reader that B n (x 0 ) does not reach an unusual rate of decreaes to 0 (namely exponential). And the bound c − ρ 6 (h) ≤ B n (x 0 ) will justify the conditions under which the minimax lower bound for the risk is going to be derived. Remark 7. The role of the auxiliary function ρ is crucial. The question of its estimation is quite simple indeed. From [5] Corollary 3.10.5(b) p.177 we know that ρ may be taken as F/F ′ . A natural estimator of ρ may be F / f where f (resp. F ) is a kernel estimator of the density (resp. of the cumulative distribution function) of X . A simple procedure for the practitioner consists here in considering the sample made of the positive real valued random variables (z i = X i 1I Xi <s ) 1≤i≤n where s is some threshold close to zero and then to perform the estimation of f and F classically on this sample.

Lower bound
From the preceding subsection the optimal risk for the kernel estimate is obtained by selecting an h balancing the trade-off between variance and bias. Imagine that we had found in Proposition 5 a result such as B n (x 0 ) ≍ F κ (h) for some κ > 0. Then the optimal bandwidth would stem from n −1 ≍ F 1+κ (h) leading to a R n (x 0 ) n −κ/(1+κ) which would contradict the initial claim of degenerate rate for the risk. This explains why we derived the lower bound on B n (x 0 ) in Proposition 5. As will be seen now when r belongs to a class large enough to inherit classical approximation features, R n cannot decrease at a polynomial rate. What we mean by classical approximation features is explicited now. Let E p denote any class of R-valued functions defined on H such that: For instance E p may be the class of Hölder functions of order p ∈ ]0, 1[. When E p is the class of functions which have two derivatives at x 0 we see from Proposition 5 that we have in addition Optimizing the bias-variance trade-off in the risk leads to choosing an h such that sup r∈Ep B n (x 0 ) = V n (x 0 ) . The next Lemma deals with this issue.
Let c * be some constant and h * be the solution of the functional equation: then n β / (nF (h * )) → +∞ for any β > 0.When X is non Gaussian but satisfies the assumptions (2.1) and (2.6) the same conclusion holds.
Proof of the Lemma. Only the case 0 < β < 1 has to be investigated. When X is Gaussian the lemma is easily derived from Proposition 4 since it was proved that F (h) ≺ 0 exp[− (log 1/h) 1+1/α ] holds. When X is not gaussian and RV 1 holds the proof of Corollary 1 shows that β log n+2p log h * > βς (h * ) log (1/h * )− 2p log h * where ς (h * ) tends to +∞ when h * tends to 0. When RV + holds the proof is the same with c α (h * ) 1−α instead of ς (h * ) log (1/h * ). Now our approach to derive lower bounds for the minimax risk follows Tsybakov's scheme (see [54]): we construct two models r 0 and r 1 far enough from each other but such that the Hellinger distance between the two models is bounded. Let p ε stand for the density of ε. Assume that for some constant p * and for all y ∈ R, This assumption is general and appears in Tsybakov's book. It holds under smoothness assumptions on p ε . We comment it briefly. If Λ (y) denotes the left hand side in the inequality above Λ (y) ≤ 2 for all y and we just need to study Λ on a compact neighborhood around 0 (up to a rescaling through the constant p * ).
Part II: Let X be non gaussian but satisfy the conditions (2.1). Let ρ be the auxiliary function of the small ball probability of X. Assume that ρ is regularly varying at 0 with index α ≥ 1 with either α > 1 or α = 1 and ρ (s) /s log (1/s) then again n β R n → +∞ for any β > 0.
In Part II we recall for the sake of completeness the conditions RV + and RV 1 introduced earlier. The theorem above shows that it is not possible to estimate the regression function in a nonparametric model with functional inputs at a polynomial rate. The rates may be considered as degenerate even when the functional variable X is very smooth (case λ (x) = exp (−x α ) for some α > 0) and the data concentrated close to a finite-dimensional space. In the classical situations of polynomial decay, λ (x) ≃ x −α for some α > 1 the situation gets even worse and the optimal rate we may recover is a logarithmic power.
Remark 8. An interesting comparison is possible with the linear regression model as studied in [8] since these authors focus on the risk at a fixed point too. Their Theorem 4.2 p. 2168 coupled with the rate given at page 2167 gives a lower bound for a linear r. This lower bound is possibly parametric (that is O (1/n)) for very regular designs. Otherwise it depends on the relative smoothness of x 0 (positively) and of X through the λ's (negatively) as well as on the regularity of the unknown slope function but remains polynomial. Obviously the lower bound is damaged when shifting to the general regression model. The next corollary is stated for completeness because the Theorem above gives only a lower bound for R n . Its proof is omitted since it stems from the derivation of Theorem 3.

Corollary 2. We get in fact
where h * is defined at (2.14) and the kernel estimator reaches this optimal rate up to constants.
These results are clearly connected with the complexity of the setting: the general nonparametric model coupled with the sparsity of functional spaces already mentioned in the paragraph below Proposition 2.
Remark 9. Other classes of regression functions could be considered. Here E p was considered because calculations are possible when looking for an upper bound. However the theorem above holds, up to a change of constants when r belongs to a class E p for which: Like in a finite-dimensional framework, obtaining large values of p switches the problem to defining higher order kernels designed for functional data. This issue is out of the scope of this work. Yet, because of the degeneracy of the convergence rate we are not sure it deserves much attention in this setting.

Complementary facts
In this short section are collected results of secondary interest. They complete however the precedings by underlining some facts about the non-unicity and the limits of the representation obtained above. Indeed the preceding theorems lead to the following question: is it possible to obtain a one to one representation, in a general framework, of the small ball probabilities of random elements in l 2 -characterized by the sequence (λ i ) i∈N -by a function in Γ 0 , depending solely on its auxiliary function ρ ? The answer is negative for at least two reasons. First it is plain that two series S and S ′ built from different sequences (λ i , Z i ) i∈N may have equivalent (at 0) small ball probabilities. Second, imagine that we confine to Gaussian small ball probabilities and consider again the r.h.s. of (2.5) denoting F ∈ Γ 0 with auxiliary function ρ. Simple calculus shows that any function φF where φ (x + tρ (x)) /φ (x) → 1 when x → 0 belongs to Γ 0 with exactly the same auxiliary function ρ. Consequently even fixing the distribution of the sequence Z i is not sufficient to obtain a one to one mapping between small ball probabilities and the set Γ 0 .
Indeed, pick a function F 0 in the class Γ 0 . This function is essentially defined by its auxiliary ρ 0 (·) and Theorem 1 is not precise enough for us to identify it with a small ball probability. This is due to the non-unicity of ρ mentioned just under (1.16) by the words 'up to asymptotic equivalence'. If ρ 1 ∼ 0 ρ 0 lim s↓0 + F 0 (s + xρ 1 (s)) /F 0 (s) = exp (x) as well. But the local behaviour at 0 of F 1 (s) = exp{η (s) − 1 s 1/ρ 1 (t) dt} may differ from F 0 (s) and F 0 may not be equivalent with F 1 . What we show below is that if F 0 is accurately scaled we may deduce from F 0 a new function F * 0 which has the same auxiliary function as F 0 (but which may not be equivalent to F 0 ) and such that for a well-chosen sequence (λ i ) i∈N and the Gaussian small ball probability P (S < r) is such that P (S < r) ∼ 0 F * 0 (r) We start with a definition which seems to be new. Definition 2. Let ρ be a self-neglecting function. A measurable function φ is called ρ-self-neglecting if: It is obvious that, if φ is ρ-self-neglecting it is ρ * -self-neglecting whenever ρ * ∼ 0 ρ. We propose below in Theorem 5 a representation theorem for ρ-selfneglecting functions.
Definition 3. Pick a ρ 0 in the class of self-neglecting functions at 0 such that ρ 0 (0) = 0. We define the equivalence class of a function F 0 ∈ Γ 0 with auxiliary function ρ 0 by the relationship △ defined for all G in Γ 0 by: Remind that ϕ (t) = tγ (t) is defined within Proposition 4.
Theorem 4. Let F 0 ∈ Γ 0 with auxiliary function ρ 0 = 1/γ 0 . Assume that ρ 0 is regularly varying at 0 with index κ > 1 and C 1 in a neighborhood of 0. Consider the equivalence class of F 0 in Γ 0 \△ say F 0 . Then one may pick s follow a χ 2 (1) distribution and: For the sake of completeness we obtain a last result, complementing and illustrating Proposition 3. From this Proposition we see that F △G if F = φG where φ is ρ-self-neglecting. The forthcoming Theorem represents these functions φ.
Theorem 5. Let ρ be self-neglecting at 0 which does not vanish in a neighborhood of 0. A function φ is ρ-self-neglecting if and only if: where c (u) → c ∈ ]0, +∞) and ε (u) → 0 when u → 0 and ε has the same regularity as ρ.
This theorem slightly generalizes the representation Theorem 2.11.3 for selfneglecting functions p.121 in [5] initially due to [6]. If one take φ = ρ the representation above coincides with the one announced in this theorem.

Conclusion and perspectives
The first main result of this article identifies small ball probabilities in l 2 with a class of rapidly varying functions involved in extreme value theory and whose derivatives at all orders vanish at zero. This representation was obtained through previous works especially the initial formula (1.12) of [41]. We hope that this new formulation will be more convenient for modelizing the small ball probabilities with some applied -especially statistical-purposes in mind. However many other questions arise. For instance the generalization to random elements with values in l p or in more general Banach spaces is certainly an intricate matter since the starting fomulae (1.12) and followings seem to be intimately suited to the space l 2 . A surprising fact is the parallel that can be drawn between large deviations on a one hand and extreme value theory on the other hand. Both were intially introduced to model and explore large values of sequences of random elements. It turns out that both provide an accurate setting to study small deviations as well: Laplace transform for the classical approach and methods around the domain of attraction of the third type (Gamma class, self-neglecting functions. . . ) as outlined here. However the connections between regular variations and small ball probabilities have been known since de Bruin in 1959, and his theorem on Laplace transfoms (see Theorem 4.12.9 in [5]). This work confirms that both Tauberian and extreme value theory may provide tools complementing large deviations techniques to derive new results in this area.
The other fact is, as an application of the previous, that the optimal risk in nonparametric regression for functional data is slow in the sense that we cannot expect to obtain polynomial rates in the reasonable setting used in this work. It is obviously interpretable in terms of curse of dimensionality. In the past such similar negative results were obtained in different contexts. We have in mind the optimal rates in deconvolution with supersmooth noise as obtained by [16]. The lower bounds were O (1/ log τ n) , τ > 0 and this situation is close to the one encountered here. Fortunately Fan's theorems did not prevent people from making deconvolution with very smoothly corrupted data, but they pointed out the formal limits of the approach.
Here one may hope that this negative result will raise new challenges since it will shift attention to new, more restricted and parsimonious non-linear models like the one introduced in [45]: where the r i are functions defined on R and estimated from one-dimensional projections of the data X. It is known since [51] that this model is not subject to the curse of dimensionality when X is valued in R k for a fixed k. Letting k increase with the sample size would be a possible track to introduce nonlinearity in regression models for functional data and avoiding some redhibitory features of a general model. A work was carried out by [23] in this direction.
Finally the rather negative result in Theorem 3 should be tempered since it relies heavily on the use of the theoretical norm. Other strategies based on datadriven semi-norms coupled with automatic bandwidth selection provide very satisfactory performances. The reader interested by these novel methods may refer to [46] or to [2]. Like in other areas of statistics, purely numerical techniques may sooner or later overcome the flaws and stumbling stones of theory.

Proofs
Considerations about the smoothness at 0 of F and ρ are not the matter in this work and we will take it for granted that both functions are smooth enough. Besides along the proofs we may sometimes consider generalized or local inverses of some fonctions which may not be invertible or have smooth derivatives everywhere. For example the auxiliary function ρ defined on R + for which we always have ρ ′ (0) = 0 has no smooth inverse on [0, c] for c > 0. But we may frequently use the smoothness of, say, ρ and ρ −1 on sets ]a, b[ for 0 < a < b without always justifying it. This section is split in two subsections. In the first are collected the derivations of results related to small ball probability representation by de Haan's Gamma class. The second is devoted to the upper and lower bounds for the risk at a fxed point in the regression model for functional data.

Proofs of results of sections 2.1, 2.2 and 3
Proof of Proposition 1. Suppose that ρ (s) /s does not tend to zero when s does. Then we may pick an ε > 0 such that for infinitely many s k ↓ 0 when k ↑ +∞, ρ (s k ) /s k > ε. Now fix x < −ε −1 then s k + xρ (s k ) < 0 and F (s k + xρ (s k )) = 0 for all k and F (s k + xρ (s k )) /F (s k ) cannot converge to exp (x). The second part of the proof, namely ensuring the ρ is self-neglecting, follows the lines of the proof of Proposition 3.10.6 in [5].
We start the proof of Theorem 2.
We will more specifically prove below that when s decays to 0: The two next lemmas are dedicated to showing that, in the line above the fraction as well as the exponential both tend to 1 when s goes to zero and ρ is chosen as in the Theorem.We just have to clarifiy formula (2.5) within the Theorem. This stems directly from (1.12). Indeed from (1.13) and (1.14) we see that σ 2 = −∂r/∂γ and we just have to show that γr + log Λ (γ) = r r0 γ (s) ds.
Remark 11. Obviously γ has at least two (we do not need more) continuous derivatives on a neighborhood of infinity (here ]1, +∞) for instance). It is also strightforward to see that γ, which is strictly decreasing on ]1, +∞) , is also a C 1 diffeomorphism on this set. Clearly lim s→0 ρ (s) = 0 but from Lemma 2 p.431 in [41] it is plain that ρ (s) /s also tends to zero when s does which implies that ρ ′ (0) = 0. Indeed proving that ρ (s) /s tends to zero comes down to proving that sγ (s) → +∞.
The continuity of ρ ′ at 0 and its nullity at 0 (see Remark 11) implies on a one hand that the line above is bounded above for fixed x and s (hence d) going to zero and also that: At last, I (s + xρ (s)) − I (s) → x which finishes the proof of the Lemma.
This proves that F * 0 △ F 0 . It remains to show that F * 0 ∼ 0 P (S < ·). Like above γ 0 = 1/ρ 0 . Start from (2.7) that is r = j λ j / (1 + 2γ 0 λ j ) . Now, following the proof of Proposition 4 we set J (r) = r/ρ 0 (r) (we just make use of equation (4.5), fix J (r) ρ 0 (r) /r = 1 instead of bounding it above and below) and take a (·) = J −1 (·) then finally S = +∞ i=1 Z i /a (i). By construction P (S < ·) ∼ F * 0 . Finally we turn to the proof of Theorem 5 and start with a Lemma. This Lemma, its proof and the subsequent proof of the Theorem adapt the derivation of Lemma 2.11.2 and Theorem 2.11.3 of [5].
Proof. First note that the sequence x n is decreasing since ρ ≥ 0 and notice from the properties of self-neglecting functions (namely ρ (s) /s → 0 when s → 0) that for a sufficently small x 0 > 0, x n ≥ 0 for all n. The limit of x n exists, is denoted l. Suppose that l > 0. Then ρ (l) > 0 and since ρ is a non decreasing function ρ (x k ) ≥ ρ (l) for all k. At last Letting n go to infinity x n goes to −∞ which contradicts x n ≥ 0 hence the Lemma.
Proof of Theorem 5. Let x n be as in the preceding Lemma. Let p be a C ∞ probability density on [0, 1] and set for x n+1 ≤ u ≤ x n The proof takes three steps.
The third and last step is devoted to proving that |ε (u)| → 0 when u → 0. Indeed for all x n+1 ≤ u ≤ x n , We focus on .
Just like above ρ (x n − λ u ρ (x n )) /ρ (x n ) → 1 since ρ is self-neglecting. Finally by the definition of φ we get ln φ (x n ) /φ (x n − ρ (x n )) → 0 which finishes the proof of the Theorem.

Proofs of results of section 2.3
Proof of Proposition 5. We start with V n (x 0 ). It is simple to see that V n (x 0 ) = nσ 2 ε Eω 2 1,n with: Computations like those carried out in [19] show that: hence that (see (2.9)) V n (x 0 ) ∼