Optimal model selection in heteroscedastic regression using piecewise polynomial functions

: We consider the estimation of a regression function with ran- dom design and heteroscedastic noise in a nonparametric setting. More precisely, we address the problem of characterizing the optimal penalty when the regression function is estimated by using a penalized least-squares model selection method. In this context, we show the existence of a minimal penalty, deﬁned to be the maximum level of penalization under which the model selection procedure totally misbehaves. The optimal penalty is shown to be twice the minimal one and to satisfy a non-asymptotic pathwise oracle inequality with leading constant almost one. Finally, the ideal penalty being unknown in general, we propose a hold-out penalization procedure and show that the latter is asymptotically optimal.


Introduction
Given a collection of models and associated estimators, two different model selection tasks can be tackled: find out the smallest true model (consistency problem), or select an estimator achieving the best performance according to some criterion, called a risk or a loss (efficiency problem). We focus on the efficiency problem, where the leading idea of penalization, that goes back to early works of Akaike [2,3] and Mallows [33], is to perform an unbiased -or uniformly biased -estimation of the risk of the estimators. FPE and AIC procedures proposed by Akaike respectively in [2] and [3], as well as Mallows' C p or C L [33], aim to do so by adding to the empirical risk a penalty which depends on the dimension of the models.
The first analysis of such procedures had the drawback of being fundamentally asymptotic, considering in particular that the number of models as well as their dimensions are fixed while the sample size tends to infinity. As explained for instance in Massart [34], in various statistical settings it is natural to let these quantities depend on the amount of data. Thus, pointing out the importance of Talagrand's type concentration inequalities in the nonasymptotic approach, Birgé and Massart [16,18] and Barron, Birgé and Massart [11] have been able to build nonasymptotic oracle inequalities for penalization procedures. Their framework takes into account the complexity of the collection of models as a parameter depending on the sample size.
In an abstract risk minimization framework, which includes statistical learning problems such as classification or regression, many distribution-dependent and data-dependent penalties have been proposed, from the more general and less accurate global penalties, see Koltchinskii [27], Bartlett et al. [12], to the refined local Rademacher complexities in the case where some favorable noise conditions hold (see for instance Bartlett, Bousquet and Mendelson [13], Koltchinskii [28]). But as a price to pay for generality, the above penalties suffer from their dependence on unknown constants. These penalized procedures are very difficult to implement and calibrate in practice. Moreover, the existing risk bounds for these procedures contain very large leading constants. Other generalpurpose penalties have been proposed, such as the bootstrap penalties of Efron [26] and the resampling and V -fold penalties of Arlot [5,6]. These penalties are essentially resampling estimates of the difference between the empirical risk and the risk. Arlot [5,6] proved sharp pathwise oracle inequalities for the resampling and V -fold penalties in the case of regression with random design and heteroscedastic noise on histograms models, and conjectured that the restriction to histograms is mainly technical and that his results can be extended to more general situations.
Model selection via penalization is not the only method which provides sharp oracle inequalities for the estimation of a nonparametric regression function. Indeed, aggregation techniques and PAC-Bayesian bounds also allow to obtain nearly optimal constants in the oracle inequalities. Bunea et al. [21] derived some sharp oracle inequalities for different aggregation tasks by means of a single unifying procedure. However, the authors asked for a fixed design and homoscedastic Gaussian noise. By using aggregation with exponential weights, Dalalyan and Tsybakov obtained in [25] oracle inequalities of a PAC-Bayesian flavor with leading constant one and optimal rate of the remainder term for the estimation of a regression function with deterministic design and homoscedastic errors. Furthermore, these authors allowed error distributions which are symmetric or n-divisible. PAC-Bayesian methods are systematically investigated in Catoni, [23]. The work of Lecué and Mendelson [29] concerning the aggregation by empirical risk minimization of a finite family of functions seems to handle the case of a random design and heteroscedastic noise, even if this example is not explicitly developed. The oracle inequalities obtained by Lecué and Mendelson are sharp and valid with probability close to one. In particular, they are related to oracle inequalities obtained, in expectation, by Catoni in [23].
A difference between aggregation and model selection studies, is that in most aggregation results, the estimators at hand are considered as deterministic functions. However, notable exceptions are the following. Leung and Barron [32] proved sharp oracle inequalities for the aggregation of projection estimators in the Gaussian sequence model. Rigollet and Tsybakov [35] recently showed sharp bounds for the aggregation of some linear estimators, including projection estimators, in a regression setting, with fixed design and homoscedastic Gaussian noise. More general PAC-Bayesian type inequalities were also recently obtained by Dalalyan and Salmon [24], considering the aggregation of affine estimators in heteroscedastic regression, with Gaussian noise and fixed design.
Birgé and Massart [19] discovered, in a generalized linear Gaussian model setting, that the optimal penalty is closely related to the minimal one. An optimal penalty is a penalty which gives an oracle inequality with leading constant converging to one when the sample size tends to infinity. The minimal penalty is defined to be the maximal penalty under which the procedure totally misbehaves (in a sense to be specified below). Birgé and Massart [19] proved sharp upper and lower bounds for the minimal penalty. These authors also showed that the optimal penalty is twice the minimal one, both for small and large collections of models. These facts are called the slope heuristics. The authors also exhibited a jump in the dimension of the selected model occurring around the value of the minimal penalty, and used it to estimate the minimal penalty from the data. Taking a penalty equal to twice the previous estimate then gives a nonasymptotic quasi-optimal data-driven model selection procedure. The algorithm proposed by Birgé and Massart [19] to estimate the minimal penalty relies on the previous knowledge of the shape of the latter, which is a known function of the dimension of the models in their setting. Thus, their procedure gives a data-driven calibration of the minimal penalty.
Considering the case of Gaussian least-squares regression with unknown variance, Baraud et al. [10] have also derived lower bounds on the penalty terms for small and large collections of models. In the setting of maximum likelihood estimation of density on histograms, Castellan [22] obtained a lower bound on the penalty term, in the case of small collections of models.
The slope heuristics has been then extended by Arlot and Massart [9] in a bounded regression framework, with heteroscedastic noise and random design.
The authors considered least-squares estimators on a "small" collection of histograms models. Their analysis differs from the one of Birgé and Massart [19] in an important way. Indeed, Arlot and Massart [9] did not assume a particular shape of the penalty term. As a matter of fact, the penalties considered by Birgé and Massart [19] were known functions of the dimension of the models, whereas heteroscedasticity of the noise allowed Arlot and Massart to consider situations where the shape of the penalty is not even a function of the dimension of the models. In such general cases, the authors proposed to estimate the shape of the penalty by using Arlot's resampling or V -fold penalties, proved to be efficient in their regression framework by Arlot [5,6].
The approach developed in [9] is more general than the histogram case, except for some identified technical parts of the proofs, thus providing a general framework that can be applied to other problems. The authors have also identified, in the case of histograms, the minimal penalty as the mean of the empirical excess loss on each model, and the ideal penalty to be estimated as the sum of the empirical excess loss and true excess loss on each model. The slope heuristics then heavily relies on the fact that the empirical excess loss is equivalent to the true excess loss for models of reasonable dimensions.
Arlot and Massart [9] conjectured that this equivalence between the empirical and true excess loss is a quite general fact in M-estimation. A general result supporting this conjecture is the high dimensional Wilks' phenomenon investigated by Boucheron and Massart [20] in the setting of bounded contrast minimization. The authors derive in [20] concentration inequalities for the empirical excess loss, under some margin conditions (called "noise conditions" by the authors) and when the considered model satisfies some general "complexity condition" on the first moment of the supremum of the empirical process on localized slices of variance in the loss class. The latter assumption can be explicated under suitable covering entropy conditions on the model.
Lerasle [31] proved the validity of the slope heuristics in a least-squares density estimation setting, under rather mild conditions on the considered linear models. The approach developed by the author in this framework allows sharp computations and the empirical excess loss is shown to be exactly equal to the true excess loss. Lerasle [31] also proved in the least-squares density estimation setting the efficiency of Arlot's resampling penalties. Moreover, Lerasle [30] generalized the previous results to weakly dependent data. Arlot and Bach [8] recently considered the problem of selecting among linear estimators in nonparametric regression. Their framework includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. In such cases, the minimal penalty is not necessarily half the optimal one, but the authors propose to estimate the unknown variance by the minimal penalty and to use it in a plug-in version of Mallows' C L . The latter penalty is proved to be optimal by establishing a nonasymptotic oracle inequality with constant close to one, converging to one when the sample size tends to infinity.
In this paper, we prove the validity of the slope heuristics in the framework of bounded regression with random design and heteroscedastic noise. This is done by considering a "small" collection of finite-dimensional linear models of piecewise polynomial functions. This setting extends the case of histograms already treated by Arlot and Massart [9]. An interesting consequence is that piecewise polynomial functions are known to have good approximation properties in Besov spaces and can lead to minimax rates of convergence, see for instance [11,37]. As a matter of fact, histograms allow minimax procedures only on Hölder spaces.
Our validation of the slope heuristics is of asymptotic nature. However, the complexity of the collection of models as well as their dimensions are not constant terms in our analysis. These quantities are indeed allowed to depend on the sample size n.
If the noise is homoscedastic, then the shape of the ideal penalty is known, and is linear in the dimension of the models as in the case of Mallows' C p . However, if the noise is heteroscedastic, then Arlot [7] showed that the ideal penalty is not even a function of the linear dimensions of the models. So, it is necessary to give a suitable estimator of this shape. As emphasized by Arlot [5,6], V -fold and resampling penalties are good, natural candidates for this task. In this paper, we show that a hold-out penalty -which is closely related to a special case of resampling penalty -is indeed asymptotically optimal under very mild conditions on the data split. As a matter of fact, a half-and-half split leads to an optimal penalization. It is worth noticing that hold-out type procedures have also been exploited in Chapter 8 of Massart [34] as simple tools to overcome the margin adaptivity issue in classification.
The paper is organized as follows. In Section 2, we describe the statistical framework. The slope heuristics is presented in Section 3, and the hold-out penalization is considered in Section 4. The proofs are collected in Section 5.

Penalized least-squares model selection
Let us take n independent observations ξ i = (X i , Y i ) ∈ X ×R with common distribution P . In Sections 2.2, 3.2-4 the feature space X = [0, 1]. The marginal distribution of X i is denoted by P X . We assume that the data satisfy the following relation where s * ∈ L 2 P X . Conditionally to X i , the residual ε i is assumed to have zero mean and variance equal to one. The function σ : X →R + is the unknown heteroscedastic noise level. A generic random variable with distribution P , independent of the sample (ξ 1 , . . . , ξ n ), is denoted by ξ = (X, Y ). It follows from (1) that s * is the unknown regression function of Y with respect to X. Our aim is to estimate s * from the sample. To do so, we are given a finite collection of models M n , with cardinality depending on the sample size n. Each model M ∈ M n is assumed to be a finite-dimensional vector space. We denote by D M the linear dimension of M . In the main part of this paper, we focus on models of piecewise polynomial functions, that are introduced in Section 2.2 below.
We denote by s 2 = X s 2 dP X 1/2 the usual norm in L 2 P X and by s M the linear projection of s * onto M in the Hilbert space L 2 P X , · 2 . For a function f ∈ L 1 (P ), we write P (f ) = P f = E [f (ξ)]. By setting K : L 2 P X → L 1 (P ) the least-squares contrast, defined by the regression function s * satisfies For the linear projections s M we get For each model M ∈ M n , we consider a least-squares estimator s n (M ) (possibly non unique), satisfying where P n = n −1 n i=1 δ ξi is the empirical measure built from the data. In order to avoid cumbersome notations, we will often write Ks in place of K (s) for the image of a suitable function s by the contrast K. We measure the performance of the least-squares estimators by their excess loss, Given the collection of models M n , an oracle model M * is defined as a minimizer of the losses -or equivalently excess losses -of the estimators at hand, The associated oracle estimator s n (M * ) thus achieves the best performance in terms of excess loss among the collection {s n (M ) ; M ∈ M n }. The oracle model is a random quantity because it depends on the data and it is also unknown as it depends on the distribution P of the data. We propose to estimate the oracle model by a penalization procedure. Given some known penalty pen, that is a function from M n to R, we consider the following data-dependent model, also called selected model, Our aim is then to find a good penalty, such that the selected model M satisfies an oracle inequality of the form with some positive constant C as close to one as possible and with probability close to one, typically more than 1 − Ln −2 for some positive constant L.

Piecewise polynomial functions
Let us take X = [0, 1] the unit interval and P a finite partition of X . For a positive integer r and any (I, j) ∈ P× {0, . . . , r}, we set The linear dimension of M is then equal to (r + 1) |P|.
Notice that models of histograms on the unit interval are exactly models of piecewise polynomial functions with degrees not larger than 0. In [36], it is shown that models of piecewise polynomial functions have nice analytical and statistical properties. Let us recall two of them.
In Lemma 8 of [36], it is proved that if the distribution P X has a density with respect to the Lebesgue measure Leb on X = [0, 1] which is uniformly bounded away from zero and if the considered partition P is lower regular with respect to Leb -that is there exists a positive constant c such that |P| inf I∈P Leb (I) ≥ c > 0 -then the associated model of piecewise polynomial functions is equipped with a localized orthonormal basis in L 2 P X . For a formal definition of a localized basis, see Section 5 below. Since the pioneering work of Birgé and Massart [15,17,34], the property of localized basis is known to play a key role in M-estimation and model selection using vector spaces or more general sieves.
Considering models of piecewise polynomial functions on the unit interval, where the density of P X with respect to Leb is both uniformly bounded and bounded away from 0 and where the underlying partition is lower regular with respect to Leb, it is shown in Lemma 9 of [36] that the least-squares estimator s n (M ) converges in sup-norm to the linear projection s M of the regression function s * .
Assumptions of lower regularity of the considered partitions as well as the existence of a uniformly bounded density of P X with respect to the Lebesgue measure on X , will thus naturally arise when dealing with least-squares model selection using piecewise polynomial functions -see Section 3.2 below. Furthermore, the interested reader will find in Section 5 a more general version of our results, available for linear models equipped with a localized basis and where least-squares estimators converge in sup-norm to the linear projections of the regression function onto the models.

Underlying concepts
In order to clarify our approach and to highlight the connection of the present paper with the results previously established in [36], we first give a brief heuristic explanation of the major mathematical facts underlying the slope phenomenon.
We rewrite the definition of the oracle model M * given in (5). The penalty function pen id is called the ideal penalty -as it allows to select the oracle -and is unknown because it depends on the distribution of the data. As pointed out by Arlot and Massart [9], the main idea of penalization in the efficiency problem is to give some sharp estimate, up to a constant, of the ideal penalty. This would yield an (asymptotically) unbiased -or uniformly biased over the collection of models M n -estimation of the loss. Such a penalization would lead to a sharp oracle inequality for the selected model.
A penalty term pen opt is said to be optimal if it achieves an oracle inequality with leading constant converging to one when the sample size n tends to infinity.
Concerning the estimation of the optimal penalty, Arlot and Massart [9] conjectured that the mean of the empirical excess loss E [P n (Ks M − Ks n (M ))] satisfies the following slope heuristics in a quite general M-estimation framework: (i) If a penalty pen : M n −→ R + is such that, for all models M ∈ M n , with δ > 0, then the dimension of the selected model M is "very large" and the excess loss of the selected estimator s n M is "much larger" than the excess loss of the oracle.
with δ > 0, then the corresponding model selection procedure satisfies an oracle inequality with a leading constant C (δ) < +∞ and the dimension of the selected model is "not too large". Moreover, is an optimal penalty.
The mean of the empirical excess loss on M , when M varies in M n , is thus conjectured to be the maximal value of penalty under which the model selection procedure totally misbehaves or, equivalently, the minimum value of penalty above which the procedure achieves an oracle inequality. It is called the minimal penalty, denoted by pen min : The optimal penalty is then close to twice the minimal one, pen opt ≈ 2 pen min .
Let us now briefly explain the points (i) and (ii) above. We give in Section 3.3 precise results which validate the slope heuristics for models of piecewise polynomial functions. If the chosen penalty is less than the minimal one, pen = (1 − δ) pen min with δ ∈ [0, 1], the algorithm minimizes over M n , In the latter identity, we neglect the difference between the empirical and true loss of the projections s M and the deviations of the empirical excess loss P n (Ks M − Ks n (M )). Indeed, as shown by Boucheron and Massart [20], the empirical excess loss satisfies a concentration inequality in a general framework, which allows to neglect the difference with its mean, at least for models that are not too small.
As the empirical excess loss is increasing and the excess loss of the projection s M is decreasing with respect to the complexity of the models, the penalized criterion is (almost) decreasing with respect to the complexity of the models, and the selected model is among the largest of the collection.
On the contrary, if the chosen penalty is greater than the minimal one, pen = (1 + δ) pen min with δ > 0, then by the same kind of manipulations, the selected model minimizes the following criterion, for all M ∈ M n , P n (Ks n (M )) + pen (M ) − P n (Ks * ) ≈ ℓ (s * , s M ) + δP n (Ks M − Ks n (M )) .
(8) The selected model thus achieves a trade-off between the bias of the models which decreases with the complexity and the empirical excess loss which increases with the complexity of the models. The selected dimension would then be reasonable, and the trade-off between the bias and the complexity of the models is likely to give some oracle inequality.
Finally, if we take δ = 1 in the latter case, pen = 2×pen min , and if we assume that the empirical excess loss is equivalent to the excess loss, then according to (8)  Hence, ℓ s * , s n M ≈ ℓ (s * , s n (M * )) and the procedure is nearly optimal. One can find in [36] some results showing that (9) is a quite general fact in least-squares regression and is in particular satisfied when considering models of piecewise polynomial functions. Thus, these results represent a preliminary material for the present study, and we shall base our arguments on the results exposed in [36].

Assumptions and comments
We take X = [0, 1], Leb is the Lebesgue measure on X , and linear models M ∈ M n are models of piecewise polynomial functions. We denote by P M the partition of X underlying the model M .
Set of assumptions for piecewise polynomial functions: (SAPP) There exists a positive constant A, that bounds the data: |Y i | ≤ A < ∞. (Ad Leb ) P X has a density f with respect to Leb satisfying for some constants c min and c max , that (Aud) there exists r ∈ N * such that, for all M ∈ M n , all I ∈ P M and all p ∈ M , deg p |I ≤ r .  (1). Thirdly, assumptions (Aud) and (Alr) specify some quantities related to the choice of the models of piecewise polynomial functions. Assumption (P1) states that the collection of models has a "small" complexity, more precisely a polynomially increasing one with respect to the amount of data. For this kind of complexities, if one wants to design a good model selection procedure for prediction, the chosen penalty should estimate the mean of the ideal one on each model, up to a constant. Indeed, as Talagrand's type concentration inequalities for the empirical process are exponential, they allow to neglect the deviations of the quantities of interest from their mean, uniformly over the collection of models. This is not the case for large collections of models, where one has to put an extra-log factor depending on the complexity of the collection of models inside the penalty, see for instance [16,11].
We assume in (P3) that the collection of models contains a model M 0 of reasonably large dimension and a model M 1 of high dimension, which is necessary since we prove the existence of a jump between high and reasonably large dimensions. One can notice that in practice, the parameter β + , which depends on the bias of the model is not known and so the existence of M 0 is not straightforward. However, it suffices for the statistician to take at least one model per dimension lower than the chosen upper bound to ensure the existence of M 0 and M 1 .
We require in (Ap u ) for the quality of approximation of the collection of models to be good enough in terms of the quadratic loss. More precisely, we ask for a polynomial decrease of excess loss of linear projections of the regression function onto the models. It is well-known that piecewise polynomial functions uniformly bounded in their degrees have good approximation properties in Besov spaces. More precisely, as stated in Lemma 12 of Barron, Birgé and Massart [11], if X = [0, 1] and the regression function s * belongs to the Besov space B α,p,∞ (X ) (see the definition in [11]), then taking models of piecewise polynomial functions of degree bounded by r > α−1 on regular partitions with respect to the Lebesgue measure Leb on X , and assuming that P X has a density with respect to Leb which is bounded in sup-norm, assumption (Ap u ) is satisfied.
Assumption (Ab) is rather restrictive, since it excludes Gaussian noise. However, the assumption of bounded noise is somehow classical when dealing with M-estimation and related procedures. Indeed, a central tool in this field is empirical process theory and more especially, concentration inequalities for the supremum of the empirical process. We used the classical inequalities of Bousquet, and Klein and Rio in [36]. As a matter of fact, we do not know yet if an adaptation of our proofs (including results established in [36]) by using extensions of the latter inequalities to some unbounded cases -as for instance in Adamczak's concentration inequalities [1] -would be possible.
The noise restriction stated in (An) is needed to derive our results which are optimal to the first order. More precisely, it allows in [36] to obtain sharp lower bounds for the true and empirical excess losses on a fixed model. This assumption is also needed in the work of Arlot and Massart [9] concerning the case of histogram models. As it is noticed in Section 5.3 of [36], assumption (An) could be replaced by the following assumption, which states that the partitions underlying the models of piecewise polynomial functions are regular from above with respect to the Lebesgue measure on [0, 1].

Statement of the theorems
We are now able to state our main results leading to the slope heuristics. They describe the behavior of the penalization procedure defined in (6).
Theorem 1 justifies the first part (i) of the slope heuristics exposed in Section 3.
As a matter of fact, it shows that there exists a level such that, if the penalty is smaller than this level for one of the largest models, then the dimension of the output is among the largest dimensions of the collection and the excess loss of the selected estimator is much larger than the excess loss of the oracle. Moreover, this level is given by the mean of the empirical excess loss of the least-squares estimator on each model. Let us also notice that the lower bound given in (11) gets worse as β + increases. This is due to the fact that when β + increases, the approximation properties of the models improve and the performances in terms of excess loss for the oracle estimator also improve.
The following theorem validates the second part of the slope heuristics.
such that it holds for all n ≥ n 0 , with probability at least 1 − A 3 n −2 , Assume that in addition, the following assumption holds, (Ap) The bias decreases like a power of D M : there exist β − ≥ β + > 0 and C + , C − > 0 such that Then it holds for all n ≥ n 0 ((SAPP) , C − , β − , β + , η, δ), with probability at least and Theorem 2 states that if the penalty is close to twice the minimal one, then the selected estimator satisfies a pathwise oracle inequality with constant almost one, and so the model selection procedure is approximately optimal. Moreover, the dimension of the selected model is of reasonable dimension, bounded by a power less than one of the sample size.
Condition (Ap) allows to remove the remainder terms from the oracle inequality (15) by ensuring that the selected model is of dimension not too small, as stated in (16). Assumption (Ap) is the conjunction of assumption (Ap u ) with a polynomial lower bound of the bias of the models. On histogram models, Arlot showed in Section 8.10 of [4] that this lower bound is satisfied for non constant α-Hölder, α ∈ (0, 1], regression functions and for regular partitions.
Finally, from Theorems 1 and 2, we identify the minimal penalty with the mean of the empirical excess loss on each model,

Hold-out penalization
The conditions on the penalty given in Theorems 1 and 2 can not be directly checked in practice. Indeed, they are expressed in terms of the mean of the empirical excess loss on each model, which is an unknown quantity in general.
Nevertheless, in the homoscedastic case, it is easy to see that Mallows' penalty is a nonasymptotic quasi-optimal penalty. According to Theorem 2, such a penalty is given by twice the mean of the empirical excess loss. Now, using Theorem 10 of [36], we get (with an explicit control of the second order terms in the following equivalence), where K 2 is an orthonormal basis in (M, · 2 ). By easy computations, we deduce that if the noise is homoscedastic, that is σ 2 (X) ≡ σ 2 > 0, it holds The second term at the right of identity (18) being negligible for models of interest in the conditions of Theorem 2 (thanks to Lemma 7 in [36], which implies that DM i=1 ϕ 2 k ≤ LD M for some constant L > 0), we conclude that an asymptotically optimal penalty is given by 2σ 2 D M /n, which is Mallows' classical penalty.
In the case where the noise level is homoscedastic but unknown, Mallows' penalty is only known through a constant, the noise level, which can be estimated via the slope heuristics (for practical issues about the slope heuristics, see Baudry et al. [14]). But in the common situation where the noise level is sufficiently heteroscedastic, the shape of the ideal penalty is not linear in the dimension of the models and not even a function of the linear dimensions. In such a case, Arlot [7] proved that any calibration of a linear penalty leads to a suboptimal procedure, but yet can achieve an oracle inequality with a leading constant more than one.
In order to achieve a nearly optimal selection procedure in the general situation, it remains to estimate the ideal penalty or, thanks to the slope heuristics, the shape of the ideal penalty. This section is devoted to this task. We propose a hold-out type penalty that automatically adapts to heteroscedasticity. Let us now detail our hold-out penalization procedure.
The ideal penalty is defined by where P ni = 1/n i j∈Ii δ ξj , n i = Card(I i ), for i = 1, 2, s n1 (M ) ∈ arg min s∈M P n1 (Ks) and C > 0 is a constant to be determined. Indeed, if n 1 is not too small, P n1 (Ks n1 (M )) is likely to vary like P n (Ks n (M )) and P n2 (Ks n1 (M )) is, conditionally to (ξ j ) j∈I1 , an unbiased estimate of P (Ks n1 (M )), which again is likely to vary like P (Ks n (M )). Moreover, we see from Theorem 10 in [36] that when the model M is fixed, the quantities P n (Ks n (M )) and P (Ks n (M )) are almost inversely proportional to n, so a good constant in front of the hold-out penalty should be C opt = n 1 /n. The previous observation is justified by the following theorem, where for the sake of clarity we fixed n 1 = n 2 = n/2. For a more general version of Theorem 3, see Section 5.3. We set and Assume that in addition (Ap) holds (see Theorem 2). Then it holds for all n ≥ n 0 ((SAPP) , C − , β − , η), with probability at least 1 − A 6 n −2 , and ℓ s * , s n M 1/2 Theorem 3 shows the asymptotic optimality of the hold-out penalization procedure, for a half-and-half split of the data. This is a remarkable fact compared to the classical hold-out, defined by Indeed, the choice n 1 = n/2 in (22) is likely to lead to an asymptotically suboptimal procedure, as the criterion is close in expectation to P Ks n/2 (M ) , and so is close to the oracle, but for n/2 data points. The hold-out penalization allows us to overcome this difficulty. Arlot [5,6] described similar advantages for resampling and V -fold penalties.
Notice also that the random hold-out penalty proposed by Arlot [6] is proportional to the mean along the splits of our hold-out penalty, providing thus a "stabilization effect" in practice. This should bring some improvement compared to our unique split, at the price of increased computational cost. However, the stabilization effect seems more difficult to study mathematically, and our results provide a first step toward the study of the more complicated resampling penalties.

Proofs
We first present in Section 5.1 some "structural" properties of models, denoted (GSA), that are sufficient for our needs and that are satisfied for models of piecewise polynomial functions considered in (SAPP). Then in Sections 5.2 and 5.3 respectively, we prove the results stated in Sections 3.3 and 4, for (GSA) instead of (SAPP).

A more general setting
Notice that the covariate space X is general in (GSA). Let us explain how assumptions (Ab'), (Ad Leb ), (Aud) and (Alr) of (SAPP) allow to recover (Ab), (Alb) and (Ac ∞ ) of (GSA) in the special case of models of piecewise polynomial functions. Assumption (Ab') only differs from (Ab) by the fact that the projections of the target onto the models are uniformly bounded in sup-norm. In the general case, this is indeed not guaranteed, but considering piecewise polynomial functions uniformly bounded in their degrees, this follows from simple computations (see Section 5.3 in [36]). Then, assumption (Alb) requires the existence of a localized orthonormal basis for each model. In the case of piecewise polynomial functions, this is ensured by (Ad Leb ), (Aud)and (Alr), see Lemma 8 of [36]. Finally, assumption (Ac ∞ ) states the consistency of each estimator for the supnorm. Again, this is satisfied for models of piecewise polynomial functions under assumptions (Ad Leb ), (Aud) and (Alr). This result is established in Lemma 9 of [36].
Let us now describe a set of assumptions, less restrictive than (SAPP), that allows to recover (GSA) when considering histogram models. Lemma 5 and 6 of [36] allow to recover (GSA) from (SAH) for models of histograms.

Set of assumptions for histogram models: (SAH)
Given some linear histogram model M ∈ M n , we denote by P M the associated partition of X . Theorems 1 and 2 would also be valid when replacing the set of assumptions (SAPP) by (SAH). This would lead to the (almost exact) recovering of the assumptions and results described in Theorems 2 and 3 of [9], concerning the selection of least-squares estimators among histogram models.

Proofs related to Section 3.3
The following remark will be useful.
we have, for all n ≥ n 0 (A M,+ , A, A cons , n 1 , r M , σ min , α M ), and where K 2 is an orthonormal basis in (M, · 2 ). Moreover, for all M ∈ M n , we have by Theorem 3 of [36], for a positive constant A u depending on A, A cons , r M and α M and for all n ≥ n 0 (A cons , n 1 ), and Two technical lemmas are needed. In the first lemma, we intend to evaluate the minimal penalty E [P n (Ks M − Ks n (M ))] for models of dimension not too small.
Proof. We set and therefore, by Bernstein's inequality we have for all x > 0, By taking x = α ln n, we then have which gives the first part of Lemma 5 for A d given in (40). Now, by noticing the fact that 2 √ ab ≤ aη +bη −1 for all η > 0, and using it in (41) with a = ℓ (s * , s M ), b = 4A 2 α ln n n and η = D Then, for a model M ∈ M n such that A M,+ (ln n) 2 ≤ D M , we apply Lemma 4 and by (29), it holds for all n ≥ n 0 (A M,+ , A, A cons , n 1 , r M , σ min , α M ), 1 − L AM,−,A,σmin,rM,αM ε 2 n (M ) As M minimizes crit ′ over M n , it is therefore sufficient by (47), to control pen (M ) − pen ′ id (M ) -or equivalently crit ′ (M ) -in terms of the excess loss ℓ (s * , s n (M )), for every M ∈ M n , in order to derive oracle inequalities. Let Ω n be the event on which: • For all models M ∈ M n of dimension D M such that A M,+ (ln n) 3 ≤ D M , (12) holds and By (25), (26), (27) and (28) in Remark 1, Lemma 4, Lemma 5 applied with α = 2 + α M , and since (12) holds with probability at least 1 − A p n −2 , we get for all n ≥ n 0 (GSA), Control on the criterion crit ′ for models of dimension not too small: We consider models M ∈ M n such that A M,+ (ln n) 3 ≤ D M . Notice that (50) implies by (24) Now notice that using (P2) in (24) gives that for all models M ∈ M n such that A M,+ (ln n) 3 ≤ D M and for all n ≥ n 0 (GSA), 0 < L (GSA) ε n (M ) ≤ 1 2 . As ℓ (s * , s n (M )) = ℓ (s * , s M ) + p 1 (M ), we thus have on Ω n , for all n ≥ n 0 (GSA), Hence, using (56) in (55), we have on Ω n for all models M ∈ M n such that A M,+ (ln n) 3 ≤ D M and for all n ≥ n 0 (GSA), Consequently, for all models M ∈ M n such that A M,+ (ln n) 3 ≤ D M and for all n ≥ n 0 (GSA), it holds on Ω n , using (47) Control on the criterion crit ′ for models of small dimension: We consider models M ∈ M n such that D M ≤ A M,+ (ln n) 3 . By (13) By (50), we have on Ω n , for all n ≥ n 0 ((GSA), η), and by (12), Consequently, we have on Ω n , for all n ≥ n 0 ((GSA), η), To conclude, notice that the upper bound (86) is smaller than the lower bound given in (83) for all n ≥ n 0 ((GSA), η, δ). Hence, points 2 and 3 above yield inequality (75). Moreover, the upper bound (86) is smaller than lower bounds given in (80), derived by using (Ap), and (83), for all n ≥ n 0 ((GSA), C − , β − , η, δ). This thus gives (76) and Lemma 6 is proved.
Proof of Theorem 1. As in the proof of Theorem 2, we consider the event Ω ′ n of probability at least 1 − L cM,Ap n −2 for all n ≥ n 0 (GSA), on which: (10) holds and • For all models M ∈ M n of dimension D M such that A M,+ (ln n) 2 ≤ D M , • For all models M ∈ M n with D M ≤ A M,+ (ln n) 2 , • For every M ∈ M n , δ (M ) ≤ L (GSA) ℓ (s * , s M ) ln n n + ln n n . (93) such that it holds for all n ≥ n 0 ((GSA) , c, η), with probability at least 1−A 6 n −2 , and ℓ s * , s n M n1 Assume that in addition (Ap) holds (see Theorem 2). Then it holds for all n ≥ n 0 ((GSA) , C − , β − , η, c), with probability at least 1 − A 6 n −2 , Proof. By Bernstein's inequality (see Corollary 2.10 in [34]) applied to the sum of (s n1 (M )) (ξ i ) conditionally to (ξ j ) j∈I1 , we get that for all x > 0, it holds where Remark 2. It is easy to see that by using the assumption of consistency in sup-norm for a fixed model, stated as (H5) in [36], instead of (Ac ∞ ) and by using Theorem 4 of [36] instead of inequality (27), the results established in Lemma 9 are valid with probability bounds proportional to n −α , for any α > 0 (in Lemma 9, we only derive the case α = 2 + α M for convenience).
Proof of Theorem 8. We set pen 0 (M ) = pen ho (M ) − (n 1 /n) · (P n2 (Ks * ) − P n1 (Ks * )). It is worth noting that P n2 (Ks * ) − P n1 (Ks * ) is a quantity independent of M , when M varies in M n . Hence, the procedure defined by pen 0 gives the same result as the hold-out procedure defined by pen ho . It will be convenient for our analysis to consider pen 0 instead of pen ho . As a matter of fact, we derive Theorem 8 as a corollary of Theorem 2 applied with pen ≡ pen 0 , through the use of Lemma 9. We