On the prediction loss of the lasso in the partially labeled setting

In this paper we revisit the risk bounds of the lasso estimator in the context of transductive and semi-supervised learning. In other terms, the setting under consideration is that of regression with random design under partial labeling. The main goal is to obtain user-friendly bounds on the off-sample prediction risk. To this end, the simple setting of bounded response variable and bounded (high-dimensional) covariates is considered. We propose some new adaptations of the lasso to these settings and establish oracle inequalities both in expectation and in deviation. These results provide non-asymptotic upper bounds on the risk that highlight the interplay between the bias due to the mis-specification of the linear model, the bias due to the approximate sparsity and the variance. They also demonstrate that the presence of a large number of unlabeled features may have significant positive impact in the situations where the restricted eigenvalue of the design matrix vanishes or is very small.


INTRODUCTION
We consider the problem of prediction under the quadratic loss.That is, for a random feature-label pair (X, Y ) drawn from a distribution P on a product space X × Y, we aim at predicting Y as a function of X.The goal is to find a measurable function f : X → Y such that the expected quadratic risk, 2 is as small as possible.When Y is an interval of R and X is a measurable set in R p -which is the setting considered in the present work-the Bayes predictor, defined as the minimizer 3 avenue Pierre Larousse, 92245 Malakoff, France.26 Shabolovka street, Laboratory of Stochastic Analysis and its Applications, Moscow, Russian Federation.
of R(f ) over all measurable functions f : X → Y, is the regression function (Vapnik, 1998) Using f ⋆ , the problem can be rewritten in a form which is more familiar in Statistics, namely where the noise variable ξ satisfies E[ξ|X] = 0, P X -almost surely 1 .In the present work, we tackle the prediction problem in the case where the available data D all is of the form D all = D labeled ∪ D unlabeled , where D labeled = {(X 1 , Y 1 ), . . ., (X n , Y n )} and D unlabeled = {X n+1 , . . ., X N }.
The labeled sample D labeled is composed of independent and identically distributed (i.i.d.) feature-label pairs with distribution P .The unlabeled sample D unlabeled contains only i.i.d.features, with distribution P X , and is independent of D labeled .This formal setting accounts for a number of realistic situations in which the labeling process is costly while the unlabeled data points are available in abundance (see, for instance, Balcan et al., 2005;Guillaumin et al., 2010;Brouard et al., 2011), that is n may be quite small compared to N .Here, the baseline idea is to build upon the sample D unlabeled to improve the supervised prediction process based on D labeled alone.In this context, our study encompasses two closely related settings: semi-supervised learning and transductive learning.
In the semi-supervised learning setting, one aims at constructing a predictor f , based on the data D all , such that the excess risk is as small as possible.This learning framework differs from the classical supervised learning only in that the data set is enriched by the unlabeled features.
In contrast with this, the goal of transductive learning is to predict solely the labels of the observed unlabeled features.This amounts to considering the same setting as above but to measure the quality of a prediction function f by the excess risk (2) We refer the reader to (Chapelle et al., 2006;Zhu, 2008) and the references therein for a comprehensive survey on the topic of semi-supervised and transductive learning.Theoretical analysis of the generalisation error and the excess risk in this context can be found in (Rigollet, 2007;Wang and Shen, 2007;Lafferty and Wasserman, 2007), whereas the closely related area of manifold learning is studied in (Belkin et al., 2006;Nadler et al., 2009;Niyogi, 2013).The purpose of the present work differs from these papers in that we put the emphasis on the high-dimensional setting and the sparsity assumption.The goal is to understand whether the unlabeled data can help in predicting the unknown labels using the ℓ 1 -penalized empirical risk minimizers.From another perspective-that of multi-view learning-the problem of sparse semi-supervised learning is investigated in (Sun and Shawe-Taylor, 2010).
When the feature vector is high dimensional, it is reasonable to consider prediction strategies based on "simple" functions f in order to limit the computational cost.A widely used approach is then to look for a good linear predictor When the dimension p is of the same order as (or larger than) the size n of the labeled sample, the simple empirical risk minimizer (i.e., the least squares estimator) is a poor predictor since it suffers from the curse of dimensionality.To circumvent this shortcoming, one popular approach is to use the ℓ 1 -penalised empirical risk minimizer, also known as the lasso estimator (Tibshirani, 1996): where λ > 0 stands for a tuning parameter and Statistical properties of the lasso with regard to the prediction error were studied in many papers, the most relevant (to our purposes) of which will be discussed in the next section.We also refer the reader to (Bühlmann and van de Geer, 2011) for an overview of related topics.The rationale behind this approach is that (a) the term ] is an unbiased estimator of the excess risk E(f β ) and (b) the ℓ 1 -penalty term favors predictors f β defined via a (nearly) sparse vector β.
The prediction rules we are going to analyze in the present work are suitable adaptations of the (supervised) lasso to the semi-supervised and the transductive settings.More precisely, we consider the estimator where λ > 0 and A ∈ R p×p are parameters to be chosen by the statistician.This definition is based on the following observation.The unlabeled sample may be used to get an improved estimator of the excess risk is the p × p covariance matrix.Indeed, the population covariance matrix can be estimated using both labeled and unlabeled data.A similar observation holds for the transductive excess risk E TL (f β ).
Denoting by Σ lab the empirical covariance matrix based on the labeled sample, that is one checks that the vector β coincides with the lasso estimator (3) when A = Σ 1/2 lab .If an unlabeled sample is available, the foregoing discussion suggests a different choice for the matrix A. This choice depends on the setting under consideration.Namely, defining the matrices unlab in the semi-supervised and transductive settings, respectively.
The following two assumptions made on the probability distribution P will be repeatedly used throughout this work.
(A1) The random variables Y and X have zero mean and finite variance.Furthermore, all the coordinates X j of the random vector X satisfy E[(X j ) 2 ] = 1.(A2) The random variables Y and X j are almost surely bounded.That is, there exist constants B Y and B X such that Assumption (A1) is fairly mild, since one can get close to it by centering and scaling the observed labels and features.For features, the centering and the scaling may be performed using the sample mean and the sample variance computed over the whole data-set.It is however important to require this assumption, since its violation may seriously affect the quality of the ℓ 1 -penalized least-squares estimator β, unless the terms |β j | of the ℓ 1 -norm are weighted according to the magnitude of the corresponding feature X j .The second assumption is less crucial both for practical and theoretical purposes, given that its primary aim is to allow for user-friendly, easy-to-interpret theoretical guarantees.In most situations, even if assumption (A2) is violated, the predictor f β does have a fairly small prediction error rate.
The main contributions of the present work are: • Review of the relevant recent literature on the off-sample performance of the lasso in the prediction problem.• Non-asymptotic bounds for the prediction error of the lasso in the semi-supervised and transductive settings that guarantee the fast rate under the restricted eigenvalue condition.We did an effort for keeping the results easy to understand and to obtain small constants.These results are simple enough to be taught to graduate students.• Oracle inequalities in expectation for the prediction error of the lasso.To the best of our knowledge, such results were not available in the literature until the very recent paper (Bellec et al., 2016).
To give a foretaste of the results detailed in the rest of this work, let us state and briefly discuss a risk bound in the semi-supervised setting (the complete form of the result is provided in Theorem 7).For a matrix A, we denote by A its largest singular value and by κ A the compatibility constant (see Section 2 for a precise definition).
Theorem.Let assumption (A1) be fulfilled and let the random variables Y , X j be bounded in absolute value by 1.For a prescribed tolerance level δ ∈ (0, 1), assume that the overall sample size N and the tuning parameter λ satisfy N ≥ 18p Σ −1 log(3p/δ) and Then, for every J ⊆ {1, . . ., p}, with probability at least 1 − δ, the estimator β defined in (4) This result follows in the footsteps of many recent papers such as (Koltchinskii et al., 2011;Sun and Zhang, 2012;Dalalyan et al., 2014) among others.The term oracle inequality refers to the fact that it allows us to compare the excess risk of the predictor f β to that of the best possible nearly sparse prediction function.(By nearly sparse we understand here a vector β such that for a set J ⊆ {1, . . ., p} of small cardinality the entries of β with indices in J c have small magnitude; that is β J c 1 = j ∈J |β j | is small.)Indeed, if we denote by β a nearly s-sparse vector in R p such that the excess risk E(fβ) is small, then the aforestated risk bound is the sum of three terms having clear interpretation.The first term, E(fβ), is a bias term due to the s-sparse linear approximation.The second term, λ βJ c , is the bias due to approximate s-sparsity.(Note that it vanishes if β is exactly s-sparse and J is taken as its support.)Finally, the third term measures the magnitude of the stochastic error.Assuming the compatibility constant to be bounded away from 0, this last term is of the order s log(p)/n, which is known to be optimal 3 over all possible estimators (Ye and Zhang, 2010;Raskutti et al., 2011;Rigollet andTsybakov, 2011, 2012).Inequality (5) readily shows the advantage of using the unlabeled data: the compatibility constant involved in the last term of the right hand side is computed for the overall covariance matrix.When the size of the labeled sample is small in regard to the dimension p, the corresponding constant computed for Σ lab may be very close (and even equal) to zero.This may downgrade the fast rate of the original lasso to the slow rate β 1 / √ n.Instead, if a large number of unlabeled features are used, it becomes more plausible to assume that the compatibility constant is bounded away from zero.In relation with this, it is important to underline that the unlabeled sample cannot help to improve the fast rate of convergence of the lasso, s log(p)/n, which is optimal in the minimax sense.The best we can hope to achieve using the unlabeled sample is the relaxation of the conditions guaranteeing the fast rate.Another worthwhile remark is that the theorem stated above is valid when the size of the unlabeled sample is significantly larger than the dimension p.Interestingly, this condition is not required for getting the analogous result in the transductive set-up.
The rest is as follows.In Section 2, we introduce the notations used throughout the paper.Section 3 contains a review of the relevant literature and discusses the relation of the previous work with our results.Section 4 presents risk bounds for the prediction error of the lasso in the transductive setting, whereas Section 5 is devoted to the analogous results in the semisupervised setting.Conclusions are made in Section 6.The proofs are postponed to Section 7.

NOTATIONS
In the sequel, for any integer k we denote by [k] the set {1, . . ., k}.For any q ∈ [1, +∞] the notation v q refers to the ℓ q -norm of a vector v belonging to an Euclidean space R k with arbitrary dimension k.Since there is no risk of confusion, we omit the dependence on k in the 3 More precisely, the optimal rate is s log(1+p/s) n , which is of the same order as s log(p) n for most values of s.
notation.For any square matrix A ∈ R p×p we denote by A + its Moore-Penrose pseudoinverse and by A its spectral norm defined by We use boldface italic letters for vectors and boldface letters for matrices.Throughout the manuscript, the index j will be used for referring to p features, whereas the index i will refer to the observations (i ∈ [n] or i ∈ [N ]).For any set of indices J ⊆ [p] and any β = (β 1 , . . ., β p ) ⊤ ∈ R p , we define β J as the p-dimensional vector whose j-th coordinate equals β j if j ∈ J and 0 otherwise.We denote the cardinality of any J ⊆ [p] by |J|.Also, we set supp(β) = {j : β j = 0}.In particular, whenever and c > 0, we introduce the compatibility constants One easily checks that these two constants are of the same order of magnitude in the sense that for every c, c > 0. These constants are slightly larger4 than the restricted eigenvalues (Bickel et al., 2009) defined by For more details, we refer the reader to van de Geer and Bühlmann (2009).

BRIEF OVERVIEW OF RELATED WORK
The material of this paper builds on the shoulders of giants and this section aims at providing a unified overview of some of the most relevant results in our setting, without having the ambition of being exhaustive.For each of the selected papers, we will discuss its strengths and limitations in relation with the results presented further in this work.Some recent results, obtained in the context of matrix regression, can be specialized to our problem and should be put in perspective with our contribution.For instance, a large part of Chapter 9 in (Koltchinskii, 2011) is devoted to the problem of assessing the off-sample excess risk of the trace-norm penalized empirical risk minimizer in the setting of trace regression with random design.One can arguably consider that setting as an extension of the random design regression problem by restricting attention to the set of diagonal matrices.Then the estimator studied in Koltchinskii (2011) coincides with the lasso estimator (3).With our notations, the main result of Chapter 9 in (Koltchinskii, 2011) reads as follows.
Theorem 1 (Theorem 9.3 in Koltchinskii, 2011).Assume that Assumptions (A1) and (A2) hold.Then there exist universal positive constants c 1 and c 2 such that, if for some δ ∈ (0, 1), the estimator (3) satisfies, with probability larger than 1 − δ, where This result can be briefly compared to the risk bound in (5).The main advantages of this result is that (a) it is established under much weaker assumptions on the boundedness of the random variables X and Y than those of Assumption (A2), (b) it holds not only for the vector regression but also for matrix regression, (c) it contains no restriction on the sample size and (d) it involves the compatibility constant of the population covariance matrix Σ.On the negative side, the oracle inequality in Theorem 1 is not sharp since the factor in front of E(f β ) is not equal to one and, more importantly, the rate of convergence of the remainder term is sub-optimal in most situations.Indeed, if the best linear predictor corresponds to an s-sparse vector the nonzero entries of which are of the same order, then the term β 2 1 log(k/δ) log(n)/n, present in the right hand side, is of order s 2 log(n) log log(n + p)/n, whereas the remainder term in ( 5) is of smaller order s log(p)/n.On a related note, Koltchinskii et al. (2011) establish sharp oracle inequalities for the tracenorm penalized least-squares estimator in the problem of matrix estimation and completion under low rank assumption.Using our notation, Theorem 2 in (Koltchinskii et al., 2011) yields the following result.
The original result (Koltchinskii et al., 2011, Theorem 2) is slightly different from the aforestated one.In particular, it is expressed in terms of the restricted eigenvalue constant with respect to the population covariance matrix Σ.However, all these differences imply only minor modifications in the proofs.Theorem 2 is very similar to the risk bounds that we establish in the present work, but has the obvious shortcoming of requiring the covariance matrix Σ to be known.In fact, this corresponds to the situation in which infinitely many unlabeled feature vectors X n+1 , X n+2 , ... are available, that is N = +∞.To some extent, one of the purposes of the present work is to provide risk bounds analogous to the result of Theorem 2 but valid for a broad range of values of N .Note that the choice of the tuning parameter λ advocated by all the aforementioned results is of the same order of magnitude.
To the best of our knowledge, the only paper establishing risk bounds for a transductive version of the lasso is (Alquier and Hebiri, 2012).In that paper, the authors considered the problem of transductive learning in a linear model Y = X ⊤ β ⋆ +ξ under the sparsity constraint.The estimator they studied is slightly different from ours and is defined by For the predictor f β based on this estimator, the authors established the following risk bound.
Theorem 3 (Theorems 4.3 and 4.4 in Alquier and Hebiri, 2012).Assume that for some . Let E 1 be the event "all the unlabeled features {X n+i : i ∈ [N − n]}, belong to the linear span of the labeled features {X i : i ∈ [n]}" and let δ ∈ (0, 1).Denote by a n,N,p the harmonic mean of the diagonal entries of the matrix Σ unlab Σ + lab Σ unlab .Then the estimator (6) with This result is close in spirit to the result that we establish in this work in the setting of transductive learning.Note however that there are three main differences.First, we do not confine our study to the well-specified situation in which the Bayes predictor is linear, f ⋆ (x) = x ⊤ β ⋆ for every x ∈ R p , with a sparse vector β ⋆ .Second, we avoid the unpleasant restriction that the unlabeled features are linear combinations of labeled features.Third, we replace the factor a n,N,p -which may be quite large-by a more tractable quantity.This being said, the result of Alquier and Hebiri (2012)-in contrast with our results-does not require the unlabeled features to be drawn from the same distribution as the labeled features.
We also review a recent result from (Lecué and Mendelson, 2016).In that paper, the authors consider the isotropic case Σ = I p , where I p stands for the p × p identity matrix, but impose only weak assumptions on the moments of the noise.Translated to our notations, their result can be formulated as follows.
Theorem 4 (Theorem 1.3 in Lecué and Mendelson, 2016).Let Assumption (A2) be satisfied and let Σ = I p .Let fβ be the best linear approximation in L 2 (P X ) of the regression function f ⋆ , that is β ∈ arg min β∈R p E(f β ).Let δ ∈ (0, 1) be a prescribed tolerance level.There are three constants c 1 (δ), c 2 (δ, B X ) and c 3 (δ, B X ) such that, if β is nearly s-sparse in the sense that5 1/2 , then with probability at least 1 − δ the lasso estimator satisfies The principal strength of this result is that it is valid under a very weak assumption on the tails of the noise, but it has the shortcoming of requiring the minimizer of the excess risk to be nearly s-sparse with a quite precise upper bound on the authorized non-sparsity bias.From this point of view, an upper bound of the form (5) provides more information on the robustness of the prediction rule with respect to the model mis-specification.
The proofs of the results above assess the off-sample prediction error rate of the lasso by using direct arguments.An alternative approach (adopted, for example, in Raskutti et al., 2010;Koltchinskii, 2011;Oliveira, 2013;Rudelson and Zhou, 2013) consists in taking advantage of the in-sample risk bounds in order to assess the off-sample excess risk.In short, by means of nowadays well-known techniques (developed in Bickel et al., 2009;Juditsky and Nemirovski, 2011;Bühlmann and van de Geer, 2011;Belloni et al., 2014;Dalalyan et al., 2014, for instance) for a well-specified model6 , an upper bound on the in-sample risk,[saut de ligne] is obtained along with proving that the vector β − β ⋆ belongs to the dimension-reduction cone appearing in the definition of the compatibility constant.Then, using suitably chosen concentration arguments, it is shown that (with high probability) the compatibility constant κ Σ lab (J ⋆ , c) of the empirical covariance matrix Σ lab is lower bounded by a (multiple of a) compatibility constant κ Σ (J ⋆ , c ′ ) of the population covariance matrix, provided that the sparsity s is of order n/ log(p).The main conceptual differences between the aforementioned papers are in the conditions on the random vectors X i .In (Raskutti et al., 2010), it is assumed that the X i 's are Gaussian.In Rudelson and Zhou (2013) and Theorem 9.2 in Koltchinskii (2011), sub-Gaussian and bounded designs are considered, whereas only a bounded moment condition is required in Oliveira (2013).We will not reproduce their results here because (a) they do not allow to account for the robustness to the model mis-specification and, to a lesser extent, (b) the constants involved in the bounds are not explicit.

RISK BOUNDS IN TRANSDUCTIVE SETTING
We first consider the case of transductive learning.From an intuitive point of view, this case is simpler than the case of semi-supervised learning since a prediction needs to be carried out only for the features in D unlabeled .Indeed, recall from (2) that in this context, the excess risk of the linear predictor f β is defined by and the suitably adapted lasso estimator is given by choosing Note here that the role of the term 2 n Y ⊤ X lab is to estimate the term 2 i , which appears after developing the square in the excess risk.Since the latter belongs to the image of the matrix X unlab , one can slightly improve the estimator by projecting onto the subspace of R p spanned by the unlabeled vectors X i .This amounts to replacing the term Y ⊤ X lab β by Y ⊤ X lab Π unlab β, where Π unlab stands for the orthogonal projector in R p onto Span(X n+1 , . . ., X N ).However, from a theoretical point of view, this modification has no impact on the risk bound stated below.That is why we confine our attention to the lasso estimator that does not use this modification.
Theorem 5. Let Assumptions (A1) and (A2) be fulfilled.Define n ⋆ = n ∧ (N − n) and assume that, for a given δ ∈ (0, 1), the tuning parameter λ satisfies Then, with probability at least 1 − δ, the predictor f β satisfies A few comments are in order.First, Theorem 5 holds for any pair of integers n and N larger than 1.However, it is especially relevant when the number N − n of unlabeled features is larger than the number n of labeled ones.As already mentioned, this kind of situation is frequent in applications where the labeling procedure is expensive.In this case, n ⋆ = n and Theorem 5 takes the same form as (5) with the notable advantage that the size of the unlabeled sample does not need to be of larger order than the dimension p.Let us present a few implications of this result in the well-specified case.
Well-specified case.Recall that the well-specified case refers to the situation where there exists β ⋆ ∈ R p such that the Bayes predictor f ⋆ satisfies f ⋆ (x) = x ⊤ β ⋆ , P X -almost surely.In this case, the excess risk of a predictor f β can be written as 2 .In this form, the technical tractability of the transductive learning problem appears clearly since the matrix A = Σ 1/2 unlab used in the definition of the estimator β coincides with the one appearing in the excess loss.As we shall see later, this is indeed not the case for semisupervised learning.Now, the choice of β = β ⋆ and J = J ⋆ in the right hand side of inequality (8) yields The choice of λ provided by the right hand side of inequality ( 7), along with the condition n ⋆ ≥ B 2 X log(2p/δ), leads to the bound with probability at least 1 − δ.Comparing our result with that of Alquier and Hebiri (2012) (cf.Theorem 3 above), we can note that Theorem 5 holds without the assumption that the unlabeled features belong to the linear span of the labeled ones.On the other hand, Alquier and Hebiri (2012) do not require the labeled and the unlabeled features to be drawn from the same distribution.

RISK BOUNDS IN SEMI-SUPERVISED SETTING
We now turn to the more challenging problem of semi-supervised learning.In this subsection, we first consider the well-specified setting in which the Bayes predictor f ⋆ is linear.We start with risk bounds that hold with a probability close to one.Such bounds are often termed in deviation as opposed to those holding in expectation.
Well-specified case.We assume here that In this context, the excess risk of the linear predictor f β , defined in (1), becomes This setting is more restrictive than the mis-specified setting considered below, but it has the advantage of allowing us to obtain risk bounds that are small even if the sample size N is not necessarily larger than the dimension p.The next result assesses the performance of the predictor f β where corresponding to the choice A = Σ 1/2 all in (4).In the next result, we set where the restricted eigenvalue κ RE A (J, c) is defined in Section 2.
With probability at least 1 − δ, it holds In addition, if the overall sample size N is such that 16s ⋆ B 2 X 2 log(4p 2 /δ) ≤ κΣ (J ⋆ , 3) √ N then, with probability at least 1 − δ, the predictor f β satisfies the inequality This theorem provides three different risk bounds, all of them being valid for the same choice of the tuning parameter λ, that clearly show the benefits of using unlabeled data.The first two bounds are stated in eq. ( 11).They share the common feature of depending on a characteristic (compatibility constant or restricted eigenvalue) of the sample covariance matrix.The latter is computed using both labeled and unlabeled data.For large values of N , it is more likely that these characteristics are bounded away from zero than those of the sample covariance matrix based on the labeled data only.In the asymptotic setting where s ⋆ goes to infinity with the sample size and the dimension, the second term in the right hand side of eq. ( 11) is of smaller order than the first one and is rate optimal, provided that the restricted eigenvalue is lower bounded by a fixed positive constant.However, for finite and small values of s ⋆ the first term in the right hand side of eq. ( 11) might be smaller than the second term.
This being said, it might be more insightful to look at the non random upper bounds on the excess risk as the one stated in eq. ( 12).It basically tells us that if the overall sample size is larger than a multiple of (s ⋆ ) 2 log p, then the off-sample prediction risk of the semisupervised lasso estimator achieves the fast rate s ⋆ log p n .Note that if we use only the labeled data points, the best known results-as recalled in Section 2 above-provide the fast rate when n is larger than a multiple of s ⋆ log p.Thus, if N is of the same order as n, our result above is not the sharpest possible, but it has the advantage of being easy to prove and, nevertheless, demonstrating the gain of using the unlabeled data.In particular, the proof of results providing the fast rate under the condition n ≥ Cs ⋆ log(p), for some C > 0, involve the important step of lower bounding the compatibility constant of the sample covariance matrix by its population counterpart.This step uses concentration arguments which are often tedious and come with implicit (or unreasonably large) constants.Instead, our proof makes use of much simpler tools essentially boiling down to the classical Bernstein inequality and leads to explicit and small constants.
Mis-specified case.Mathematical analysis of the semi-supervised lasso under mis-specification is more involved, since it requires careful control of the bias terms corresponding to the nonlinearity and the non-sparsity of the model.We first state results providing risk bounds in deviation, then state their counterpart in expectation.

Suppose in addition that
Then the semi-supervised lasso estimator β defined in (10) above satisfies with probability larger than 1 − δ.
The novelty of Theorem 7 lies in the semi-supervised nature of the estimator (10), which involves all the unlabeled features through the matrix A = Σ 1/2 all in eq. ( 4).In particular, Theorem 7 quantifies the natural intuition according to which, if N is large enough, the matrix A = Σ 1/2 all is a good estimator of Σ and a result similar to Theorem 2 should hold.As mentioned in the introduction, an attractive feature of the upper bound in eq. ( 14) is that it is of the same form as the recent oracle inequalities established in the case of fixed design regression (see, for instance, Dalalyan et al., 2014;Pensky, 2014, and the references therein) and quantify in an easy-to-understand manner the error terms accounting for the non-linearity and the non-sparsity of the true regression function f ⋆ .
The minimal number N of features satisfying (13) depends on Σ −1 = λ −1 min (Σ), reflecting the fact that the quality of approximation of the identity matrix I p by Σ −1/2 Σ all Σ −1/2 depends on Σ −1 .One can remark that under constraint (13), the lowest eigenvalue of the sample covariance matrix is close to its population counterpart (Vershynin, 2010) and provides a simple lower bound on the compatibility constant κ Σ all (J, 3) appearing in eq. ( 14).These considerations lead to the following corollary.
Corollary 1.Under the conditions of Theorem 7, with probability at least 1 − δ, Let us also mention that the factor B 2 X p Σ −1 present in the right hand side of eq. ( 13) is an upper bound on the norm Σ −1/2 X i 2 2 under assumption (A2).Under additional assumptions on the support of the features X i , this expression may be replaced by a smaller one leading thus to a relaxation of condition ( 13).
Sharp oracle inequality in expectation.All the previously stated results assert that the lasso estimator has a small prediction error on an event of overwhelming probability.However, in these results, the choice of the tuning parameter λ and, therefore, the final predictor f β , depends on the prescribed level of tolerance.A consequence of this dependence is that one can not integrate out the bounds in deviation in order to get a bound in expectation.This is probably one of the reasons why the bounds in expectation for the lasso are scarce in the literature.To fill this caveat, we state below a risk bound in expectation that can be easily deduced from the bounds in deviation.
Theorem 8. Let Assumptions (A1) and (A2) be fulfilled.Suppose that the overall sample size is such that N ≥ 18B 2 X p Σ −1 log(3pN 2 ).Then, for the tuning parameter the semi-supervised lasso estimator β defined in (10) above satisfies The proof of this theorem is postponed to section 7.2.3.The bound above is not optimal in terms of its dependence on N .In particular, it blows up when N goes to infinity and all the other parameters are fixed.However, this divergence is only logarithmic in N .The dominating term in the risk bound above is (at least in the well specified setting) of the order λ 2 |J| ≍ s log(pN ) n .

CONCLUSION
We have reviewed some recent results on the prediction accuracy of the lasso in the problem of regression with random design and have proposed their extensions to the setting where the labels of some data points are not available.Theoretical guarantees stated in previous sections are formulated as oracle inequalities that allow us to compare the excess risk of a suitable adaptation of the lasso to the best possible (nearly) sparse prediction function.We have opted for considering only those risk bounds that provide the fast rate and are valid under some conditions on the design such as the restricted eigenvalue condition or the compatibility condition.Some of the established upper bounds involve the compatibility constant of the sample covariance matrix.Using results on random matrices (Rudelson and Zhou, 2013;Oliveira, 2013;Bah and Tanner, 2014) they can be further worked out to get deterministic upper bounds.However, the evaluation of the restricted eigenvalues and related quantities of the random covariance-type matrices is a dynamically evolving research area and we expect that new advances will be made in near future.
The main high level message of the contributions of this paper is that one can take advantage of the unlabeled sample for improving the prediction accuracy of the lasso.Roughly speaking, if the size of the unlabeled sample is larger than the ambient dimension, then the modified lasso predictor has a prediction risk that converges to zero at the optimal rate even if the sample covariance matrix based only on the labeled sample does not satisfy the compatibility or the restricted eigenvalue condition.However, it should be acknowledged that when the model is well specified (that is there exists a sparse linear combination of the features with an extremely low approximation error) and the population covariance matrix is well-conditioned, then the original lasso might perform as well as, or even better than, the modified lasso proposed in this work.Therefore, one can conclude that the use of the unlabeled sample improves on the robustness of the lasso to the model mis-specification.
We would like also to emphasize that, pursuing pedagogical goals, we have restricted our attention to the simple case of bounded feature vectors and bounded labels.All the proofs presented in this paper are based on elementary arguments and are fairly simple.Using more involved arguments, they can be carried over the case of sub-Gaussian design and labels.It would be interesting to explore their extensions to other settings such as regression with structured sparsity, low rank matrix regression or matrix completion, etc.

PROOFS
We start with a general result that holds for the penalized least squares predictor with arbitrary convex penalty.This result is of independent interest.It generalizes the corresponding result of (Koltchinskii et al., 2011) established for the matrix trace-norm penalties.The proof that we present here is different from the one in (Koltchinskii et al., 2011) in that it does not rely on the precise form of the sub-differential of the penalty function.
Lemma 1.Let n, p ≥ 1.Let pen : R p → R be any convex function and β be defined by Proof.Let us introduce the function Φ(β) = Aβ 2 2 − 2 n Y ⊤ X lab β + pen(β) for every β ∈ R p , so that β is a minimum point of Φ.Since the latter is a convex function, we know that the zero vector 0 p of R p belongs to the sub-differential ∂Φ( β) of Φ at β.For all β ∈ R p , let The function ψ is proper and convex.It is also differentiable on R p and the sub-differential of ψ at β is reduced to its gradient at β, so that ∂ψ( β) = {∇ψ( β)} = {0 p }.The function Φ defined on R p is the sum of an affine function and the convex function pen, thus it is also convex.The functions ψ, Φ are proper and convex, the function ψ is continuous on R p so by the Moreau-Rochafellar Theorem, Thus 0 p ∈ ∂ Φ( β), which can be rewritten as By adding ψ(β) on both sides of the previous display, we obtain Rearranging the terms of this inequality, we get the claim of the lemma.
We will also repeatedly use the following result.
Proof.To ease notation, we set u = β − β ′ .Using that If c γ u J 1 < u J c 1 , the claim of the lemma is straightforward.Otherwise, u J c 1 ≤ c γ u J 1 and using the definition of the compatibility constant we get 2λ(γ + 1) which completes the proof.
To close this subsection of auxiliary results, we provide simple upper bounds on the quantiles of some random noise variables.
Proof.We will only prove the inequality corresponding to ζ.The others being very similar are left to the reader.Denote

and introduce the random vectors
The vectors Z i are independent, centered, bounded and satisfy One can also bound from above the variance of the j-th component Z ij of Z i as follows.If i ≤ n then, in view of Assumptions (A1) and (A2), . Hence, we may easily deduce that, for all j ∈ [p], Therefore, using the Bernstein inequality recalled in Proposition 4 of Appendix A, for every j ∈ [p] and every δ > 0, we get that inequality holds with probability at most δ/p.The claim of Proposition 1 follows from the union bound.
Remark 7.1.One can easily check that the inequality E[Z 2 ij ] ≤ (N B Y /n) 2 , for i = 1, . . ., n, used in the previous proof can be replaced by E . This may lead to a better risk bound in the cases where the random variable Y i is not well concentrated around its average value.
We are now in a position to prove the main theorems of this paper.

Proof of Theorem 5
The proof of Theorem 5 follows directly from Proposition 1 and Proposition 2 below.For simplicity, the parameter γ > 1 introduced in Proposition 2 is fixed at the value γ = 2 in Theorem 5.
Proposition 2. Let ζ be as in Proposition 1.For any γ > 1, we set c γ = (γ + 1)/(γ − 1).On the event E = { ζ ∞ ≤ λ/γ}, for every β ∈ R p and every J ⊆ [p], we have Proof.Along the proof, we will use for convenience the shorthand notations m = N − n and A = Σ 1/2 unlab .First, notice that developing the square in the expression This implies that for every β ∈ R p , we have Using Lemma 1 with the convex penalty term pen(β) = 2λ β 1 , we deduce that, for every Combining equations ( 16) and ( 17), we get that on the event E , for every β ∈ R p and every The claim of the proposition follows from eq. ( 18) by applying Lemma 2 with µ = λ.
To conclude the proof of Theorem 5, it suffices to note that in view of Proposition 1, the probability of the event 7.2 Proofs for the semi-supervised version of the lasso We start this section by some arguments that are shared by the proofs of both theorems stated in Section 5. Let J ⊆ [p] and let β be a minimizer of the right hand side of ( 14).Note in particular that β is a deterministic vector depending on the unknown distribution P of the data.In addition, if the model is well-specified and J = J ⋆ then β = β ⋆ .We will also use the notation u = β − β and Furthermore, to ease notation, we set Next, notice that and that where in the last line we have used the identity 2a ⊤ b = a + b 2 2 − a 2 2 − b 2 2 with a = Au and b = Aβ.Transforming eq. ( 20) thanks to ( 21) and ( 22) we obtain where we have used the identity ] and the definitions of ζ (1) and ζ (2) .Applying Lemma 1 with pen(β) = 2λ β 1 and combining its result with (23), we arrive at 7.2.1 Proof of Theorem 6.As mentioned earlier, in the well-specified setting we have β = β ⋆ and, therefore, E(f β ) = Σ 1/2 u 2 2 and E(f β ⋆ ) = 0. Hence, (24) yields Combining the duality inequality |u This implies that u (J ⋆ ) c 1 ≤ 3 u J ⋆ 1 and, therefore, On the other hand, if we denote by I the set of the s ⋆ largest entries of the vector |u|, inequality (26) implies that 2 Σ 1/2 N u 2 2 ≤ λ(3 u I 1 − u I c 1 ).Therefore, using the definition of the restricted eigenvalue and similar arguments as above, we deduce that u I 2 ≤ 3λ √ s ⋆ /(2κ RE Combining ( 27) and( 28), we get the first claim of the theorem.
To get the second claim of the theorem, we go back to (26) and use the following inequalities: In the sequel, let us denote κ = κΣ (J ⋆ , 3) for brevity.Then, upper bounding the two instances of u J ⋆ 1 in (29) by (s ⋆ Σ 1/2 u 2 2 /κ) 1/2 , we infer that on E , Dividing both sides by Σ 1/2 u 2 (if this quantity vanishes then the claim of the theorem is obviously true) and after some algebra, we get the inequality Lemma 5. Let pen : R p → [0, +∞) be a convex function such that pen(0 p ) = 0. Let β be a minimizer of the function Proof.We apply Lemma 1 with A = E[XX ⊤ ] 1/2 , n = 1, Y = 1 and Rearranging the terms and using that pen( β) ≥ 0, we get imply that P(E 1 ) ≥ 1 − δ/3 and P(E 2 ) ≥ 1 − δ/3.One can easily check that under the conditions of the theorem, the two inequalities of the last display are satisfied.Therefore, we have P(E 1 ∩ E 2 ∩ E 3 ) ≥ 1 − δ.Finally, applying Proposition 3 we get the claim of the theorem.
7.2.3Proof of the oracle inequality in expectation.Let δ be a positive number smaller than 1 to be chosen later.We have already seen in 1 that on an event E of probability 1 − δ, we have On the other hand, using the fact that β minimises the function ψ 2 n Y ⊤ X n β + 2λ β 1 , we have ψ( β) ≤ ψ(0 p ), which yields Note that Σ −1/2 N is understood as the Moore-Penrose pseudo-inverse and all the expressions involving this quantity are well defined since N Σ N n Σ n = X ⊤ n X n .This implies that Y 2 2 , which entails It is also true that for every β ∈ R p , Therefore, we have Combining this inequality with (35), we get Setting δ = N −2 , we get the claim of the theorem.