Dimension reduction for regression estimation with nearest neighbor method

: In regression with a high-dimensional predictor vector, dimen- sion reduction methods aim at replacing the predictor by a lower dimen-sional version without loss of information on the regression. In this context, the so-called central mean subspace is the key of dimension reduction. The last two decades have seen the emergence of many methods to estimate the central mean subspace. In this paper, we go one step further, and we study the performances of a k -nearest neighbor type estimate of the re- gression function, based on an estimator of the central mean subspace. In our setting, the predictor lies in R p with ﬁxed p , i.e. it does not depend on the sample size. The estimate is ﬁrst proved to be consistent. Improvement due to the dimension reduction step is then observed in term of its rate of convergence. All the results are distributions-free. As an application, we give an explicit rate of convergence using the SIR method. The method is illustrated by a simulation study.


Introduction
In a full generality, the goal of regression is to infer about the conditional law of the response variable Y given the R p -valued predictor X. Many different methods have been developped to adress this issue. In the present paper, we consider sufficient dimension reduction which is a body of theory and methods for reducing the dimension of X while preserving information on the regression (see Li [13,14], and Cook and Weisberg [6]). Basically, the idea is to replace the predictor with its projection onto a subspace of the predictor space, without loss of information on the conditional distribution of Y given X. Several methods have been introduced to estimate this subspace: sliced inverse regression (SIR; Li [13]), sliced average variance estimation (SAVE; Cook and Weisberg [6]), average derivative estimation (ADE; Härdle and Stoker [10]), . . . See also the paper by Cook and Weisberg [7] who gives an introductory account of studying regression via these methods. Even if the methods above give a complete picture of the dependence of Y on X, certain characteristics of the conditional distribution may often be of special interest. In particular, regression is often understood to imply a study of the conditional expectation E[Y |X]. Subsequently, the response variable Y is a univariate and integrable random variable. Following the ideas developped for the conditional distribution, Cook and Li [4] introduced the central mean subspace that will be of great interest for the paper. Let us recall the definition. For a matrix Λ ∈ M p (R), denote by S(Λ) the space spanned by the columns of Λ. Here, M p (R) stands for the set of p×p-matrices with real coefficients. Letting Λ T the transpose matrix of Λ, we say that S(Λ) is a mean dimension-reduction that is, if the projection of the predictor onto S(Λ) has no influence on the regression. When the intersection of all dimension-reduction subspaces itself is a dimension-reduction subspace, it is defined as the central mean subspace and is denoted by S E[Y |X] . With this respect, a matrix Λ that spans the central mean subspace is called a candidate matrix. Hence the central mean subspace, which exists under mild conditions (see Cook [1][2][3]), is the target of sufficient dimension reduction for the mean response E[Y |X]. Various methods have been developed to estimate S E[Y |X] , among with principle Hessian direction (pHd; Li [14]), iterative Hessian transformation (IHT; Cook and Li [4]), minimum average variance estimation (MAVE; Xia et al [16]). Discussions, improvements and relevant papers can be found in Zhu and Zeng [19], Ye and Weiss [17] or Cook and Ni [5].
Regarding the regression estimation problem in a nonparametric setting, the aim of the dimension-reduction methods is to overcome the curse of dimensionality -which roughly says that the rate of convergence of any estimator decreases as p grows-by accelerating the rate of convergence. Indeed, assuming (1.1), it is naturally expected that the rate of convergence of any estimator will depend on rank(Λ) instead of p, since Λ T X lies in a vector space of dimension rank(Λ). In general, rank(Λ) is much smaller than p, hence the rate of convergence in the estimation of E[Y |X] may be considerably improved. For this estimation problem, we shall use the so-called k-nearest neighbor method (NN), which is one of the most studied method in nonparametric regression estimation since it provides efficient and tractable estimators (e.g., see the monography by Györfi et al [8], and the references therein). As far as we know, similar studies in a dimension-reduction setting were only been carried out for particular models, such as additive models or projection pursuits for instance. We refer the reader to Chapter 22 in the book by Györfi et al [8] for a complete list of references on the subject.
In the present paper, we address the problem of estimating the conditional expectation E[Y |X] based on a sequence (X 1 , Y 1 ), . . . , (X N , Y N ) of i.i.d. copies of (X, Y ). In our setting, the predictor X lies in R p with fixed p, i.e. p does not depend on the sample size n. Assuming the existence of a mean dimensionreduction subspace as in (1.1), we first construct in Section 2 the k-NN type estimator based on an estimateΛ of Λ. Roughly speaking, it is defined as the k-NN regression estimate drawn from the (ΛX i , Y i )'s. In a distribution-free setting, we prove consistency of the estimator (Theorem 2.1) and we show that the rate of convergence essentially depends on rank(Λ) (Theorem 2.2). In particular, up to the terms induced by the dimension-reduction methodology, we recover the usual optimal rate when the predictor belongs to R rank(Λ) . Section 3 is devoted to the term induced by the dimension-reduction method: in a general setting, we propose and study the performances (convergence and rate) of a numerically robust estimator. As an example, we consider in Section 4 the case where the candidate matrix is constructed via the SIR method. This section also present a data-driven choice of the tuning parameters and a simulation study. All the proofs are postponed to the last three sections.

The estimator
Throughout this section, we shall assume the following assumption. Recall that troughout the paper, the dimension p does not depend on the sample size n; in particular, it can not grow with n.
Basic assumption: there exists Λ ∈ M p (R) such that S(Λ T ) is a mean dimension-reduction subspace, i.e.
Note that we have written "Λ" instead of the usual "Λ T " in the conditional expectation. This choice is for notational simplicity since, in this section, we only have to deal with Λ.
The estimation of the regression function requires to first estimate the matrix Λ and then to estimate the regression function r defined by To reach this goal, we assume throughout the paper that the sample size N is even, with N = 2n. We split the dataset into two sub-samples: the n first data (X 1 , Y 1 ), . . . , (X n , Y n ) are used to estimate the matrix Λ, whereas the last ones (X n+1 , Y n+1 ), . . . , (X 2n , Y 2n ) are used to estimate the body of the regression function r.
For the first estimation problem, we assume in this section that we have at hand an estimateΛ of Λ, constructed with the observations (X 1 , Y 1 ), . . . , (X n , Y n ). We refer to Sections 3 and 4 for an efficient and tractable way to estimate Λ. We now explain the nearest neighbor method that will be introduced to estimate the function r (for more information on the NN-method, we refer the reader to Chapter 6 of the monography by Györfi et al. [8]). For all i = n + 1, . . . , 2n, we letX i =Λ X i . Then, if x ∈ R p , we reorder the data (X n+1 , Y n+1 ), . . . , (X 2n , Y 2n ) according to increasing values of { X i − x , i = n + 1, . . . , 2n}, where . stands for the Schur norm of any vector or matrix. The reordered data sequence is denoted by: which means that In this approach,X (i) (x) is called the i-th NN of x. Note that ifX i andX j are equidistant from x, i.e. X i − x = X j − x , then we have a tie. As usual, we then declareX i closer to x thanX j if i < j. We now let k = k(n) ≤ n be an integer and for all i = n + 1, . . . , 2n, we set: Observe that we have 2n i=n+1 W i (x) = 1. With this respect, the estimater of r is then defined by: From a computational point of view, the complexity of the calculation algorithm ofr(x) is O(n ln n) in mean, using a random Quick-Sort Algorithm.

Behavior ofr
In the sequel, (X, Y ) is independent of the whole sample and with the same distribution as (X 1 , Y 1 ). Observe that our results are distribution-free; in particular, we do not assume that the law of (X, Y ) has a density. The first result, whose proof is deferred to Section 5, establishes a consistency property for the estimatorr(ΛX).
Therefore, we assume in the following that k/n → 0. Recall that the consistency assumptionΛ P −→ Λ holds for the standard dimension reductions methodologies, as we shall see in Sections 3 and 4.
We now turn to the study of the rate of convergence. Recall that the function r is lipschitz if there exists L > 0 such that for all x 1 , x 2 ∈ R p : Because we deal with the estimation of E[Y |ΛX], it is naturally expected that the convergence rate in Theorem 2.1 depends on the dimension of the vector space spanned by the matrix Λ. In the sequel, d stands for the rank of Λ, and we also denote byd an estimator such thatd = rank(Λ). Section 6 is devoted to the proof of the following result: Remark 2.3. When d ≤ 2, under the additional conditions of Problem 6.7 in the book of Györfi et al. [8], a slight adaptation of the proof of Theorem 2.2 enables us to derive the same convergence rate.
Observe that the global error is decomposed into two terms: first, the classical error term C k + C k n 2/d in nonparametric regression estimation using k-NN, but when the predictor belongs to R d (see Chapter 6 in Györfi et al. [8]); seconds, the term induced by the dimension-reduction method. We shall concentrate on this term in the next two sections. Note also that in this result, the best choice of k, namely k = n 2/(2+d) , gives the following bound: Hence, up to the last two terms, our nearest neighbor estimate achieves the usual optimal rate in regression estimation, but when the predictor belongs to R d (see Ibragimov and Khasminskii [11], Györfi et al. [8]). With this result, one may quantify the positive effects of the dimension reduction step, that are measured in term of the rate of convergence.
Next section is dedicated to the construction and estimation of Λ in a general setting.

Construction of Λ
Papers dealing about dimension reduction primarily focus on the determination of a candidate matrix M ∈ M p (R) such that the central mean subspace is spanned by the columns of M , i.e. S(M ) = S E[Y |X] . Observe that the matrix M is symmetric for the standard dimension-reduction methodologies. We shall see in the next section an explicit description of M with the SIR method. Note that the matrix M is in some sense minimal because it spans the smallest mean dimension-reduction subspace.
In this section, the matrix Λ of Section 2 will be constructed from a candidate matrix with a spectral decomposition. There are two main reasons for this: first, it automatically gives the effective directions of the reduced space; seconds, the thresholding procedure of the empirical eigenvalues developped below is robust from a numerical point of view.
Here, we only have to assume that M ∈ M p (R) is a symmetric matrix such that S(M ) is a mean dimension-reduction subspace, i.e.
We let rank(M ) = d. Furthemore, we denote by λ 1 , . . . , λ p the eigenvalues of M indexed as follows: Set now v 1 , . . . , v p the normalized eigenvectors associated with λ 1 , . . . , λ p , and ℓ 1 < · · · < ℓ d the integers such that λ ℓj = 0 for all j = 1, . . . , p. Recall that v 1 , . . . , v p are orthogonal vectors. In the particular case where M is positive In particular, the basic assumption of Section 2 holds. We also assume that we have at hand the estimatorM ∈ M p (R) of M , constructed with the n first data (X 1 , Y 1 ), . . . , (X n , Y n ). We suppose thatM is a symmetric matrix with real coefficients, and we denote byλ 1 , . . . ,λ p the eigenvalues indexed as follows:λ 1 ≥ · · · ≥λ p , and byv 1 , . . . ,v p the corresponding normalized eigenvectors. A natural -and numerically robust-estimatord of d is then obtained by thresholding the eigenvalues:d where the threshold τ is some positive real number with τ ≤ 1, to be specified latter. Letl 1 < · · · <ld be the integers such that |λl j | ≥ r for all j = 1, . . . ,d. Then, we put: and we observe that rank(Λ) =d.

Rate of convergence
It is an easy task to prove that ifM provided k → ∞ and k/n → 0. This subsection is dedicated to the rate of convergence in the above convergence result.
As seen in Theorem 2.2, we need to give bounds for both terms The bounds are given in Lemmas 7.1 and 7.2 in Section 7. As an application of Theorem 2.2, we immediately deduce the following result: Next section is dedicated to the case where M is constructed with the SIR method. In this context, we can give a bound for E M − M 2 , hence an explicit rate of convergence ofr(ΛX) to E[Y |X].

Theoretical results
The goal of this section is to apply Corollary 3.1 when the candidate matrix M is constructed with some dimension-reduction method. It appears that for each dimension-reduction method (SIR, ADE, MAVE, . . .), the estimatorM of M is such that √ n(M − M ) converges in distribution. However, in view of an application of Corollary 3.1, we need a bound for the quantity Each dimension-reduction method need a specific process, and an exhaustive study of all processes is beyond the scope of the paper.
Hence, we have chosen to study the case where M is constructed with SIR, since it is one of the most popular and powerfull dimension-reduction method, and because it is the subject of many recent papers (e.g. Saracco [15], Zhu and Zeng [19] and the references therein).
In this section, we assume that X and Y are bounded. For simplicity, we also assume that X is standard, i.e. X has mean 0 and variance matrix Id. With the SIR method, the candidate matrix M of Section 3, further denoted M SIR , is the symmetric matrix defined by: In view of an application of Corollary 3.1, we assume throughout that where S Y |X stands for the central subspace of Y given X (e.g. Li [13]). We refer to the papers by Li [13] and Hall and Li [9] for discussions on this assumption, as well as sufficient conditions on the model that ensures this property. In particular, , hence we are in position to apply the results of Section 3.
Let us introduce the partition {I(h), h = 1, . . . , H} of the support of Y , such that each slice I(h) (shorten as h) is an interval with length κ/H for some κ > 0, and moreover: With this respect, a natural estimator for the SIR matrix M SIR iŝ where for any slice h: We now denote by m h the theoretical counterpart ofm h , i.e.
and by M ′ SIR the matrix: It is an easy exercise to prove that for some constant C > 0 that does not depend on n and H. Hence in the estimation of M SIR byM SIR , the bound on the variance term does not need additional assumptions. The biais term M ′ SIR − M SIR , however, has to be handled with care. In the sequel, r inv stands for the inverse regression function, that is: We observe that for each slice h: Hence, provided r inv is Lipschitz, one obtains: for some constant C > 0, and where the c h 's are contained in the I(h)'s. Moreover, we observe that M SIR can be written as Therefore, for some constant C > 0. Under the Lipschitz assumption on r inv , we thus get from (4.2), (4.3) and (4.4): Hence, we recover the usual optimal rate when the predictor vector belongs to a d-dimensional vector space.

Statistical methodology: the NN-SIR method
In view of a simulation study, the first point is to provide a data-driven choice of the parameters H, τ and k. The aim here is only to propose a data-dependent selection of the tuning parameters, but a theoretical study is beyond the scope of the paper.
We assume that the estimate of the candidate matrix and the body of the regression estimate are both constructed with the learning sample (In the theorems, the independence was essentially imposed to avoid some technical difficulties in the proofs.) More precisely, for each number of slices H, we denote byM H SIR the estimate of the matrix M SIR constructed via equation (4.1). Following the construction leading to (3.1), this gives the matrixΛ H,τ SIR for each threshold τ . For each number of nearest neighbors k, we then construct the estimater H,τ,k with the learning sample: The best random choice for (H, τ, k) is obtained via a minimization of the function where X is independent of the learning sample. However, this best choice can not be computed with the data; the idea presented below is to approximate it by splitting the data. We introduce a data-dependent type choice of (H, τ, k) that is inspired by the method of Chapter 7, in the book by Györfi et al. [8]. Let now be the testing data set of size n. We use the testing data to select the parameterŝ H,τ andk that satisfy where the minimum is taken over all H ∈ N, 0 < τ ≤ 1 and 1 ≤ k ≤ n. Such a method has been proved to be efficient in the classical regression estimation problem using NN (see Chapter 7 in the book by Györfi et al. [8]), in the sense that the selected parameter approximates the best random choice of the parameter.
In the sequel, we shall make use of the estimaterĤ ,τ ,k , and we further refer the method as NN-SIR.

A small simulation study
The aim of the simulation study is to quickly illustrate the fact that, in NNregression, the dimension reduction step considerably improves the performance of the classical NN-estimate. For this reason, we compare our NN-SIR method to the classical method in NN regression, with a similar choice for the tuning parameters.
We let p = 10 and X = (X (1) , . . . , X (10) ) a 10-dimensional standard gaussian vector. We study the following models: Here, σ > 0 and ε is a standard real gaussian variable independent of X. Both models, that were studied by Li [13] in his seminal work on SIR, have the property that the reduced dimension d is 2.
We compare our NN-SIR method with the usual NN method, based on a NN estimator of the regression function with a 10-dimensional predictor (10-dim. NN). Furthermore, it is of interest to compare the performances of NN-SIR to the classical NN-method based on the first two components of the predictor (2-dim. NN). Of course, the last method must be the one that works best, but it is based on a full knowledge of the models, hence an unrealistic situation. The results with the 2-dim. NN method must be seen as the best possible results in NN regression estimation. As for the data-driven choices of the parameters H, τ and k, for both methods 2-dim. NN and 10-dim. NN, the best random choice for the number of NN is obtained by the methodology developped by Györfi et al. [8] Chapter 7; more precisely, it is obtained as in (4.5), however, by eliminating the parameters H and τ .
We compute the estimators for a data set of size 400 (i.e. n = 200) and σ = 0.1 or 0.5. For each method, we split this data set into the learning data set D learn of size 200 and the testing data set D test of size 200 (see the previous section). For each estimate, sayr, we compute an approximation of the quadratic distance between it and E[Y |X] using a Monte-Carlo algorithm (with a 200 sample size), i.e. we compute an approximation of This step is replicated 200 times, with independent samples. We then compute the mean and the standard deviation (in parentheses) of these experiments. The results appears in the following tables.  As expected, the (unrealistic) 2-dim. NN method provides the best performances, whereas we observe the curse of dimensionality for the 10-dim. NN method. Our NN-SIR method performs very well in each case and the simulation study illustrates the effects of the dimension reduction step which overcomes the curse of dimensionality: first, the numerical results are close to those of the 2-dim. NN method in which it is known that the model is 2-dimensional; seconds, the results are far from those of the 10-dim. NN method.

Preliminaries
For simplicity, we assume throughout the section that |Y | ≤ 1. We let X =Λ X,X = Λ X and, for all i = n + 1, . . . , 2n:X Proof. In the proof, µ stands for the distribution of X. Let ε > 0. Since X is independent from the sample and distributed according to µ, we have the following equality: Then, according to the Lebesgue domination Theorem, one only needs to prove that for all x in the support of µ: Observe now that for all x: Let a > Λ . If Λ − Λ ≤ a and X i −Λx ≤ ε, we then have: Therefore, According to the strong law of large numbers: Assume that the latter quantity equals 0. Then, we have a.s.
Lemma 5.2. Let ϕ : R p → R be a uniformly continuous function such that 0 ≤ ϕ ≤ 1. If k/n → 0 andΛ P −→ Λ, we have: Proof. For all K > 0, we let X K = X1{ X ≤ K}. Then, we noteX K =ΛX K , X K = ΛX K and similarly forX i,K andX i,K . Moreover, W i,K is defined as W i , but with theX i,K 's instead of theX i 's (see Section 2.1). A moment's thought reveals that, since 2n i=n+1 W i,K (X K ) = 1: where R K is a positive real number that satisfies sup n R K → 0 as K → ∞. Therefore, one only needs to prove that for all K > 0, one has: We now proceed to prove this property. Fix K > 0 and ε > 0. There exists r > 0 such that |ϕ( Hence, one only needs to prove that the rightmost term tends to 0. If Λ − Λ ≤ r/(4K) and X i,K −X K > r, then: Now denote byX (i),K (x) the i-th NN of x ∈ R p among {X n+1,K , . . . ,X 2n,K }.
we can deduce from (5.4), Lemma 5.1 and the fact thatΛ converges to Λ in probability that Using (5.3), we get that for all ε > 0: hence (5.2) holds.
Lemma 5.3. Let ψ : R p → R + be a borel function which is bounded by 1.
Then, there exists a constant C > 0 that only depends on p and such that Proof. By Doob's factorisation Lemma, there exists a borel function ξ : R p → R + such that for all i = n + 1, . . . , 2n: Note that such a function does not depends on i, because the law of the pair (X i ,X i ) is independent on i. We let S = {(X 1 , Y 1 ), . . . , (X n , Y n )} and E = {X n+1 , . . . ,X 2n }. Then, By Stone's Lemma (e.g. Lemma 6.3 in Györfi et al, 2002), there exists a constant C > 0 only depending on p, and such that: This leads to: by definition of ξ, hence the lemma.

Proof of Theorem 2.1
In the sequel,r stands for the function defined for all x ∈ R p by: Fix ε > 0. There exists a continuous function r ′ : R p → R with a bounded support such that One may also choose r ′ so that 0 ≤ r ′ ≤ 1. Since 2n i=n+1 W i (X) = 1, we have by Jensen's inequality: Introducing the continuous function r ′ , we obtain: According to Lemma 5.3 and by definition of r ′ , we then get: for some constant C > 0. Therefore, by Lemma 5.2, we have for all ε > 0: The task is now to prove the following property: But, if i, j = n + 1, . . . , 2n are different, i=n+1 W i (X) = 1, W i (X) ≤ 1/k and |Y | ≤ 1 by assumption. The theorem is now a straightforward consequence of (5.5).

Proof of Theorem 2.2
Recall that we assume here that k/n → 0. We shall make use of the notations of Section 5.1:X =Λ X,X = Λ X and, for all i = n + 1, . . . , 2n:X i = Λ X i . For simplicity, we assume throughout the proof that X ≤ 1 and |Y | ≤ 1. Finally, we denote by S the sub-sample The above proof will borrow and adapt some elements from the proof of Theorem 6.2 in Györfi et al. [8]. We first need a lemma. Lemma 6.1. If d ≥ 3, then there exists a constant C > 0 such that: on the event whered ≤ d and Λ ≤ 2 Λ .
Proof. We assume throughout the proof that the sub-sample S is fixed, witĥ d ≤ d and Λ ≤ 2 Λ , and we denote byμ the law ofX (given S). Sincê d ≤ d, the support ofμ is contained in some vector space of dimension d. For simplicity, we shall consider thatμ is a probability measure on R d . We first fix ε > 0. Then, where B(x, r) stands for the Euclidean closed ball in R d , with center at x and radius r. Since X ≤ 1, the support supp(μ) ofμ is contained in the ball B(0, Λ ). Thus, one can find N (ε) Euclidean balls in R d with radius ε, say B 1 , . . . , B N (ε) , such that Observe that if x ∈ B j , then B j ⊂ B(x, ε). Consequently, Recall now that X ≤ 1 and hence X ≤ Λ . Therefore: Using (6.2) and (6.1) lead to the following bound: Since Λ ≤ 2 Λ , it is now an easy task to prove that, provided d ≥ 3, for some constant C > 0, hence the lemma.
We are now in position to prove Theorem 2.2.
Proof of Theorem 2.2 We shall use the bias-variance decomposition of the following form: where we put, with the notation S W = S ∪ {X n+1 , . . . , X 2n }: We first proceed to bound I 1 . Let us remark that since, by assumption, r( Consequently, since, as seen in a similar context in the proof of Theorem 2.1, provided i, j = n+1, . . . , 2n are different. Using the properties 2n i=n+1 W i (X) = 1, W i (X) ≤ 1/k and |Y | ≤ 1, we conclude that: We now proceed to bound I 2 . Since r is a Lipschitz function, there exists a constant L > 0 such that |r( Then, according to (6.4): where we used the facts that X ≤ 1 and 2n i=n+1 W i (X) = 1. We now letñ = [n/k], and we split the sub-sample {X 1 , . . . ,X kñ } into k sub-samples Z 1 , . . . , Z k of sizeñ, with: . .X (i+1)ñ }, i = 1, . . . , k.
For each sample Z i , we denote by Z (1) i the closest element of Z i fromX (ties being considered as usual). Then, Jensen's Inequality and (6.6) then give Therefore, on the event whered ≤ d and Λ ≤ 2 Λ , we have by Lemma 6.1: for some constant C > 0. Since k/n → 0, there exists a constant κ > 0 such thatñ ≥ κn/k. Hence, on the event whered ≤ d and Λ ≤ 2 Λ , By (6.5) and (6.3), we then deduce that for some constant C ′ > 0: , on the event whered ≤ d and Λ ≤ 2 Λ . Noticing that Λ − Λ > Λ when Λ > 2 Λ , and since |r(X)| ≤ 1, |r(X)| ≤ 1, we obtain: using the Markov Inequality. Finally, by the Lipschitz property of r, The last 2 inequalities give the result since, by the basic assumption, r(X) = E[Y |X].

Proof of Corollary 3.1
The proof of Corollary 3.1 is straightforward from Theorem 2.2 and Lemmas 7.1 and 7.2 below.
Our next task is to bound the quantity E Λ −Λ 2 . For this purpose, we recall the following classical fact (e.g. see Kato [12]): for any symmetric matrix A ∈ M p (R), let v i (A) be the normalized eigenvector associated with the i-th largest eigenvalue. If it is a simple eigenvalue, then there exists δ A > 0 such that for any symmetric matrix A ′ ∈ M p (R) with A − A ′ ≤ δ A : for some constant C 0 > 0 that only depends on A.
Lemma 7.2. Assume that the non-null eigenvalues of M have multiplicity 1. Then, there exists a constant C > 0 such that: Proof. We let Here, and in the following, C is a positive constant whose value may change from line to line. Since v j = v j = 1 for all j = 1, . . . , p, we have: