Principal quantile regression for suﬃcient dimension reduction with heteroscedasticity

: Suﬃcient dimension reduction (SDR) is a successful tool for re- ducing data dimensionality without stringent model assumptions. In practice, data often display heteroscedasticity which is of scientiﬁc importance in general but frequently overlooked since a primal goal of most existing statistical methods is to identify conditional mean relationship among variables. In this article, we propose a new SDR method called principal quantile regression (PQR) that eﬃciently tackles heteroscedasticity. PQR can naturally be extended to a nonlinear version via kernel trick. Asymptotic properties are established and an eﬃcient solution path-based algo- rithm is provided. Numerical examples based on both simulated and real data demonstrate the PQR’s advantageous performance over existing SDR methods. PQR still performs very competitively even for the case without heteroscedasticity.


Background on sufficient dimension reduction
Dimension reduction is often of primary interest in high dimensional data analysis. The principal component analysis (PCA) is widely used in this regard, but it suffers when identifying relationship between response and covariates. Variable selection can be regarded as another type of dimension reduction. However, it often relies on specific model assumptions which can possibly be violated in practice. Sufficient dimension reduction (SDR) has recently received much attention thanks to its promising performance in reducing data dimensionality under relatively mild model assumptions. SDR achieves dimension reduction of pdimensional predictor X by finding a matrix B = (b 1 , · · · , b d ) ∈ R p×d which satisfies Y ⊥ X | B X. (1.1) under mild conditions (Cook [3]; Yin, Li and Cook [38]) and is typically denoted by S Y |X . Thus, we assume its existence throughout this article and further span(B) = S Y|X to facilitate the estimation. The dimension of S Y |X , d is called the structural dimension which is another important quantity to be inferred from the data. The SDR model (1.1) is often called the linear SDR since the dimension reduction is achieved through finding a linear mapping, B X. The linear SDR can naturally be extended in a nonlinear fashion by assuming Y ⊥ X | φ(X), (1.2) where φ : R p → R d is an arbitrary function of X defined on a Hilbert space denoted by H [5]. Similarly as B, φ is not unique and we assume that φ is a unique modulo injective transformation to guarantee its uniqueness [18]. A variety of SDR methods have been developed in the literature. Sliced inverse regression [SIR,15] and sliced average variance estimation [SAVE,8] are the two seminal works for the linear SDR and they are still widely-used in many applications. Other linear SDR methods include, but are not limited to, principal Hessian direction [pHd, 17,4], partial least squares [PLS,11], inverse regression [6], contour regression [23], directional regression [22], and composite quantile outer-product of gradients method [qOPG,13]. Toward the nonlinear SDR, several methods are proposed by exploiting reproducing kernel Hilbert space for H to estimate the nonlinear function φ via kernel trick. See, for example, Wu [34], Wu, Liang and Mukherjee [35], and Yeh, Huang and Lee [37].
The principal support vector machine [PSVM,18] is the first attempt to tackle both linear and nonlinear SDR in a unified framework. PSVM connects SDR to the support vector machine [SVM,33] and the idea is as follows. First dichotomize the continuous response Y based on its value. A pseudo binary variable is introduced asỸ = 1 if Y is larger than a pre-specified cutoff c and −1 otherwise. Next, find a hyperplane of the standardized X that separates the two classes by training a linear SVM with respect to (X,Ỹ ). Then it turns out that the normal of the hyperplane lies on S Y |X and hence SDR naturally follows. The PSVM can be readily extended to nonlinear SDR via kernel trick, just like the SVM.

Motivation
In practice, it is often observed that the data display heteroscedasticity. The heteroscedasticity itself can be of scientific importance, but is often overlooked since most existing statistical methods primarily focus only on conditional mean relationship. In principle, SDR only requires the conditional independence assumption as shown in (1.1) or (1.2), and there is no difficulty to uncover underlying heteroscedasticity in the data. However, most of SDR methods are designed mainly for the mean relationship and may be inefficient to uncover heteroscedasticity. This is illustrated in the upcoming toy example.
The quantile regression (QR) is known as a reasonable alternative to the least squares (LS) regression when errors are non-iid or have heteroscedastic variance. The QR explores the entire conditional distribution of Y given X by controlling target quantile levels while the LS regression focuses only on the conditional mean E(Y | X). Kong and Xia [13] showed that the gradients of regression quantiles lie on S Y |X and proposed an associated SDR method, qOPG. Compared with moment-based methods such as SIR and SAVE, qOPG requires less restrictive assumptions and can identify all dimension reduction directions including those associated with a heteroscedastic structure.
The QR also has a very close connection to the SVM due to the similarity of their loss functions. However, the SVM solution depends only on a part of data, data points either close to the classification boundary or misclassified, while the QR takes into account all the data points for estimating the conditional quantile function. For this reason, the SVM is not proper to capture the heteroscedasticity and this makes PSVM inefficient for extracting information from the heteroscedasticity. Motivated by this, we propose the principal quantile regression (PQR) for SDR with heteroscedasticity. In order to illustrate how the SVM and QR behave differently with heteroscedastic data, we consider the following toy example with error term only: Notice that the conditional variance of Y given X depends on X. For the PSVM, the continuous response Y i is artificially dichotomized based on a given c as Y i,c = 1 if Y i > c and −1 otherwise. We set two values of c 1 and c 2 as the 33.3% and 66.6% sample percentiles of Y i s, respectively and apply the SVM to {(X i ,Ỹ i,c h ), i = 1, · · · , n} for the two different c h , h = 1, 2. Next, the QR is applied to {(X i , Y i ), i = 1, 2, · · · , n} with two different quantile levels, 33.3% and 66.6%. The results are depicted in Figure 1. Panel (a) shows the SVM classification boundaries and panel (b) plots the regression functions estimated by the QR. As shown in the left panel (a), the two classification boundaries estimated from the SVM (red solid lines) are parallel with both slopes nearly zero. It clearly shows that their normals are not affected very much by the heteroscedasticity and fails the PSVM. On the other hand, the QR produces different (non-zero coefficient) slope estimates for different quantile levels since it takes into account the behaviors of all data points. See panel (b) of Figure 1. This simple example justifies the use of quantile regression as an alternative of the SVM in the presence of heteroscedasticity and provides a clear motivation of the PQR.
The rest of article is organized as follows. In Section 2, the PQR for linear SDR, which is referred to as the linear PQR, is developed and its theoretical properties and computational issues are described in details. The kernel PQR (KPQR) is proposed in Section 3 as a nonlinear extension of the linear PQR for nonlinear SDR. Finite sample performance of the proposed methods are investigated via simulation in Section 4 and real data analysis in Section 5. Discussions follow in Section 6. All the technical proofs are relegated in Appendix.

Principal quantile regression
Let us begin with a brief introduction of the linear QR model: Y = α + β X + , where the random error satisfies P ( ≤ 0 | X) = τ for a given target quantile level τ ∈ (0, 1). The regression function α + β X thus represents the τ th conditional quantile of Y given X, and α and β are parameters of interest. The QR does not require the error term to be i.i.d. and hence is often regarded as an attractive alternative to the conventional mean regression in the presence of heteroscedasticity. At the population level, the QR solves where ρ τ (u) = u{τ −I(u < 0)} denotes the check loss function. Sample estimates of α 0 and β 0 are obtained by minimizing the empirical counterpart of (2.1). Motivated by Li, Artemiou and Li [18] and Shin et al. [28], the τ th PQR objective function at the population level is defined by where θ = (α, β ) , Σ = cov(X) and λ > 0 is the regularization parameter which balances the data fitting and the model complexity. Let θ 0,τ = (α 0,τ , β 0,τ ) be the minimizer of (2.2). Then it can be shown that β 0,τ is unbiased for the linear SDR (1.1) as follows.

Theorem 1.
Under the linearity condition that E(X | B X) is a linear function of B X, β 0,τ ∈ S Y |X for any given τ ∈ (0, 1).
The linearity condition plays an essential role in many SDR methods. It implies E(β X|B X) = β P B (Σ)X when X is centered such that E(X) = 0, where P B (Σ) = B(B ΣB) −1 B Σ. It is known that the linearity condition holds when X is elliptically symmetric [20,19] or when p is large [9]. It is important to note that the linearity condition is only for the marginal distribution of X, not the conditional distribution of Y | X.
The proposed PQR shares a fundamental similarity to Kong and Xia [13] that recovers S Y |X from derivatives of the conditional quantile Y |X with respect to X, since β 0,τ can be viewed as a linear approximation of derivative of the τ th quantile of Y |X. We admit Theorem 1 is a similar but weaker result than Lemma 1 of Kong and Xia [13] in the sense that the linear PQR requires the linearity condition and do not recover S Y |X exhaustively. Both drawbacks are due to the global linearity of the target function, α + β T {X − E(X)}, and are finally resolved when kernel PQR is employed. However, the linear target quantile function works reasonably well in many real applications and brings in substantial amount of computational savings as demonstrated in Section 4.1.
Since linear PQR do not possess exhaustiveness, we assume the coverage condition that span(B) = S Y |X . See Cook and Ni [7] for the practical impact of the coverage condition. Now, one can recover S Y |X by obtaining multiple solutions of β 0,τ for different values of τ . More explicitly, we consider H distinctive values of τ h , h = 1, · · · , H, where H is assumed to be larger than d to fully recover a d-dimensional subspace. The selection of d will be further discussed in section 2.4. Let θ 0,h = (α 0,h , β 0,h ) denote the sequence of minimizers of (2.2) for different values of τ h , h = 1, · · · , H, then S Y |X is estimated by the eigenvectors of M 0 associated with non-zero eigenvalues, where Notice that unlike PSVM [18], an additional step of dichotomizing the response is not required for PQR. Instead, PQR obtains multiple solutions β 0,h by varying the quantile level parameter τ that controls the shape of the loss function ρ τ . This makes PQR fundamentally different from PSVM whose objective function varies as the (pseudo) binary response changes while the loss function remains fixed. In fact, the idea is much similar to the principal weighted SVM [28].
In order to see the connection to the QR, suppose that X has E(X) = 0 p and cov(X) = I p where 0 p and I p are the p-dimensional zero vector and identity matrix, respectively. Then (2.2) reduces to which can be viewed as a population version of the L 2 -penalized linear QR. That is, the L 2 -penalized linear QR coefficient β is unbiased for linear SDR when X is standardized. This provides intuitive explanation why the PQR is advantageous in handling heteroscedasticity.
We remark that the optimization should be repeatedly carried out for different values of τ to obtain (2.5), which can be too computationally intensive especially when n and/or H is large. To improve computational efficiency, we consider the following transformation η =Σ It is shown that the solution of (2.6) moves in a piecewise linear manner as τ varies, which enables us to develop an efficient algorithm that computes the entire solution trajectories of η (and hence β) as a function of τ ∈ (0, 1) with the same computational complexity of solving a single optimization problem for a single τ [30,27]. In order to illustrate the solution paths for the aforementioned toy example, we add four additional noise variables X 2 , · · · , X 5 100. The five paths of β 1 , · · · , β 5 as a function of τ are depicted in Figure 2. Notice that X 1 is the only signal variable whose corresponding solution profile of β 1 (red solid line) shows significantly larger variability compared to those of β 2 , · · · , β 5 (black dashed lines) corresponding to noise variables X 2 , · · · , X 5 .

Large sample properties
Without loss of generality, we assume that E(X) = 0 p in this section. We define For notational simplicity, we omit subscript τ if τ is fixed, and let θ 0 andθ n be the minimizers of Λ τ (θ) andΛ n,τ (θ), respectively, whereΛ n,τ (θ) denotes the empirical version of Λ τ (θ) based on a sample.
We first establish consistency ofθ n for an arbitrary given τ .
In order to derive asymptotic distribution of PQR solutionθ n , a Bahadur representation is given in Theorem 3

Theorem 3. Under the assumptions (C1)-(C4) in the Appendix
As a consequence of Theorem 3, a Bahadur representation ofβ n,h is given by With the above Bahadur representation, we can establish the asymptotic normality of M n as follows.
Finally a basis of the central subspace S Y |X is estimated by the first d leading eigenvectors of M n , denoted by V n = ( v 1 , · · · , v d ). The asymptotic normality of V n is established in the following corollary which is a direct consequence of Theorem 4 and Bura and Pfeiffer [1].
Here the operator ⊗ denotes Kronecker product.

Determination of structural dimension
In practice the structural dimension d is typically unknown, and should be inferred from data. Following the spirit of Li, Artemiou and Li [18], we estimate it byd that maximizes where υ j is the jth leading eigenvalue of the candidate matrix M n and ρ > 0 is a tuning parameter. By Theorem 4 and Theorem 8 of Li, Artemiou and Li [18], we can proof thatd is a consistent estimator of d, i.e., lim n→∞ P (d = d) = 1.
In order to tune ρ, we propose the following procedure based on crossvalidation. First, randomly split the data into the training and test sets, which are denoted by {(X tr j , Y tr j ) : j = 1, · · · , n tr } and {(X ts j , Y ts j ) : j = 1, · · · , n ts }, respectively. Note n ts = n − n tr . Then apply the linear PQR to the training set {(X tr j , Y tr j ) : j = 1, · · · , n tr }. Let M tr n denote the corresponding candidate matrix. For an appropriate grid of ρ given, repeat the steps 1-4 below.
3. For the given τ h , h = 1, · · · , H, apply the linear QR to {(X tr j , Y tr j ) : j = 1, · · · , n tr } to obtain τ h th conditional quantile function estimates of Y |X, denoted byf τ h (X). 4. Compute the total test quantile loss We repeat the above procedure on each fold in the cross validation and select ρ * to be the minimizer of the sum of T C(ρ) across different folds.
Finally, we propose to choosed which maximizes G n (k; ρ * , M n ) using the full data. This tuning method is named cross validation Bayesian information criterion (CVBIC). In Section 4.3, we investigate the numerical performance of CVBIC under a variety of combinations of p, d, and n using different models.

Theorem 5.
Consider the identity mapping from a function in H to a function in L 2 (P X ) where L 2 (P X ) = {f : |f | 2 dP X < ∞} with P X the probability measure induced by X. Assume that the mapping is continuous and H is a dense subset of L 2 (P X ). Then ψ 0,τ (X) has a one-to-one transformation that is measurable with respect to σ{φ(X)}, where σ{φ(X)} denotes the σ-field generated by φ(X).
The concept of unbiasedness of nonlinear SDR is firstly introduced by Li, Artemiou and Li [18]. See also Lee, Li and Chiaromonte [14] for a general theory for nonlinear SDR.

Sample estimation
The objective function (3.1) is not as easy to minimize as in the linear case since the space H is of infinite dimensionality. To tackle this issue, we employ the reproducing kernel Hilbert space (RKHS) theory and let H K be the RKHS associated with a positive definite kernel K(·, ·). Common choices of the kernel function include the radial basis kernel K(X, X ) = exp(−r X − X 2 ), r > 0 and the polynomial kernel (c + X X ) q with a positive integer q and c ≥ 0. Based on the RKHS theory, the minimizer of the empirical version of (3.1) has a finite representation [12]. Namely, the solution can always be represented by the following form: ψ(·) = α k n (·) where α = (α 1 , · · · , α n ) and k n = {K(·, X i ) : i = 1, · · · , n} . However as pointed out by Li, Artemiou and Li [18] in the context of PSVM, such a finite form cannot be directly used for the KPQR because it always overfits the training data. Instead, Li, Artemiou and Li [18] introduce the following alternative: where γ = (γ 1 , · · · , γ b ) , ω(X) = {ω j (X) : j = 1, · · · , b} , and ω j (X) is the jth leading eigenfunction of the sample covariance operator Σ n defined by ψ 1 , Σ n ψ 2 H K = cov n {ψ 1 (X), ψ 2 (X)} for ψ 1 , ψ 2 ∈ H K . Here cov n (X, X ) denotes the sample covariance between X and X . By Proposition 2 of Li, Artemiou and Li [18], we have where w j and λ j are the jth leading eigenvector and eigenvalue of the matrix (I n − J n /n)K n (I n − J n /n), respectively. Here K n is the kernel matrix whose (i, j)th element is K(X i , X j ), i, j = 1, · · · , n, and J n denotes the n-dimensional square matrix whose elements are all one. In fact, ω j (X i ) is closely related to the kernel principle component analysis [26] for the covariates X i , i = 1, · · · , n on the RKHS generated by K(·, ·). Essentially, the representation (3.2) proposes to restrict the full solution space corresponding to K(·, X i ), i = 1, · · · , n by focusing on its subspace spanned by the first b principal directions on the RKHS, to avoid over-fitting. For the choice of b, any integer between n/3 and 2n/3 can be used [18]. For our KPQR, we propose to use b = n/4 based on our limited numerical experience. Finally, the sample version of (3.1) is given bŷ where Ω is a (n × b) matrix whose (i, j) element is ω j (X i ) − n −1 n m=1 ω j (X m ) for i = 1, · · · , n and j = 1, · · · , b and Ω i is the ith row of Ω. The dual problem of (3.4) is even simpler quadratic programming as shown in Theorem 6. Theorem 6. Letv = (v 1 , · · · ,v n ) andη = (η 1 , · · · ,η n ) denote the maximizer of the following quadratic programming max v1,··· ,vn,η1,··· ,ηn

Then the minimizer of (3.4) is given byγ
Similar to the linear PQR we can obtain the entire solution profileν = (ν 1 , · · · ,ν n ) . For a given grid τ 1 < · · · < τ H , we extract from the solution profile a sequence of kernel PQR solutions, (α n,h ,γ n,h ) corresponding to τ h and obtain the first d leading eigenvectors,V n = (v 1 , · · · ,v d ), of H h=1γn,hγ n,h . Then the estimated d sufficient predictors evaluated at X is given byφ( At the best of our knowledge, there is no method developed for the estimation of the structural dimension d in nonlinear SDR. In this article, we have skipped developing the estimation of d for the kernel PQR since it requires the theoretical analysis for RKHS, which is somewhat beyond of the scope of this paper. In practice, we can consider the cross-validation idea proposed by Xia et al. [36] that selects an optimal d that results in the best (cross-validated) prediction accuracy on the reduced space. Although the idea is originally proposed in the linear SDR context, we believe that it can be readily extended to the nonlinear SDR as long as the notion of the reduced space is clearly defined at the sample level, which is the case for the kernel PQR.

Simulation studies
We carried out simulations to investigate the finite sample performance of the proposed linear and kernel PQR. We assume that the true structural dimension d is known for both linear and kernel PQR in Sections 4.1 and 4.2, respectively. In Section 4.3, we demonstrate the performance of the CVBIC procedure for estimating d for the linear PQR.

Linear sufficient dimension reduction
We consider the following six models: where ∼ N (0, 1), p = 10, 20, 30 and the sample size n is taken to be 300. Note that only two predictors X 1 and X 2 are informative (i.e., d = 2) in models (L1)-(L5) and there is a third informative predictor X 3 (i.e., d = 3) in (L6).
We compare the linear PQR with SIR, PSVM and qOPG. To evaluate the performance of each method, we use the following distance measure where P A denotes the orthogonal projection matrix onto span(A) and A F is the Frobenius norm of a matrix A.
The number of slices is fixed at 10 for SIR and PSVM. Our choice is in line with the usual practice in the SDR literature for such a sample size. For linear PQR and qOPG, to be fair we set the number of different quantile levels to be 10 as well. For simplicity, we also refer to the number of τ as number of slices from now on. Note that both PSVM and PQR involve a cost parameter λ. We tried different values of λ and observed that the performance is not overly sensitive to the value of λ. The reported values are from the best case that each method can achieve. Table 1 contains the averaged distance measure (4.1) over 100 independent repetitions.
We observe that the linear PQR performs consistently better than SIR and PSVM for all scenarios under consideration. The improvement over SIR and PSVM is dramatic for models (L1)-(L4). Note that models (L1)-(L4) have heteroscedasticity and PQR shows promising performance as expected. Even for model (L5) without heteroscedasticity, PQR performs very competitively in comparison to SIR and PSVM. Although the improvement compared to the qOPG turns out not to be significant, it reduces the computational time dramatically. The qOPG is more computationally intensive than all other methods. It takes the qOPG 478.1 minutes to estimate the central subspace when p = 10, and up to 3645 minutes for p = 30, while the PQR takes less than 5 minutes. Figure 3 shows computing time of the two methods for different p on model (L2). Under model (L6) the linear PQR performs unsatisfactory. This is because the regression function is approximately symmetric about the origin, in which both SIR and the linear PSVM fail badly as well. This motivates the kernel version.
In the review process, one referee asked us to include comparison to SAVE [8], contour regression [CR,23], directional regression [DR,22]. The performance of these three methods are summarized in the last three columns of Table 1. It shows that our new method PQR outperforms all these three methods as well.

Nonlinear sufficient dimension reduction
To investigate performance of nonlinear SDR methods, we consider the following six models:  (Single-index location-only models with d = 1) 1 + X 2 2 and φ 2 = sin(X 2 )/2. We set p = 10 and n = 100. The predictor X is independently generated from U (−1, 1) for models (N22) and (N23) and from standard normal distribution for the rest four models. We compare the kernel PQR (KPQR) with the kernel SIR [KSIR,34] and the kernel PSVM [KPSVM,18]. For all of these kernel-based methods, we employ Gaussian kernel K(X, X ) = exp{−||X−X || 2 /(2σ 2 )} with σ chosen as the median pairwise distance of predictors in the data.
It is not appropriate to use (4.1) to evaluate the performance of nonlinear SDR methods and we consider two alternatives. The first is the distance correlation [29] betweenφ(X) and φ(X), denoted by dCor{φ(X), φ(X)}. To avoid potential overfitting, we generate an independent test set of size 1000, and the distance correlations are evaluated over the test set. The second one measures prediction performance of the regression model built on the dimension reduction space. Toward this we fit a nonparametric regression model of Y onφ(X) via local constant smoothing from the training set. Then the coefficient of determination, R 2 = Cor(Y,Ŷ ) over the independent test set can regarded as a reasonable performance measure for nonlinear SDR methods. HereŶ denotes predicted value of the test Y from the nonparametric regression model.
For both measure, the larger values indicate better performance in terms of SDR. For both KPQR and KPSVM, we try different values of cost parameter λ and set it to the value which gives best performance. Results over 100 independent repetitions are reported in Table 2. It is observed that the proposed KPQR outperforms KSIR and KPSVM under all scenarios under consideration including both heteroscedastic and homoscedastic models. In the review process, one referee pointed out that it is not easy to tell whether the improvement over KPSVM is significant. While revising, we have performed paired t tests, which show that the improvement over KPSVM are significant with p-value less than .0001 for all models and for both distance correlation and coefficient of determination criteria. To provide better snapshot of the PQR with heteroscedasticity, we additionally consider error-only models (i.e. d = 1) as follows.
, where ∼ N (0, 0.5) and X ∼ N p (0 p , I p ) with p = 10 and n = 100. Averaged distance correlation dCor{φ(X), φ(X)} for test sets over 100 repetitions are reported in Table 3. Note that R 2 measure does not make sense because there is no mean relations between Y and X. KPQR shows clear improvement for capturing signals from heteroscedasticity.

Estimation of structural dimension
We now investigate the performance of the proposed CVBIC procedure to estimate the structural dimension from the linear PQR candidate matrix. Table  4 reports the empirical probabilities (in percentage) that the CVBIC correctly estimates d. One can observe that the numerical performance of the proposed CVBIC approach for the linear PQR is quite promising especially when n is large enough.

Real data analysis
We apply the proposed method to the Boston housing data [10]. The dependent variable Y is the logarithm of the median value of owner occupied homes in each  (Table 5). We do not consider the houses with tract bounds the Charles river, so X 4 is excluded in the following analysis. The sample size ends up to be 471 with each observation/case representing a census tract.
Y median value of owner-occupied homes in $1000's X 1 per capita crime rate by town X 2 proportion of residential land zoned for lots over 25,000 sq.ft. X 3 proportion of non-retail business acres per town X 4 Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) X 5 nitric oxides concentration (parts per 10 million) X 6 average number of rooms per dwelling X 7 proportion of owner-occupied units built prior to 1940 X 8 weighted distances to five Boston employment centres X 9 index of accessibility to radial highways X 10 full-value property-tax rate per $10, 000 X 11 pupil-teacher ratio by town X 12 1000(Bk − 0.63) 2 where Bk is the proportion of blacks by town X 13 % lower status of the population We first apply linear SDR methods considered in Section 4. Detailed settings of these methods are the same as done for simulated data in Section 4. In order to estimate the structure dimension d in the PKQR, we employ the four-fold CVBIC approach and select an optimal ρ as 0.02. The corresponding BIC-type criterion in (2.10) is maximized at k = 1 and givesd = 1. Since the true B is not available for the real data, we use the distance correlation between sufficient predictors projected onto the estimated S Y |X ,b 1 X and Y , dCor(b 1 X, Y) as a performance measure. Table 6 reports the distance correlations from the SDR results of PSVM and PQR under different values of λ. The performance of PQR is not overly sensitive to λ, and their best performances are 0.836 and 0.869 respectively. The distance correlations of SIR and qOPG are 0.856 and 0.865, respectively. Figure 4 depicts the scatter plots of Y againstb 1 X obtained from different SDR methods. Based on the distance measurements and scatter plots, we see that PQR achieves a slightly better prediction comparing with SIR and PSVM. Comparing with qOPG, PQR has a great computational advantage. It takes 6.9 seconds for PQR while it takes 9.7 minutes for qOPG to estimate B.
Toward nonlinear SDR, we also apply KPQR, KSIR and KPSVM with the Gaussian kernel to the Boston housing data. First, we randomly divide the dataset into nonoverlapping training and test data sets with 300 and 171 observations respectively. The performance measures used in Section 4.2 for nonlinear SDR methods are then computed. We repeat this process on 100 random splitting of training and test sets under a wide range of cost parameters and with the number of slices fixed at 10. Table 7 reports the averaged R 2 and dCor{φ(X), Y } for test set under selected cost parameter values. The paired t tests show that the improvements of our new method KPQR over both KSIR and KPSVM are significant in terms of both distance correlation and coefficient of determination criteria. Figure 5 depict the scatter plots of Y againstφ(X), and its corresponding predicted valueŶ over the test set for one random splitting. They indicate that PQR achieves a slightly better prediction compared with KSIR and KPSVM. Between the linear and kernel methods, we think that linear methods give a better prediction for the Boston housing data set.

Conclusion
In this paper we proposed a new SDR method, PQR by exploiting QR that is particularly useful in the presence of heteroscedasticity. Compared to PSVM, PQR greatly improves the performance in capturing the conditional variance structure of Y |X. Thanks to the flexibility of SDR under very mild assumption, PQR possess a wide range of applicability in a variety of applications in which heteroscedasticity itself is of interest. Our limited numerical experience shows that PQR still performs very competitively even for the case without heteroscedasticity.