Iterative application of dimension reduction methods

: The goal of this article is to introduce an iterative application of dimension reduction methods. It is known that in some situations, methods such as Sliced Inverse Regression (SIR), Ordinary Least Squares (OLS) and Cumulative Mean Estimation (CUME) are able to ﬁnd only a partial basis for the dimension reduction subspace. However, for many models these methods are very good estimators of this partial basis. In this paper we propose a simple iterative procedure which diﬀers from existing combined approaches in the sense that the initial partial basis is estimated ﬁrst and the second dimension reduction approach seeks only the remainder of the dimension reduction subspace. Our approach is compared against that of existing combined dimension reduction approaches via simulated data as well as two example data sets.


Introduction
In regression analysis, one of the challenges faced with multi-dimensional data sets is graphical visualization to allow the relationship(s) between response and predictor variables to be detected. Within the past 20 years, in order to combat this challenge, several dimension reduction techniques that are very simple to implement have been introduced, including Sliced Inverse Regression (SIR) [16], Sliced Average Variance Estimation (SAVE) [8] and Principal Hessian Directions (PHD) [18]. Whilst classical slicing estimation methods such as SIR and SAVE require dividing the response into a discrete number of slices, more recent methods have been proposed which negate the need for slicing, such as Discretization-Expectation Estimation [27], Cumulative Mean Estimation (CUME), Cumulative Variance Estimation (CUVE) and Cumulative Directional Regression (CUDR) [26]. Consider a univariate response variable Y ∈ R and pdimensional predictor vector x = [X 1 , . . . , X p ] ⊤ ∈ R p . Then we can define the K-index model by where K < p, f is an unknown link function, β 1 , . . . , β K are linearly independent p-dimensional column vectors and ǫ is an error term independent of x. Define S to be the span of (β 1 , . . . , β K ) and note that replacing x with β ⊤ 1 x, . . . , β ⊤ K x reduces the dimension of the model, since the dimension of the latter is K < p. Li [16] coined S the effective dimension reduction (e.d.r.) space whose elements are e.d.r. directions. For identifiability purposes we will assume throughout that S is a central dimension reduction (CDR) subspace, which Cook [6] defined to be the intersection of all dimension reduction subspaces.
It is the aim of dimension reduction methods to find a basis for S. However, existing methodologies are not without their limitations. To combine the noted strengths of various methods, several authors have promoted the combination of methods, in particular when one or more of the methods are restricted to only find a partial basis for S (see, for e.g., [17,25,28,23]). Methods that find the full basis for S are said to have the "exhaustiveness" property, which was rigorously defined in [15].
The purpose of this article is to propose a new way of combining two (or potentially more) dimension reduction methods by using each one iteratively. This approach ensures that second (or further) iterations only return information regarding S that has not already been found. Section 2 discusses some existing methods and argues the advantages and disadvantages of each. Section 3 describes the theory and implementation of the iterative method of combining dimension reduction methods proposed in this article. The success of the iterative approach is assessed at the sample level in Section 4, where it is compared with various existing methods via simulations and examples. Concluding remarks are given in Section 5.

Dimension reduction
The use of dimension reduction methods to be implemented within this paper is not limited to the setting of a continuous Y . However, when Y is discrete it is difficult to conceptualize the role of the error term in (1). The model can be further generalized (see, for e.g., [6]) to state that Y and x are such that, for where ⊥ ⊥ denotes independence. This is equivalent to requiring that Y depends on x only through β ⊤ 1 x, . . . , β ⊤ K x whilst avoiding the necessity of an error term. Since the link function is unknown, we cannot uniquely determine the unique e.d.r. directions given by β 1 , . . . , β K . However, dimension reduction can still be achieved via γ ⊤ 1 x, . . . , γ ⊤ K x for any γ 1 , . . . , γ K such that the Span(γ 1 , . . . , γ K ) = Span(β 1 , . . . , β K ). In other words, any basis for S will suffice. Given a basis γ 1 , . . . , γ K for S, a plot of Y versus γ ⊤ 1 x, . . . , γ ⊤ K x is called a Sufficient Summary Plot (SSP, see [6]) which can be used to explore the relationship between Y and x in the lower dimensional setting. At the sample level and for {y i , x i } n i=1 denoting n sample realizations of (Y, x) and γ 1 , . . . , γ K an estimated basis for S, an Estimated SSP (ESSP) is constructed by plotting the y i 's versus the γ ⊤ 1 x i , . . . , γ ⊤ K x i 's. We now briefly discuss several dimension reduction techniques that will be the focus of our iterative approach. For simplicity we discuss them only at the population level (i.e. for a random Y, x). However, all are moment based and therefore sample estimates follow simply.

Ordinary least squares
Whilst Ordinary Least Squares (OLS) was intended for use in the multiple linear regression framework, it is also useful in the dimension reduction setting. Let β ols = Σ −1 Σ xy denote the OLS slope vector where Σ xy = Cov(x, Y ) is the covariance between x and Y . For K = 1 and an additive error in (1), Brillinger [2] showed that when x is multivariate normal, β ols = cβ 1 for a c ∈ R. The conditions for which this result holds were generalized by Li & Duan [19]. They relaxed the normality condition for x and required only that x. An additive error term was also no longer necessary. In the case of a general K in (1), consider the following condition proposed by Li [16]: which holds when x is elliptically symmetrically distributed. However, there are other situations which satisfy Condition 2.1 and Hall & Li [11] showed that it often approximately holds for high dimensional x.
for e.g., the proof to Lemma 1 in [21]). Let u be orthogonal to S. Under Condition 2.1, Hence, under Condition 2.1 and the model in (1), u ⊤ β ols = 0 for all u orthogonal to S so that β ols ∈ S. As a result, even when K > 1, OLS can still be used to extract part of S. Further discussion and results on this in a more general setting (that includes OLS) can be found on pages 143-147 of Cook [6]. This is an important point when we discuss opportunities for iterative use of dimension reduction methods later.

Sliced inverse regression
Unlike OLS, Sliced Inverse Regression [16] (SIR) can, although is not guaranteed to, return an entire basis for S when K > 1. Under Condition 2.1, Li [16] showed that eigenvectors corresponding to non-zero eigenvalues of Cov[E(x|Y )] are elements of ΣS (which we define as the basis for the span of the Σβ i 's). Let S 1 , S 2 , . . . , S H denote non-overlapping subranges of Y such that H h=1 S h = range(Y ). Li circumvented the challenge faced with determining E(x|Y ) by utilizing 'slicing' which is equivalent to discretizing Y according to which S h it belongs to. This slicing strategy means that E(x|Y ) is approximated by the slice means given as E(x|Y ∈ S h ) = µ h , (h = 1, . . . , H). The choice of p h 's is up to the researcher although it is common (and straightforward) to choose equally probable slices such that each p h = 1/H. In practice this is akin to choosing an (approximately) equal number of observations per slice.
Let Γ = γ 1 , . . . , γ K denote the basis for S returned by SIR. When SIR operates on the standardized predictor Z = Σ −1/2 (x − µ) and is followed by a re-standardization back to the original scale, then Γ ⊤ ΣΓ = I K such that the dimension reduced predictors γ ⊤ 1 x, . . . , γ ⊤ K x are uncorrelated and each have unit variance. The SIR algorithm is therefore Step 1 For p = [p 1 , . . . , p H ] and µ z,h = E(Z|Y ∈ S h ), determine Step 2 Return Σ −1/2 η 1 , . . . , Σ −1/2 η K as a basis for S where η 1 , . . . , η K are the eigenvectors corresponding to the K largest eigenvalues of V p .
Under Condition 2.1 and the model in (2), Σ −1/2 η i ∈ S if λ i is non-zero [16,6]. Hence, provided rank(V p ) = K, SIR returns a basis for S under these conditions. However, there are two notable limitations which can cause rank(V p ) < K, both of which are are important to this paper. Firstly, as illustrated by Cook & Weisberg [8], for model types which exhibit some symmetric dependency structure about the mean of x, SIR will fail to find an entire basis for S (see, for e.g., [16,8]). Secondly, since p 1 µ z,1 + p 2 µ z,2 + . . . + p H µ z,H = 0, the µ z,h 's are linearly dependent so that the maximum rank of V p is H − 1. Whilst for a continuous Y one can choose H large enough so that H −1 ≥ K, when Y is discrete the response discretization is not necessary and the number of slices H is equal to the number of unique elements in the response space. For example, if Y is binary then there are H = 2 slices and the maximum rank of V p is one. Both of these limitations mean that for some model types, SIR may only find part of a basis for S.

Sliced average variance estimation
Consider the following condition: Recalling the slicing approach used by SIR and discussed in Section 2.2, Cook & Weisberg [8] showed that if Conditions 2.1 and 2.2 hold (which are both satisfied when x is normal), then slice covariances also contain information regarding S (Li and Wang [14] also noted that SAVE holds the exhaustiveness property if x|Y is normally distributed). Cook & Weisberg therefore introduced SAVE, which may be implemented in a similar manner to SIR but where Step 1 in the SIR algorithm becomes, for SAVE, Step 1 For p = [p 1 , . . . , p H ] and Σ z,h = Cov(Z|Y ∈ S h ), determine and the eigen-analysis in Step 2 is conducted on M p to obtain a basis for S after re-standardization. SAVE does not suffer the same restrictions with respect to symmetric dependency or rank deficiencies for a discrete response that SIR does. Therefore, provided the additional Condition 2.2 also holds, SAVE can be a useful approach for returning a basis for S.

Principal Hessian directions
For x ∼ N p (µ, Σ), Li [18] used Stein's Lemma [24] and properties of the Hessian matrix to introduce PHD which may also be used to derive a basis for S. As with SAVE, the algorithm for PHD is similar to SIR where Step 1 For µ Y = E(Y ), determine the weight-covariance matrix and, again, the eigen-analysis in Step 2 is now conducted on Σ yzz to obtain a basis for S after re-standardization. One advantage of PHD over SIR and SAVE is that a slicing parameter need not be chosen. Additionally, whilst the normality condition for x seems restrictive, PHD is still applicable if Conditions 2.1 and 2.2 hold, although, as noted by Cook [5], the connection with the Hessian matrix may be lost. As stated by Li [18], one restriction of PHD is that due to the nature of the Hessian matrix, PHD is not expected to return linear components.
Li [18] also highlighted that when linear functions of x are added or subtracted from Y then, under the normality condition for x, the Hessian matrix does not change and at the population level PHD is not effected. This lead to another version of PHD that utilizes is the OLS residual. This version of PHD is typically preferred (see, for e.g., [5]) and, for many model types, estimator variability is smaller for this residuals-based approach when compared to the standard PHD approach [22]. We note that this residuals approach simply involves a transformation of Y using the OLS residual before the implementation of PHD. This is distinguished from the method we later propose in Section 3.3, which instead involves first implementing OLS in the dimension reduction sense to obtain the first component of S (rather than to simply transform Y ), followed by the implementation of residuals-based PHD.

Combined approaches
In the rejoinder to his original paper which introduced SIR, Li [17] introduced SIRII which, like SAVE, utilized slice covariances. In doing so Li introduced SIRII α , for which a convenient notation is SIRII α = (1 − α)SIR 2 + α(SIRII), which should be read as sum of (1 − α) times the SIR matrix squared (i.e. V 2 p ) and α times the SIRII matrix which is based on sliced covariances. In a similar fashion, Ye & Weiss [25] introduced the combination of (1 − α)SIR+αPHD as well as (1 − α)SIR+αSAVE. Zhu et al. [28] analysed these methods further and also introduced variations on the combination of (1−α)SIR+αSAVE. As a result of their analysis, the hybrid method they recommended was (1−α)SIR+αSAVE with a parameter value of α = 0.5.

Cumulative slicing procedures
Whilst the slicing approaches such as SIR and SAVE are simple to utilize, a possible deficiency in the continuous response case is that they may be sensitive to the number of slices chosen. Zhu et al. [26] extended these methods by using the idea of cumulative slicing. Whilst SIR extracts information regarding S using slice means given as E(x|Y ∈ S h ) and SAVE via the slice covariances denoted Cov(x|Y ∈ S h ), (h = 1, . . . , H), Cumulative Mean Estimation (CUME) and Cumulative Variance Estimation (CUVE) instead use E[xI(Y ≤ y)] and Cov[xI(Y ≤ y)] for all y ∈ R and where I(·) is the indicator function. Like SIR, CUME requires that Condition 2.1 holds and is similarly limited with respect to detecting all components when symmetric dependency structure is evident. Similarly to SAVE, CUVE requires that both Conditions 2.1 and 2.2 hold.
In the same way as for SIR, SAVE and PHD, we consider CUME and CUVE initially on the standardized scale with a re-standardization back to the original scale using Σ −1/2 following the eigen-decomposition step. For CUME, we then have: Step Step 1: The approaches also allow for the possibility of weighting with respect to Y although we are specifically dealing with the case of equal weighting as recommended by Zhu et al. [26], which are shown above. In addition to CUME and CUVE, Zhu et al. also provide a cumulative slicing extension to the directional regression proposed by Li and Wang [14]. Complementally to the CUME and CUVE approaches, Cumulative Directional Regression (CUDR) uses is an independent copy of (Y, x). CUME, CUVE and CUDR are all moment-based approaches and so can be employed conveniently in practice.

Iterative use of inverse regression methods
Before we discuss a simple iterative approach for dimension reduction, we will consider a motivating example that highlights how SIR can be a very good estimator of a partial basis for S.

A motivating example
Consider the K = 2 model where x ∼ N 10 (0, I 10 ), β 1 = (1, 0, . . . , 0) ⊤ and β 2 = (0, 1, 0, . . . , 0) ⊤ . As discussed by Cook & Weisberg [8] and Li [16], we would expect SIR to fail at finding the β ⊤ 1 x 2 part of the model due to symmetric dependency around the mean of x. However, we do not expect SAVE to have trouble finding this component. For this example we will consider simulating n = 120 observations according to this model. Let X n denote the 120 × 10 matrix whose ith row is x i -the simulated regressor vector for the ith observation. Let Γ sir and Γ save denote the K = 2 basis estimates for S that are returned by SIR and SAVE respectively. Since the dimension reduced regressors are of primary importance and it is a basis for S that is to be targeted, we will report the canonical correlations between each of the estimated dimension reduced regressors and XB (the dimension reduced regressors with B = [β 1 , β 2 ]). We have also chosen H = 5 equally probable slices. We provide the average canonical correlations in Table 1 where standard deviations of these correlations are in parentheses. We can see that SIR is highly capable of finding one component with an average first canonical correlation of 0.978 and a small standard deviation of only 0.014. However, it typically performs poorly with respect to the second element of the basis where the average canonical correlation is only 0.258. SAVE also performs very well with the first component (average first canonical correlation of 0.939) but not as well as SIR. The standard deviation for the SAVE correlation is also three times that of SIR indicating higher variability in estimation. As expected, SAVE performs better than SIR when estimating the second component.
In conclusion for this motivating example, we see that SIR is capable of estimating a partial basis extremely well for this model. It would therefore be beneficial to use this partial estimate with another use of SAVE.

Theory
We now discuss the iterative usage of dimension reduction methods that preserves a good partial estimate of a basis for S found by the first method. Given the methods discussed in Section 2, a simple approach to achieve this is to remove the already estimated component from the matrix derived for the second dimension reduction method in Step 1.
Proposition 3.1. Let C be a p × p symmetric matrix whose eigenvectors corresponding to non-zero eigenvalues are elements of Σ 1/2 S. Furthermore, let P be a projection matrix onto Σ 1/2 S 1 ⊂ Σ 1/2 S. Let S ′ 1 be the complement of S 1 . Then the eigenvectors corresponding to the non-zero eigenvalues of (I p − P) C (I p − P) are elements of Σ 1/2 S ∩ Σ 1/2 S ′ 1 (the basis whose elements are all in Σ 1/2 S but not in Σ 1/2 S 1 ).
Proof. Since P is a projection matrix onto a subspace Σ 1/2 S, then I p − P is a projection matrix onto the complement of this space. Hence, any eigenvectors corresponding to nonzero eigenvalues must still be elements of Σ 1/2 S although they must also be orthogonal to Σ 1/2 S 1 . Proposition 3.1 is useful because it provides a simple means for which to remove from estimation the already determined components of the dimension reduction subspace. The result is general, and there exist many possible combinations of existing dimension reduction methods that can be used iteratively. We will focus our attention on three particular iterative combinations where we have chosen the methods due mainly to (i) the first method having proven to be a useful estimator in many applications though also having noted limitations and (ii) the second method being a natural pairing with the first and one which seeks to overcome the aforementioned restrictions.
We choose to operate firstly on the z-scale (i.e. focus on Σ 1/2 S) and then to re-standardize to the x-scale to preserve some convenient characteristics of the dimension reduction methods discussed earlier. For example, if γ 1 , . . . , γ K are a basis for S returned by SIR, SAVE or PHD, then Var(γ ⊤ k x) = 1 and That is, the dimension reduced regressors are scaled to have variance 1 and they are also mutually uncorrelated. By removing the component on the z-scale for the iterative approach and then restandardizing with respect to Σ −1/2 (as is done for SIR, SAVE and PHD) we retain these features for the final dimension reduction subspace basis.

Iterative OLS and PHD
As in Section 2.1, let β ols denote the least squares slope vector which, under the model in (1) and Condition 2.1, is an element of S. For simplicity, let γ 1 , . . . , γ L denote the PHD directions where under the assumption of a normal x (or the less restrictive combination of Conditions 2.1 and 2.2) and the model in (1), Span(γ 1 , . . . , γ L ) ⊆ S. Given that OLS can only, at most, find one direction in the dimension reduction subspace, our intention here is to use PHD conditionally on the implementation of OLS. To remove the OLS component from the basis to be returned by PHD, we define the associated projection matrix for OLS as which projects onto the subspace of Σ 1/2 S spanned by Σ 1/2 β ols . Let the PHD matrix whose eigenvectors corresponding to non-zero eigenvalues are elements of Σ 1/2 S be denoted Σ ·zz which is equal to either Σ yzz or Σ rzz as denoted in (3) and (4) respectively. Therefore, to remove the component already retrieved by OLS, Step 1 of the PHD algorithm becomes Step 1 Determine (I p − P ols ) Σ ·zz (I p − P ols ) .
Step 2 then involves an eigen-analysis of this matrix and re-standardized eigenvectors corresponding to nonzero eigenvalues to form a basis for S when combined with the original OLS direction. Let Γ be this basis. Then choosing a re-scaled OLS direction of β ⊤ ols Σβ ols −1/2 β ols results in Γ ⊤ ΣΓ = I K so that each dimension reduced predictor has unit variance and the dimension reduction predictors are mutually uncorrelated (note, however, that simply choosing β ols still results in uncorrelated dimension reduced regressors). We will refer to this approach as PHD|OLS. We note that PHD|OLS is quite different from the Iterative Hessian Directions (IHT) method proposed by Cook and Li [7]. Although both methods are hybrids of OLS and PHD, PHD|OLS uses OLS and PHD separately by retrieving a component by OLS first and then retrieving further components using PHD conditional on the OLS component already obtained.
By contrast, IHT retrieves a basis for S based on an eigen decomposition of a matrix constructed using the OLS direction and powers of the Hessian matrix.
Lue et al. [20] also use OLS in conjunction with r-based PHD, but in a censored survival regression setting.

Iterative SIR and SAVE
SIR and SAVE make a natural pairing with SIR utilizing slice means and SAVE slice covariances. When Y is continuous, the same slicing strategy can be implemented for both although this isn't strictly necessary. For the iterative approach we choose to use SIR first since (i) simulations have shown that it has very good estimation properties for a wide choice of models and (ii) there are some known and discussed limitations involving some model types where only part of a basis for S is achievable. Let γ 1 , . . . , γ L denote a partial basis for S returned by SIR. Then the projection matrix onto the associated subspace of Σ 1/2 S is equal to For this approach, which we will refer to as SAVE|SIR, Step 1 becomes Step 1 Determine (I p − P sir ) M p (I p − P sir ) and Step 2 is applied to this matrix. For simplicity, let Γ denote the basis for S which consists of the original SIR directions and the additional SAVE directions from the iterative step, which are the elements corresponding to restandardized eigenvectors corresponding to non-zero eigenvalues of the matrix in Step 2 above. Then, again, Γ ⊤ ΣΓ = I K resulting in unit variance dimension reduced regressors that are mutually uncorrelated.

Iterative CUME and CUVE
The iterative approach for CUME and CUVE is almost identical to that taken for SIR and SAVE. Since CUME requires fewer conditions on x than what is required for CUVE, but where CUME may be restricted due to symmetric dependency, we focus on the application of CUME to obtain a partial basis followed by CUVE to find the remaining elements of a basis for S. Here we define P cume as the projection matrix for the partial basis of Σ 1/2 S obtained by CUME. Application of CUVE then follows after a pre-and post-multiplication of (I p − P cume ) with respect to the matrix shown in Step 1 for CUVE.

Simulated comparisons
In this Section, we compare the performance of SIR, SAVE, (1 − α) SIR+αSAVE (which, for simplicity, we will refer to as SIR+SAVE), SAVE|SIR, PHD|OLS, CUDR and CUVE|CUME using simulated data. Note that SIR+SAVE is equivalent to SIR when α = 0 and SAVE when α = 1. All models considered are K = 2 models. Simulations were run 500 times for each model and we used the canonical correlations between γ ⊤ 1 x, . . . , γ ⊤ K x and β ⊤ 1 x, . . . , β ⊤ K x as the method of assessment. Similar methods of assessment are commonly used in the context of dimension reduction analysis (see, for example, [16] and [28]).
The first two models we consider have a continuous response. For these models we look at the first two canonical correlations, denoted r 1 and r 2 , separately, as well as looking at averages. Since we will be looking at K = 2 models where in many cases some methods will be expected to estimate one direction very well and the other poorly, seeing r 1 and r 2 separately will provide useful insight.
For x ∼ N 10 (0, I 10 ) and ǫ ∼ N (0, 1), the first two models are  Figure 1 shows the results of the simulation for Model 1 for 500 trials. The first two figures show the mean canonical correlations (first and second) whilst the third is the mean of the average of the first two canonical correlations. The table displays the standard deviations for the average canonical correlations. Six methods are compared; namely SAVE, SAVE|SIR, PHD|OLS, SIR+SAVE, CUDR and CUVE|CUME. Both the residuals-based and standard PHD approaches were considered, however only the residuals-based PHD results are shown, as they were typically superior for this example. For SIR+SAVE, three values of α were considered (0.2, 0.5 and 0.8), however for the sake of brevity only the SIR+SAVE results with α = 0.2 are reported, as its performance was superior to SIR+SAVE with α =0.5 and 0.8. We have chosen n = 50, 100, 200 and 400 and H = 2, 5 and 10 equally probable choices for SIR and SAVE.
Model 1 was used for simulations in [28] because it has both a symmetric (β ⊤ 2 x) 2 and asymmetric (β ⊤ 1 x) 3 element, so we would expect SIR to do well at identifying the asymmetric element and SAVE to do well at finding the symmetric element. Figure 1 shows that PHD|OLS and CUVE|CUME's results are better than or equal to those of any of the other methods for all choices of n for both r 1 and r 2 . SAVE|SIR is the next best performer, followed by SIR+SAVE and CUDR. The canonical correlations for SAVE are consistently the lowest for all choices of n and H and for both directions. SAVE also shows relative  Similarly to the analysis for Model 1, Figure 2 shows the results of the simulation for Model 2. This model was chosen because, as for Model 1, the model contains a symmetric and an asymmetric element. Again, CUVE|CUME and PHD|OLS are the best performers for this model, and they produce superior results for all choices of n and H and for both directions. Next are SAVE|SIR and SIR+SAVE which provide similar results, followed by CUDR and SAVE respectively. Models 1 and 2 show that for all chosen values of n, H and α, the results of CUVE|CUME, SAVE|SIR and PHD|OLS are better than or equal to those of SIR+SAVE and SAVE.
In the examples we have considered thus far, PHD|OLS and CUVE|CUME have performed exceptionally well. However, PHD|OLS can be sensitive to extremely large response values (in comparison to other responses), whereas slicing methods such as SIR, SAVE, CUME etc. can be robust to particularly large (or small) response values since the response plays a 'positioning role' only for the ordering of the regressors and we briefly highlight this here. Additionally, and in fairness to PHD|OLS, we also provide an example where PHD|OLS is the preferred approach in a setting where large responses are not a concern. The third model we therefore consider is from Zhu et al. [26] and is defined to be, for x ∼ N 15 (0, Σ), where Σ is the covariance matrix with ijth element defined to be σ ij = 0.2 |i−j| , and β 1 = [1, 1, 1, 0, . . . , 0] ⊤ and β 2 = [1, 0, 0, 0, 1, 3, 0, . . . , 0] ⊤ , We also reconsider Model 2 but with directions β 1 = [1, 0, . . . , 0] ⊤ and β 2 = [0, 1, 0, . . . , 0] ⊤ . For brevity we report only the average canonical correlations and standard deviations for 500 trials, p = 15 and n = 400. Table 2 shows that for Model 3, which contains extremely large response values in comparison to other responses, PHD|OLS performs poorly, achieving an average canonical correlation of just 0.465, compared to 0.949 and 0.968 for SAVE|SIR and CUVE|CUME respectively. On the other hand, PHD|OLS clearly outperforms the other two methods for modified Model 2, achieving an average canonical correlation of 0.968 compared to 0.82 for SAVE|SIR and 0.898 for CUVE|CUME.
We now consider a model which has a discrete binary response, which, for x ∼ N 10 (0, I 10 ) and ǫ ∼ N (0, 1), is defined to be     Since Model 4 is a binary response model, slicing occurs naturally. As such, we focus our analysis on the SIR and SAVE based approaches which are then naturally suited to this model. In Figure 3 we have chosen to show boxplots of the average canonical correlations so that we could highlight a particular commonality between SIR+SAVE and SAVE|SIR. The results for SIR and SAVE|SIR are on the right. SIR typically performs poorly due to its inability to target an entire basis for S. SAVE does not perform well for small n, although its perfor-mance improves with increasing n. The results of SAVE|SIR and SIR+SAVE are superior to those of SIR and SAVE, however the advantage of SAVE|SIR is that it does not require the choice of an additional parameter α. What is obvious, however, is that SIR+SAVE clearly improves as α decreases but also the results seem to also approach those of SAVE|SIR. We discuss this in the following remark. Remark 4.1. As noted earlier in this section, SIR+SAVE is equivalent to SAVE when α = 1 and equivalent to SIR when α = 0. As such, we would expect that as α increases, the results of SIR+SAVE would approach those of SAVE, and that as α decreases, the results of SIR+SAVE would approach those of SIR. Figure 3 seems to support the former claim, however the opposite of the latter claim has occurred where as α decreases, the results seem to approach those of SAVE|SIR rather than SIR. This is limited to the binary response case and can be explained. In the binary case, the SIR matrix has exact rank one. Therefore, when the SAVE matrix is added, all of the additional information (other than that already found by SIR) is in that matrix even if α is small (but nonzero). A small α increases the chance of the SIR information being undisturbed by SAVE for the SIR+SAVE approach. That is, for small α the SIR matrix is prominent and the SIR information is likely to be utilized. For large α, the SAVE matrix becomes dominant and, as such, the SAVE information is likely to contribute the most to the estimated basis.
In Remark 4.1 we highlighted why SIR+SAVE seems to approach SAVE|SIR when Y is binary despite it being typically expected to approach SIR with decreasing α. To highlight this further we reconsider Model 1 (a continuous model) and also now consider a fifth model, which has a discrete ternary response, defined as In Figure 4 we provide boxplots of average canonical correlations for Model 1 and Model 5. The data was simulated for n = 200 observations. Unlike in the binary case, we can now see that the results for SIR+SAVE approach the results for SIR with decreasing α. Once again the results approach those for SAVE as α increases. Overall the best performers for both models were SIR+SAVE with α approximately 0.5 and SAVE|SIR which provided similar results.

Choosing K and the number of iterations
So far we have dealt with known K. However, in practice K is usually not known and needs to be estimated. Tests for K have been developed for the various methods discussed in this paper. For example, Li ( [16,18]) introduced conservative tests for SIR and PHD respectively whilst Cook [5] and Bentler  and Xie [1] provided further suggestions for improved tests for PHD. Zhu et al. [26] provide tests for CUME, CUVE and CUDR. Other methods exist, including those given by Bura and Cook [3] and Ferre [10].
When it comes to the number of directions to choose from the first method (i.e. OLS, SIR or CUME for the approaches we have considered here) then either: (i) The number of initial directions may be already determined. For example, naturally OLS can only return one initial direction and SIR is restricted to returning, at most, H − 1 (the number of slices minus one) which is most limiting when Y is discrete where H is the number of uniquely possible responses. (ii) A test for K or visual inspection of the eigenvalues from the eigen-decomposition step in the algorithm can be used for guidance.
For the second situation above, many tests for K are biased towards the expected value of K (i.e. the rank of the matrix in Step 1 of the methods discussed earlier) rather than the true dimension K of the underlying model. For example, we simulated 1000 trials of Model 1 for n = 400 and p = 10 and then used the test from Li [16] for SIR. The test selected K = 1 95.5% of the time and when this happens the choice of one direction from SIR follows naturally.
Complementary to this, such estimates for K can be used simultaneously to both determine the number of directions to select for the first method and also the number of directions to choose from the second. For example, Zhu et al. [26] consider various models with true dimension K = 2 but where, due to symmetrical dependence, CUME (and also SIR) are expected to find only one direction. For the majority of their models (four of six) and 1000 simulated runs, the test for K based on the CUME eigenvalues chose the estimated K as K = 1 100% of the time. On the other hand, the estimated K using the CUVE and CUDR eigenvalues was chosen to be K = 2 at least 80% of the time and up to 99% of the time (with the exception of one model for CUDR). Hence, the contradicting estimates for K occurring in practice would lead us to choose one direction from the first method and then focus on achieving the final element in the second. This approach can similarly be employed with SAVE|SIR and PHD|OLS.
For the iterative approaches we have chosen, we have opted for either OLS, SIR or CUME to be the first method. We have done so for two reasons. Firstly, these methods have less restrictive distributional conditions on x than the methods that follow next and have been shown to perform very well for a variety of models. Secondly, it is these approaches that are known to lack 'exhaustiveness' when symmetric dependency is evident. However, there is no specific reason as to why the order of application of the methods cannot be reversed. For example, OLS could follow PHD however it is less obvious how many directions one would choose for PHD and how to ensure that the chosen directions do not already include the OLS direction. Hence, the application of OLS, SIR and CUME first is an admission that these methods have missed some elements of S although an acknowledgement that they may have been very successful in finding a partial basis.
Finally, it may also be possible to carry out more than two iterations of dimension reduction methods. For example, we could use OLS first, followed by PHD and then finish with SIR to extract perhaps a final element that was not detected via PHD|OLS. However, thought would need to be given as to why the first two collectively may fail to find a complete basis and given that we have a preference for small K, the larger choices of K that intuitively follow from more than two iterations would still restrict the researcher in a data visualization capacity. We anticipate that combining two dimension reduction methodologies would suffice in most situations.

Examples
Example 4.1. For the first example we consider data provided by Dr. Hayley Castlehouse and analyzed as part of her PhD dissertation [4]. The data consists of soil compositions for 41 soil samples that were taken from a site in North Lincolnshire, England. We let the response be iron-oxide bound arsenic (As) in mg.kg −1 . The 15 predictors are depth and level of the chemical compounds PO 3− 4 , Mg 2+ , Cl − , NO − 3 , NH + 4 , SO 2− 4 , Ca 2+ , K + , F − , Na + , TIC, TOC, Fe, and Mn. To estimate the dimension of S, we implement the Bayesian information criterion (BIC) recommended by Zhu et al. [26], which is based on the eigenvalues returned by the dimension reduction method utilized. When run on the CUME eigenvalues, we estimate K = 1. By contrast, when we use the CUDR eigenvalues, we estimate that K = 2. Given the contradiction between these two estimates, we will look at including some iterative approaches in our analysis.
The eleven dimension reduction methods which are compared using this data are contained in Table 3. Whilst OLS, PHD and the cumulative slicing procedures do not require the choice of a slicing parameter, a value of H = 6 has been chosen for the other methods. There are seven observations in each slice with the exception of the fourth slice, which contains six observations. For SIR+SAVE, a value of α = 0.5 has been chosen, which was the value recommended by Zhu et al. in [28].
To assess the quality of the directions found, a polynomial model of degree two has been fit using multiple linear regression least squares to the data. Table 3 shows the adjusted R 2 values where either the first direction (i.e. K = 1) or the first two directions (i.e. K = 2) found by each method have been used. For all methods except CUVE, the results where K = 2 were an improvement when compared to those where K = 1. However, the improvement for SIR was almost negligible, as its adjusted R 2 value only increased from 0.724 to 0.728. This indicates that SIR found all of its useful information in the first direction, so that when it was combined with SAVE to find a second direction, any useful information that SAVE was able to find resulted in a much better adjusted R 2 value for SAVE|SIR of 0.777. A similar phenomenon has occurred with PHD|OLS. OLS on its own achieved an adjusted R 2 value of 0.808, however when combined with PHD, the extra information found by PHD was enough to improve the adjusted R 2 value to 0.892. In contrast with PHD|OLS's adjusted R 2 value of 0.892 when K = 2, PHD's result, when used as a standalone method, was 0.280. Similarly, SAVE's adjusted R 2 value was 0.282. These results show the value of the iterative approach to dimension reduction because the performance of PHD and SAVE has been greatly improved by conditioning on the information found by OLS and SIR respectively. SIR+SAVE's result of adjusted R 2 =0.262 (for K = 2) indicates that in this instance, the information found by SIR was overwhelmed by that of SAVE when the two matrices were added together with α = 0.5. Overall, the best result was that of PHD|OLS when K = 2 (adjusted R 2 = 0.892).
Example 4.2. In the next example, we consider the Pen Digit data analysed by Zhu & Hastie [29] and Sheather et al. [23]. The Pen Digit database contains multiple samples from 44 different writers of handwritten digits from 0 to 9. There are 16 attributes (or predictors) for each digit. Whilst there is both a training dataset and a learning dataset, we focus on the training dataset, as did Sheather et al. We also focus on the 0s and 8s, since the estimated sufficient summary plots (ESSP) indicate that these two digits cannot be separated using just one direction (see Figure 5). This results in n = 1, 499 observations, p = 16 predictors and a binary response. Due to the discrete response of the Pen Digit data, which leads to natural slicing of the data, the four methods we compare are SAVE, SIR, SAVE|SIR and SIR+SAVE with α = 0.5. The first method of assessment we use is the ESSP plot for each method shown in Figure 5. This figure shows that SIR is able to separate the groups using the first direction, but the boundary of separation slightly overlaps. SIR's second direction, as expected, provides no further useful information since the SIR matrix is of rank one and information found in a second direction would be by chance only. The second directions of SAVE|SIR and SIR+SAVE provide useful information, as they show greater variability amongst the 8's, making separation easier. SAVE also separates well, and clearly needs two directions to do so.
The second method of assessment we use involves the use of cross validation [13] and Support Vector Machines (SVM), introduced by Karatzoglou et al. [12]. We again compare SAVE, SIR, SAVE|SIR and SIR+SAVE with α = 0.5. Using cross validation, we use random sampling to divide the data into a training set and a testing set. Based on the directions found by the respective methods using observations in the training set, we predict the response for each observation in the testing set using a two-dimensional model via the use of SVM. To create this model, the R package e1071 [9] was used and a radial kernel was chosen. Thus, a classification rate can be found by calculating the percentage of correct predictions. This process was repeated 500 times so that average classification rates, along with standard deviations, could be found. In order to test the performance of each dimension reduction method relative to sample size, we let the number of observations in the training set take the values n = 200, n = 100, n = 60 and n = 40. Figure 6 shows the result of this analysis. For all chosen values of n, we can see that SAVE|SIR and SIR+SAVE produce the best classification results. This supports the results of the binary response simulation in Section 4.1, and SAVE|SIR has the advantage because, unlike SIR+SAVE, it does not require the choice of the extra parameter α. In general, SIR produces the worst results because it is only able to find one direction. However, we note that whilst SAVE's average classification rate is higher than SIR's, its variability is much higher when n = 60 and n = 40, suggesting it is more volatile than SIR for this model.

Conclusions
In this article, we introduced a new way of combining existing dimension reduction methods. Whilst it is known that methods such as SIR, OLS and CUME are restricted in some cases, we have shown that the information found by these methods for part of S can still be extremely useful. We have compared this new iterative method with existing stand-alone dimension reduction methods using both simulated and real-world data with either a continuous or discrete response, and found that the method we introduce is capable of producing superior results. We also compared the new method against the combination method recommended by Zhu et al. [28], SIR+SAVE and found the results of the new method to be at least as good as SIR+SAVE but without the added complication of having to choose the parameter α. This does not mean that our approach should necessarily be preferred over the existing combinations, but rather that it provides another alternative that may sometimes provide improved results.