Sensitivity of principal Hessian direction analysis

We provide sensitivity comparisons for two competing versions of the dimension reduction method principal Hessian directions (pHd). These comparisons consider the effects of small perturbations on the estimation of the dimension reduction subspace via the influence function. We show that the two versions of pHd can behave completely differently in the presence of certain observational types. Our results also provide evidence that outliers in the traditional sense may or may not be highly influential in practice. Since influential observations may lurk within otherwise typical data, we consider the influence function in the empirical setting for the efficient detection of influential observations in practice.


Introduction
Dimension reduction methods have increased in popularity in recent times due to an abundance of high-dimensional data. The increased acceptance of such methods gives rise to the need for further understanding with regards to the sensitivity of the associated estimators. For some dimension reduction methods, a consequence of this is the lack of diagnostics that can be used to detect influential observations. The purpose of this paper is to compare the sensitivity of two related, yet competing, dimension reduction methods and provide an influence diagnostic that is useful in practice.
Consider a univariate response variable Y and p-dimensional predictor vector X. In the regression setting, when p is large it may be difficult to visually determine the complex structure relating Y and X due to our own inability to visualize data in more than a few dimensions. As such, dimension reduction methods that seek to reduce the dimension of X without loss of important regression information are highly valued.
Here we examine the multiple-index model with B = [β 1 , . . . , β K ] where β k (k = 1, . . . , K) are unknown p-dimensional column vectors, ε is the error term with ε ⊥ ⊥ X (where ⊥ ⊥ will denote independence throughout), E(ε) = 0 and f is the unknown link function. If we let Γ = [γ 1 , . . . , γ K ] denote an arbitrary basis for S = span(β 1 , . . . , β K ), then dimension reduction without loss of information can be achieved by replacing X with Γ ⊤ X when K < p. Li [13] calls S the effective dimension reduction (e.d.r) space and we will follow the lead of Cook [5] in assuming that S is a central subspace in that it is defined at its minimum dimension.
Many dimension reduction methods have been recently proposed that seek to identify S without prior knowledge of f and only mild distributional conditions for X. These include Sliced Inverse Regression (sir, [13]), Sliced Average Variance Estimates (save, [6]), sirii [14], Principal Hessian Directions (phd, [15]) and Minimum Average Variance Estimation (mave, [22]) to name a few.
Gather et al. [10; 11] show that, at the sample level, sir can fail in the presence of just one 'bad' observation; a finding supported by way of the influence function by Prendergast [18; 19]. Prendergast [20] provided similar results via the influence function for save and sirii and showed that either of these methods or sir may be the preferred choice, from a sensitivity standpoint, with respect to certain types of observations. Lue [17] introduced a trimming algorithm for one version of phd that iteratively trimmed observations and was shown to work well under simulations of some perturbed models.
Despite the fact that two different versions of phd were introduced by Li [15], there has been little in the way of developing sensitivity comparisons between them. Cook [4] notes that one of these versions may be preferable when the underlying model incorporates strong linear trends. The first purpose of this paper is to analyze and compare the sensitivity of these methods at the model. This allows for a deeper understanding into the detrimental effect that certain observational types may have in practice and allows us to explore the differences in the methods when dealing with such observations. As a consequence of such analyses, the second purpose of this paper is to introduce influence measures that can detect influential observations in practice.

Principal Hessian directions
Of the many recently proposed dimension reduction procedures, principal Hessian directions (phd) is perhaps the most intuitive extension of existing methodology. Though the method was developed by Li [15] using Stein's Lemma [21], phd is strongly related to Ordinary Least Squares (ols) regression. Let X ∼ N p (µ, Σ) and suppose that the model given in (1) holds with K = 1. It can be shown that (See [2], [3], and [16]), under these conditions, where Hence, in the single-index case where K = 1 for the model given in (1), ols may be employed to derive a basis for S when the predictor variable is normally distributed. An exception to this is when Σ −1 Σ xy in (2) is 0 in which case, whilst the ols direction is trivially an element of S, the direction itself does not provide a basis for S. Let X ∼ N p (µ, Σ) and denote µ y = E(Y ) and Σ yxx = E{(Y − µ y )(X − µ)(X − µ) ⊤ }. With the application of Stein's Lemma [21], Li [15] showed that the average Hessian matrix of E(Y |X) is given as where the eigenvectors corresponding to nonzero eigenvalues of H x are elements of S. Li also noted that adding a linear function of B ⊤ X to Y does not change H x so that an alternative definition is The original phd methods estimated the matrix H z based on Z = Σ −1/2 (X − µ) which provides an orthonormal basis for Σ 1/2 S. Re-transformation using Σ −1/2 could then be utilized to provide a basis for S. However, the eigenvectors based on non-zero eigenvalues of H x provide an orthonormal basis for S and, as such, all further reference throughout this paper to the phd methods will be concerning estimation of H x .

Perturbation analysis in the dimension reduction setting
Consider an arbitrary distribution function F and define the contamination distribution, with respect to F and contaminant point w, to be F ǫ = (1 − ǫ)F + ǫ∆ w where 0 < ǫ < 1 and ∆ w is the Dirac measure putting all of its mass at w. Consider a statistical estimator with functional t defined at F and F ǫ . The influence function [12] for t at F is defined to be The influence function approximates the relative influence of an observation w from a large sample generated from F on the estimator t. Perturbation analysis in dimension reduction seeks to study the effect of small perturbations on detecting a correct basis for S. Let b k (k = 1, . . . , K) denote the functional for an e.d.r. direction estimator with, for an arbitrary distribution F , b k (F ) = 1 and b i (F ) ⊤ b j (F ) = 0 (i = j). Also, let (Y, X) ∼ G such that the model in (1) is satisfied and span{b 1 (G), . . . , b K (G)} = S such that b 1 (G), . . . , b K (G) provide a basis for S.
In the dimension reduction setting define the contamination distribution function as where 0 < ǫ < 1 and ∆ (y0,x0) is the Dirac measure putting all of its mass at the point (y 0 , x 0 ) ∈ R p+1 . Let S ǫ = span{b 1 (G ǫ ), . . . , b K (G ǫ )} be the equaldimension perturbed equivalent of S. Since the basis for S is of primary relevance, a perturbation analysis seeking changes in S ǫ should not simply compare S and S ǫ column by column. Following the lead of Bénasséni [1], one approach is to study the angle between each b k (G ǫ ) and its projection onto S. In noting that many measures of angle are insensitive to small perturbations, Bénasséni introduced a measure between spans that utilized the average sine of the angle between each element of one basis and its projection onto the space spanned by the other. Bénasséni then also derived the influence function for this measure based on eigenvector subsets of the covariance matrix estimator.
Prendergast [19] utilized Bénasséni's measure for a sensitivity analysis of sir using the influence function. Prendergast [20] extended this result to include the methods save and sirii and provided useful sensitivity comparisons between these methods and sir. For a given (y 0 , x 0 ), the influence function for this measure is simply the negative average of the sine of the angle between each perturbed direction and its projection onto the unperturbed space relative to ǫ ↓ 0. Hence, the sine of this angle can be seen as a relative increase in sine due to an ǫ-perturbation. We now provide a formal definition of the Relative Increase in Sine with respect to the kth e.d.r. direction estimator.
Definition 3.1. Using the notation defined above, let θ ǫ,k denote the angle between b k (G ǫ ) and its projection onto S. The Relative Increase in absolute Sine ( ris) for the kth direction is defined to be Remark 3.1. Let s denote the statistical functional such that, at an arbitrary distribution F , s(F ) = sin(θ F ) where θ F is the angle between b k (F ) and its projection onto S. Then, with θ ǫ,k defined as in Definition 3.1, and since sin(θ 0,k ) = 0, then There is a strong link between the ris and the influence functions for sir, save and sirii considered by [19; 20] in that they are equal to under the appropriate conditions for which they were defined. Assume θ ǫ,k ∈ [−π, π]. The ris has the following properties: Closed-form solutions to ris(b k , G; y 0 , x 0 ) can then be used to study the effect that various observational types have on the kth e.d.r. direction estimator. This will be looked at with respect to phd in the next section.

Influence on the PHD e.d.r. space estimator
Throughout this section assume G ǫ and G are defined as in (6) with the following condition.
Under Condition 4.1, let |λ 1 | ≥ . . . ≥ |λ K | > 0 denote the absolute nonzero eigenvalues of H x that correspond to the phd e.d.r. directions γ 1 , . . . , γ K and let Γ = [γ 1 , . . . , γ K ]. The proof of the following Theorem can be found in the Theorem 4.1. With notation defined above, let b y k and b r k denote the functionals for the kth y-based and r-based phd e.d.r direction estimators such that, at G and under Condition 4 Remark 4.1. The ris measures for the phd y and phd r methods are equal for any given (y 0 , x 0 ) when Σ xy = 0. This can occur when Y ⊥ ⊥ X (a trivial case that is not supported under the assumption of rank(H x ) > 0) or for some types We now consider some examples that allow us to study the sensitivity of the phd methods.
This example is interesting for two reasons. Firstly, despite the fact that both phd y and phd r estimate the same matrix, the two methods can behave completely differently with respect to certain types of observations. Secondly, [19; 20] showed that observations of this type can be highly influential for similar dimension reduction methods such as sir, save and sirii. However, this is not the case with phd r so that, with respect to observations of this type, phd r is unusual.
In Figure 1 (a) we plot ris(b y 1 , G; y 0 , x 0 ) for varying cos(θ 0 ) and x 0 . It is clear from this plot that just small changes in θ 0 can result in large changes of influence; in particular with increasing x 0 . It is also clear, however, that outliers in the predictor space, in the sense of a large x 0 , are not necessarily highly influential on the e.d.r. space estimator. In fact, it is possible for outlying observations to have little or no influence. In plot (b) we provide a simple crosssection of ris(b y 1 , G; y 0 , x 0 ) where x 0 = 2. This plot emphasizes the large differences in influence that can be obtained with only small rotations of x 0 .
Similarly, in Figure 1 (c) we plot ris(b r 1 , G; y 0 , x 0 ) for varying cos(θ 0 ) and x 0 . Again it is evident that small rotations of x 0 can effect large changes in influence on the r-based e.d.r. space estimator. This is again emphasized via a cross-section where x 0 = 2 in plot (d). For the range of cos(θ 0 ) and x 0 values provided here, the highest influence was achieved for the r-based method.
However, for some types of observations it is clear that this method is less sensitive than the y-based approach. As mentioned in Example 4.1, there is zero influence on the r-based e.d.r. space estimator when x 0 ⊥S. This is again emphasized in plot (d) whereas the same observational type has non-zero influence on the y-based method.

Sample based sensitivity
Before we look at sample versions of the ris we review sample versions of the influence function in general (see, for e.g., [7]). Consider a sample of m observations, w 1 , . . . , w m , sampled from F and let F n denote the empirical distribution of this sample. Also, let F n,(j) denote the empirical distribution for the sample without the jth observation. Recall the definition of the influence function for a statistical functional t given in (5). The sample influence function (SIF) for the jth observation on the estimator t is achieved by replacing F ǫ with F n and F with F n,(j) such that SIF(t, F n ; w j ) = (n − 1){t(F n ) − t(F n,(j) )}. An approximating empirical version of the SIF can be achieved by replacing F with F n in a closed-form derivation of the influence function. This approximating version is often referred to as the empirical influence function (EIF) and depends only on estimates at F n and the observation w j .

Sample versions of the RIS
Due to the link between the ris and the influence function (see Remark 3.1) sample versions based on the SIF and EIF of the ris will now be introduced to detect influential observations in practice. Let {(y i , x i ) : i = 1, . . . , n} denote a sample of n observations with sample mean and covariance of the x i 's given as x and covariance S, and sample mean of the y i 's given asȳ. For this sample, let G n denote the empirical distribution and let G n,(j) denote the empirical distribution with the jth observation removed. Also, let Γ y = [γ y,1 , . . . ,γ y,K ] denote the estimated basis for S at G n for y-based phd and similarly denote Γ r = [γ r,1 , . . . ,γ r,K ] for r-based with P y = Γ y Γ ⊤ y and P r = Γ r Γ ⊤ r . Also suppose thatγ y,k andγ r,k are associated with the eigenvaluesλ y,k andλ r,k respectively.
Let θ y kj denote the angle between the kth y-based estimated e.d.r. direction at G n,(j) (i.e. without the jth observation) and its projection with respect to P y onto the space spanned by the columns of Γ y . Then the sample ris for the jth observation is sris y,k (y j , x j ) = (n − 1) sin θ y kj . and similarly, we define sris r,k (y j , x j ) = (n − 1) sin θ r kj for the r-based approach.
Two issues arise with the use of the sris. The first is that, whilst it may be employed to detect influential observations, the measure provides little interpretive information as to why an observation may or may not be influential. The second issue is that the e.d.r. space needs to be estimated n + 1 times; once each at G n , G n,(1) , . . . , G n,(n) . An alternative is to approximate the sris by replacing G with G n in the ris to obtain a version that replaces the unknown parameters with their respective estimates at G n . We will let these y and rbased phd empirical measures be denoted as eris y,k (y j , x j ) and eris r,k (y j , x j ) respectively.
The empirical approximations to the sample influence measures may not offer a reasonable approximation to the sample measures when n is small [20]. Prendergast [20] then introduced a hybrid measure that utilized both the empirical and sample measures which improved the approximation whilst retaining the efficiency and interpretative strengths of the empirical measure. For example, from the Appendix, we have ris(b y k , G; y 0 , x 0 ) = (I p − P S )IF(H y , G; y 0 , x 0 )γ k /λ k where IF(H y , G; y 0 , x 0 ) is the influence function for the y-based phd average Hessian matrix estimator. Hence the empirical ris is, eris y,k (y j , x j ) = (I p − P y )EIF(H y , G n ; y j , x j )b y,k /λ k where EIF(H y , G n ; y j , x j ) is the empirical influence function for H y at G n . The idea of the hybrid measure is to replace the EIF(H y , G n ; y j , x j ) with an efficiently computed SIF(H y , G n ; y j , x j ) = (n − 1){H y (G n ) − H y (G n,(j) )} which is derived in a closed form in terms of (y j , x j ) and the estimates at G n .
Let Σ yxx denote the maximum likelihood estimate of Σ yxx at G n and let S denote the usual unbiased estimator of Σ at G n . Similarly, let these estimates at G n,(j) be denoted Σ yxx,(j) and S (j) respectively. Then, for S xy denoting the usual unbiased estimate at G n for the covariance between the x i 's and y i 's, it can be shown that which provides a closed-form solution for Σ yxx,(j) . This along with the fact that (see, for example, [20]) , allows us to derive a closed form solution for the SIF(H y , G n ; y j , x j ). We will denote the hybrid measure that replaces the EIF(H y , G n ; y j , x j ) with this closed form solution for SIF(H y , G n ; y j , x j ) in the eris y,k (y j , x j ) as hris y,k (y j , x j ). Similarly we can define a version for the rbased approach and denote this as hris r,k (y j , x j ).
For comparative purposes we will also consider the Mahalanobis Distance (md) as a measure of outlyingness for observations in the predictor space. For the ith observation this is given as We now consider an example that looks at the usefulness of these influence measures in practice.

Hitter's data example
The Hitter's data set, first published in Sports Illustrated (April 20, 1987), contains seventeen quantitative variables concerning regular and leading substitute hitters competing in American major league baseball in 1986. The response is the log of the salary variable where any individuals whose salary was not recorded were omitted leaving a total of n = 263 observations. [17] also applied phd to this data. The three largest absolute eigenvalues for phd y are 0.0314, 0.0238, and 0.0060 and, as such, we choose K = 2. In Figure 2 we provide plots of sample versions of ris for phd y and phd r and the md values for the Hitter's data. For clarity, all data in the plots are ordered according to the size of the phd y ris values such that i indexes the ith smallest average of the ris values for the 1st and 2nd phd y directions.
Plot (a) shows that the eris provides a good approximation to the sris and can be used to successfully detect influential observations for this example with respect to phd y however it tends to underestimate the sris. On the other hand the hris in general, gives an improved approximation for this data. Plot (b) indicates similar findings for phd r though the ordering according to the phd y values makes it difficult to draw direct comparisons. This will be left to the discussion of Table 1.
In Plot (c) we provide direct comparisons between the sris for phd y and phd r . We see that the magnitude of influence can be significantly greater for phd r with the largest average sris for phd r being more than three-fold the largest calculated for phd y . Conversely, however, it is also clear from this plot for some observations that are highly influential on the phd y estimator, little influence is recorded for the phd r estimator. This plot further emphasizes the difference in the methods with regards to sensitivity.
In Plot (d) we provide the md values for the data. Here it is evident that there is little tendency for outliers to be influential and vice versa when compared to the influence values recorded for phd y . We leave comparisons of the md values with the influence on the phd r estimator to the discussion of Table 1. In Table 1 we provide further comparisons between the sample versions of the ris for phd y and phd r using Spearman Rank Correlations.
For this example we see that the sris for each of the phd y directions is approximated well by the respective eris values. With respect to phd r , the eris approximates the sris very well for the first direction and moderately well with respect to the second direction. The hris approximates the sris extremely well for each direction estimated using either method.
The low correlations between the sris and md values emphasize that not all outliers are influential and vice versa, therefore treating them may not necessarily benefit the estimates. As such, troublesome observations from an influence perspective, may lurk within otherwise typical observations.

Conclusion
We have introduced and considered an influence measure (ris) based on the influence function and Bénasséni's coefficient to compare two versions of Principal Hessian Directions (phd y and phd r ). Despite the fact that phd y and phd r seek to estimate the same Hessian matrix (and hence a basis for S) under assumed normality of the predictor variable, we have shown that these methods can behave differently in the presence of certain observational types.
Since these differences exist in favor of either phd y or phd r depending on the observational types considered, we recommend the implementation of both approaches in practice and for users to give consideration to both analyses.
The unboundedness of the influence measure for both methods also reiterates the findings for other dimension reduction methods by [10; 11; 18; 19; 20] which show that such methods can fail in the presence of certain types of observations. As such, considerations for the robustification of phd y and phd r should be initialized.
We also provided details for how a measure such as the sris can be utilized at the sample level to detect influential observations in practice. Two sample measures, the eris and hris, were considered as efficient approximations to the true sample influence. The eris tended to underestimate the influence, in particular for small samples, though was typically successful at detecting influential observations for the example considered. For this example it is important to note that the hris provided an excellent approximation to the sample influence.

A.1. Preliminaries
For simplicity throughout, when necessary let {. . .} ⊤ denote the transpose of the preceding term enclosed in {}. Let T y and T denote the functionals for the usual mean estimators of Y and X respectively where T y (G) = µ y and T (G) = µ. Also, let C denote the function for the usual covariance matrix estimator where C(G) = Σ and recall that cov G (Y, X) = Σ yx with Σ xy = Σ ⊤ yx .

A.3. RIS proof for r-based PHD of Theorem 4.1
The same conditions and definitions as those given for the lris proof for phd y are likewise employed here. Let C rxx be the functional defined at an arbitrary F to be C rxx (F ) = r F (Y, X){X − T (F )}{X − T (F )} ⊤ dF where r F (Y, X) denotes the ols residual function for the regression of Y on X where (Y, X) ∼ F and denote C rxx (F ) = Σ rxx . The ols residual functional is of the form r F (Y, X) = Y − T y (F ) − {X − T (F )} ⊤ {C(F )} −1 C xy (F ) so that, at G ǫ , Then, from (8) and since Σ rxx = Σ yxx when X ∼ N p (µ, Σ), From (11), the remainder of the proof can be completed by closely following the proof for the phd y ris.