Dimension reduction with expectation of conditional difference measure

In this article, we introduce a flexible model-free approach to sufficient dimension reduction analysis using the expectation of conditional difference measure. Without any strict conditions, such as linearity condition or constant covariance condition, the method estimates the central subspace exhaustively and efficiently under linear or nonlinear relationships between response and predictors. The method is especially meaningful when the response is categorical. We also studied the -consistency and asymptotic normality of the estimate. The efficacy of our method is demonstrated through both simulations and a real data analysis.


Introduction
With the increase of dimensionality, the volume of the space increases so fast that the available data become sparse (Bellman, 1961). The sparsity is a problem to many statistical methods since not enough data is available to do model fitting or make inference. Because of the situations discussed above, many classical models derived from oversimplified assumptions and nonparametric methods are no longer reliable. Therefore, dimension reduction that reduces the data dimension but retains (sufficient) important information can play a critical role in high-dimensional data analysis. With dimension reduction as a pre-process, often the number of reduced dimensions is small. Hence, parametric and nonparametric modelling methods can then be readily applied to the reduced data.
Sufficient dimension reduction is one approach to do dimension reduction, which focuses on finding a linear transformation of the predictor matrix, so that given that transformation, the response and the predictor are independent (Cook, 1994(Cook, , 1996Li, 1991). For the past 25 years, sufficient dimension reduction is a hot topic and many methods have been developed to estimate the central subspace (Cook, 1996). These methods can be classified into three groups: inverse, forward and joint regression methods. Inverse regression methods use the regression of X|Y, and require certain conditions on X, such as linearity condition and/or constant covariance condition. Specific methods include sliced inverse regression (SIR; Li, 1991), sliced average variance estimation (SAVE; Cook & Weisberg, 1991) and directional regression (DR; Li & Wang, 2007). Also see (Cook & Ni, 2005;Cook & Zhang, 2014;Dong & Li, 2010;Fung et al., 2002;Zhu & Fang, 1996). The forward regression methods include the minimum average variance estimation (MAVE;Xia et al., 2002), its variants, (Xia, 2007;Wang & Xia, 2008), average derivative estimate (Härdle & Stoker, 1989;Powell et al., 1989), and structure adaptive method (Hristache et al., 2001;Ma & Zhu, 2013). The forward methods require nonparametric approaches such as kernel smoothing. Joint regression methods require the joint distribution of (Y, X), and methods include principal hessian direction (PHD; Cook, 1998;Li, 1992), and the Fourier method (Zeng & Zhu, 2010;Zhu & Zeng, 2006). They require either smoothing techniques or stronger conditions.
In this article, we develop a new sufficient dimension reduction method based on the measure proposed in Yin and Yuan (2020) to estimate the central subspace. It involves the technique of slicing the range of Y into several intervals, which is similar to the classical inverse approaches, such as SIR and SAVE, but it does not require any linearity or constant covariance condition and can exhaustively recover the central subspace without smoothing requirement. On the other hand, comparing to other sufficient dimension reduction methods using distance measures, such as Sheng and Yin (2016), our method makes more sense when the response Y is categorical with no numerical meaning because the measure used in this article is properly defined for categorical variables.
This article is organized as follows: Section 2 introduces the new sufficient dimension reduction method, the algorithm, theoretical properties and the method of estimating the structural dimension d. In Section 3, we show the simulation studies, while Section 4 presents the real data analysis and a brief discussion is followed in Section 5.

A measure of divergence
In Yin and Yuan (2020), they proposed a new measure of divergence for testing independence between two random vectors. Let X ∈ R p and Y ∈ R q , where p and q are positive integers. Then the measure between X and Y with finite first moments is a nonnegative number, C(X|Y), defined by where f X|Y and f X stand for the characteristic functions of X|Y and X, respectively. Let |f | 2 = ff for a complexvalued function f, withf being the conjugate of f. The weight function w(t) is a specially chosen positive function. More details of w(t) can be found in Yin and Yuan (2020). They also give an equivalent formula as where the expectation is over all random vectors. For instance, the last expectation is first taking the conditional expectation given Y, then over Y. (X , Y ) is an independent and identically distributed copy of (X, Y). X Y denotes a random variable distributed as X|Y, X Y denotes a random variable distributed as X |Y and X Y denotes a random variable distributed as X |Y with Y = Y.
One property of C 2 (X|Y) is that it equals 0 if and only if the two random vectors are independent (Yin & Yuan, 2020). This property makes it possible that C 2 (X|Y) can be used as a sufficient dimension reduction tool. What's more, the measure works well for both continuous and categorical Y and because C 2 (X|Y) is well defined for categorical Y, our method is particularly meaningful when the class index of dataset does not have numerical meaning, where other measures do not attain similar advantage.

Review of sufficient dimension reduction
Let γ be a p × q matrix with q ≤ p, and be the independence notation. The following conditional independence leads to the definition of sufficient dimension reduction: where (3) indicates that the regression information of Y given X is completely contained in the linear combinations of X, γ X. The column space of γ in (3), denoted by S(γ ), is called a dimension reduction subspace.
If the intersection of all dimension reduction subspace is itself a dimension reduction subspace, then it is called the central subspace (CS), and it is denoted by S Y|X (Cook, 1994(Cook, , 1996Li, 1991)). Under mild conditions, CS exists (Cook, 1998;Yin et al., 2008). Throughout the article, we assume CS exists, which is unique. Furthermore, let d denote the structural dimension of the CS, and let X be the covariance matrix of X, which is assumed to be nonsingular. Our primary goal is to identify the CS by estimating d and a p × d basis matrix of CS.
Here we introduce some notations needed in the following sections. Let β be a matrix and S(β) be the subspace spanned by the column vectors of β. dim(S(β)) is the dimension of S(β). P β( X ) denotes the projection operator, which projects onto S(β) with respect to the inner product a, b = a X b, that is, where I is the identity matrix.

The new sufficient dimension reduction method
Let β be a p × d 0 arbitrary matrix, where 1 ≤ d 0 ≤ p. Under mild conditions, it can be proved that solving (4) will yield a basis of the central subspace.
Here the squared divergence between β X and Y is defined as The conditions E|X| < ∞, E|Y| < ∞ and E|X Y | < ∞ in Yin and Yuan (2020) guarantee that the C 2 (β X|Y) is finite. Thus throughout the article, we assume they hold. The constraint β X β = I d 0 in (4) is needed due to the property C 2 (cβ X|Y) = |c|C 2 (β X|Y) for any constant c (Yin & Yuan, 2020).
The following propositions justify our estimator. They ensure that if we maximize C 2 (β X|Y) with respect to β under the constraint and some mild conditions, the solution indeed spans the CS.
Proposition 2.2: Let η be a p × d basis matrix of the CS, β be a p × d 2 matrix with η X η = I d and β X β = I d 2 . Here d 2 could be bigger, less or equal to d. Suppose P η( X ) X Q η( X ) X and S(β) S(η). Then C 2 (β X|Y) < C 2 (η X|Y).
Proposition 2.1 indicates that if S(β) is a subspace of the CS, then C 2 (β X|Y) is less than or equal to C 2 (η X|Y) and the equality holds if and only if β is a basis matrix of the CS, i. e., S(β) = S(η). Proposition 2.2 implies that if S(β) is not a subspace of the CS, then C 2 (β X|Y) is less than C 2 (η X|Y) under a mild condition. The above two propositions show that we can identify the CS by maximizing C 2 (β X|Y) with respect to β under the quadratic constraint. The condition in Proposition 2.2, P η( X ) X Q η( X ) X, was discussed in Sheng and Yin (2013), where they showed the condition is not very strict and can be satisfied asymptotically when p is reasonably large. Proofs for Propositions 2.1 and 2.2 are in the Appendix A.

Estimating the CS when d is specified
In this section, we develop an algorithm for estimating the CS when the structural dimension d is known. Let (X, Y) = {(X i , Y i ), i = 1, . . . , n} be a random sample from (X, Y) and let β be a p × d matrix. For the purpose of slicing, these n observations can be equivalently written as X y,k y , Y y,k y , where y = 1, . . . , H, k y = 1, . . . , n y , where n y is the number of observations for slice y. The empirical version of C 2 (β X|Y) denoted by C 2 n (β X|Y) is defined as: Here | · | is the Euclidean norm in the respective dimension. Letˆ X be the estimate of X . Then an estimated basis matrix of the CS, say η n , is An outline of the algorithm is as follows.
(1) Obtain the initials η (0) : any existing sufficient dimension reduction method, such as SIR (Li, 1991) or SAVE (Cook & Weisberg, 1991) can be used to obtain the initial. (2) Iterations: let η (k) be the estimate of η in the kth iteration. In order to search for the η (k+1) , the interior-point approach is applied. In the interior-point approach, the original optimization problem in (6) is replaced by a sequence of barrier subproblems, which are solved approximately by two powerful tools: sequential quadratic programming and trust region techniques. In this process, one of two main types of steps is used at each iteration: a direct step or a conjugate gradient step. By default, the algorithm tries a direct step first. If a direct step fails, it attempts a conjugate gradient step. More extensive descriptions of the interior-point approach are in Byrd et al. (2000Byrd et al. ( , 1999 and Waltz et al. (2006). (3) Check convergence: if the difference between η (k) and η (k+1) is smaller than the preset tolerance value, such as 10 −6 , then stop the iteration and set η n = η (k+1) ; otherwise, set k: = k + 1 and go to step 2.
In the above algorithm, we assume the structural dimension d is known, which is not true in practice. We will propose an approach to estimate d in Section 2.6.

Theoretical properties
Proposition 2.3: Let η n = arg max β ˆ X β=I d C 2 n (β X|Y), and η be a basis matrix of the CS with η X η = I d . Under the condition P η( X ) X Q η( X ) X, η n is a consistent estimator of a basis of the CS, that is, there exists a rotation matrix Furthermore, we can prove the √ n-consistency and asymptotic normality of the estimator as stated below.
Proposition 2.4: Let η n = arg max β ˆ X β=I d C 2 n (β X|Y), and η be a basis matrix of the CS with η X η = I d . Under the regularity conditions in the supplementary file, there exists a rotation matrix Q: is the covariance matrix given in the supplementary file.
Proofs of Propositions 2.3 and 2.4 are in Appendices B and C, respectively.

Estimating structural dimension d
There is a rich literature of discussing determining d in sufficient dimension reduction, for example, some nonparametric methods such as Wang and Xia (2008), Ye and Weiss (2003) and Luo and Li (2016) and some eigen-decomposition-based methods, for examples, Luo et al. (2009), andWang et al. (2015). Here we apply the kNN method proposed in Wang et al. (2015).
can be estimated by the following kNN procedure.
(1) Find the k-nearest neighbours for each data point (2) For each data point (X i , Y i ), apply the method proposed in this article to its k-nearest neighbours and estimatê β i . Here the dimension ofβ i is set as 1.
(3) Calculate the eigenvalues of the matrix n i=1β iβ i . Denote and order them as λ 1 ≥ λ 2 ≥ · · · ≥ λ p . (4) Calculate the ratios The dimension d is estimated as the largest r i happens in the sequence.
In the last step, this maximal eigenvalue ratio criterion was suggested by Luo et al. (2009) and was also used by Li and Yin (2009) and Sheng and Yuan (2020).

Simulation studies
Estimation accuracy is measured by the distance m (Ŝ, S) = PŜ − P S (Li et al., 2005), where S is the real ddimensional CS of R p ,Ŝ is the estimate, P S , PŜ are the orthogonal projections onto S andŜ, respectively and · is the maximum singular value of a matrix. The smaller the m is, the better the estimate is. Also a method works better if it has a smaller standard error of m . In the following, the first three examples show the nice performance of the proposed method in terms of both continuous and categorical response, assuming we already know the dimension d. The last example illustrates the performance of estimating dimension d using the kNN procedure in Section 2.6. Example 3.1: Consider the Model 1 where X ∼ N(0, I p ), ∼ N(0, 1) and is independent of X. β 1 = (1, 0, . . . , 0) , and β 2 = (0, 1, . . . , 0) . We compare DCOV (Sheng & Yin, 2016), SIR (Li, 1991), SAVE (Cook & Weisberg, 1991) and LAD (Cook & Forzani, 2009) with our method ECD with 10 slices. Table 1 shows the average estimation accuracy (¯ m ) and its standard error (SE) under different (n, p) combinations and 500 replications. Note that ECD performs consistently better than other methods, under all the different (n, p) combinations.
From the simulation results, we find ECD method outperforms other methods when the response is continuous. When the response is categorical, it also performs better than SIR, SAVE and LAD and its performance is comparable to DCOV. To be more specific, the accuracy of ECD and DCOV is very close as sample size n gets large when the response Y is categorical. On the other hand, the computation speed of ECD is faster than that of DCOV due to its slicing technique in calculating C 2 n (β X|Y). For example, when (n, p) = (200, 6), ECD is about 2.7 times faster than DCOV under Model 1 and 2, and about 3.6 times faster under Model 3. Overall, ECD is superior to other methods.
Example 3.4: Estimating d. We test the performance of the kNN procedure in Section 2.6 based on Model 1 and Model 2. Table 4 shows that the kNN procedure can estimate dimension d very precisely, no matter the response is continuous or categorical.

Real data analysis
To further investigate the performance of our method, we apply it to the Pen Digit database from the UCI machinelearning repository. The data contains 10,992 samples of hand-written digits {0, 1, . . . , 9}. The digits were collected from 44 writers and every writer was asked to write 250 random digits. Every digit is represented as a 16-dimensional  feature vector. The 44 writers are divided into two groups, in which 30 are used for training, while others are used for testing. The data set and more details are available at archive.ics.uci.edu/ml/machine-learning-databases/pendigits/.
We choose the 0's, 6's and 9's, three hardly classified digits, as an illustration. In this subset of the database, there are 2,219 cases in the training data and 1,035 cases in the test data. We apply the dimension reduction methods to the 16-dimensional predictor vector for the training set, which serves as a preparatory step for the three-group classification problem. Because the response has three slices, SIR estimates only two directions in the dimension reduction subspace. The other methods, SAVE, DCOV and ECD, all estimate three directions. Figure 1 presents the two-dimensional plot of (SIR1, SIR2) and Figure 2 shows the three dimensional plots of (SAVE1, SAVE2, SAVE3), (DCOV1, DCOV2, DCOV3) and (ECD1, ECD2, ECD3). SIR provides only location separation of the three groups. SAVE implies there are covariance differences among three groups, but no clear location separation is provided. Both DCOV and ECD get the location separation and covariance differences, but ECD presents a more clear separation among the three groups.The three-dimensional plot of (ECD1, ECD2, ECD3) gives a comprehensive demonstration of the different features of the three groups.

Discussion
In this article, we proposed a new sufficient dimension reduction method. We studied its asymptotic properties and introduced the kNN procedure to estimate the structural dimension d. The numerical studies show that our method can estimate the CS accurately and efficiently. In the future, we consider to develop a variable selection method by combining our method with the penalized method such as LASSO (Tibshirani, 1996). Furthermore, it can be extended to large p small n problems by using the framework of Yin and Hilafu (2015).

Disclosure statement
No potential conflict of interest was reported by the author(s). In order to prove Propositions 2.1 and 2.2 in Section 2.3 in the article, we first provide and prove the following Lemma A.1.

Proof of Proposition 2.1:
Assume the single value decomposition of A is U V , where U is a d × d orthogonal matrix, V is a d 1 ×d 1 orthogonal matrix and is a d×d 1 diagonal matrix with nonnegative numbers on the diagonal, and it is easy to prove that all nonnegative numbers on the diagonal of are 1. Based on Theorem 3, part (2) of Yin and Yuan (2020), Let U η X = ( X 1 , . . . , X d ) . Since all nonnegative numbers on the diagonal of are 1 and U η X = ( X 1 , . . . , X d 1 ) , by Lemma A.1, we get C 2 ( U η X|Y) ≤ C 2 (U η X|Y). The equality holds if and only if d = d 1 . And again based on Theorem 3, part (2) of Yin and Yuan (2020), C 2 (U η X|Y) = C 2 (η X|Y). Thus, C 2 (β X|Y) ≤ C 2 (η X|Y), and equality holds if and only if S(β) = S(η).

Appendix B. Proof of Proposition 2.3
In order to prove Proposition 2.3 in Section 2.5 of this article, we provide and prove the following Lemma B.1 first.

Lemma B.1:
If the support of X, say S, is compact and furthermore, η n Proof: Based on Yin and Yuan (2020) Corollary 1, we have that |η n X y,k y − η n X y,l y |, |η X y,k y − η X y,l y |.
Because η n → η in probability, let η n = η + ε n . Then for any > 0, ||ε n || < , when n → ∞, where || · || is the Frobenius norm. Hence, by the condition on X, we have that for a positive constant c x , and large n, |C 2 n (η n X|Y) − C 2 n (η X|Y)| ≤ c x . Hence the conclusion follows.

Proof of Proposition 2.3:
To simplify the proof, we restrict the support of X to be a compact set, and it can be shown that S Y|X = S Y|X S (Yin et al., 2008, Proposition 10), where X S is X restricted onto S. Without loss of generality, we assume Q = I d . Suppose η n is not a consistent estimator of S Y|X . Then there exists a subsequence, still to be indexed by n, and an η * satisfying η * ˆ X η * = I d such that η n P −→ η * but Span(η * ) = Span(η).
−→ C 2 (η * X|Y). Therefore, C 2 n (η n X|Y) P −→ C 2 (η * X|Y). On the other hand, because η n = arg max β ˆ X β=I d C 2 n (β X|Y), we have C 2 n (η n X|Y) ≥ C 2 n (η X|Y). If we take the limit on both sides of the above inequality, we get C 2 (η * X|Y) ≥ C 2 (η X|Y). However, we have proved that under the assumption P η( X ) X Q η( X ) X, η = arg max β X β=I d C 2 (β X|Y), and we also assume that the central subspace is unique. Therefore, C 2 (η * X|Y) ≥ C 2 (η X|Y) conflicts with the above assumption, so η n is a consistent estimator of a basis of the central subspace.

Appendix C. Proof of Proposition 2.4
Lagrange multiplier technique is used to prove the √ n-consistency of vec(η n ) in the Proposition 2.4 in Section 2.5 of the article. First, we introduce the following notations and conditions and we also give a new definition.
For a random sample (X, Then there exists a λ n such that vec(η n ) λ n is a stationary point for L n (ζ ). Let θ n = vec(η n ) λ n .
Then L n (θ n ) = 0. Let η be a basis of CS. Then under the assumption P η( X ) X Q η( X ) X, there exists a rotation matrix Q : Q Q = I d , such that ηQ = arg max β X β=I d C 2 (β X|Y). Without loss of generality, we assume Q = I d here. Therefore, there exists a λ 0 such that vec(η) λ 0 is a stationary point for L(ζ ). Let θ = vec(η) λ 0 .
In the proof, we need to take derivatives of C 2 (η X|Y) and C 2 n (η X|Y) with respect to vec(η), so for the simplicity of notation, when we consider the derivatives of C 2 (η X|Y) and C 2 n (η X|Y), we use C(η) and C n (η) to denote C 2 (η X|Y) and C 2 n (η X|Y), respectively.
Here are additional notations, which will be used later in the following proof. I (d,d) is the vec-permutation matrix. I m is a identity matrix with rank m, and I m (:, i) denotes the ith column of I m . A ⊗ B denotes the Kronecker product between matrix A and B. vec(·) is a vec operator.
Furthermore, we give the following definition and assumptions.
Definition C.1: Let (η) = {α : ||α − η|| ≤ c}, where α is a p × d matrix, α X α = I d , c is a fixed small constant, and || · || is the Frobenius norm. We define an indicator function where X is an i.i.d. copy of X and 0 is a small number. We define the second and third derivatives of C(η) with respect to vec(η) as C (η)ρ(X, X ) and C (η)ρ(X, X ). For the simplicity of notation, we will still use C (η) and C (η) to denote C (η)ρ(X, X ) and C (η)ρ(X, X ), respectively.
The reason we use this definition is that under Definition C.1, the second and third derivatives of C(η) and C n (η) are bounded, near the neighbourhood of the central subspace.

Assumption C.2:
Assumption C.1 is needed for Proposition 2.4 in the main article and Lemma C.1 in the next section, which is similar to the assumed conditions of Theorem 6.1.6 (Lehmann, 1999, Ch. 6). This assumption is required by the asymptotic properties of U-statistics.
Assumption C.2 is in the spirit of von Mises proposition (Serfling, 1980, Section 6.1). In this proposition, it claims that if the first nonvanishing term of Taylor expansion is the linear term, then the √ n-consistency of the differentiable statistical function can be achieved. In our case, we assume the corresponding matrix is nonsingular, which guarantees the √ n-consistency. If the matrix is singular, then n or higher order consistency of some parts of our estimates can be proved.
In order to prove Proposition 2.4 in Section 2.5 of the paper, we provide and prove the following Lemma C.1 first.  N(0, V). The explicit expression for V is in the proof.