Supervised Dimensionality Reduction via Distance Correlation Maximization

In our work, we propose a novel formulation for supervised dimensionality reduction based on a nonlinear dependency criterion called Statistical Distance Correlation, Szekely et. al. (2007). We propose an objective which is free of distributional assumptions on regression variables and regression model assumptions. Our proposed formulation is based on learning a low-dimensional feature representation $\mathbf{z}$, which maximizes the squared sum of Distance Correlations between low dimensional features $\mathbf{z}$ and response $y$, and also between features $\mathbf{z}$ and covariates $\mathbf{x}$. We propose a novel algorithm to optimize our proposed objective using the Generalized Minimization Maximizaiton method of \Parizi et. al. (2015). We show superior empirical results on multiple datasets proving the effectiveness of our proposed approach over several relevant state-of-the-art supervised dimensionality reduction methods.


Introduction
Rapid developments of imaging technology, microarray data analysis, computer vision, neuroimaging, hyperspectral data analysis and many other applications call for the anal-ysis of high-dimensional data.The problem of supervised dimensionality reduction is concerned with finding a low-dimensional representation of data such that this representation can be effectively used in a supervised learning task.Such representations help in providing a meaningful interpretation and visualization of the data, and also help to prevent overfitting when the number of dimensions greatly exceeds the number of samples, thus working as a form of regularization.In this paper we focus on supervised dimensionality reduction in the regression setting, where we consider the problem of predicting a univariate response y i ∈ R from a vector of continuous covariates x i ∈ R p , for i = 1 to n.
Sliced Inverse Regression (SIR) of Li [1991], Lue [2009], Szretter and Yohai [2009] is one of the earliest developed supervised dimensionality reduction techniques and is a seminal work that introduced the concept of a central subspace that we now describe.This technique aims to find a subspace given by the column space of a p×d matrix B with d << p such that y |= X|B T X where |= indicates statistical independence.Under mild conditions the intersection of all such dimension reducing subspaces is itself a dimension reducing subspace, and is called the central subspace [Cook, 1996].SIR aims to estimate this central subspace.Sliced Average Variance Estimation (SAVE) of Shao et al. [2009] and Shao et al. [2007] is another early method that can be used to estimate the central subspace.SIR uses a sample version of the first conditional moment EX | Y to construct an estimator of this subspace and SAVE uses the sample first and second conditional moments to estimate it.Likelihood Acquired Directions (LAD) of Cook and Forzani [2009] is a technique that obtains the maximum likelihood estimator of the central subspace under assumptions of conditional normality of the predictors given the response.Like LAD, methods SIR and SAVE rely on elliptical distributional assumptions like Gaussianity of the data.
More recent methods that do not require any distributional assumptions on the marginal distribution of x or on the conditional distribution of y.The authors of Gradient Based Kernel Dimension Reduction (gKDR), Fukumizu and Leng [2014], use an equivalent formulation of the conditional independence relation y |= X|B T X using conditional cross-covariance operators and aim to find a B that maximizes the mutual information I(B T X, y).In this work, the authors estimate the conditional cross-covariance operators by using Gaussian kernels.gKDR instead uses kernels only to provide equivalent characterizations of conditional independence using sample estimators of cross-covariance operators.
Sufficient Component Analysis (SCA) of Yamada et al. [2011] is another technique where the B is also learnt using a dependence criterion.SCA aims to maximize the leastsquares mutual information given by SM I(Z, Y ) = 1 2 ( pzy(z,y) pz(z)py(y) − 1) 2 dzdy between the projected features Z = B T X and the response.This is done under orthonormal constraints over B, and the optimal solution is found by approximating pzy(z,y) pz(z)py(y) using method of density ratio estimation [Sugiyama et al., 2012, Vapnik et al., 2015], and also an analytical closed form solution for the minima is obtained.In Suzuki and Sugiyama [2013] (LSDR), the authors optimize this objective using a natural gradient based iterative solution on the Steifel manifold S m d (R) via a line search along the geodesic in the direction of the natural gradient [Amari, 1998, Nishimori andAkaho, 2005].Our contribution in this paper is as follows: We propose a new formulation for supervised dimensionality reduction that is based on a dependency criterion called Distance Correlation, [Szekely et al., 2007].This setup is free of distributional, as well as regression model assumptions.The novelty in our formulation is that we do not restrict the transformation from x to z to be linear, as in case many of the above techniques.To further add, other works of Li et al. [2012], Kong et al. [2015], Berrendero et al. [2014] have used Distance Correlation as a criterion for feature selection in a regression setting.In our work, we show benefits Distance Correlation as a criterion for supervised low-dimensional feature learning.
In our work we use the following notation: The spectral radius of a matrix M is denoted by λ max (M), i th eigenvalue by λ i (M), and i th generalized eigenvalue Ax = λ i Bx by λ i (A, B).Moreover, λ max (M ) (λ max (A, B)), and λ max (M ) (λ min (A, B)) respectively, the maximum and minimum eigenvalues (generalized eigenvalues) of matrices M, A and B. We use the usual partial ordering for symmetric matrices: A B means A − B is positive semidefinite; similarly for the relationships , ≺, .The norm • will be either the Euclidean norm for vectors or the norm that it induces for matrices, unless otherwise specified.

Distance Correlation
Distance Correlation introduced by Szekely et al. [2007] and Székely et al. [2009], Székely and Rizzo [2012,2013] is a measure nonlinear dependencies between random vectors of arbitrary dimensions.We describe below α-distance covariance which is an extended version of standard distance covariance for α = 1.Definition 2.1.Distance Covariance [Székely et al., 2007], α-dCov: Distance covariance between random variables x ∈ R d and y ∈ R m with finite first moments is a nonnegative number given by where f x , f y are characteristic functions of x, y, f x,y is the joint characteristic function, and w(t, s) is a weight function defined as w(t, s) α2 α Γ((α+d)/2) .The distance covariance is zero if and only if random variables x and y are independent.From above definition of distance covariance, we have the following expression for Distance Correlation.
Definition 2.2.Distance Correlation [Székely et al., 2007] (α-dCorr): The squared Distance Correlation between random variables x ∈ R d and y ∈ R m with finite first moments is a nonnegative number defined as The Distance Correlation defined above has the following interesting properties; 1) ρ 2 (x, x) is defined for arbitrary dimensions of x and y, 2) ρ 2 (x, y) = 0 if and only if x and y are independent, and 3) ρ 2 (x, y) satisfies the relation 0 ≤ ρ 2 (x, y) ≤ 1.In our work, we use α-Distance Covariance with α = 2 and in the following paper for simplicity just refer to it as Distance Correlation.
We define sample version of distance covariance given samples {(x k , y k )|k = 1, 2, . . ., n} sampled i.i.d.from joint distribution of random vectors x ∈ R d and y ∈ R m .To do so, we define two squared Euclidean distance matrices E X and E Y , where each entry with k, l ∈ {1, 2, . . ., n}.These squared distance matrices are when double-centered, by making their row and column sums zero, and are denoted as E X , Q X , respectively.So given a double-centering matrix Hence sample distance correlation (for α = 2) is defined as follows.
Definition 2.3.Sample Distance Correlation [Székely et al., 2007]: Given i.i.d samples X × Y = {(x k , y k )|k = 1, 2, 3, . . ., n} and corresponding double centered Euclidean distance matrices E X and E Y , then the squared sample distance correlation is defined as, and equivalently sample distance correlation is given by .

Laplacian Formulation of Sample Distance Correlation
In this section, we propose a Laplacian formulation of sample distance covariance, and sample distance correlation, which we later use to propose our objective used for supervised dimensionality reduction (SDR).
A graph Laplacian version of sample distance correlation can be obtained as follows, Lemma 3.1.Given matrices of squared Euclidean distances E X and E Y , and Laplacians L X and L Y formed over adjacency matrics E X and E Y , the square of sample distance correlation ρ2 (X, Y) is given by Proof.Given matrices E X , E Y , and column centered matrices X, Y, from result of Torgerson [1952] we have that E In the problem of multidimensional scaling (MDS) [Borg and Groenen, 2005], we know for a given adjacency matrix say W and a Laplacian matrix L, Now for the Laplacian L = L X and adjacency matrix W = E Y we can represent Tr X T L Y X in terms of E Y as follows, From the fact [E X ] i,j = ( x i , x i + x j , x j − 2 x i , x j ), and also E X = −2 X X T we get It also follows that Similarly, we can express the sample distance covariance using Laplacians L X and L Y as The sample distance variances can be expressed as ν2 (X, substituting back into expression of sample distance correlation above we get Equation 1.

Problem Statement
The goal in supervised dimensionality reduction (SDR) is to learn a low dimensional representation Z ∈ R n×p of input features X ∈ R n×d , so as to predict the respone vector y ∈ R from Z.The intuition being that Z captures all information relevant to predict y.Also, during testing, for out-of-sample prediction, for a new data point x * , we estimate z * assuming that it is predictable from x * .In our proposed formulation, we use aforementioned Laplacian based sample distance correlation to measure dependencies between variables.We propose maximize dependencies between the low dimensional features Z and response vector y, and also low dimensional features Z with input features X.Our objective is to maximize the sum of squares of these two sample distance correlations which is given by, On simplification we get the following optimization problem which we refer to as Problem (P). where are constants, and

Algorithm
In the proposed problem (Problem (P)), we observe that numerator of our objective is convex while denominator is non-convex due the presence of a square root and a nonlinear Laplacian term L Z on Z. Hence, this makes direct optimization of this objective practically infeasible.So to optimize Problem (P), we present a surrogate objective Problem (Q) which lower bounds our proposed original objective.We maximize this lower bound with respect to Z and show that optimizing this surrogate objective Problem (Q) (lower bound), also maximizes the proposed objective in Problem (P).We do so by utlizing the Generalized Minorization-Maximization (G-MM) framework of Parizi et al. [2015].
The G-MM framework of Parizi et al. [2015] is an extension of the well known MM framework of Lange et al. [2000].It removes the equality constraint between both objectives at every iteration Z k , except at initialization step Z 0 .This allows the use a broader class of surrogates that avoid maximization iterations being trapped at sharp local maxima, and also makes the problem less sensitive to problem initializations.
The surrogate lower bound objective is as follows, where M ∈ R n×d belongs to the set of column-centered matrices.
The surrogate problem (Problem (Q)) is convex in both its numerator and denominator for a fixed auxiliary variable M. Theorem 4.1 provides the required justification that under certain conditions, maximizing the surrogate Problem (Q) also maximizes the proposed objective Problem (P) .
An outline of the strategy for optimization is as follows: Proof.For convergence it is enough for us to show the following, [Parizi et al., 2015]: T , we observe that Z 0 column-centered, Statement 2 follows from the optimization Z k+1 = arg max g(Z, Z k ).To prove statement 3 we have to show that .
Since numerators on both sides are equal, it is enough for us to show that It follows from the rescaling step (step c) of the optimization strategy that the left hand side Tr Z t+1 L Z t+1 Z t+1 is always greater that one, and so taking square root of it implies We summarize all of the above steps in Algorithm 4.1 below and section 5 further describes optimization algorithm to solve Problem (Q) required by it. Solve,

Optimization
In this section, we propose a framework for optimizing the surrogate objective g(Z, M), referred to as Problem (Q), for a fixed M = Z k .We observe that for a given value of M, g(Z, M) is a ratio of two convex functions.To solve this, we convert this maximization problem to an equivalent minimization problem h(Z, M), by taking its reciprocal [Schaible, 1976].This allows us to utilize the Quadratic Fractional Programming Problem (QFPP) framework of Dinkelbach [1967] and Zhang [2008] to minimize h(Z, M).We refer to this new minimization problem as Problem (R).It is stated below.
where M = Z k .
In his seminal work Dinkelbach [1967] and later Zhang [2008] proposed a novel framework to solve constrained QFP problems by converting it to an equivalent parametric optimization problem, by introducing a scalar parameter α ∈ R. We utilize this equivalence proposed to defined new parametric problem, Problem (S).The solution involves a search over the scalar parameter α while repeatedly solving Problem (S) to get the required solution Z k+1 .This search process continues until values of α converge.
In a nutshell, Dinkelbach [1967] and Zhang [2008] frameworks suggest the following optimizations are equivalent: To see the equivalence of h(Z, M) in Problem (R) to h(z) above we observe that: In subsection 5.1 we propose a Golden Section Search [Kiefer, 1953] based algorithm (Algorithm 5.1) which utilizes concavity property of H(Z; α) with respect to α to locate the best α * .During this search we repeatedly solve Problem (S) starting with an intial interval 0 = α l ≤ α ≤ α u = λ min (L M , S X,y ) for a fixed M, then at each step shorten the search interval by moving upper and lower limits closer to each other.We continue until convergence to α * .The choice of the upper limit of α u = λ min (L M , S X,y ) is motivated by proof of Lemma A.2.
To solve Problem (S) for a given α, we propose an iterative algorithm in subsection 5.2 (Algorithm 5.2).It uses the classical Majorization-Minimization framework of Lange [2013].

Distance Correlation Maximization Algorithm
Algorithm 5.2 gives a iterative fixed point algorithm which solves Problem (S).Theorem 5.2 provides a fixed point iterate used to minimize H(Z, α) with respect to Z for a given α.The fixed point iterate4 Z t+1 = HZ t minimizes Problem (S) and a monotonic convergence is assured by the Majorization-Minimization result of Lange [2013].Theorem 5.2 below derives the fixed point iterate used in Algorithm 5.2.Theorem 5.2.For a fixed γ 2 (Lemma A.1), some α (Lemma A.2) and the iterate Z t = HZ t−1 monotonically minimizes the objective, Proof.From Lemma A.1 we know that, (γ 2 D X − L M ) 0. Hence the following would hold true for any real matrix N, Rearranging the terms we get the following inequality over Tr Z T L M Z , It also follows that the surrogate function l(Z, N) − αTr Z T S X,y Z majorizes our desired objective function H(Z; α).To optimize this surrogate loss we equate its gradient to zero and rearrange the terms to obtain which gives us the update equation Z t+1 = HZ t where H is given by, Hence it follows from framework of Lange [2013] that above update equation monotonically minimizes H(Z; α).
Algorithm 5.2 summarizes the steps of an iterative Majorization-Minimization approach to solve Problem (S).

Experiments
In this section we present experimental results that compare our proposed method with several state-of-the-art supervised dimensionality reduction techniques on a regression task.(i) We run our proposed algorithm on the training set X Train to learn low-dimensional features Z Train .

Methodology
Algorithm 5.2 Distance Correlation Maximization for a given α We learn the map ψ : z → y using Support Vector Regression on Z Train and Y Train .
(iii) We learn mappings φ i : x → z i , i = 1 to d for each dimension of z using Support Vector Regression on X Train and Z Train .
During testing/out-of-sample phase, given a test input x * , we use maps φ i : x → z i for i = 1 to d and generate z * .We then utilize maps ψ : z → y on z * to get the predicted response y * .Figure 1 illustrates the testing phase of our methodology.

Datasets
In our results we report the Root Mean Squared (RMS) errors on five datasets from the UCI-Machine Learning Repository [Lichman, 2013] in Tables 1 to 5. We use the following datasets in our experiments.
(a) Boston Housing [Harrison and Rubinfeld, 1978]: This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.This dataset has been used extensively throughout the vast regression literature to benchmark algorithms.The response variable to be predicted is the median value of owner-occupied homes.
(b) Relative Location of Computed Tomography (CT) Slices [Graf et al., 2011]: This dataset consists of 385 features extracted from computed tomography (CT) images.Each CT slice is described by two histograms in polar space that are concatenated to form the final feature vector.The response variable to be predicted is the relative location of an image on the axial axis.The ground truth of responses in this dataset was constructed by manually annotating up to 10 distinct landmarks in each CT Volume with a known location.This response takes values in the range [0, 180] where 0 denotes the top of the head and 180 denotes the the soles of the feet.
(c) BlogFeedback [Buza, 2014]: This dataset originates from a set of raw HTML documents of blog posts that were crawled and processed.The task associated with this data is to predict the number of comments in the upcoming 24 hours.In order to simulate this situation, the dataset was curated by choosing a base time (in the past) and selecting the blog posts that were published at most 72 hours before the selected base date/time.Then a set of 281 features of the selected blog posts were computed from the information that was available at the basetime.The target is to predict the number of comments that the blog post received in the next 24 hours, relative to the basetime.In the training data, the base times were in the years 2010 and 2011.In the test data the base times were in February and March 2012.
(d) Geographical Origin of Music [Zhou et al., 2014]: Instances in this dataset contain audio features extracted from 1059 wave files covering 33 countries/areas.The task associated with the data is to predict the geographical origin of music.The program MARSYAS was used to extract 68 audio features from the wave files.These were appended with 48 chromatic attributes that describe the notes of the scale bringing the total number of features to 116.
(e) UJI Indoor Localization [Torres-Sospedra et al., 2014]: The UJIIndoorLoc is a Multi-Building Multi-Floor indoor localization database that relies on WLAN/WiFi fingerprinting technology.Automatic user localization consists of estimating the position of the user (latitude, longitude and altitude) by using an electronic device, usually a mobile phone.The task is to predict the actual longitude and latitude.
The database consists of 19937 training/reference records and 1111 validation/test records.The 529 features contain the WiFi fingerprint, the coordinates where it was taken, and other useful information.Given that this paper focusses on the setting of univariate responses, we only aim to predict the 'Longitude'.

Results
We perform five-fold cross validation on each of these datasets and report the average Root Mean Square (RMS) error on the hold-out test sets.Tables 1 to 5 present the cross-validated RMS error of our proposed method (DisCoMax), and six other supervised dimensionality reduction techniques namely; LSDR [Suzuki and Sugiyama, 2013], gKDR [Fukumizu and Leng, 2014], SCA [Yamada et al., 2011], LAD [Cook and Forzani, 2009], SAVE [Shao et al., 2009] and [Shao et al., 2007] and SIR [Li, 1991].
We fix folds across the seven techniques presented within each of the tables (Tables 1  to 5).We also compute RMS errors for increasing dimensions d = 3, 5, 7, 9 and 11.We note the significant improvement in the predictive performance (smaller error) of DisCoMax learnt features across for all cases with different dimensionality, and also gradual increase performance (smaller error) as we increase dimensionality learnt features.
For baseline comparison purposes, in case of the Boston Housing dataset, we observe a RMS error of 0.1719 using Support Vector Regression without any dimensionality reduction (d = 13).This when compared to DisCoMax RMS errors which ranged between 0.1559 (d = 3) and 0.1297 (d = 11) always did worse.We bold errors for DisCoMax for cases where errors were significantly better when compared with their corresponding standard deviations taken into account.
Figure 3a and 3b repectively show the overall growth of distance correlations (ρ(X, Z), ρ(Z, y)) and f (Z), with respect to the fixed point iterations (t), for α * = 800 × 10 4 .We periodically observe a sharp increases in f (Z) and distance correlations after each DisCoMax subproblem of 220 fixed point iterations.The figures show four such G-MM iterations of Algorithm 4.1.These sharp increases are due to the resubstitution of M = Z k in Step 2 of Algorithm 4.1.This clearly shows us that we are able to maximized are original proposed objective in Problem (P).

Conclusion
In our work, we proposed a novel method to perform supervised dimensionality reduction.Our method aims to maximize an objective based on a statistical measure of dependence called statistical distance correlation.Our proposed method does not necessarily constrain the dimension reduction projection to be linear.We also propose a novel algorithm to optimize our proposed objective using the Generalized Minorization-Maximization approach of Parizi et al. [2015].Finally, we show a superior empirical performance of our method on several regression problems in comparison to existing state-of-the-art meth-   For future work, we aim to extend our framework to handle multivariate responses y ∈ R q , as distance correlation is applicable to variables with arbitrary dimensions.Our proposed approach is practically applicable on relatively small datasets, as it involves repeatedly solving multiple optimization subproblems.So we aim to to simplyfy this approach so that it is tractable for larger size (several thousands of examples) datasets.In our work, we currently tackle the out-of-sample issue by learning mutiple SVR's, one for each dimension of z, we plan to extend our framework so as to learn explicit out-of-sample mappings from x to z.
If we compare the above equation with a the general update equation from Zhang et al. [2000], which is of the form where ∇f (Z t ) is the gradient of the objective function f (Z) we get, Now from Theorem A.1 we conclude that B(Z) 0, We also check the following condition from Zhang et al. [2000] that or equivalently, as in our case 0 2Q 4(Q + P), which is indeed true.Hence it follows that λ max (T (Z)) ≤ 1 which implies λ max (H) ≤ 1.
We now proceed to show that at end of every (t + 1) fixed point iterations we have Tr Z T t+1 L Z t+1 Z t+1 ≤ Tr (Z t+1 L Z 0 Z t+1 ).
Lemma A.4.For fixed point iteration Z t+1 = HZ t for optimization of Z k+1 = arg max Z g(Z, Z k ), we have, Proof.Laplacian for a weighted adjacency matrix W (with self loops) is defined as L = D − W where D is a diagonal degree matrix with diagonal elements [D] i,i = j [W] i,j and zero off-diagonal entries [Chung, 1997].For adjacency matrix E Z we have Torgerson, 1952].We have Laplacian as L Z = D Z − E Z with D Z = 0.This gives us for Z t+1 the Laplacian L Z t+1 = 2Z t+1 Z T t+1 .It also follows from the fact that since we choose our intialization Z 0 as column-centered matrix, and Z t+1 = HZ t are also successively column-centered for all t > 0. Hence, L Z t+1 = 2 Z t+1 Z T t+1 .Now substituting Z t+1 = HZ t in Laplacian equation L Z t+1 we get, Substituting above equation into right hand side of the statement to be proved gives us, Substituting eigen decomposition of H = QΛQ T where Λ is a diagonal eigenvalues matrix with values less than one (Theorem A.3) we get, For Λ = I (identity matrix) gives us, a) Initialize: Initialize Z 0 = cJ d , 0 T (n−d)×d T , a column-centered matrix where c = and J d ∈ R d×d is a centering matrix.This is motivated by statement 1) in proof of Theorem 4.1.b) Optimize: Maximize the surrogate lower bound Z k+1 = arg max g(Z, Z k ) (See section 5).c) Rescaling: Rescale Z k+1 ← κZ k+1 such that Tr Z k+1 L Z k+1 Z k+1 is greater than one.This is motivated by proof of statement 3) of Theorem 4.1, and also the fact that g(Z, M) = g(κZ, M) and f (Z) = f (κZ) for any scalar κ. d) Repeat step b and c above until convergence.Theorem 4.1.Under above strategy, maximizing the surrogate Problem Q also maximizes Problem P.
Theorem 5.1.Let G : R → R be defined asG(α) = min Z H(Z; α) = min Z Tr Z T L M Z − αTr Z T S X,y Zas derived from Problem (S), then following statements hold true.1.G is continuous at any α ∈ R. 2. G is concave over α ∈ R.3.G(α) = 0, has a unique solution α * .Algorithm 5.1 exploits the concavity property of G(α) to perform a Golden Section Search over α.Subsection 5.2 provides an iterative Majorization-Minimization algorithm (Algorithm 5.2) to solve this minimization problem Problem (S).Algorithm 5.1 Golden Section Search for

Figure 2 :
Figure 2: Effect of α values on growth of the proposed objective in Algorithm 5.2 the figures show slower (faster) growth of distance correlations for smaller (larger) α.

Figure 3 :
Figure 3: Overall gradual increase in f (Z) (Figure 3a) and distance correlations (Figure for α * = 800 × 10 4 .Plots show increase in both for each DisCoMax subproblem of (Algorithm 5.2) and four outer G-MM iterations of Algorithm 4.1

Table 1 :
[Harrison and Rubinfeld, 1978]binfeld, 1978]: U.S Census Service concerning housing in the area of Boston Mass.To predict median value of owner-occupied homes.Baseline results SVR RMS error 0.1719.
: This data contains features computed from raw HTML documents of blog posts.The task associated with this data is to predict the number of comments in the upcoming 24 hours.

Table 4 :
[Zhou et al., 2014]f CT slices[Zhou et al., 2014]: Dataset consists of 385 features extracted from CT images.Features are concatenation of two histograms in polar space.The response variable is the relative location of an image on the axial axis.

Table 5 :
UJI Indoor Localization [Torres-Sospedra et al., 2014]: Multi-Building Multi-Floor indoor localization database.The task is to predict the actual longitude and latitude.The 529 attributes contain the WiFi fingerprint, the coordinates where it was taken.The database consists of around 20ktraining/reference records and 11k validation/test records.