Feature Selection in High-Dimensional Models via EBIC with Energy Distance Correlation

In this paper, the LASSO method with extended Bayesian information criteria (EBIC) for feature selection in high-dimensional models is studied. We propose the use of the energy distance correlation in place of the ordinary correlation coefficient to measure the dependence of two variables. The energy distance correlation detects linear and non-linear association between two variables, unlike the ordinary correlation coefficient, which detects only linear association. EBIC is adopted as the stopping criterion. It is shown that the new method is more powerful than Luo and Chen’s method for feature selection. This is demonstrated by simulation studies and illustrated by a real-life example. It is also proved that the new algorithm is selection-consistent.


Introduction
Advancements in technology have led to the production of sophisticated machines which are able to measure many details about every observational or experimental unit in a system. This results in data with more features (p or predictors) than the number of observational or experimental units (sample size n), referred to as high-dimensional data. Most of these data come from genetic research, e-commerce, biomedical imaging, and functional magnetic resonance imaging, among many others.
Since there are many more features (p) than the sample size (n), they are analyzed using a sparse high-dimensional regression (SHR) model: x ij β j + i , for i = 1, . . . , n. (1) It is assumed that there is only a relatively small number of the nonzero β j 's. The main goal in their analysis is feature selection. As stated by [1], feature selection typically has two goals. The first is for model building using desirable prediction properties. The second is for identifying the features with nonzero coefficients. For convenience, such features are referred to as relevant features in this paper. One approach to the SHR model is to estimate the β j 's by a regularization method, which is done by simultaneously minimizing the penalized least squares below: where λ is the regulating parameter and p λ is a penalty function. When p λ is based on the L 1 norm, thus β 1 = ∑ p j=1 |β j |, it is referred to as the LASSO [2]. The L 1 norm penalty is able to shrink the coefficients of redundant predictors to zero. Thus, the LASSO usually results in sparse models that are easier to interpret. Other penalty functions such as the SCAD [3] and adaptive LASSO (ALasso) [4] have also been reported in the literature. SCAD smoothly clips the L 1 penalty (for small |β j |), and assigns a constant penalty (for large | β j |s). On the other hand, adaptive LASSO utilizes the minimax concave penalty [5], p λ (| β j |) = λw j | β j |, where w j represents the weights. The regulating parameter, λ, is usually chosen using cross-validation (CV).
Another approach to analyzing the SHR model, large-p-small-n problem, is sequential variable selection, which is designed to reduce the dimension of the data such that d < p. There are two forms of sequential variable selection. The first uses the sure screening property, which selects from the many features a subset which contains the relevant features (predictors). This is usually followed by a regularization method such as SCAD or ALasso to identify and estimate the relevant predictors from the reduced feature space. The other form is to sequentially select the relevant features through a repetitive process which terminates when a stopping criterion is met.
A recent addition to sequential feature selection is the sequential LASSO cum EBIC in ultra high-dimensional feature space (SLasso) by [1], which sequentially solves partially penalized least squares problems and uses EBIC as the stopping criteria. The EBIC proposed by [6] are suitable for model selection in large model spaces. It has the ordinary BIC as a special case. For large model spaces, the ordinary BIC tends to select a model with many spurious variables. Let k and k + 1 be the number of predictors in two models, respectively. Using EBIC as the selection or stopping criteria, the model with k predictors is selected if the EBIC(k + 1) > EBIC(k).
In SLasso, sequentially solving the partially penalized least squares reduces to selecting the feature(s) which maximize the ordinary correlation coefficient between the features and the response variable at each step. It is well-known that the Pearson correlation coefficient is used for measuring the strength of linear associations. Thus, maximizing the Pearson correlation coefficient might not work well for data structures where the relationship between at least one feature and the response variable is nonlinear.
In this article, we propose the use of the energy distance correlation instead of the ordinary correlation coefficient to identify and maximize both the linear and nonlinear relationships that might exist between each feature and the response. Energy distance is a metric that measures the distance between the distributions of random vectors. The name 'energy' is motivated by analogy to the potential energy between objects in a gravitational space. The potential energy is zero if and only if the locations (the gravitational centers) of the two objects coincide, and increases as their distance in space increases [7]. The energy distance correlation has an explicit relationship with the product-moment correlation, but unlike the classical definition of correlation, energy distance correlation is zero only if the random vectors are independent. The empirical energy distance correlation is based on Euclidean distances between sample elements rather than sample moments.
The remainder of the article is arranged as follows: in Section 2, we discuss the derivation of the energy distance correlation, extended Bayesian information criteria and our proposed method (energy distance correlation with EBIC (Edc + EBIC)). In Section 3, we report simulation studies comparing Edc + EBIC with various other methods and provide an analysis of real data. In Section 4, we conclude the article with a discussion of the results.

Energy Distance Correlation
The authors in [8] proposed the energy distance correlation between two random variables. Suppose that W ∈ R p and Z ∈ R q are two random vectors with E W < ∞, and E Z < ∞, where . is the euclidean norm and E is the expected value. Let F and G be the cumulative distribution function (CDF) of W and Z, respectively. Further, let W denote an independent and identically distributed (iid) copy of W; that is, W and W are iid. Similarly, Z and Z are iid.
The squared energy distance can be defined in terms of expected distances between the random vectors and the energy distance between distributions F and G is defined as the square root of D 2 (F, G).
The energy distance correlation between random vectors W and Z with finite first moments is the nonnegative number R(W, Z) defined by where ν 2 (W, Z) is the energy distance covariance between W and Z, ν 2 (W) and ν 2 (Z) are the energy distance variance of W and Z respectively. For a statistical sample (w, z) = {(w k , z k ), k = 1, 2, . . . , n} from a pair of real-valued or vector-valued random variables (W, Z), the sample energy distance correlation, R n (W, Z), is calculated by first computing the n by n distance matrices (a j,k ) and (b j,k ) containing all pairwise distances (a j,k ) = W j − W k , j, k = 1, 2, . . . , n and (b j,k ) = Z j − Z k , j, k = 1, 2, . . . , n where . denotes euclidean norm. Secondly, calculate all doubly centered distances is the j th row mean, a .k is the k th column mean, andā .. is the grand mean of the distance matrix of the w sample. The notation is similar for the b values.
The squared sample distance covariance (a scalar) is the arithmetic average of the products A j,k B j,k .
The sample energy distance variance for sample w The sample energy distance variance for sample z The sample energy distance correlation is Some basic properties of the distance correlation are as follows: (iii) Suppose that R n (W, Z) = 1. Then, there exist a vector a, a nonzero real number b and an orthogonal matrix C such that Z = a + bWC. For further details on the energy distance correlation, see [7].

EBIC
Ref. [6] derived the EBIC which have special cases as AIC and BIC. Let {(y i , x i ) : i = 1, 2, . . . , n} be independent observations. Suppose that the conditional density function of y i given x i is f (y i |x i , β), where β ∈ Θ ⊂ R p n , p n being a positive integer. The likelihood function of β is given by Denote y = (y 1 , y 2 , . . . , y n ) τ . Let s ⊂ {1, 2, . . . , p n } and β(s) be the parameter vector β with those components outside s set to 0. Let S be the underlying model space, i.e., S = {s : s ⊆ {1, 2, . . . , p n }}, let p(s) be a prior for model s. Assume that, given s, the prior density of β(s) is π(β(s)). The posterior is Suppose S is partitioned into ∪ p j=1 S j , such that models within each S j have an equal dimension. Let τ(S j ) be the size of S j . Assign the prior distribution P(S j ) proportional to τ η (S j ) for some η between 0 and 1. For each s ∈ S j , assign equal probability, p( Then, the extended BIC family is given by whereβ(s) is the maximum likelihood estimator of β(s) and |s| is the number of components in s.

Energy Distance Correlation with EBIC (Edc + EBIC) Algorithm
We propose a sequential model selection method which we call energy distance correlation with EBIC, and for convenience abbreviate it as Edc + EBIC. Let y i , i = 1, . . . , n be a continuous response variable and x j , j = 1, . . . , p be an n × p data matrix. Let S be the index set of all predictors. Let s 0 = {j : β j = 0, j = 1, . . . , p}. For s ⊂ S, let s − = s c ∩ s 0 . If s ⊂ s 0 , then s − is the complement of s in s 0 . Let p 0 = |s 0 | be the number of elements in the set s 0 .
At the initial stage we standardize all the variables. Next, we find the energy distance correlation between the response variable and each of the predictor variables-{R(x j , y) j = 1, . . . , p.}. We then select the predictor (feature) which has the highest distance correlation with the response and store it in the active set s * 1 .
Let L(s) be the linear space spanned by the columns of X(s) and H(s) its corresponding projection matrix, i.e., The variableỹ is the unexplained part of y by X(s * 1 ). This gives X(s * 1 ) close to a zero chance of being selected in the subsequent steps.
For the general step where k > 1, we calculate {R(x j ,ỹ) j = 1, . . . , p.} and update the active set to s * k+1 , which is the union of all the previous selected variables and the current one. We then compute EBIC(s * k+1 ) and compare it with EBIC(s * k ). The procedure stops if EBIC(s * k+1 ) > EBIC(s * k ). The selected variables which we call the relevant variables will be X(s * k ). We can then fit a linear regression model between the response y and the relevant variables.
We wish to note that care must be taken in fitting this model because some of the predictors might be non-linearly related to y, and thus some of the predictors may have to enter into the model in their quadratic or cubic form, etc. Alternatively, a Box-Cox transformation can be performed on the data before fitting the model.
The algorithm details are given in the following.
• Initial Step: With y, x j , j = 1, . . . , p satisfying y τ 1 = 0, x τ j 1 = 0 and y τ y = n, x τ j x j = n, compute R(x j , y) for j ∈ S. Let When the process terminates, return the least-squares estimates for parameters in the selected model.

Selection Consistency of Edc + EBIC
We attempt to establish the large sample property for the Edc + EBIC. We will show that under regular conditions, the Edc + EBIC is selection-consistent. The proof essentially follows the approach in [9]. We proceed with the regularity conditions. Assumption 1. Random vectors X and Y possess the subexponential tail probabilities, uniformly in p, specified as follows. There is a constant a 0 > 0, such that for any 0 < a ≤ 2a 0 , sup p max 1≤k≤p E{exp(a X k 2 1 )} < ∞ and E{exp(a Y 2 q )} < ∞.
Details for requiring Assumption 1 and 2 are stated in [9]. Intuitively, Assumption 1 is required to make it easy to establish a relationship between the energy distance correlation and the squared Pearson correlation to aid with the derivations in the proof. Assumption 2 requires that the energy distance correlation for the relevant predictors cannot be too small. Assumption 3 requires that the maximum energy distance correlation between the selected features and the residual responseỸ is smaller than the maximum energy distance correlation between the remaining features and the residual response in the sequential step of the algorithm. Theorem 1. Suppose that Assumptions 1-3 hold. The proposed Edc + EBIC with the energy distance correlation is consistent, i.e., lim n→∞ P(s * k * = s 0n ) = 1, where s * k * is the set of features selected at the k * th step of Edc + EBIC such that |s * k * | = p 0n, s 0n is the set of relevant features and p 0n = |s 0n |.
Proof. Suppose that X ∈ R p and Y ∈ R q with cumulative distribution function (CDF) F and G, respectively, where E X < ∞, and E Y < ∞ . The energy distance correlation R(X, Y) is the square root of the standardized coefficient: In the numerator is the distance covariance defined by [8], as where S j , j = 1, 2, and 3 are defined as: where (X, Y), (X , Y ), and (X , Y ) are independently and identically distributed. For a random sample {(x i , y i ), i = 1, . . . , n} from (x, y), [8] estimated S 1 , S 2 , S 3 as: |x k − x l | p |y k − y l | q so the sample distance covariance is dcov 2 =Ŝ 1 +Ŝ 2 − 2Ŝ 3 . The remaining part of the proof is to show that the energy distance correlation is uniformly consistent and has the sure screening property. The numerator and denominator of the energy distance correlation are similar, so to show the uniform consistency of the energy distance correlation it suffices to show that both the numerator and the denominator are uniformly consistent.
The uniform consistency of the numerator, dcov 2 =Ŝ 1 +Ŝ 2 − 2Ŝ 3 , of the energy distance correlation between the random vectors (x, y) is shown by [9]. However, in the general step of the sequential algorithm for Edc + EBIC, the energy distance correlation is calculated between the residualsỹ = [I − H(s * k )]y, andx j = [I − H(s * k )]x j at each step of the algorithm. Thus, to show the uniform consistency of Edc+EBIC it is equivalent to follow the proof by [9].
Additionally, in [9] they showed that the energy distance correlation has the sure screening property. They showed that the energy distance is able to select a subset of the features which contains the relevant features. Their argument applies here because we used the energy distance correlation as well, thus the Edc + EBIC has the sure screening property.
Therefore, the Edc + EBIC is selection-consistent since it is uniformly consistent and has the sure screening property. The proof is complete.

Sure Independence Screening Using Energy Distance Correlation
We establish the need for Ebc + EBIC by firstly examining the performance of a sure independence screening method introduced by [9] called Distance Correlation Sure Independence Screening (DC-SIS). This is similar to the Sure Independence Screening (SIS) introduced by [10].
In SIS, they perform a componentwise regression between each predictor and the response and select the first n − 1 or [n/log(n)] predictors with the largest estimates. Performing a componentwise regression is equivalent to finding the ordinary correlation between the response and each predictor when the two variables are standardized. Hence, in DC-SIS, they replaced the ordinary correlation with the energy distance correlation.
We examine the performance of DC-SIS through a simulation study. We are interested in observing, on average, the model size selected by SCAD or ALasso if we screened the data first using DC-SIS. We present two simulation set-ups. For each simulation we generated two hundred datasets, and for each dataset we ran SCAD, ALasso, DC-SIS + SCAD, DC-SIS + ALasso and found the average model size and the standard deviation.
In [10], details of two simulation setups we adapted for this subsection are discussed, namely independent features setup and dependent features setup. In Tables 1 and 2 we present results under the independent features setup and in Tables 3-5 we present results under the dependent features setup. In each simulation, n is the sample size, p is the number of features and s is the true model size. For the screening using the energy distance correlation we chose d = [n/ log n] features and applied SCAD or ALasso.
In Tables 1 and 2, we report the average selected model size and their standard deviations. We observe that applying the sure screening by distance correlation before either SCAD or ALasso in all cases did not lead to significant differences in the average model size when SCAD and ALasso were applied directly to the data. This suggests that either applying distance correlation before SCAD or ALasso did not yield the intended result, and thus needs some improvement.  In Tables 3-5 we report the selected model size and the standard deviation. We observe that applying DC-SIS followed by either SCAD or ALasso did not yield any significant difference in the average model size, as was also observed in the independent features setup.

Simulation Studies to Compare Edc + EBIC with Other Feature Selection Methods
In this simulation study we adopted two simulation setups from [1], which they call group A and group B, respectively. Under their group A we considered four settings of the covariance structure for the design matrix X, namely GA1, GA2, GA3, and GA5. In their group B setup we considered all three settings of the covariance structure for the design matrix X, namely GB1, GB2, and GB3. We compared the performance of adaptive LASSO (ALasso) [11], SCAD [12], SIS+SCAD [10], SLasso [1], and the energy distance correlation with EBIC (Edc + EBIC) based on the model size (MSize), positive discovery rate (PDR), PDR = |s * k * ∩s o | |s o | , and false discovery rate, FDR = |s * k * ∩s c 0 | |s * k * | averaged over 200 and 500 simulations, respectively.
We considered the diverging pattern (n, p, p o ) = (n, [5e n 0.3 ], [4n 0.16 ]), meaning that as the sample size increased, the number of predictors increased and the number of relevant predictors also increased. The coefficients were generated as independent random variables distributed as (−1) u (4n −0.15 + |z|), where u ∼ PBernoulli(0.4) and z is a normal random variable with mean 0 and satisfies P(|z| ≥ 0.1) = 0.25. The variance of the error term in the linear model was determined by where Σ is the variance-covariance matrix of relevant features. The response variable is simulated from the sparse high-dimensional regression (SHR) model

Group A Simulations and Results
We used two sample sizes, n = 100 and n = 200. By the diverging pattern considered for the simulation, for a sample size of 100 we have (n, p, p 0 ) = (100, 268, 8) and for a sample size of 200 we have (n, p, p 0 ) = (200, 672,9). Under the two sample sizes we had eight and nine relevant features, respectively, and we expected a well-performed selection model to select the right number of relevant predictors (model size) on average. The details about the covariance structure for GA1, GA2, GA3, and GA5 are in [1].
In Tables 6 and 7, we report the simulation results under the conditions for GA1, GA2,  GA3, and GA5 using sample sizes 100 and 200, respectively. We observed that under all setups, Edc + EBIC improved in terms of the average model size, PDR, and FDR, as we increased the sample size. This demonstrates the selection consistency of Edc + EBIC. Table 6. We compared the methods using PDR, FDR, and model size (MSize) averaged over 200 simulation replications. The relevant predictors were 8 and the sample size was 100. The standard deviations are in parentheses.

Group B Simulations and Results
We considered three different covariance structures named GB1, GB2, and GB3 for the features (predictors), as used in [1]. We also increased the signal-to-noise ratio by increasing the value of the expected predictors.
In Table 8, we report the simulation results under the conditions for GB1, GB2, and GB3. We observed that SLasso and Edc + EBIC performed better. SLasso had the highest PDR while Edc + EBIC had the lowest FDR. Thus, when there was some correlation among the features, Edc + EBIC still performed well.

Real Data Example
The new method was applied to the gene expression data used in [1]. For the data and details of data collection and variable definitions, see [1].
This study aimed to find the probes among the remaining 18,975 probes most closely related to TRIM32. The response variable was the expression level of probe 1389163_at. The features were the expression levels of the remaining 18,975 probes. Of the 18,975 probes, the top 3000 probes with the largest variances were considered. The expression levels were standardized to have mean 0 and standard deviation 1.
In our analysis of the data, for each of the 100 replications we selected a random sample of size 100 from 120 rats and applied Edc + EBIC to the sample. From these 100 replications, Edc + EBIC selected the distinct probes 1367705_at and 1367728_at. In [1], the probes selected by five (5) variable selection methods are reported. The probes selected by Edc + EBIC did not intersect with any of the probes these methods selected. Among the probes selected by these five (5) methods, some intersected, but this is not a surprise because these methods essentially maximized only the linear relationship between TRIM32 and each of the probes. Since Edc + EBIC is capable of detecting and maximizing both the linear and nonlinear relationships that might exist between TRIM32 and each of the probes, as evidenced in the simulation studies by the high PDR and low FDR, we are convinced that the two probes selected by our method are the most associated with TRIM32.

Conclusions and Discussion
From the simulation results in Tables 6 and 7 we observed that, as the sample size (n) increased, Edc + EBIC selected on average the expected number of predictors and did so with decreasing standard deviations, meaning that through the simulation runs more and more of the selected predictors were close to the expected number of relevant predictors. We also observed the positive discovery rate of 100%, indicating that on average for each simulation run, out of the selected features all of the relevant features were selected. Of greater importance was the small false discovery rates recorded as the sample size increased.
We found through simulation studies that when we applied the Energy Distance Correlation Sure Independence Screening proposed by [9] for variable screening followed by a regularization method such as SCAD and ALasso, the average model size selected was higher than expected and with high standard deviations.