Structured feature selection using coordinate descent optimization

Background Existing feature selection methods typically do not consider prior knowledge in the form of structural relationships among features. In this study, the features are structured based on prior knowledge into groups. The problem addressed in this article is how to select one representative feature from each group such that the selected features are jointly discriminating the classes. The problem is formulated as a binary constrained optimization and the combinatorial optimization is relaxed as a convex-concave problem, which is then transformed into a sequence of convex optimization problems so that the problem can be solved by any standard optimization algorithm. Moreover, a block coordinate gradient descent optimization algorithm is proposed for high dimensional feature selection, which in our experiments was four times faster than using a standard optimization algorithm. Results In order to test the effectiveness of the proposed formulation, we used microarray analysis as a case study, where genes with similar expressions or similar molecular functions were grouped together. In particular, the proposed block coordinate gradient descent feature selection method is evaluated on five benchmark microarray gene expression datasets and evidence is provided that the proposed method gives more accurate results than the state-of-the-art gene selection methods. Out of 25 experiments, the proposed method achieved the highest average AUC in 13 experiments while the other methods achieved higher average AUC in no more than 6 experiments. Conclusion A method is developed to select a feature from each group. When the features are grouped based on similarity in gene expression, we showed that the proposed algorithm is more accurate than state-of-the-art gene selection methods that are particularly developed to select highly discriminative and less redundant genes. In addition, the proposed method can exploit any grouping structure among features, while alternative methods are restricted to using similarity based grouping. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0954-4) contains supplementary material, which is available to authorized users.


Analytics in Sports
One major objective of analytics in sports is to enhance team performance by selecting the best possible players and make the best possible decisions on the field or court [?]. Imagine that a coach needs to select a set of best players for the team. Intuitively, the set of all possible players can be grouped (based on their positions in the field) into G groups where each group contains all players who play in that position. Since the objective is to select the best team one may claim that the problem can be solved by selecting the best player in each position separately. However, using that approach the synergism among the players are not considered. For example, players 1 and 2 are the best players for positions A and B, respectively, but the players might not be cooperative. So, including players 1 and 2 in the same team might reduce the performance of the entire team. Therefore, the idea is to select one player from each group such that the selected team has the best performance.

Derivation of Loss Functions Logistics Loss
Assume that the logistic function is the loss function of interest [3, ?], then: where y m is the label for the mth example, and f g im is the ith feature in the group g for the mth example. Assume thatŷ Then, the Jacobin and Hessian matrix are defined as:

Class Separation Loss
In order to precisely define the class separation loss, let us assume that IntraCS is the intra-class distances, which is the sum of the distance between all instances from one class C.
where f :m is the feature vector for the mth example, w w w is the weight vector for all features, is the pairwise multiplication, and the sum runs over all examples m and k from the same class C.
Similarly, the inter-class distances InterCS is defined as the sum of the distances between all examples from different classes.
where the sum runs over all examples from different classes C 1 and C 2 .
The objective of the class separation loss is to minimize the intra-class distances and maximizes the inter-class distances. Therefore, the loss function is defined as Following [4], the class separation CS loss function can be defined as: where A mk is 1 if the examples m and k are in the same class C, and -1 otherwise. Let us assume that L is the Laplacian of the adjacency matrix A, then equation (9) can be rewritten as linear term [4]: and Q g i is the value of the entry in Q that corresponds to w g i . Then, the Jacobin and Hessian matrix are defined as: Implementation Details

Convergence Criteria
The convergence of the optimization algorithm (either the standard trust-region-reflective or the proposed BCGD algorithm) can be based on the norm of the step tolerance. However, we found that using the step tolerance as the convergence test while converges to the minimum it makes more iterations than necessary in both algorithms. In other words, the algorithm identifies the representative (one feature from each group) in just few iterations and the rest of iterations is just to optimize the weights. Since we are not interested in finding the optimal weights of the features (because all of them would approach zero and the prototypes would approach 1) but instead we are interested in finding the prototypes themselves, we tested the convergence of the algorithm based on the selected features from the last 3 iterations. If the same features are selected for the last 3 iterations, we consider the algorithm converges. While this convergence test is heuristic, we found it trades the scalability for optimality and they gave the same set of features while the heuristic one is much faster. To have fair comparison between both trust-region-reflective and BCGD algorithms, we used the same convergence test for both algorithms.

Parameters for Gene Expression Experiments
In the gene expression experiments and in order to have a fair comparison to STBIP algorithm, the class separation loss function is used. The value of the Lagrangian parameters λ 1 = λ 2 = 100 in all gene expression experiments. The value of the Lagrangian parameter is chosen to balance between the two terms: minimization of the loss function and the constraints on the weights to choose a representative feature from each group. Higher value of the parameter puts more emphasize on validating the constraints, whereas lower value puts more emphasize on minimizing the loss function.

Synthetic Data Generation Process
Let us assume that the number of examples is M and each class has the same number of examples M/2. Then, we generated N features, where each feature is generated uniformly from {0, 1}, and split the features randomly into G groups (each group might have different number of features). Then, we chose one feature from each group and replaced it by another discriminative feature that has ones in one class and zeros in the other class. The rational for this generation process is that we need to compare the computational time between trust region and block coordinate gradient descent algorithm.

Scalability of BCGD
We compared the average running time of both algorithms TR and BCGD on different number of groups as shown in Figure 1, and on different number of features as shown in Figure 2.