A Hierarchical Feature and Sample Selection Framework and Its Application for Alzheimer’s Disease Diagnosis

Classification is one of the most important tasks in machine learning. Due to feature redundancy or outliers in samples, using all available data for training a classifier may be suboptimal. For example, the Alzheimer’s disease (AD) is correlated with certain brain regions or single nucleotide polymorphisms (SNPs), and identification of relevant features is critical for computer-aided diagnosis. Many existing methods first select features from structural magnetic resonance imaging (MRI) or SNPs and then use those features to build the classifier. However, with the presence of many redundant features, the most discriminative features are difficult to be identified in a single step. Thus, we formulate a hierarchical feature and sample selection framework to gradually select informative features and discard ambiguous samples in multiple steps for improved classifier learning. To positively guide the data manifold preservation process, we utilize both labeled and unlabeled data during training, making our method semi-supervised. For validation, we conduct experiments on AD diagnosis by selecting mutually informative features from both MRI and SNP, and using the most discriminative samples for training. The superior classification results demonstrate the effectiveness of our approach, as compared with the rivals.

multiple atlases or templates are available, the classification performance can be further improved 17,18 . Besides structural MRI, other imaging modalities such as functional MRI can also be used in AD/MCI diagnosis [19][20][21][22][23] , as they provide additional functional information about hypometabolism and specific protein quantification, which can be beneficial in disease diagnosis.
Besides imaging data that provide tissue level information to help AD diagnosis, genetic variants, which are related to AD, have also been shown to be valuable for AD diagnosis 24,25 . Genome-wide association studies (GWAS) were conducted to discover the association between the single nucleotide polymorphism (SNP) and the imaging data 26 . The SNP reveals molecular level information, which is complementary to the tissue level information in the imaging data. In ref. 27, the associations between SNPs and MRI-derived measures with the presence of AD were explored and the informative SNPs were identified to guide the disease interpretation. To date, most previous works have focused on analyzing the correlation between imaging and genetic data 28 , yet using both types of data for AD/MCI diagnosis has received very limited attention 29 .
Computer-aided diagnoses, including those for AD/MCI, often encounter a challenge that the data dimensionality is usually much higher than the number of available samples for model training 30 . This imbalance between feature number and sample size may affect the learning of a classification model for disease prediction, or a regression model for clinical score prediction. Furthermore, feature redundancy exists in both imaging and genetic data in terms of specific diseases. For example, in MRI-based diagnosis, features are usually generated by segmenting a brain into different regions-of-interest (ROIs) 29 . As some of the ROIs may be irrelevant to AD/MCI, feature selection can be conducted to identify the most relevant brain regions in order to learn the classification model more effectively. Similarly, only a small number of SNPs from a large pool are associated with AD/MCI 29 . Therefore, it is preferable to use only the most discriminative features from both MRI and SNPs for classification model training.
For AD diagnosis, various feature selection schemes, either unsupervised or supervised, have been proposed. Typical feature selection methods include t-test 31 , Fisher score 32 , Laplacian score 33 , and Pearson correlation 34 . Recently, sparse feature learning, e.g., the LASSO-based sparse feature learning 35 , has become a popular choice for feature selection. Besides using the  1 -norm based sparsity constraint for feature selection, the grouping or relational information embedded in data has also been introduced for improving feature selection procedures 17,36 . It is also important to mention that the unsupervised methods often consider certain data distributions or manifold structures, while the association between features and the corresponding class labels are overlooked. On the other hand, the supervised feature selection methods can be more effective by utilizing the label information in the learning process. In practice, unlabeled data may also be available but unusable by the supervised methods. In addition, while most of the previous works focused on feature selection, they did not consider discarding poor samples. Those unwanted samples may have been contaminated by noise, or may be outliers. Including poor samples can affect the model learning, thus degrading the diagnosis accuracy 37 .
In this paper, we propose a joint feature and sample selection framework which takes advantage of all labeled data along with unlabeled ones, in order to find the most informative features for classifier training. Specifically, a semi-supervised hierarchical feature and sample selection (ss-HMFSS) framework is introduced, which simultaneously selects discriminative features and samples from multimodal data. Besides a sparse constraint, we also impose a manifold term, which regularize on both labeled and unlabeled data. This regularization term preserves the neighborhood structures during the mapping from the original feature space to the label space. In our semi-supervised setting, we are able to exploit useful information from both labeled and unlabeled data, wherein the latter of which may be abundant in clinical practice.
Since the redundant features and poor samples may not be scarce, instead of achieving feature and sample selection in one single step, we perform feature and sample selection in a hierarchical manner, i.e., in multiple steps. Moreover, the feature coefficients learned in one hierarchy are used not only to discard unimportant features but also to weight the remaining features. The updated features and pruned sample set from each current hierarchy are supplied to the next round to further identify a smaller subset with even more discriminative features and samples. In this way, we gradually refine the feature and sample sets step-by-step, undermining the effect of non-discriminative data.
To validate our methodology, we conduct experiment on AD diagnosis. Structural MRI and SNPs are jointly used to improve the diagnosis accuracy, as data from both modalities are mutually informative measures in understanding disease prognosis 26 . The final selected features and samples by our method are used to train classifiers (in our case we use a Support Vector Machine (SVM)). The experimental data include 737 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. In different classification tasks, i.e., AD versus NC, MCI versus NC, and progressive MCI (pMCI) versus stable MCI (sMCI), our method demonstrates superior results, as compared with other competing methods.

Results
Experimental Settings. We consider three binary classification tasks in the experiments, namely AD vs.
NC, MCI vs. NC, and pMCI vs. sMCI. We adopt a cross-validation strategy (10-fold) in order to examine the classification performance. In detail, the data are randomly divided into ten roughly equal portions, and in each fold, the subjects in one fold are used as testing data, while the rest subjects are used as training data. Such process is executed ten times to alleviate bias in random partitioning. For the unlabeled data in our method, we choose the irrelevant subjects with respect to the current classification task, e.g., when we classify AD and NC, the data from MCI subjects are used as unlabeled data. The dimensionality of the SNP features is reduced to that of the MRI features before our joint feature and sample learning.
The parameters in feature and sample selection for each classification task are selected by grid search on the training data. The parameters λ 1 and λ 2 in Eq. 7 for regularization purpose are searched from the range {2 10 , 2 −9 , … , 2 0 }. After each hierarchy, 5% samples are discarded, and the features whose coefficients in w are smaller than 10 −3 are removed. The neighborhood size K in Eqs (5) and (6) is set to 20, as we empirically find that this is a reasonable choice to allow sufficient neighbors to assign a reliable soft label to unlabeled samples. To train the classifier, we use the implementation of LibSVM 38 for linear SVM model training with the default parameter C = 1, since we observe that the results are not sensitive to the changes in this parameter. To validate the statistical significance of our method, we perform paired-sample t-test to compare our method with the other benchmark methods.
Effects of Hierarchical Structure. To examine the effectiveness of the proposed hierarchical framework, Fig. 1 compares the classification accuracy (ACC) and area under the receiver operating characteristic (ROC) curve (AUC) under different settings of number of hierarchies. It is observed that the use of more hierarchies benefits the classification performance in all tasks, although the improvement becomes marginal after three hierarchies. Especially for the task of pMCI vs. sMCI classification, where the training data are not abundant, keeping discarding samples and features in a sequence of hierarchies may result in insufficient classification model learning. Therefore, we set the number of hierarchies to three in the following experiments. After this iterative process, on average, about 40% of the features are selected for training the classification models. It is also worth mentioning that compared to AD vs. NC classification, MCI vs. NC and pMCI vs. sMCI classifications are also critical in early diagnosis and possible therapeutic interventions, and these tasks can be more difficult, as demonstrated by lower values in ACC and AUC.
Effects of Multimodal Features. In our method, both MRI and SNP features are used. To study the contribution of individual feature modality, the ROC curves of the classification results using single feature modality are compared with those using both modalities in Fig. 2, and the values of ACC and AUC are listed in Table 1. As observed, using both modalities, i.e., MRI and SNP, better classification performances are achieved as compared with the use of a single feature modality in different classification tasks. This suggests that our method can effectively utilize the information from both modalities, and therefore produces better overall performance. For AD vs. NC classification, MRI features are more discriminative than the SNPs, while the opposite is observed in MCI vs. NC and pMCI vs. sMCI classifications. This suggests that the SNP features are more helpful in discerning the subtle differences in the possible presence of MCI.
Effects of Feature and Sample Selection. In our method, we select both discriminative features and samples to help build better classification models. To verify the individual contribution of feature and sample  selection, we compare the ACC and AUC values of the proposed method with both feature and sample selection, to using only sample or feature selection. The outcomes are shown in Table 2. We can observe that when using sample selection or feature selection only, the classification performance is inferior to the proposed method with both sample and feature selection. The contribution of feature selection is more significant than that of sample selection. This suggests that removing feature redundancy is more imperative. It is worthwhile to mention that while discarding samples is helpful, excessively doing so may be less effective or even counterproductive, due to the small sample size for training as a result.
Comparison with Other Methods. For a more comprehensive comparison, the proposed method (ss-HMFSS) is compared with some popular and advanced methods for AD related diagnosis. Specifically, the methods being compared are the following: • No feature selection (no FS), using MRI features only.
• No feature selection, using SNP features only.
• No feature selection, using both MRI and SNP features.
For all methods, linear SVM is used as the classifier. To more thoroughly compare performances, besides ACC and AUC, we also report sensitivity (SEN) and specificity (SPE). The sensitivity is defined by SEN = TP/ (TP + FN) and the specificity is defined by SPE = TN/(TN + FP), in which TP denotes true positive, FN denotes false negative, TN denotes true negative, and FP denotes false positive. SEN measures the classification accuracy for the positive samples, and SPE measures the classification accuracy for the negative samples.
The average classification results from the 10-fold cross-validation are reported in Table 3. Regarding the overall performance, AD vs. NC classification is relatively easier for different methods, as compared with MCI vs. NC and pMCI vs. sMCI classifications, as evidenced by higher performances in AD vs. NC classification. Regarding feature modality, MRI is more discriminative than SNP in distinguishing AD from NC, while for MCI vs. NC and pMCI vs. NC classifications, SNP is more useful.
Directly combining features from two different modalities may not necessarily improve classification performance. For example, in AD vs. NC, simply concatenating MRI and SNP features decreases the classification accuracy to 87.5%, as compared to an accuracy of 88.3% by using only MRI features. This is because SNP features are less discriminative for this classification task, and simply adding them affects the classification model learning. When features from both modalities are combined, a feature selection step is helpful, as indicated by the improved results using different feature selection methods. Compared with unsupervised feature selection method such as Laplacian score 33 , the supervised ones, i.e., Fisher score 32 and LASSO 35 perform better.
The RIML method 17 is a recently proposed multimodal feature selection method, representing the state-of-the-art in feature selection for AD and MCI diagnosis. It considers the relationships among samples and different feature modalities when performing feature selection in a single step. On the contrary, we improve the effectiveness of feature selection by employing a hierarchical framework to keep only the most discriminative  Table 3. Comparison of classification performance by six different methods (in %). Symbol † indicates p < 0.01 in the t-test as compared to the best method, and * means p < 0.05. parahippocampal gyrus, middle temporal gyrus, and precuneus, have been shown to be related to AD [39][40][41] . The selections of ROIs by our method are congruent with those from previous works.
Selected SNP Features. The most frequently selected SNP features and their gene origins are listed in Table 4. These genes have also been reported to be related to AD in previous works 29,[42][43][44] . For example, the CTNNA3 gene, which is a protein-coding gene, is a top candidate gene for AD 29 . The SNPs in SORL1, DAPK1 and SORCS1 genes have shown significant association with hippocampal volume change, which is related to AD progression 42 . The VEGFA gene is associated with an increased risk of developing AD, as well as an accelerated cognitive decline 43 . The SNPs in APOE have also been related to neuroimaging measures in brain disorders such as MCI and AD 44 . The discovery of those SNPs by our method suggests that our method is able to identify the most relevant SNPs for AD diagnosis.

Discussion
To sum up, we have presented a semi-supervised hierarchical feature and sample selection (ss-HMFSS) framework, in which both labeled and unlabeled data can be utilized to preserve the data manifold in the learning process. To validate the effectiveness of our method, we conducted experiments on AD diagnosis with both imaging and genetic data from ADNI cohort. Results showed that the proposed hierarchical scheme was able to gradually refine the feature and sample set in multiple steps, therefore leading to superior performances in AD vs. NC, MCI vs. NC, and pMCI vs. sMCI classifications.
In clinical applications, differentiating pMCI and sMCI is of great interest and importance. The results on pMCI vs. sMCI classification by our method in Table 3 indicate that the classification ability of our algorithm on this task is on par with that for MCI vs. NC classification (with an accuracy of 80.8% as compared to that of 80.1% for MCI vs. NC classification). Although the performance itself may not warrant highly accurate computerized diagnosis, we believe that the results by our method can aid physicians by providing a useful second opinion for reference.
Different from ROI-based MRI features, the dimensionality of the original SNP features is substantially higher. Since only a small set of genetic variants are directly related to AD 25 , using a prior knowledge from clinical studies to select only the most relevant SNPs may result in a more effective classification model learning.
We have use two different modalities, i.e., MRI and SNP, for three binary classification tasks. Although other modalities such as PET (positron emission tomography) and CSF (cerebrospinal fluid) are available for some subjects in the ADNI-1 dataset, the subjects in our experiment do not have complete data from each modality. Note that a subset of the population in our study may contain all data modality, yet the results from a smaller test set may be less informative and conclusive. We conjecture that, with the inclusion of more data modalities, the predictive performance of the trained diagnostic models can be further improved. Therefore, in future, we plan to utilize more data with additional modalities, e.g. PET and CSF, to help further improve diagnosis performance.
When examining the data used in this work, we noticed that the numbers of females and males are not homogeneous. Given that the available data are not quite abundant from a machine learning point of view, we decided to use all the data to train and validate our algorithm on sample and feature selection. This would help alleviate under-fitting in the iterative learning process if more data are needed than available. Regarding the ethnicity, over 90% of the subjects in this study are white, and the rest are mainly black or Asian. Therefore, this dataset may not be the first choice to study the correlation between AD and race. It has been reported that gender 45,46 and race 47,48 are important factors in AD studies. Studying the impact of gender or race on AD diagnosis would help further improve the algorithm development and diagnosis. For a more comprehensive study, dataset without demographic bias needs to be collected. In this work, we train a model without taking into account the gender or race information, and this is a current limitation. Nevertheless, the goal of this work is to introduce a generic machine learning framework, which can be readily applied for AD diagnosis. Future work is expected to address these aforementioned aspects. In addition, another limitation of our method is that it requires complete data from different modalities for each subject. Extending our method to handle incomplete data is our current ongoing work.

Methods
Data. The data in our experiments are from the ADNI-1 dataset (http://adni.loni.usc.edu). This dataset enrolls subjects who were 55-90 years old with study partners who can provide independent evaluations of functioning. The general inclusion/exclusion criteria for the enrolled subjects are the following: • NC subjects: Mini-Mental State Examination (MMSE) scores between 24 and 30 (inclusive), a Clinical Dementia Rating (CDR) of 0, non-depressed, non-MCI, and non-demented.   Specifically, in this study, we use 737 subjects whose MRI and SNP features are both available in the dataset. Among these subjects, 171 were diagnosed with AD, 362 were MCI patients, and the rest 204 subjects were NCs. Among the MCI patients, 157 of them were labeled as pMCI, and 205 were sMCI. The sMCI subjects were diagnosed previously as MCI patients but remained stable all the time, while pMCI refers to the MCI patients who converted to AD within a 24 months span. Table 5 summarizes the demographic information of the subjects in our experiments.
Preprocessing. The data preprocessing follows the procedures as outlined in ref. 29. Specifically, for MRI data, the preprocessing steps included skull stripping 49 , dura removal, intensity inhomogeneity correction, cerebellum removal, tissue segmentation, and registration. The preprocessed images were then divided into 93 pre-defined ROIs based on the template in ref. 50, and the gray matter volume in these ROIs were calculated as MRI features. Note that the gray matter volumes were corrected for the total intracranial volume of each subject, in order to account for the body size variations in the population.
The SNP data were genotyped using the Human 610-Quad BeadChip 42 . According to the AlzGene database (www.alzgene.org), only SNPs that belong to the top AD gene candidates were selected after standard quality control (QC). The QC of SNP data included the following steps: • Call rate check per SNP per subject.
• Marker removal by the minor allele frequency.
After QC, the SNPs were imputed to estimate the missing genotypes, and the Illumina annotation information was used to select a subset of SNPs 51 . The dimensionality of the processed SNP data is 2098. Since this SNP feature dimension is much higher than that of MRI, we perform sparse feature learning 35 on the training data to reduce the number of SNP features to the same dimension as the MRI features.
The framework of the proposed method is illustrated in Fig. 4. After features are extracted and preprocessed from the raw SNP and MRI data, we first calculate the graph Laplacian matrix to model the data structure, using the concatenated features from both labeled and unlabeled data. This Laplacian matrix is then used in the manifold regularization to jointly learn the feature coefficients and sample weights. In each hierarchy, the features are selected and weighted based on the learned coefficients, and the samples are pruned by discarding those with smaller sample weights. The updated features and samples are then forwarded to the next hierarchy for further selection, following the same process. In such a hierarchical manner, we gradually select the most useful features and samples to mitigate the adverse effect of data redundancy in the learning process. Finally, the selected features and samples are used to train classification models using SVM for AD/MCI diagnosis tasks. In the following, we explain in detail how the joint feature and sample selection works in each hierarchy.
Throughout this section, we use boldface uppercase letters to denote matrices (e.g., X), and boldface lowercase letters to denote vectors (e.g., x). All non-bold letters denote scalar variables. x 2 2 and x 1 represent the squared Euclidean norm and the  1 norm of x, respectively. The transpose of X is denoted as X Τ .
Suppose we have N l labeled training subjects with their class labels and the corresponding features from both MRI and SNP, denoted by is the loss function defined on the labeled data,  ∼ y X X w ( , , , ) m is the manifold regularization term for labeled data as well as unlabeled data. The regularization term is based on the assumption that if two samples x p and x q are close to each other in their original feature space, after mapping into the new space (i.e., label space), their neighborhood structure should also be maintained, with an illustration given in Fig. 5 1 is the sparse regularizer for the purpose of feature selection, and only features with non-trivial coefficients in w are expected to be discriminative. In the following, we explain in detail how the loss function and the manifold regularization term are defined, and how the sample weights are incorporated.

Loss Function. The loss function y X
( , )  considers the weighted loss for each sample, and is given by where  ∈ × A N N l l is a diagonal matrix, and each diagonal element denotes the weight for a data sample. Intuitively, a sample that can be more accurately mapped into the label space with minimal error is more desirable, comparatively, and thus it should contribute more to the classification model. The sample weights in A will be learned through optimization and the samples with larger weights will be selected to train the classifier.
Manifold Regularization. The manifold regularization preserves the neighborhood structures for both labeled and unlabeled data when they are mapped from feature to label space:  S(p, q), and S is the affinity matrix with S(p, q) denoting the similarity between samples x p and x q . S(p, q) is defined as Figure 4. Framework of the proposed semi-supervised hierarchical multimodal feature and sample selection (ss-HMFSS) for AD and MCI diagnosis. The data are first preprocessed, and features are extracted from MRI and SNP, respectively. The MRI features and the preselected SNP features from both labeled and unlabeled data are used to exploit the data manifold via a Laplacian matrix computation. In a joint feature and sample selection learning framework, manifold preservation, feature selection, and sample selection are achieved. This learning process is performed in a hierarchical manner to gradually identify a set of the most discriminative features and samples, which are then used to train the classification model. where y p and y q are the labels for x p and x q . For the case of unlabeled data, y p is a soft label for an unlabeled data sample x p , defined as p p pos where k p pos is the number of x p 's neighbors with positive class labels out of its K neighbors in total. Note that for an unlabeled sample, the nearest neighbors are searched only in the labeled training data, and the soft label represents its proximity to a target class. Using such definition, the similarity matrix S encodes relationships among both labeled and unlabeled samples.
ul applies weights on both labeled and unlabeled samples. The elements in A are different for labeled and unlabeled data: ), it is assigned a smaller weight, since this sample may not be discriminative enough in terms of class separation. The weights in A for the labeled data are to be learned in the optimization process.
Objective Function. Taking into account the loss function, the manifold regularization, as well as the sparse regularization on features, the overall objective function is where the elements in A are enforced to be non-negative to assign physically interpretable weights to different samples. Also, the diagonal of A should sum to one, which makes the sample weights interpretable as probabilities, and ensures that sample weights will not be all zeros.
Optimization. Since Eq. (7) is biconvex with respect to w and A, we employ an alternating optimization strategy to solve this problem, meaning that we split the objective function into two sub-problems and then solve them iteratively. When one unknown variable is fixed, the resulting sub-problem would be convex. In such a way, the original objective function can converge to the optimal point 52 . Specifically, we first fix A to find the solution of w, and then vice versa. When A is fixed, Eq. (7)  It is easy to verify that Eq. (8) is non-smooth, although it is convex, because of  1 -norm regularizer. One way to cope with this problem is to approach the original non-smooth objective function using a function which is smooth. Then this smooth objective function can be solved using standard fast algorithms. In this work, we resort to the widely used Accelerated Proximal Gradient (APG) method 53 to solve Eq. (8).
In the second step, given a fixed w, the objective function in Eq. (7)  Note that since the unlabeled data are irrelevant to the original objective function in Eq. (7), we only need to optimize A via Eq. (9). Eq. (9) is convex with respect to A, and can be efficiently solved via quadratic programming 54 .
To this end, the discriminative features are identified by the significant values in w, and the poor samples are assigned lower weights in A. Therefore, those less useful features and samples can be discarded based on the values in w and A, which leads to a more compact yet effective subset of features and samples as compared with the original data. In addition, the learned coefficients in w can be used to weight the features, addressing their importance. This completes the first hierarchy. In the next hierarchy, the selected samples and updated feature sets are used similarly in the optimization of Eq. (7) to further refine the sample and feature sets. The entire process of the proposed method is summarized in Algorithm 1.

Input:
Labeled and unlabeled data, the number of hierarchies L.
1: Initialize labeled sample weights in A and feature coefficients in w.
2: for i = 1 to L do

3:
Calculate the data similarity scores in S by Eq. (4).

4:
Calculate the sample weights in A by Eq. (6).
8: unitl convergence 9: Discard poor samples and non-discriminative features based on the values in A and w.
10: Weight the remaining features by the coefficients in w.

11: end for
Output: Subset of samples and features.