Metabolomic biosignature differentiates melancholic depressive patients from healthy controls

Major depressive disorder (MDD) is a heterogeneous disease at the level of clinical symptoms, and this heterogeneity is likely reflected at the level of biology. Two clinical subtypes within MDD that have garnered interest are “melancholic depression” and “anxious depression”. Metabolomics enables us to characterize hundreds of small molecules that comprise the metabolome, and recent work suggests the blood metabolome may be able to inform treatment decisions for MDD, however work is at an early stage. Here we examine a metabolomics data set to (1) test whether clinically homogenous MDD subtypes are also more biologically homogeneous, and hence more predictiable, (2) devise a robust machine learning framework that preserves biological meaning, and (3) describe the metabolomic biosignature for melancholic depression. With the proposed computational system we achieves around 80 % classification accuracy, sensitivity and specificity for melancholic depression, but only ~72 % for anxious depression or MDD, suggesting the blood metabolome contains more information about melancholic depression.. We develop an ensemble feature selection framework (EFSF) in which features are first clustered, and learning then takes place on the cluster centroids, retaining information about correlated features during the feature selection process rather than discarding them as most machine learning methods will do. Analysis of the most discriminative feature clusters revealed differences in metabolic classes such as amino acids and lipids as well as pathways studied extensively in MDD such as the activation of cortisol in chronic stress. We find the greater clinical homogeneity does indeed lead to better prediction based on biological measurements in the case of melancholic depression. Melancholic depression is shown to be associated with changes in amino acids, catecholamines, lipids, stress hormones, and immune-related metabolites. The proposed computational framework can be adapted to analyze data from many other biomedical applications where the data has similar characteristics.


Background
Major Depressive Disorder is the most common mental illness, affecting an estimated 350 million people worldwide [1]. For many, MDD is a lifelong illness consisting of recurrent episode, each of which may cause disability and severely interfere with an individual's everyday life, and greatly increase the risk of suicidal behavior [2]. Current antidepressants are fully effective in only about a third of patients, with another third partially responding. Thus there is a tremendous need to identify novel therapies to help those not served well by today's treatment strategies. This can only come with a deeper understanding of the biology and the biological heterogeneity of MDD.
The identification of replicable biomarkers differentiating patients with MDD from healthy controls has lagged behind other diseases. This is reflected in a megaanalysis of GWAS data performed by the Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium on 9240 MDD cases and 9519 controls which identified no replicable markers, despite the detection of significant markers in 94 % of other disease tested in populations of the same size [3]. A key factor likely hindering the discovery of biomarkers for or prediction of MDD compared with other diseases is high degree of clinical and biological heterogeneity [4][5][6][7][8][9][10][11]. In the DSM V diagnostic manual, MDD is defined by patients having at least one of the two core depressive symptoms, 'depressed mood' and 'anhedonia' , and at least 5 of 9 overall symptoms. This definition alone leads to heterogeneity based on the combination of symptoms endorsed by any given patient. Even within individual symptom items there is heterogeneity with criteria such as "gaining weight or losing weight", "hypersomnia or insomnia", "psychomotor agitation or retardation". It would be surprising if such clinical heterogeneity were not also reflected in the biology of MDD.
Two patient groups easily separable based on clinical features are those with melancholic depression and those with anxious depression. Melancholic depression is characterized by pervasive anhedonia, lack of reactivity to circumstances, depressed mood of a distinct quality, and typical vegetative symptoms such as appetite or weight loss, early morning awakening, worse mood in the morning [4,12,13]. The prevalence of melancholic depression among depression population is around 25-30 % [14]. Melancholic depression is found to exhibit stronger response to physical treatments but weaker response to psychotherapy or placebos compared to other subtypes of depression [15][16][17]. Anxious depression is characterized by the co-occurrence of MDD with anxious symptoms, possibly subthreshold for diagnosis of an anxiety disorder. Anxiety during an MDD episode has been widely studied as a predictor of a more severe course of illness, and poorer patient outcomes [18][19][20].
Here we report results on the Janssen-BRC metabolomics data set, consisting of 97 healthy control and 90 MDD subjects, of which 21 suffer melancholic depression and 58 from anxious depression. In this work, our goals are three-fold. First, we test the hypothesis that more clinically homogeneous groups of MDD patients are easier to predict from healthy controls than the entire MDD group using blood metabolomics data. Second, we develop a novel method for building maximally predictive and robust machine-learning classifiers that retain information on the correlation structure of the metabolomics data to ease biological interpretation. Third, we use this framework to describe the metabolomics biosignature of melancholic depression.

Data description
The Janssen-BRC metabolomic data set was part of a casecontrol study of MDD designed to detect biomarkers of depression and subtypes of depression by investigating various data sources, such as patients' demographic information, psychophysiological and neuropsychological indices, and molecular profiling. Volunteers were recruited nationwide by Brain Resource Company (BRC) in Australia. Patients were designated as MDD if they scored > = 14 on the Hamilton Depression Rating Scale-17 (HAMD17). Within MDD subjects, patients were designated as melancholic depressed if they additionally scored > =8 on the CORE scale [21][22][23][24]. The CORE scale for melancholic depression consists of a 18-item scale with each item rated on a 4-point scale (0-3) by clinicians [25,26]. Patients were designated as anxious depressed the number of comorbid anxiety disorders on the M.I.N.I. International Neuropsychiatric Interview > 0. Based on these diagnostic criteria, the data set consists of 97 healthy control and 90 MDD subjects, of which 21 suffer melancholic depression and 58 from anxious depression. One healthy control subject was not evaluated for anxiety, was therefore not included in the analysis of healthy control vs. anxious depressed out of caution. For detailed statistics of the samples used in our study, readers may refer to Table 1.
This study was approved by Institutional Review Board (IRB)/Independent Ethics Committee (IEC) including the protocol and written informed consent form at each trial site. Written consent was obtained from participants prior to enrollment in the study.

Metabolomics sample
A 20 ml sample of plasma was collected into EDTAcontaining tubes from 99 depressed subjects and 100 healthy controls (gender and age matched) collected from four recruitment sites at BRC, Australia. Samples were stored at BRC at -20°C prior to shipment after which they were shipped on dry ice and stored at -80°C. Plasma samples were profiled by Metanomics Health GmbH. Plasma samples were extracted by a proprietary method and separated into four fractions (lipid and polar fractions) prepared for Gas chromatography-mass spectrometry (GC-MS) and Liquid chromatography-mass spectrometry (LC-MS/MS). For GC-MS analytics the samples were sequentially derivatized before measurement. In LC-MS/MS analysis a metanomics proprietary technology was applied which allows target and high sensitivity MRM (Multiple Reaction Monitoring) profiling in parallel to full screen analyses.
Data were acquired on 272 peaks: 160 peaks mapped to known metabolites and the remaining 112 unknown. Data were centered to the median of healthy control samples, and log-transformed to assure normal distribution of the data.

Correction of storage time effects
The concentration of many metabolites is known to change as a function of storage time. We removed 44 metabolites reported by Metanomics Health as having sensitivity to storage time in the same direction as the change observed in our data set, leaving us with 228 metabolites. Those not deemed sensitive to storage may still have residual storage time effects. To assess this, we calculate the p-values of bivariate correlations between metabolites and the storage time, and assess their deviation from the null distribution (that expected under the null hypothesis of no metabolite associated with the storage time) in a Quantile-Quantile (QQ) plot (Additional file 1: Figure S1 (left)). The calculated p-values (-log transformed) for metabolites are sorted in descending order and plotted against the values sampled from the expected uniform distribution of p-values. The strong deviations from the straight line suggest that either the null distribution is incorrect or that there exists a true association. In our study, we observe that the linear relationship with the storage time at -20°continues up to 200 days, while the effects of the storage time are skewed after day 200. We thus remove samples stored for more than or equal to 200 days.
To control for this effect, we correct the metabolite data by taking the residuals of linear regression models on storage time at -20° [27]. We build the regression model on healthy controls only, and then apply the model to all subjects in order to avoid the removal of some disease-related effects as suggesied in [28].
Specifically, let X be one column of the metabolite feature vector (one metabolite) to be corrected, and T be the column vector of the storage time at -20°. The lengths of X and T equal the sample size. Let X H and T H denote the feature and storage time vectors of healthy controls respectively. Then, we correct the metabolite features as follows.
1) We build the following linear regression model on the 97 healthy control samples where β 1 and β 0 stand for the effect of storage time T on X, and the bias, respectively. We can estimate the parameters as follows: where S is a two-column matrix with the first column being a vector of ones and the second column being T.
Once β 1 is determined, the corrected metabolite feature X c will be 2) We apply the above storage time correction on all subjects, and repeat this procedure for all metabolite features.
After correction, the strong deviation is removed from the QQ plot (Additional file 1: Figure S1 (right)) of the p-values of bivariate correlation between metabolite features and storage time, which indicates the effectiveness of the correction approach.

Imputation of missing values
Among the 187 samples used in our study, 1.10 % of the feature values are missing. We impute the missing values by several different approaches: halfMin: Impute the missing values by half of the minimum value in the corresponding feature. The assumption behind this method is that most of the missing values are too small to be detected, and therefore a simple approach is to replace the missing entries with reasonably small values. For methods such as GC/MS and LC/MS where nonlinear maps must be aligned to match peaks across samples, it may be a poor assumption that a missing value corresponds to a value below the limit of quantification, because in some instances a missing value may be the result of a misaligned, though possibly large, peak which does not get counted. kNN3: Impute the missing values by the k-nearest neighbor method (kNN). kNN imputes a missing value with a weighted average of the top k nearest-neighbor columns (k = 3 was used here). The weights used in kNN are inversely proportional to the distances from the neighbor columns. EM: Impute the missing values by the expectationmaximization (EM) method [29]. Under the assumption that the data matrix is Gaussian distributed, EM algorithm imputes the missing values with conditional expectation values by iteratively estimating the mean and covariance matrix from incomplete data and maximizing the likelihood of the available data. SVD: Impute the missing values by the Singular Value Decomposition (SVD) method. The SVD method, assuming the data matrix is low-rank, imputes the missing values by iteratively updating the data matrix with low-rank approximations.
In our study, all the input data matrices are normalized with zero mean and unit standard deviation before feature selection or classification. The distributions of original and imputed values of four metabolite features (Glyoxylate ratio, Caffeine ratio, Elaidicacid ratio and Indole 3 propionic acid ratio) are shown in Additional file 1: Figure S2. The distribution of the values imputed by kNN3, EM and SVD are very similar to that of original data while the halfMin method yields an imputed data with more small values as it assumes that the missing values are too weak to be observed. For our primary results reported we use kNN3 and contrast with halfMin to compare the effect on classifier performance.

Cluster representation
Recent studies on statistical learning show that advanced feature learning algorithms like Lasso may fail to select important but highly correlated features simultaneously, and show that the clustered Lasso may lead to improved prediction and feature selection [30]. In clustered Lasso, we first apply clustering on the features to identify a set of feature groups. Then, we construct a reduced dataset consisting of cluster-representatives, which refer to the averages of the features from the same cluster. In our experiments, we first cluster the features into 100 groups by two clustering algorithms, K-means and hierarchical clustering, and then generate a 100-dimensional dataset with each feature being the centroid of each cluster. The distance used in K-means was Euclidean distance while the inter-cluster distance used in hierarchical clustering was the maximum correlation between points in two different clusters. For each cluster, the cluster centroid is calculated, and this is the new feature which is then used for classification.
We show the QQ plots of the p-values in the twosample t-test are improved through feature clustering (Additional file 1: Figure S3). Although all the p-values obtained on raw features seem to be non-significant except for the minimum one, there exist strong signals for the clustering representatives (including K-means and Hierarchical Clustering). This is because QQ plot assumes that the p-values are independent to each other while the raw metabolite features are correlated to each other and the clustering on features can effectively mitigate the correlation.
An ensemble feature selection framework (EFSF) Figure 2c illustrates the feature selection + ensemble framework used in our study and we name the process the Ensemble Feature Selection Framework (EFSF) Ensemble learning is a general and powerful framework in machine learning. The underlying philosophy of ensemble learning is to build a learner by combining a collection of base learners [31]. Ensemble learning can be divided into two tasks: 1) building a set of base learners from the training data; 2) combining the base learners to produce the predictor. The majority voting scheme in our study is one type of the ensemble learning methods. The base learners (e.g., SVM, Random Forest) are first trained from each undersample, and their predictions on testing data are then combined to form the final prediction which decides the output that has the majority (i.e., more than half the votes).

Undersampling
In our study, the ratio between the number of melancholic depressive patients and the number of the healthy controls is around 1:5, which is shown in Table 1. Traditional machine learning methods are less effective for such severely imbalanced data set as the classifier trained in such case will be biased towards the majority class. We adopt a very effective and commonly used approach, called "undersampling", to deal with the imbalance problem [32][33][34]. In our case, we have a relatively large number of negative samples (e.g., healthy controls) and a relatively small number of positive samples (e.g., melancholic depressive patients). During the training stage, we randomly select a subset of negative samples with size equal to the total number of positive samples, and build a classifier on the combination of the positive samples and the undersampled negative samples. For instance, if there are 90 negative samples and 16 positive samples available in the training set, then 16 out of 90 negative samples will be randomly selected, which is combined with the 16 positive samples to form a training set to build a classifier. To reduce the variability of random undersampling and further enhance the performance, we repeat the undersampling procedure 30 times in the experiment and thus build 30 classifiers during the training process. Finally, we combine all the k classifiers via majority vote as the final prediction.

Feature selection
To identify the most predictive features, we apply feature selection. Since we adopt the undersampling strategy in the training process, we do feature selection on each undersampled data set and combine feature selection results to generate a final ranking of features. We employ both univariate and multivariate feature selection methods in this study including T-test [35], Fisher's Score [36], Gini Index [37] and Stability Selection [38].

Classification
Two different widely used classifiers were tested in this study: Support vector machines (SVM) and Random Forest (RF). SVMs were first introduced in 1992 [39]. In the task of binary classification, SVM separates the two classes of data points by determining a boundary with a hyperplane which maximizes the margin of the boundary. The margin is defined as the width that a boundary could be increased by before hitting a data point. SVMs can easily build non-linear classifiers by adopting the kernel tricks [39] to implicitly map the input data into high-dimensional non-linear feature space where the data are more easily separable. In RF, the learning approach models a predictor that averages a collection of de-correlated decision / regression trees [31,40]. In random forest, the algorithm makes a large set of bootstrap samples and builds decision / regression trees on them by selecting the best split point among a random subset of features. The final model of the random forest is the ensemble (average) of the trees grown in the bootstrap samples. Random forest takes advantage of the bootstrap aggregation to effectively reduce the high-variance made by the trees and shown success in a wide range of applications [31,40].
In our experiments, we report the classification performances obtained from 10-fold cross validation. We randomly divide the data into 10 sets of equal size, and, holding out one set for testing, use the remaining 9 sets for training. The EFSF is applied on the training data and the learned models are used for prediction. Each set is used for testing once and thus the training and testing procedure is repeated 10 times.

Classification performance measures
In our experiment, we treat melancholic depressed patients as positive samples and healthy controls as negative samples. Because the dataset is highly imbalanced, the commonly used performance measure, i.e., classification accuracy, is not sufficient. In addition to accuracy, we also reported both sensitivity and specificity which measure the proportions of positive and negative samples classified correctly. Specifically, In our experiment, the classification performance measures (i.e., accuracy, sensitivity and specificity) are obtained from 10-fold cross-validation, described above.

Permutation testing
In order to demonstrate the strength of the signals discovered in our study, we propose to use the framework of permutation testing [41,42] to validate the learning results. In the permutation test [41,42], sample labels (i.e. Melancholic Depressed, Healthy Control) are randomly permuted and the trained classifiers are tested on the permuted data set. The permutation-based p-value is defined as the fraction of randomized samples where the classifier performs better in the original data than in the permuted data, and is computed as: whereD is a set of K randomized (permuted) data sets D ' of the original data D, f is the trained classifier, and e represents the error function.

Network visualization
Metabolites from the top 15 ranked clusters in the Kmeans clustering were analyzed using IPA (Ingenuity® Systems, www.ingenuity.com) for visualization and metabolite annotation. Metabolites were mapped to KEGG identifiers using the Human Metabolome Database (version 3.5, http://www.hmdb.ca/); KEGG ID's were used as input in IPA. In some cases, the identity of metabolites was too specific for mapping to the KEGG database.
Where possible, metabolites were assigned a KEGG ID, which encompasses a class of molecules, e.g. triacylglycerides. A union of 76 metabolites from the 4 feature selection methods was used as input. Of the 76 metabolites, 56 were mapped to KEGG.

Results and discussion
Classification performance on MDD subtypes We first compared the performance of the Random Forest classifier using kNN3 imputation using individual metabolites (228 features) on three different classification tasks illustrated in Fig. 1 (Fig. 1b), suggesting that Melancholic Depressed patients may be more homogenous at the biological level (at least the blood metabolome) and easer to predict using a metabolomics biosignature. That Anxious Depressed subjects were not predicted with as high a degree of accuracy could mean that either there is less of a signal in the blood metabolome associated with symptoms of anxious depression, or alternatively that within that designation, there is still a considerable degree of biological heterogeneity. For the remainder of this work we focus on building a robust classifier for melancholic depression, and to describe the biology underlying the biosignature.

A robust classifier for melancholic depression
We next focused on optimizing the classification of the Melancholic Depressed subjects from Healthy Controls. Classifiers were built using either individual metabolites (228 features), or deriving features by first clustering metabolites using K-means or hierarchical clustering into 100 cluster. The 100 cluster centroids were used as features in the predictive models (Fig. 2b). The Ensemble Feature Selection Framework (EFSF) was run testing four imputation methods (halfMin, kNN3, EM and SVD, described in the Methods), and four feature selection methods (Fisher's Score, Gini Index, T-Test, and Stability Selection as described in the Methods). Tables 2 and 3 show the classification results by Random Forest and SVM, respectively, across all imputation, feature selection and classification methods. We observed that Random Forest achieved higher classification performance than SVM in most cases.
Imputing missing values with half of the minimum feature value achieved better performance than other methods on individual metabolite features, while Knearest neighbor imputation outperforms others on cluster-representatives (both K-means and hierarchical clustering). The performance obtained by the EM imputation is slightly worse. However, overall all imputation methods achieve fairly comparable classification performance, a robustness that may be conferred by the small amount of missingness in the data set.
Stability selection, which penalizes highly correlated features that are less likely to be selected together in a model, performed consistently better on the individual features (not clustered). In contrast, Gini Index performed better when learning was performed on clustercentroids.
Classification performance (especially sensitivity) overall was higher on K-means-clustered features than when using individual metabolite features; this indicates that highly correlated features with high predictive power can be effectively grouped by clustering approaches.
We also compare the EFSF framework with the standard undersampling method using exactly the same training-testing data splitting. The result of the standard undersampling method depends on which majority samples are selected. To get a good estimate of the performance of the standard undersampling method, we simply use the k classifiers from the k undersampled (balanced) data constructed in the EFSF and report the average performance of these k classifiers. The results are shown in Additional file 1: Tables S1 and S2 for Random Forest and Support Vector Machines, respectively. We observe that, the highest performance (accuracy, sensitivity, and specificity) achieved is around 73 %, which is significantly lower than that obtained by the EFSF.
To further validate strength of the signals discovered in the study, we permutations tests (100 iterations, randomly permuting sample labels) and find that, for all cases, the classifier consistently performs better in the original data than in the permuted data and the permutation-based p-value is 1/101, which indicates that the performances achieved in the original study is significantly high with significant level α = 0.01 (Fig. 3). This suggests our classification performance is not the simple result of over-fitting to a small number of samples.

The blood metabolomic biosignature of melancholic depression
We selected K-means clustering, kNN3 imputation with Random Forest classification as the combination method with highest overall classification performance on which to perform pathway analysis. Each feature selection method resulted in similar classification accuracy (range of 78.69-79.74 %), although cluster representatives selected by each method did not overlap completely. The number of cluster representatives required for optimal performance also varied; 15, 30, 36, and 18 for Gini, Stability, Fisher, and t-test, respectively. Features comprising the top 15 clusters ranked by score were selected from each method for evaluation of the biosignature. The union metabolites from the 4 methods was used, as each feature selection method resulted in similar classification accuracy and was, therefore, hypothesized to contribute biological information of interest. This yielded 76 metabolites, 48 of which were selected by at least 3 of the 4 methods. Stability Selection exhibited the least overlap with other methods, identifying 19 unique metabolites.
Fifty-six of the 76 metabolites were mapped to KEGG identifiers and analyzed. Table 3 lists the unique metabolites with their ontology class and the cluster ranking by each feature selection method. Top ranked features selected by other combinations of the ensemble framework parameters are included in Additional file 1: Tables S3-S7. c Illustration of the ensemble learning framework. Given the imbalanced training data, we randomly undersample the training data k times, and then we perform feature selection and classification on each undersampled dataset. Finally, we combine all k classifiers to make the final prediction, and report out top cluster-features  Many of the clusters were comprised of metabolites falling within the same ontology class (Table 4), suggesting that expression of classes of metabolites often covary with each other. As expected, performing classification with cluster representatives retained metabolites which could be biologically relevant, but which would have been excluded as redundant information using other classification methods.
Overall, the majority of the selected metabolites were increased in melancholic depressed vs. healthy control subjects (48 of 56 metabolites). Some metabolite classes contribute greatly to the classification of melancholic depressed subjects from healthy controls (Fig. 4). Eleven of the 56 metabolites analyzed were amino acids (turquoise outlined metabolites), and several others (pseudouridine, phosphotyrosine, urea, fumaric acid, and succinic acid) are products of amino acid degradation or related. Fourteen of the metabolites are related to lipids (grey outlined metabolites) and most are increased in melancholic depressed subjects. While these changes suggest differences related to overall metabolism, there was no obvious mechanism revealed by canonical pathway analysis for these classes of metabolites.
Some of the metabolites suggested functional pathways of interest. Several are involved in catecholamine synthesis and degradation (outlined in blue), which have been studied for the treatment of depression. Three metabolites related to stress hormone signaling (outlined in red) were also identified as increased in melancholic depression vs. healthy controls. There are also 2 immunerelated molecules (outlined in purple) that are decreased in melancholic depressed subjects.

Discussion
Several studies have sought to understand symptomatic differences between MDD subtypes with inconsistent conclusions. A recent NESDA study [23] identified 3 main symptomatic groups which seemed to approximate melancholic, atypical, and undifferentiated/mixed subtypes. Another meta-analysis concluded that there is too much heterogeneity in studies and not enough detailed symptomatic information available to classify subtypes of MDD [43]. A third group found that response to antidepressants could not be predicted based on depression subtypes [5]. The variability of these studies suggests that symptomatic classification of MDD may not always reflect a shared etiology in depression subtypes. Several studies, however, point to factors such as severity of MDD, heritability, chronicity, and traumatic childhood events as being characteristic and predictive of melancholic depression [21,23]. The fact that melancholia has a strong heritable component suggests that there is an underlying biological dysfunction which could be represented by molecular features and which could also be used to classify melancholic depressed subjects from other MDD subtypes or healthy controls. In our study, we have identified blood metabolite markers which are able to distinguish melancholic subjects from healthy controls with 80 % accuracy.
In our empirical study, the ensemble feature selection framework has shown great effectiveness in both feature selection and classification on imbalanced metabolite data. We thus expect it to be a promising framework capable of achieving robust classification results for the analysis of other biomedical data where the sample distribution is severely imbalanced. Based on the EFSF framework, we identified discriminative metabolite features related to melancholic depression, which are biologically relevant to the disease state.
We found that different methods of data imputation and feature selection had only minor effects on the accuracy of classification, suggesting that there are robust metabolite changes that can classify Melancholic Depressed subjects from Healthy Controls. Interestingly, although the overall accuracy of classification was similar, Stability feature selection identified several different metabolites than other methods. Different from other feature selection methods, stability selection identifies features which are stable under data perturbation (via subsampling/bootstrapping). Because all of the methods resulted in similar accuracy, we hypothesized that all of the metabolites were important contributors to melancholia and applied pathway analysis to understand biological functions that can contribute to the disease.
Despite characteristics of slowed motor movement, melancholic depression is described as a physiological state of hyperarousal, which includes a chronic activation of the hypothalamic-pituitary-adrenal axis. This stress pathway has been studied extensively in melancholia and in animal models of chronic stress (reviewed in [21,44]). Elevated levels of the glucocorticoid, cortisol (also called hydrocortisone), is an important signaling molecule in this pathway and is increased in numerous studies of melancholia. Although cortisol changes are present in multiple subtypes of MDD, a recent metaanalysis of 354 studies found that the effect size for cortisol was larger when restricted to just melancholic depressed subjects [45]. Consistent with other studies, we identified an increase in cortisol in melancholia vs healthy controls. However, because we took a metabolomic approach, we also identified increases in other metabolites in the hormone biosynthesis pathway (androstenedione and corticosterone). Together these data might suggest a dysregulation promoting steroidogenesis in the upstream pathway. Glucocorticoids are also important anti-inflammatory molecules. There is Table 4 Unique metabolites included in pathway analysis. Metabolites from the kNN3 imputed, K-means clustering, were analyzed using IPA. Standardized KEGG nomenclature is included together with metabolite class and ranking of cluster representatives in the four feature selection methods. Metabolites were selected for pathway analysis if they were members of a cluster that was among the top 15 cluster-centroids selected by GINI, Stability, Fisher, or T-Test. Of 76 selected metabolites, 48 were selected by 3 of the 4 methods. 19   renewed interest in the involvement of inflammation in the development of MDD, and it has been hypothesized that there is immune repression in melancholia mediated by chronic cortisol levels and activated inflammation in atypical depression [21,46]. Our results are consistent with immune repression, evidenced by lower levels of histamine and arachidonic acid in melancholic depressed compared to healthy control subjects. We also saw increases in catecholamine pathways, which are inherently linked to glucocorticoids through a complex feedback system. Lamers et al. (reviewed in [21]) have postulated that melancholic depression is a state of chronic stress, which activates corticotropin releasing hormone (CRH), cortisol, and norepinephrine (NE) pathways in the absence of inhibitory feedback. Several metabolites in this pathway were removed from our dataset due to instability over long storage times, including NE itself. However, NE's precursor, dopamine, and metabolites of NE (L-metanephrine and Lnormetanephrine) were increased in melancholic depressed subjects, which could suggest that NE may also be elevated. Another member of the catecholamine pathway is serotonin, which is decreased in our dataset. An important class of antidepressants, selective serotonin reuptake inhibitors (SSRI), is aimed at increasing extracellular levels of serotonin to treat depression. A Fig. 4 Network analysis of metabolites. IPA network analysis was used to view connections between selected metabolites. Several classes and biological functions were observed and are highlighted in the diagram. Red-filled metabolites are increased comparing melancholic depressed subjects to healthy controls, while green-filled metabolites are decreased. Higher intensity color indicated larger changes. Increases range from 1.94 to 1.07 fold change, while decreases range from -1.71 to -1.02 decrease in plasma serotonin, as seen in these melancholic subjects, would be hypothesized to contribute to a depressive state.
Heart disease and type II diabetes (T2DM) have been cited as comorbidities of depression, which increases the mortality rate in depressed subjects independent of suicide. One shared mechanism for heart disease and T2DM is dyslipidemia characterized by increased triglycerides and fatty acids. Metabolic factors have been hypothesized to be associated with atypical depression, because of an increase in appetite and body mass index (BMI) [23]. The POWER study [4], however, found significant increases in triglycerides, ACTH, and leptin in a melancholic, but not atypical depression, cohort. Conversely, at least one study found no differences in triglycerides between melancholic and healthy subjects [46]. We found that both triglycerides and fatty acids were increased in melancholic depressed subjects compared to healthy controls, even though there was no difference in BMI between the groups (data not shown). Clusters containing triglycerides were consistently ranked highly by 3 of the 4 feature selection methods, indicating that they were important in classification of melancholia. Although not measured in this study, inflammatory factors such as cytokines may also play a role in development of insulin resistance in these subjects.
Two modified triglycerides (triglyerceride hydroperoxides) are particularly interesting as they were highly ranked in each of the feature selection methods. There is little literature about lipid hydroperoxides, although a recent study correlated an increase in lipid hydroperoxides and an inflammatory marker, c-reactive protein (CRP), in depressed smokers [47]. Lipid hydroperoxides are hypothesized to occur physiologically in plasma under conditions of oxidative stress and may also play a role in the development of heart disease [48]. In the NESDA dataset, the melancholic group contained a statistically significant increase in smokers over other types of depression [23]. We were unable to explore this correlation as our study did not record smoking status.
Amino acids are another class of metabolites that is consistently increased in this study of melancholic depression. While a mechanism for elevated amino acids cannot be determined from this dataset, there is evidence that this class of metabolites can differentiate MDD from healthy subjects. Branched chain amino acids were decreased following treatment with the selective serotonin reuptake inhibitor antidepressant sertraline [49]. Another recent study concludes that a combination of tryptophan, cysteine, and glutamine are capable of differentiating MDD from controls, although their results, unlike ours, identify a decrease in these plasma amino acids [50]. The inconsistency in directionality of amino acid changes potentially arises from the difference in study populations, where Xu et al. [50] examined MDD, and our results are determined from the melancholic subtype, which comprises only~20 % of MDD patients in our study.

Conclusions
This paper explores the metabolic biomarkers for melancholic depression, a major subtype of MDD. A comprehensive multivariate study on the identification of metabolic biomarkers that are strongly related to melancholic depression is of great importance and necessity for understanding the illness and the development of pharmaceuticals and therapies. To address the problems of confounding effects, incomplete data, feature correlation, and imbalanced sample distributions, we explored different data correction, imputation, and feature grouping methods with an ensemble feature selection framework for identifying biomarkers and building prediction model. Our extensive experiments on the metabolite data show that strong signals exist in metabolite data that can differentiate melancholic depressive patients from healthy controls. Our pathway analysis on differentiating metabolites elucidates underlying molecular mechanisms that could contribute to melancholic depression. We expect that the proposed computational system can be adapted to analyze other biomedical data with similar characteristics, which are common in many biomedical applications.

Additional file
Additional file 1: Figure S1. Effect of storage time correction. Figure  S2. The distributions of original and imputed metabolite features: Glyoxylate ratio, Caffeine ratio, Elaidicacid ratio and Indole 3 propionic acid ratio. Figure S3. QQ plots of the p-values of the two-sample t-tests on raw features, k-means and hierarchical clustering representatives. Table S1. Classification performance obtained by Random Forest on metabolite data using the standard undersampling technique. Table S2. Classification performance obtained by Support Vector Machines on metabolite data using the standard undersampling technique. Abbreviations BMI, body mass index; BRC, Brain Resource Company; CRH, corticotropin releasing hormone; EFSF, ensemble feature selection framework; EM, expectation-maximization; halfMin, half of the minimum value; HC, healthy control; JRD, Janssen Research & Development, L.L.C; kNN, k nearest neighbor method; MDD, major depressive disorder; NE, norepinephrine; PMD, psychomotor disturbance; SVD, singular value decomposition; T2DM, type II diabetes