Optimal Algorithm for Metabolomics Classification and Feature Selection varies by Dataset

Metabolomics, the systematic identification and quantification of all metabolites in a biological system, is increasingly applied towards identification of biomarkers for disease diagnosis, prognosis and risk prediction. Applications of metabolomics extend across the health spectrum including Alzheimer's, cancer, diabetes, and trauma. Despite the continued interest in metabolomics there are numerous techniques for analyzing metabolomics datasets with the intent to classify group membership (e.g. Control or Treated). These include Partial Least Squares Discriminant Analysis, Support Vector Machines, Random Forest, Regularized Generalized Linear Models, and Prediction Analysis for Microarrays. Each classification algorithm is dependent upon different assumptions and can potentially lead to alternate conclusions. This project seeks to conduct an in depth comparison of algorithm performance on both simulated and real datasets to determine which algorithms perform best given alternate dataset structures. Three simulated datasets were generated to validate algorithm performance and mimic 'real' metabolomics data: (Han et al., 2011) independent null dataset (no correlation, no discriminatory variables), (Davis, Schiller, Eurich, & Sawyer, 2012) correlated null (no discriminating variables), (Guan et al., 2009) correlated discriminatory. This comparison is also applied to 3 open-access datasets including two Nuclear Magnetic Resonance (NMR) and one Mass Spectrometry (MS) dataset. Performance was evaluated based on the Robustness-Performance-Trade-off (RPT) incorporating a balance between model classification accuracy and feature selection stability. We also provide a free, open-source R Bioconductor package (OmicsMarkeR) that conducts the analyses herein. The proposed work provides an important advancement in metabolomics analysis and helps alleviate the confusion of potentially paradoxical analyses thereby leading to improved exploration of disease states and identification of clinically important biomarkers.


Introduction
Metabolomics, similar to the other two common 'omics' approaches (i.e.transcriptomics and proteomics), is defined as the systematic identification and quantification of all metabolites in a biological system.Such data is commonly acquired via Nuclear Magnetic Resonance spectroscopy (NMR) or Mass Spectrometry (MS).Metabolomics has been increasingly applied towards identification of biomarkers for disease diagnosis, prognosis and risk prediction.Applications extend across the health spectrum including Alzheimer's (Han et al., 2011), cancer (Davis et al., 2012;Guan et al., 2009;Nishiumi et al., 2010), diabetes (Bain et al., 2009), and trauma (Determan et al. 2014).
Following the initial pre-processing (e.g.peak picking, deconvolution, integration, etc.), a metabolomics dataset must ultimately be analyzed to typically classify two or more classes/conditions in addition to identifying the most important metabolites for the discrimination (e.g.biomarker studies).The availability and use of multivariate approaches is rapidly becoming critical with decreased cost and increased access to high-throughput metabolomics platforms including NMR and MS resulting in "large p, small n" problems (i.e.many more variables than samples).The common univariate tests become grossly underpowered to assess every feature and require a secondary model if classification is desired.The restrictive assumptions of univariate tests (e.g.normality) are typically avoided with more sophisticated multivariate, machine learning algorithms.
But despite the continued interest in metabolomics there is no standard statistical approach resulting in numerous techniques applied inconsistently across experiments.Common methods include Partial Least Squares Discriminant Analysis (PLSDA), Lasso and Elastic-Net Regularized Generalized Linear Models (GLMNET), Support Vector Machines (SVM), Random Forests (RF), Gradient Boosting Machines (GBM), and Prediction Analysis for Microarrays (PAM).Although each method is an effective classification algorithm, previous comparisons of algorithms in gene expression experiments report that different algorithms provide improved accuracy for different datasets (JW Lee, JB Lee, Park, Song, 2005).Limited algorithm comparisons in metabolomics studies (i.e.comparing two or three methods) often measure performance solely on accuracy and neglect the stability of features selected (i.e.how consistently the same features are identified).Even though an analysis reports high accuracy, repeating the biomarker discovery procedures can result in different feature subsets even within the same datasets (Ein-Dor, Zuk, & Domany, 2006;Michiels, Koscielny, & Hill, 2005;Zucknick, Richardson, & Stronach, 2008).Feature selection can also introduce optimistic bias into statistical inference because the signal-noise ratio of the data set is increased by the feature selection procedure.An extreme example is that all features are irrelevant to a response, but the selected features will still appear fairly predictive to the response which is, however, completely by chance (Ambroise & McLachlan, 2002).Therefore, to determine which algorithms perform optimally, both feature selection stability and overall classification accuracy must be evaluated together.
In this work, we evaluate the six aforementioned classification algorithms performance and stability on both in silico and experimentally acquired datasets.Metabolomics datasets are inherently multivariate with both independent and multicollinear variables in addition to possessing a mix of Gaussian and non-Gaussian distributions.To evaluate algorithm performance on such datasets it is necessary to generate standardized datasets that mimic true metabolomics data and possess known results as a benchmark.Furthermore, an application to previously acquired datasets from multiple platforms with previous results is provided herein.
The goal of this experiment has been to demonstrate variability in algorithm performance across multiple datasets of different characteristics (e.g.sample size, number variables, etc.) and various methodologies (e.g.ensemble).To our knowledge, this is also the first integration of feature stability with model performance to metabolomics datasets.Furthermore, the ability to conduct multiple analyses utilizing accepted methods resulting in similar conclusions adds further support to any conclusions in a field where reproducibility is often a major concern.

Datasets
Despite the growth of metabolomics, there is no commonly accepted gold standard dataset for algorithm evaluation.This necessitates the production of simulated datasets that accurately mimic typical metabolomics datasets from both NMR and MS.This requires the perturbation of normality in multiple variables and inclusion of multicollinearity as is typical of metabolite distributions and relationships.It is also necessary to determine the performance of algorithms when examining the null condition wherein there is no difference between conditions.Therefore, three simulated datasets were generated (null independent, null correlated, and correlated discriminatory) to analyze algorithm performance that may also be used by others for further performance evaluations with respect to mimicking biological datasets as opposed to datasets designed to evaluate specific algorithm performances.This was repeated twice, once at the NMR scale and once at the MS scale, as the number of resolvable features between the two techniques can be an order of magnitude; NMR typically can resolve 50-75 metabolites whereas MS can resolve 100's to 1000's of metabolites (Wishart 2010).Although exceedingly large datasets are possible with in silico data the sample sizes herein were selected to more accurately reflect empirical datasets given the limits from costs and/or sample availability.Low and high sample sizes were set at 25 and 50 samples per group respectively.Although these are still high for many applications, this allows the use of leave k-fold out cross-validation.For very small datasets one may use leave-one-out cross-validation or the option to forgo validation, if appropriate parameters are known.

Simulated Datasets
Each simulated dataset consists of N samples and p variables representing the individual samples and metabolites of an experiment.In addition, for the purposes of classification, groups are assigned to represent classes within a dataset (e.g.Control, Cancer). 1) Null: A null dataset was generated to provide an absolute base where no classification should be found.
Simulated data were generated with the create.random.matrixfunction following previously established methods (Wongravee et al., 2009) with the following noted modifications.The initial datasets was generated with random numbers from a normal distribution, each dataset consisting of N samples (nsamp, 25/50 per group, low and high) and p variables (nvar, NMR = 50, MS = 1000).The normal distribution was also perturbed by adding a second matrix containing uniform random numbers between -0.2 and 0.2 (perturb = 0.2).The samples were then assigned to groups of equal numbers.
2) Null Correlated: To mimic 'real' metabolomics datasets, correlations were induced for the null correlated datasets with the create.corr.matrixfunction.This provided a dataset that could be used to evaluate the effect of correlations on classification and feature selection performance.Blocks of variables of size b (min.block.size= 2, max.block.size= 5) were randomly assigned and had values replaced with correlated values derived from the first variable of the specific block.We elected to incorporate blocks of size 1 (min.block.size= 1) as the smaller metabolomics datasets in NMR more likely possess independent variables.This induced correlation was also perturbed to more accurately represent real data.Derived correlation coefficients were compared to real metabolomics datasets to validate the method (data not shown).
3) Discriminatory: To facilitate discriminatory analysis, D variables (NMR = 10, MS = 20) were randomly induced to be discriminatory with the create.discr.matrixfunction.A discriminatory index (l) was selected and for each variable D whereby a random number between -l and l was added to one group and subtracted from the other.
The MS explores potential biomarkers of Hepatocellular Carcinoma in serum samples (Xiao et al. 2012) and was accessed from the open source metabolomics data repository Metabolights (Haug et al., 2013) accession number MTBLS19.

Classifier and Feature Selection Algorithms
Below we describe six classification algorithms utilized in the metabolomics literature, with built-in or added feature selection capability.We briefly describe how we used each for classification and feature selection.With respect to feature selection, investigators also may or may not have an approximate idea of how many features (i.e.metabolites) they expect to be discriminating.As such two methods, defined herein as 'subset' and 'model derived', are applied whereby a specific number of variables are specified a priori or the specific model is allowed to return an internally determined number of variables respectively.

Partial Least Squares Discriminant Analysis (PLSDA)
Partial Least Squares Discriminant Analysis is a dimension reduction technique analogous to principal component analysis.The algorithm focuses on maximizing the variance of the dependent variables explained by the independent variables (Wold, 1975).It is robust to multicollinearity, missing data, and skewed distributions (Cassel, Hackl, & Westlund, 1999).These models were tuned on the number of components to be retained.Feature selection was accomplished by ranking features on the sum of squares of their loading weights, a technique known as variable importance of projection (VIP) (Wold, Johansson, & Cocchi, 1993).Model derived features were selected as those with a VIP score >= 1.0.PLSDA is commonly used in metabolomics investigations including multiple forms of cancer, cardiac ischemia, parkinson's disease and asthma (Nishiumi et al., 2010;Bodi et al., 2012;Bogdanov et al., 2008;Carraro et al., 2007;Chen et al., 2011;Duarte et al., 2010;Qiu et al., 2010).This technique has also been previously implemented in our lab investigating hemorrhagic shock (Lexcen, Lusczek, Witowski, Mulier, & Beilman, 2012).It is readily available in the R package DiscriMiner (Sanchez, 2012).

Regularized General Linear Model (GLMNET)
Generalized linear models are a more flexible form of linear regression that allows the response variables to have non-parametric distributions.To avoid the risk of overfitting data in multiple linear regression a regularization method can be applied.We have selected to use the elastic-net penalty, which is a compromise between the LASSO and Ridge shrinkage methods and has been shown to outperform LASSO (Zou & Hastie, 2005).In brief, elastic net is a weighted average of the lasso and ridge solutions.The LASSO penalty encourages non-discriminative features to have a coefficient that is exactly zero, thus it performs feature selection; and the Ridge penalty is used to overcome multicollinearity.Having both of these penalties integrated facilitates analysis of data with collinearity and internal feature selection.These models were tuned on the lambda penalty and the elastic-net parameter alpha.Important features are identified as those with non-zero coefficients.These coefficients were also ranked for subset feature selection.Although it is less common than other techniques it has been used in recent metabolomics studies (Rohart et al., 2012, Wahl et al., 2013).GLMNET is readily available in the R package glmnet (Friedman, Hastie, & Tibshirani, 2010).

Random Forest (RF)
Random forest is a machine learning algorithm that uses a combination of tree predictors (i.e.forest) such that each tree is constructed on a random sample of the observations thereby independently ensuring that the distributions are the same for all the trees in the forest.Each tree in the forest provides a 'vote' for the best class.This is constructed on a training subset of the data and tested against the remaining test data known as the 'out-of-bag' (OOB) data.The scaled sum of the votes derived from the trained trees determines the final "score" (Breiman, 2001).Random Forests were tuned on the number of trees and the number of variables randomly sampled at each split.The feature selection is determined by permuting variables in the OOB and observing increases in error.A variable score indicates greater importance to the model.These scores were ranked for subset feature selection and those exceeding a score of 1.0 for model derived results.It is robust to noise and outliers and computationally faster than bagging or boosting.Prior studies have reported error rates comparable if not better than other predictors such as logistic regression, linear discriminant analysis, quadratic discriminant analysis (QDA), K-nearest neighbors (KNN), Support Vector Machines (SVM), classification and regression trees (CART) and Naïve Bayes (Breiman, 2001;Folleco, Khoshgoftaar, Van Hulse, & Bullard, 2008;Svetnik et al., 2003).However, consistency of selected feature rankings has been shown to be problematic for high dimensional problems (Verikas, Gelzinis, & Bacauskiene, 2011).It has been used in several metabolomics studies (Hische et al., 2012;Houtkooper et al., 2011;Patterson et al., 2011) and is readily available in the R package randomForest (Liaw & Wiener, 2002).

Gradient Boosting Machine (GBM)
Gradient boosting is another machine learning technique applied most commonly to decision trees that produces robust and interpretable procedures for both regression and classification (Freidman, 2001).Unlike the bagging approach (e.g.random forest), where trees are constructed independently and are thus assumed to make prediction errors independently, gradient boosting trees are constructed sequentially and each new tree is fitted to compensate for errors committed by previous trees.As with random forest, feature selection is determined by permuting variables in the OOB and observing increases in error resulting in a subsequent variable score.Gradient Boosting models were tuned on the number of trees, the interaction depth, and learning rate (i.e.shrinkage, step-size reduction).Feature selection was accomplished via ranking scores and those exceeding 1.0 for model derived results.Boosting has become known as one of the most powerful learning ideas in the last twenty years (Hastie, Tibshirani, & Friedman, 2009) but curiously has never been applied to metabolomics settings.To our knowledge, this is the first application of boosting to analyze metabolomics data.Freidman's gradient boosting machine algorithm is available in the R package gbm (Ridgeway, 2013).

Support Vector Machines (SVM)
Support vector machine is based on the structural risk minimization principle from statistical learning theory (Vapnik, 1998).It can be applied to classification problems with the idea of structural risk minimization to find a hypothesis that has the lowest probability of error.It has been shown to be robust to high dimensionality, noisy data, and outliers.Prior comparisons with PLSDA report improved overall accuracy with less features (Mahadevan, Shah, Marrie, Slupsky, 2008) but feature selection consistency is unknown.This classification algorithm is readily available within the R e1071 package (Meyer, Dimiriadou, Hornik, Weingessel, & Leisch, 2012) where we elected to use the common linear kernel and tuned on the cost parameter.Feature selection was accomplished via recursive feature elimination (RFE) as detailed by Guyon as to our knowledge there is no metric specifically designed for SVM (Guyon, Weston, Barnhill, & Vapnik, 2002).

Prediction Analysis for Microarrays (PAM)
Prediction Analysis for Microarrays is a modified nearest centroid classification method to include centroid shrinkage and contains an embedded feature selection step (Tibshirani, Hastie, Narasimhan, & Chu, 2002).In brief, the average value for each variable is divided by the within-class standard deviation to provide class centroids.These class centroids are then shrunk towards zero by a defined threshold to reduce noise and facilitate variable selection.Then a new sample profile is compared to each of the class centroids.The class whose centroid is closest is the predicted class.These models were tuned on the 'threshold' parameter for the centroid shrinkage.The internal feature selection is accomplished by identifying features with non-zero coefficients which are subsequently ranked for subset selection.This technique has not been used widely in metabolomics investigations; however, as the name implies it has been successfully been used for classification in gene expression experiments (Ray et al., 2007;Sadanandam et al., 2013).This algorithm is readily available in the R package pamr.

Evaluate Stability of Feature Selection Techniques
The high-dimensional datasets of metabolomics often necessitate feature selection techniques to reduce dimensionality to the most important features to facilitate subsequent analysis.Although many approaches rely exclusively on classification accuracy of feature subsets to facilitate biomarker selection this is problematic where several different feature subsets may yield equally optimal results (Saeys, Inza, & Larrañaga, 2007).It is therefore necessary to evaluate the robustness of feature selection techniques applied to metabolomics data to facilitate improved reproducibility and confidence in identified biomarkers.In brief, algorithm robustness were evaluated via instance (bootstrapped data subsets) and function (alternate algorithms) perturbation and evaluated by the Jaccard's Index (Real & Vargas, 1996).Other options include the Dice-Sorensen's Index (Dice, 1945;Sorensen, 1948), Ochiai's Index (Ochiai, 1957), Percent of Overlapping Features (Shi et al., 2005), Kuncheva's Index (Kuncheva, 2007), Spearman Rank Correlation, and Canberra Distance (Jurman et al., 2008).A comparison of these metrics is beyond the scope of this article.
Two common approaches are applied within instance perturbation to evaluate the robustness of feature selection techniques: perturbation at the instance level (i.e.removing or adding samples) or at the feature level (i.e.adding noise).We have selected to evaluate robustness of feature selection algorithms by estimating stability following perturbation at the instance level as the number of samples is the most likely problem facing metabolomics investigations.

Single Feature Selection Stability and Classification Performance
For each feature selection algorithm we estimated stability via instance perturbation with the fs.stability function.
Instance perturbation was conducted via bootstrapping, without replacement, 90% (p = 0.9) of the data 10 times (k = 10) thereby creating a training and testing dataset for each iteration.For each training dataset all 6 feature selection algorithms were run simultaneously to provide an ordered list of selected feature rankings.Each iteration tunes the full model (optimize = TRUE) with a tuning grid of a specified resolution determining how fine the tuning parameters are optimized (resolution = 5).To avoid overfitting, 10-fold cross-validation was utilized (k.fold = 10) wherein 1/10 th of the each training data subset is randomly removed and the model evaluated on this test fold.Results were averaged over all 10 folds to provide the confusion matrix for subsequent performance metrics evaluated on the typical prediction accuracy (metric = "Accuracy") which is the proportion of true results (true positive and true negative) in the sample population.The optimized models were then used to extract feature subsets of a user specified length (f, NMR = 10, MS = 20) or optionally by the model defined cutoff (model.features= FALSE).These feature subsets are compared via the Jaccard index (Equation 1).The overall stability is defined as the average over all pairwise similarity comparisons between each of the feature selection runs.The final model is refit using the extracted feature subset from the individual method and re-optimized using the initial tuning grid generated.Lastly, this trimmed model is used to predict the initial testing dataset generated at the start of the iteration.Accuracy was extracted with the performance.metricsfunction to compare each algorithms performance.This is repeated for the additional 9 times utilizing the previously optimized parameters for the full model generation (optimize.resample= FALSE).

Balance Stability and Classification Performance
In every scientific investigation which sample size is a limitation (i.e.most studies), researchers must balance power and sensitivity.The same principle is applied to balancing feature selection robustness and classifier performance as both are integral to confident biomarker identification.We utilized the robustness-performance trade-off (RPT) to balance feature selection stability and classification performance (Saeys, Abeel, & de Peer, 2008).In brief, the user can specify the parameter β to control the relative importance of feature stability versus classification.The default value of β = 1 which represents equal importance between stability and classification.

Ensemble Feature Selection Stability and Classification Performance
Ensemble feature selection has been shown to improve stability in gene expression studies (Abeel, Helleputte, de Peer, Dupont, & Saeys, 2010;Davis et al. 2006).Therefore it is important to incorporate such analysis into metabolomics analysis for each algorithm.In essence, ensemble approaches use different data subsets and aggregating the results following feature selection.As described in the 'Single Feature Selection Stability and Classification Performance' section, stability was evaluated via instance perturbation with the fs.ensembl.stabilityfunction.For each subsample a second level of instance perturbation generated 40 (bags = 40) further datasets via bootstrap aggregation (aka.Bagging) (Breiman, 1996).For each bag a separate feature ranking was performed.The resulting list of selected feature rankings from each bag were combined via linear aggregation (aggregation.metrics= "CLA") whereby the sum of the individual feature ranks within each bag contribute linearly to the overall final rank.The Jaccard Index was used to measure similarity and averaged over all pairwise comparisons for an overall measure of stability.Function perturbation, the use of multiple feature selection algorithms, was also conducted by the list of methods chosen within fs.stability and fs.ensembl.stability.Lastly, in contrast to non-ensemble approaches, there are no model derived runs because all features must be ranked for aggregation methods.Non-ensemble analysis of random and correlated dataset analyses provided generally expected results.Accuracy exceeded 0.700 for SVM, RF, GLMNET and PAM but stability remained low (>=0.47).While RF achieved the highest accuracy (often in excess of 0.900), it had the lowest stability warranting caution in interpreting results.Notably, accuracy was generally higher with the MS-scale dataset where accuracy exceeded 0.900 for the same four algorithms.However the stability of the feature subsets was also lower.Ensemble analysis of NMR-scale and MS-scale random and correlated datasets again reflected previous analysis with high accuracy levels for SVM, RF, GLMNET and PAM but low stability (Supplementary File -S1).

Discriminatory datasets
Analysis of the NMR-scale discriminatory dataset determined PAM as the optimal model with the highest RPT and PPI% (percent of discriminating features positively identified, Equation 3) (Table 1).GLMNET performed similarly with better accuracy but lower stability and PPI%.The model derived analysis also provided PAM with the highest RPT, however, the low sample size resulted in a conservative trimming of features resulting in many remaining in the model and decreasing the PPI%.The highest PPI% was reported by SVM which also had the very high accuracy; however it is also noted that there is no internal trimming metric for SVM and only the top 10% of features are returned making this a more restricted subset model.This suggests that experiments with lower sample sizes may need to restrict to only a few of the most discriminate features.The MS-scale datasets also reported PAM with the highest RPT and stability but the highest PPI% was reported by PLSDA.Such a situation supports the value of using multiple algorithms to determine consistent results.

Equation 3 Percent Features Positively Identified (PPI%) | ∩ | | |
Ensemble analysis of NMR-scale discriminatory dataset reported SVM with the highest RPT with PLSDA, GLMNET and PAM performing similarly.The MS-scale analysis was less conclusive with mixed performance among SVM, GLMNET and PAM (Supplementary File -S2).

Binary Classification -High Samples
3.1.2.1 Random and correlated datasets (N=100, p=50/1000 (NMR/MS)) Increased sample size had little effect on accuracy and stability of NMR-scale random and correlated datasets but worsened models of MS-scale data (Supplementary File -S1).This should be expected as having more data should increase the likelihood of calculating the 'true' condition.As expected, there was also little effect of high samples on the ensemble analysis of the both random and correlated datasets of MS-scale and NMR-scale sizes.

Discriminatory Datasets
Overall performance greatly improved in the NMR-scale discriminatory dataset with accuracy exceeding 0.9 for three algorithms, stability exceeding 0.8, and PPI% to 80% (Table 1).Performance in the MS-scale dataset also improved with stability and PPI% increasing most notably for GBM and RF respectively (Supplementary File -S2).Both datasets provide a circumstance whereby no algorithm performs best and multiple methods are beneficial.
The results of the ensemble NMR-scale discriminatory dataset improved RF and SVM but worsened GLMNET and PAM (Table 1).Likewise, the ensemble MS-scale discriminatory dataset also improved performance with increased samples most notably in RF stability and PPI% (Supplementary File -S2).However, the best performing algorithms were PLSDA, PAM and GLMNET.Although ensemble aggregation tends to improve model performance it may also have no effect or even worsen model performance.NMR-scale random and correlated datasets generally had low accuracy and stability whereas MS-scale had four algorithms consistently with accuracy >= 0.700 but stability still remained low.Ensemble analysis of random and correlated datasets had little effect on performance of both NMR-scale and MS-scale (Supplementary File -S1).

Discriminatory Datasets
Analysis of the NMR-scale discriminatory dataset determined SVM as the best performing algorithm with the highest RPT whereas PAM and RF had the highest stability and accuracy respectively.The model derived results report GLMNET and PAM among the best; however, as with the binary classification problems the low sample size resulted in untrimmed features wherein nearly all were retained resulting in a decreased PPI%.Curiously, SVM reported the highest PPI% (64%) again suggesting that the lower sample size may restrict to only a few of the most discriminating features (Table 2).The MS-scale dataset had high predictive accuracy with SVM, GLMNET and PAM but very low with PLSDA and GBM.Curiously, PLSDA had the highest PPI% and stability (0.53) suggesting a potential use in comparison to better classifying algorithms for feature selection (Supplementary File -S2).
Ensemble analysis of the NMR-scale discriminatory dataset reported GLMNET with the highest RPT but PLSDA with the highest PPI% (Table 2).This reflects the single run analysis whereby the improved stability of ensemble methods provided improved classification and stability of GLMNET.The MS-scale analysis reported SVM as the optimum algorithm with the highest RPT and accuracy but PAM reported the highest PPI%.NMR-scale random and correlated datasets continued to provide expected results, whereby most models had poor classification.In contrast, MS-scale data provided high accuracy but continued to provide very poor stability.Ensemble analysis of random and correlated datasets again reflected previous analysis with low accuracy and stability for both NMR-scale and MS-scale data (Supplementary File -S1).

Discriminatory Datasets
Analysis of the NMR-scale discriminatory dataset determined GLMNET as the best performing algorithm despite lower accuracy (Table 2).Model derived results determined GLMNET and PAM as the best models however they did not successfully extract any discriminating features resulting in a depleted PPI% whereas SVM reported the highest PPI% at 60%.
Ensemble analysis of the discriminatory dataset did not significantly improve performance for the NMR-scale dataset.This is also reflected in the MS-scale data where only RF and PLSDA stability improved slightly (Supplementary File -S2).

Eisner -Urine analysis of Cachexia via NMR
Non-ensemble analysis determined PAM and GLMNET as the best overall models according the RPT (0.711, 0.705).Important features were extracted with the feature.tablefunction.Adipate, Glucose, 3-Hydroxyisovalerate were identified in all subsamples by both models.PAM also identified creatine and succinate consistently (Table 3).GLMNET identified leucine, quinolinate, and valine as important features (Table 4).Seven of these metabolites were within the top 10 metabolites identified by Eisner et.al. with creatine being the 11 th (Eisner et al., 2011).Furthermore, the other 3 metabolites in the top 10 (myo-inositol, betaine, and N,N-dimethylglycine) were also identified in the top 10 by GLMNET and PAM.Random forest provided the best classification accuracy at the expense of stability.SVM performed similarly well with respect to classification accuracy (Supplementary File -S3), however, stability was also quite low (0.31).
Ensemble methods improved stability of GBM, SVM, and RF; however, it noticeably decreased GLMNET stability from 0.64 to 0.49.Performance decreased slightly for PLSDA but improved for PAM; both proved the best overall models.Overall 8 of the 9 identified metabolites (frequency >= 0.9) by PLSDA and PAM were in the top 10 identified by .For this particular dataset the ensemble approach does not appear necessary as neither model performance nor feature stability was significantly improved.Irrespective, the applied methods provide further validation and support for the classification and metabolites selected.The PAM and GLMNET models performed very well with the Ametaj dataset with RPT values of 0.850 and 0.806 respectively.Although PLSDA did not have a high RPT (0.497), the stability was very high (0.77) and was therefore evaluated for consistency with PAM and GLMNET.Glucose, endotoxin, and methylamine were consistently identified by PAM, GLMNET and PLSDA (Supplementary File -S4).Glucose and endotoxin were expected and methylamine was the first statistically significant metabolite discussed by Ametaj et.al.(Ametaj et al., 2010).In addition, uracil, acetate, fumarate, and lactate were also consistently identified by at least two models.
All of these metabolites are identified by Ametaj et.al.except for lactate which is reported as non-significant but was approaching significance (0.149).This suggests a potential power issue although the authors' comment on the conversion of lactate to proprionate in ruminates is well supported.
Ensemble methods improved stability of PLSDA, GBM, SVM, and RF; however, it slightly decreased GLMNET and PAM stability by 0.02 and 0.07 respectively (Supplementary File -S4).This general improvement is expected with smaller sample sizes as variability often greater.The stability of all models was very high with all >= 0.67 except for RF.Stability of PLSDA was highest; however, classification accuracy was poor (0.438).Accuracy increased slightly with PAM and GLMNET but decreased in SVM to match GLMNET.These three proved to be the best overall models.Although the stability was high, identified features did vary between models creating what we refer to as a 'hierarchy of confidence' whereby the greatest confidence would be placed in the most consistently identified features within and across algorithms.Endotoxin, glucose, and methylamine were once more identified by four models (PAM, SVM, PLSDA and GBM).Alanine was also identified by 3 models (PAM, GLMNET, PLSDA).In addition, acetate, 3-PP, and uracil were also identified by at least two models consistently.All of these metabolites were identified as important by Ametaj et.al.Lastly, although most results were consistent, ferulate was also identified by PAM and SVM which was not discussed previously.It is also apparent that given the small sample size and greater variability within the data, this analysis benefits from ensemble methods.

Xiao -Serum Analysis of Hepatocellular Carcinoma via MS
Within the negative mode comparisons PAM mostly performed better than all other algorithms applied with respect to RPT (Supplementary File -S5).There was only one exception where PAM and GLMNET had almost identical RPT values (0.679, 0.680).However, PAM stability was consistently the highest suggesting it to be the ideal method to explore this particular dataset.Additionally, it should be noted that PAM did not have very high accuracy compared to other methods such as GLMNET and SVM.Random Forest also had very high accuracy but exceptionally low stability making it a good tool for analyses that require high classification accuracy but do not require further feature identification.The higher accuracy makes GLMNET a close second to the PAM approach.An ensemble analysis was performed, however only RF improved stability significantly rendering the ensemble analysis of little help.Identified features were largely consistent with previous analyses (Supplementary File -S6), however, each comparison identified at least an additional 15 metabolites not mentioned in the prior manuscript.
Within the positive mode comparisons GLMNET performed slightly better than PAM in all except the 1 comparison; however, PAM consistently maintained the highest stability demonstrating that even with very high-dimensional datasets no single algorithm dominates.As with the negative mode, an ensemble analysis was performed but only RF improved stability significantly making ensemble aggregation unnecessary.As with the NMR studies the two models provided further support to mutually identified features (Supplementary File -S7).These features included GDCA, Oleoylcarnitine, GCDCA, L-N 2 -(2-Carboxyethyl) arginine, Tetracosahexaenoic acid, Palmitoyl carnintine and Linoelaidyl carnitine which support the author's interpretations of bile and fatty acid metabolism.In addition, 48 metabolites were consistently identified by PAM but were not mentioned in the paper however it is possible they were identified as the authors do not present all identified metabolites.

Discussion
The results presented focused on determining if a single classification algorithm performs better than other commonly used algorithms in the field of metabolomics.Although there is strong support for several algorithms, there are additional papers stating one algorithm outperforms another.This is expected according to the concept of No Free Lunch Theorem (Wolpert & Macready, 1997) which, in essence, states that there is no single model that is appropriate for all problems.Our analyses support this theory and suggest a comparative approach to evaluate algorithm performance as well as to provide additional support via multiple algorithms for future conclusions.As such, the Bioconductor package OmicsMarkeR has been created to facilitate the rapid comparison of algorithm performances on individual datasets.
The null datasets initially appeared to demonstrate abnormally high accuracy and AUROC values when such metrics theoretically should result in ~0.50.However, these final models were re-tuned and built upon the 'best' features identified likely resulting in an inflated model metric.This has provided an example of the 'optimistic bias' that may result from feature selection (Ambroise & McLachlan, 2002).Furthermore, increased dimensionality of mass spectrometry datasets also increases the likelihood of finding discriminating variables by chance.Increased noise may also result in an inflated model metric.It is important to note that despite high model metrics, the low stability and RPT values provide strong evidence that the relationship has been found by chance and no further information can be reliably drawn from the model thereby increasing the value of such metrics in metabolomics investigations.Using the fit.only.modelfunction on the null datasets confirmed the datasets as a whole did not possess any inherent classification whereby Accuracy and AUROC were ~0.50 (data not shown).
The binary and multivariate simulation results demonstrate the variation in optimal models.Depending on the users approach to a dataset (i.e.feature subset or model derived), number of samples relative to features, the number of groups, or application of ensemble methods, a different algorithm may be more appropriate.Within the simulation analyses PLSDA, SVM, PAM, RF, and GLMNET all proved to be optimal algorithms for different situations.Curiously, the reduction in features generally decreased GBM performance but feature selection was relatively consistent with other methods.Random Forest, in most cases, did not prove to be the optimal algorithm but consistently had among the highest accuracy but very low stability; which could be partially mitigated by an ensemble approach.This low stability is expected given the algorithms search encompasses interactions between variables resulting in a larger search space.If a user wished to weigh accuracy more than stability, the RPT value can be easily recalculated with the RPT function by increasing the beta parameter (e.g.beta = 1.5 increases the weight of accuracy by 50%).
The Eisner cachexia NMR dataset provided an example of the value behind using multiple algorithms.In the event that algorithms have similar overall performance, analysis of features selected by both may provide further support for identified features.Additionally, this dataset demonstrated the circumstance whereby ensemble approaches are superfluous.Consistent with published results, the Eisner dataset did reflect the high accuracy of SVM but stability was low.The GLMNET and PAM analyses provided respectable accuracy and high stability.The ability to assess stability of identified features for classification models is valuable if rapid and clinically applicable tests are to be developed.1 is meant to be a general guide; however, the results herein strongly suggest using multiple algorithms and comparing performance on each unique dataset as well as ensemble methods which have been made far more accessible with the OmicsMarkeR package.It is important to emphasize, however, that there are far more considerations to be considered including the strengths and weaknesses of the available technology such as sensitivity, reproducibility, costs, and sample preparation (Robertson, 2005).In addition, preprocessing methods from normalization techniques to NMR preprocessing (e.g.binning) to the diverse software implementations of mass spectrometry processing (Castillo, Gpalacharyulu, Yetukuri, & Orešič, 2011) likely will impact results and further comparisons are encouraged.
In addition, the package contains two Monte-Carlo permutation functions (perm.classand perm.features) for assessing model performance and identified features for further evaluation.Current areas of improvement include the addition of further algorithms, database searching, improving memory efficiency, and easy to access graphics (e.g.scores plots, variable importance plots, etc.).The R package OmicsMarkeR for this analysis is publically accessible from the bioconductor platform (www.bioconductor.org).

Table 1 .
Results from NMR-scale Binary Classification Simulations.RPT -Robustness-Performance Trade-off, PPI% -Percent features positively identified.*SVM doesn't have internal cutoff so defaults to top 10%.

Table 2 .
Results from NMR-scale Multi-class Classification Simulations.RPT -Robustness-Performance Trade-off, PPI% -Percent features positively identified.*SVM doesn't have internal cutoff so defaults to top 10%

Table 3 .
Feature table of PAM analysis consisting of consistency (i.e.number of times the feature identified as important) and frequency (i.e. percentage of iterations feature identified)

Table 4 .
Feature table of GLMNET analysis consisting of consistency (i.e.number of times the feature identified as important) and frequency (i.e. percentage of iterations feature identified) discriminatory features and the datasets scale (i.e.MS or NMR).The diagram in Figure