SECIMTools: a suite of metabolomics data analysis tools

Background Metabolomics has the promise to transform the area of personalized medicine with the rapid development of high throughput technology for untargeted analysis of metabolites. Open access, easy to use, analytic tools that are broadly accessible to the biological community need to be developed. While technology used in metabolomics varies, most metabolomics studies have a set of features identified. Galaxy is an open access platform that enables scientists at all levels to interact with big data. Galaxy promotes reproducibility by saving histories and enabling the sharing workflows among scientists. Results SECIMTools (SouthEast Center for Integrated Metabolomics) is a set of Python applications that are available both as standalone tools and wrapped for use in Galaxy. The suite includes a comprehensive set of quality control metrics (retention time window evaluation and various peak evaluation tools), visualization techniques (hierarchical cluster heatmap, principal component analysis, modular modularity clustering), basic statistical analysis methods (partial least squares - discriminant analysis, analysis of variance, t-test, Kruskal-Wallis non-parametric test), advanced classification methods (random forest, support vector machines), and advanced variable selection tools (least absolute shrinkage and selection operator LASSO and Elastic Net). Conclusions SECIMTools leverages the Galaxy platform and enables integrated workflows for metabolomics data analysis made from building blocks designed for easy use and interpretability. Standard data formats and a set of utilities allow arbitrary linkages between tools to encourage novel workflow designs. The Galaxy framework enables future data integration for metabolomics studies with other omics data. Electronic supplementary material The online version of this article (10.1186/s12859-018-2134-1) contains supplementary material, which is available to authorized users.


Data Input
This section describes the structure and the format of the input datasets for SECIMTools on Galaxy. Only data files in text format can be used in SECIMTools.

Tabular File Structure
All the data files used for calculations in Galaxy must be text files and use the tabular delimiter. If your data are not saved in a tabular delimited file they will not appear in the dropdown menus for the SECIMTools. Tab delimited files often use the .tab or .tsv extensions.
A mechanism for the file conversion is included in Galaxy. The following sequence of steps can be used to convert a text file with an arbitrary delimiter to a text file with a tabular format: 1. Select the Text Manipulation menu on the left side of the screen. 2. Select the Convert tool which will convert delimiters to TAB. 3. Select the delimiter to convert (e.g. comma, whitespace, and colon) and the dataset you wish to convert. 4. Press the "Execute" button and new tabular delimited dataset will be created.

Wide Format Dataset
A wide formatted dataset should contain feature measurements for each sample with samples in columns and features in rows.
Feature information may also be included in wide format datasets (e.g. average mass to charge ratio and retention time). The dataset must contain a column with a unique identifier for each row (Unique Feature ID). For example the column "Compound" in the figure above. In the wide dataset all the columns containing data to be analyzed should to be numeric. Missing values, can be represented by empty cells or cells with a common value denoting a missing observation. Missing values can be characters (e.g. NA). Please see the section on naming conventions below for more information about naming column headers.

Long Format Dataset
A long formatted dataset contains one column containing the numeric data and another column (or columns) uniquely identifying the numeric data. One of the columns should contain sample identifiers identical to the sample identifiers in the Design Dataset. This column must be called 'sampleID'. The other column contains numeric values (e.g. scores generated from a PCA) for each sample. In cases where there are multiple variables for each sampleID (e.g. peak intensities), additional columns such as the unique Feature identifier may be required to fully describe the content of the value column.

Design Dataset
The design dataset is used to relate the sample values in the wide format dataset with sample characteristics. Examples of sample characteristics are: treatment group, batch number, sample weight, order in which the sample was processed in the instrument (run order). The sample identifiers used as column headers in the wide dataset have to be saved in the design file with the column name sampleID. This is necessary for the tools to link the two files during analysis. Spelling of the sampleID column name is case sensitive. Values in the design dataset can be categorical (such as grouping variable containing treatment name or batch name) or numeric (such as sample weight or run order). Please see the section on naming conventions below for more information about how to name column headers.

Relationship between the Wide Format Dataset and the Design Dataset
In most of the tools the sampleID column in the design file is used to link sample information to the metabolomics data in the wide format dataset. The values for the column name in the wide format and in the sampleID column in the design file must match exactly and the matching algorithm is case sensitive. If there are values in the sampleID column that do not match a column name in the wide dataset a warning will be generated. Warnings are shown in the Tool Standard Output and the Tool Standard Error links in the information section for each tool output. To access the output click on the circular "eye". If there are columns with the data in the wide dataset that are absent from the sampleID column in the design file those columns will not be included in analysis.

Annotation File
The annotation files use the wide format files that link the featureID-s to the corresponding compound names and additional feature information. The annotation files are expected to have at least two columns: one containing the featureID-s and the other contains the corresponding compound names.

Flag File
Flag files are auxiliary datasets that are produced by many of the SECIMTools and are also used as an input by some SECIMTools. The flag files are named in this way to indicate that they usually contain columns of binary values (1/0) even though the other columns are possible. These files are used to identify (flag) features or samples. The flag files can either be in the wide format or in the design format.
For the flag files produced by the SECIMTools, the wide format flag files have the unique feature identifier column (featureID) and additional flag column(s) used to classify the features. The columns that start with the name flag_ contain binary values (0 or 1).
The design format flag files have the sampleID column and additional flag column(s) used to classify the samples. The columns with the flags contain binary values (0 or 1). The column indicating the sampleID must have values that match the values in the design file, but the name of the column in the flag file is arbitrary.
Users can import their own flag files. The tools that accept flag files are flexible and allow any categorical variable. Users only need to be mindful that the sampleID or featureID column values in the flag files have to match the corresponding values in the data files exactly.

Naming conventions
There are naming conventions used by SECIMTools. These follow the conventions and restrictions of the Linux operating system and the tabular text file format used by Galaxy. The majority of the programming languages recognize variable names that follow these rules. Most of these rules are straightforward, however when dealing with chemical data there are some particular challenges. The design file values should not contain a comma (,) in the column name. Compound names often contain special characters that can be problematic in some contexts. Since SECIMTools use several published algorithms by other authors and incorporate many functions from other software packages, users are advised to separate these columns into an annotation file during analysis and join them at the end of the analysis for results interpretation.
Column names and values in the sampleID and featureID columns should use roman letters from the English alphabet, should not: start with a number; be longer than 32 characters; or have special characters. The only permitted special character is the underline symbol (_).
To address possible naming violations that may cause problems the SECIMTools suite has a buildin string cleaning interface executed prior to running any tool. The interface replaces special characters with '_' symbol and prepends the '_' symbol to the column names and featureIDs that start with numbers. By default tools like MZmine output unique identifiers that are numeric for each row. The SECIMTools interface prevents this common convention from being problematic in operating within the Galaxy environment.

Viewing Output
To view the output files click on the view button ("eye" symbol) in the green box in the history on the right side of the screen. If the tool outputs multiple files each file will have a separate view button.

Saving Output
To download and save the output to the local machine click on the title of the output in the history. This will expand the box and give more information about the tool output. After expanding the history green box, click on the download button ("diskette" symbol) on the bottom of the box to save the file to a local machine.

Analysis of Variance (ANOVA) Fixed Effects Model
The tool fits an analysis of variance (ANOVA) fixed effects models with multiple grouping variables, their interactions and covariates. The analysis is performed independently for each row. The user can choose whether to include interactions between grouping variables in the model or to use a pure additive model Click on the Analysis of Variance (ANOVA) Fixed Effects Model tool in the SECIMTools menu on the left side of the screen.
The ANOVA fixed effects model is fit performed using the "ols" function from the statsmodels package in Python. In the Unique Feature ID field type the name of the unique Feature identifier. 4. In the Group(s)/Treatment(s) text box, type the names of the columns in the Design File that identify the variables used for the model. Categorical and continuous variables are supported. Separate the column names with a comma and without a space. For example "treatment_group,gender,age". Column names are case sensitive. 5. In Type of Group(s)/Treatment(s) text box, input the variable types. These are specified as C for categorical and N for continuous and must be in the same order as the variable names. For example for "treatment_group,gender,age" input "C,C,N" 6.
Clicking on the checkbox Calculate ANOVA with interactions will calculate all possible pairwise interactions between the Group(s)/Treatment(s) variables. Numerical variables are not included in the interactions. 7.
Click Execute

Output
This tool outputs two files: - TSV file for the results table containing the fixed effects ANOVA results for each variable,  -PDF file for the volcano plots.
A volcano plot is generated by the ANOVA Fixed Effects tool. On the plot, the differences between the group means are displayed on the x-axis. The corresponding -log base 10 p -values from the test of the null hypothesis are displayed on the y-axis. Each dot represents a feature. One volcano plot is generated for each pairwise comparison. The red dashed line in the volcano plot(s)

Bland-Altman (BA) Plot
The Bland-Altman plot (BA-Plot) is used to look at the concordance of data between pairs of samples, particularly between replicates. The script generates BA-plots for all pairwise combinations of samples. If both a grouping variable name and a specific group name are provided, then the BA-Plots are generated only for pairwise combinations within the specified group.
In addition to generating BA-plots, a linear regression fit is produced between the values that correspond to the pair of samples to identify (flag) any unusual outlying feature values. The flags produced by the regression fit are used to generate distribution plots and text files for (i) each sample (column) and for (ii) each feature (row).
Bland-Altman plots are described in the reference below: Bland JM, Altman D. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet. 327:307-310.

NOTE: The tool is relevant for Mass Spectroscopy (MS) data analysis and allows removing "noise" from the data using values in the blank samples as a reference.
This tool uses features in negative (blank) control samples to calculate a limit of detection and flag features in non-blank samples that are below this limit. The BFF Threshold for each feature is equal to (3*Standard Deviation of the blank group) + (the average of the blank group) and is computed across blank samples only. If, for a given feature, the computed BFF Threshold is less than or equal to 0, the user specified BFF Threshold overrides the computed BFF Threshold (default value for user specified BFF Threshold is 5000). The user-specified BFF threshold is particularly important when the blank group contains a lot of zero values or is on log-transformed scale. A feature is flagged as below the detection limit for a given group if the ratio (group mean -BFF Threshold) / BFF Threshold is less than the Criterion Value (default 100) for the average within that group.

Output
This tool outputs two files: -TSV file with values compared to the Criterion Value. It contains the following columns: o Unique Feature ID o A column for each Group/Treatment: Contains the value of (group mean -BFF Threshold) / BFF Thresholds for each feature. The BFF Threshold is calculated as (3*Standard Deviation of the Blank Name group) + (the average of the Blank Name group) and is computed across Blank Name samples only.
-TSV file containing flags for each feature. Flag values of one (1) correspond to features which failed to satisfy the BFF Threshold Criterion Value and are considered below the detection limit for the given group. Columns are: o Unique Feature ID o flag_bff_{Group/Treatment}_off: 0/1 indicator flag for each feature. A feature is flagged as below the detection limit for the given group if the ratio (group mean -BFF Threshold) / BFF Thresholds is less than the Criterion Value. The value "1" is assigned to those features that fall below the Criterion Value.

Compare Flags
This tool compares two flag columns in a flag file and creates a 'cross tabulation' results file. Flags from multiple flag files can be combined by first running the 'Merge_Flags' tool.

1.
Select the Flag File from the dropdown menu.

2.
In the Column Name for Flag 1 text box, enter the name of the first flag column to compare. 3.
In the Column Name for Flag 2 text box, enter the name of the second flag column to compare.

Output
This tool outputs a single file: -TSV file containing the frequencies of the compared flags in the appropriate cells.

Coefficient of Variation (CV) Flags
This tool calculates the coefficient of variation (standard deviation as a percentage of the mean) and is often used to look at the consistency of features across samples. The user can define what percent of features with the highest CV to flag. If no percentage is selected, then the top 10% of features with the highest CV are flagged (default value). The CV value corresponding to the percentage is indicated in the output plots.

1.
Select the Wide Dataset from the drop-down menu.

2.
Select the Design File from the drop-down menu. 3.
In the Unique Feature ID text box, type the name of the unique feature identifier. 4.
In the Group/Treatment [Optional] text box, type the name of the column in the Design File that identifies the groups.

5.
In the CV cutoff [Optional] text box, enter a CV cutoff to flag the top X% of features with the largest CV's. Enter the value as a decimal (e.g.: 10% = 0.10). If no value is entered, the top 10% (Default = 0.1) of features with the highest CV's are flagged. 6.
Click Execute

Output
This tool outputs two files: -TSV file containing the CV Flags for each feature for each group (if group variable is specified). A flag value of one (1) corresponds to features with large CV values as specified by the CV cutoff. : o Unique Feature ID o flag_feature_big_CV_{Group/Treatment [Optional]}: 0/1 indicator flag for each feature. The value "1" is assigned to those features with CV values that exceed the CV cutoff criteria.
-PDF file containing histograms with overlaid density plots of the coefficients of variation for each group (optional, if the group variable is provided) and a summary density plot containing the densities for each group without the histograms.
Histogram and density plot (in green) of the coefficients of variation for each feature across all samples. The dotted red line indicates the CV value corresponding to the top 10% of coefficient of variation values and the actual value that corresponds to the top 10% is also displayed in the legend.

NOTE: This tool is primarily intended for identification of compounds in a target file given a mass spectroscopy library file.
Each metabolite (feature) is characterized by mass to charge (m/z) ratio and retention time (RT). This tool matches two files: (1) a mass spectroscopy library file and (2) a target annotation file. The library file (in tsv format) contains a list of compounds and their associated m/z ratios and RTs. The target annotation file contains the m/z ratios and RTs for the experimental samples. The unique identifier in the target annotation file is matched to compound name(s) from the library file based on the m/z ratio and RT. The match is performed using a window around the m/z ratio and a window around the RT.

Output
This tool outputs one file: -TSV file. This file will contain the original target annotation input file plus an additional column containing the name of any compounds matching the m/z ratio and RT. o Target Annotation Unique Feature ID column o Target Annotation Mass/Charge column o Target Annotation RT column o Library Unique Feature ID column: If a compound is matched, this column will contain the name of the compound. If no compound is matched this field will be empty. o {Additional columns}: Note, Any additional columns present in the Annotation File.

Data Normalization and Re-Scaling
The first three normalization methods (Mean, Sum and Median) perform re-scaling of the data by sample. Each individual sample (column) in the wide dataset is re-scaled by dividing of all the feature values within that column by the mean, median or sum of those feature values. Each sample (column) is re-scaled independently from the other samples (columns).
The last six normalization methods (Centering, Pareto, Autoscaling, Range, Level, and Variable Stability (VAST)) perform scaling of the data by features. Each feature (row) is re-scaled independently from other features. Each individual feature (row) in the wide dataset is centered by subtraction of the mean of that feature and is re-scaled by dividing all the feature values within that row by the scaling factor. The scaling factor is computed from the feature values in the current row and depends on the selected method. Centering does not have a scaling factor and does not perform division, Autoscaling uses the standard deviation, Pareto scaling uses the square root of the standard deviation, Range uses the difference between the max and min values, and Level uses the mean. VAST scaling is performed in two steps. The first step is Autoscaling, followed by division of the resulting feature values in each row by the coefficient of variation for that feature.
More information on the scaling methods are available via the references: Keun, Hector C., Timothy MD Ebbels, Henrik Antti, Mary E. Bollard, Olaf Beckonert, Elaine Holmes, John C. Lindon, and Jeremy K. Nicholson. "Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling." Analytica chimica acta 490, no. In the Unique Feature ID text box, type the name of the Unique feature identifier.

4.
Select the Normalization Method using the radio button. Each method indicates whether it is applied to samples or features.

Output
This tool outputs one file: -TSV file containing the same column names as in the original Wide Dataset where the values in each cell correspond to the values after normalization/re-scaling.

Distribution of Features across Samples
The tool summarizes the distribution of 50 randomly selected features (rows) across all samples. Boxplots with outliers and mean value are provided for each selected feature across all samples.
If group or treatment information is provided, boxplots are generated for samples within each group and for all samples. If a group or treatment variable is not provided, boxplots are provided for all samples in the dataset.
As an additional summary, all features are summarized in a single density plot. If group or treatment information is provided, then one density plot is provided for all features and all samples within each group. A single plot is also provided for all samples and features, ignoring the group information. If no group or treatment information is provided, a single density plot is provided for all features and all samples in the dataset.

NOTE: While it recommended to use the tool with log transformed data, the tool remains functional for non-transformed data.
1. Select the Wide Dataset from the drop-down menu. 2. Select the Design File from the drop-down menu. 3. In the Unique Feature ID text box, type the name of the Unique feature identifier. 4. In the Group/Treatment [Optional] text box, type the name of the column in the Design File that identifies your groups or treatments. 5. Click Execute.

Output
The tool outputs one file: -PDF file with boxplot(s) and density plot(s): if the Group/Treatment [Optional] variable is provided plots will be generated for every group as well as for all samples. Otherwise, a single plot will be generated for all samples.
Boxplots of the values for 50 randomly selected features across the treatment groups (top). Density plot for all features across all samples is presented in one graph (bottom).

Distribution of Features within Samples
The tool plots the distribution of features within each sample. All samples are colored by group and are plotted on the same graph for comparison purposes. The distributions of the features within each sample are presented as estimated densities and box-and-whiskers plots with potential outliers. If the sample run order variable column name form the design file is specified in the input, box plots will be displayed according to the run order.

Output
This tool outputs one file: -PDF file containing two graphs: o Density plot.
The density plots illustrate the distribution of features within the given sample. Each line represents a sample and its colored based on the group it belongs to.
o Box and whiskers plot.
Boxplots of the distribution of features within each sample. Each boxplot is color coded based on the group it belongs to. Run order variable (if provided) defines the order of those boxplots.

Hierarchical Cluster Heatmap
This tool generates a hierarchical cluster heatmap from a wide format dataset. This tool works best on the log-transformed data without missing values. An option to add a hierarchical clustering dendrogram to the heatmap figure is included along with an option to remove plot labels.
The Hierarchical Cluster Heatmpaps are produced using the "heatmap" and "clustermap" functions from the seaborn package in Python.

Output
The tool outputs one file: -A PDF file with a hierarchical cluster heatmap of the data Heatmap representing the current dataset (left), different settings allow the usage of dendrograms (right).

Imputation (Mean, Median, K-Nearest Neighbours (KNN), Stochastic)
The tool performs an imputation procedure for missing data based on three conceptually different methods: (1) naive imputation (mean, median), (2) K-nearest neighbor imputation (KNN) and (3) stochastic imputation (based on normal and Poisson distributions). The user specifies which method to use.
Imputations are performed separately for each sample group since treatment groups are expected to be different. If only a single sample (column) is available for a given group, nothing is imputed and the sample is kept intact. An option to select which values should be treated as missing is included. The default value is an empty cell in the dataset with the option to treat zeroes, negative values and user-defined value(s) as missing and subsequently impute missing values.
(1) Naive imputation computes the mean (or median) of the features within the samples for a given group and uses that value to impute values for that feature among the missing samples. Feature values for all missing samples in the group get the same value equal to the mean (median) of that feature from the available samples, provided the allowed missing threshold is met.
(2) K-Nearest Neighbors (KNN) imputation is based on the procedure where the K nearest neighbor samples (default value K = 5) for the given sample within each group are considered. The neighboring samples are used to generate the missing feature value for the current samples. If less than the specified K neighbors are available for the current sample in the current group, then the maximum available number of neighbors is used. If the proportion of missing values for each row (feature) is greater than the specified Row Percent Cutoff value (default 0.5), then the column (sample) mean is imputed instead of feature values from the KNN algorithm. The maximum proportion of missing values for each column (sample) can be specified (Column Percent Cutoff default = 0.8) which determines whether a sample should be imputed or not. If the proportion of missing values for each sample is greater than the specified value, then the missing values are not imputed and the imputation process is interrupted. The algorithm is deterministic and always imputes the same missing values for the same settings.
The KNN imputation is performed using the "impute.knn" function from the impute package in R.
More details on the algorithm are available via the reference: (3) Stochastic imputation is based on the assumption that each feature within a given group follows some underlying distribution. As a result, all missing values are generated from that underlying distribution. The parameter(s) of the underlying distribution is (are) estimated from the observed feature values within that group. Two distribution options are available: normal (recommended for logged and negative data) and Poisson (recommended for nonnegative counts). The normal distribution parameters for each feature are estimated by the mean and standard deviation of that feature values among the observed samples in the given group. If all observed values for a feature are the same then the standard deviation is assumed to be 1/3 of the absolute value of the mean of those values. The Poisson distribution parameter is estimated by the mean of the observed values for a given feature and is expected to be positive for the imputation procedure to work correctly.
The stochastic imputation is performed using the "Impute", "MCMC", "Poisson", and "Normal" functions from the PyMC package in Python.

Kruskal-Wallis Non-Parametric Test
The tool performs a Kruskal-Wallis non-parametric test, an analog of the one-way ANOVA F-test that does not rely on the normality assumption of the distribution of features. Unlike t-tests or an ANOVA F-test, a Kruskal-Wallis test is based on ranks where ranks are compared between groups. The test is performed (1) for all samples from all groups together and (2) for all the samples belonging to each possible pair of groups.
The Kruskal-Wallis tests are performed using the "kruskalwallis" function from the SciPy package in Python.
The reader is referred to the literature for more details on the Kruskal-Wallis test and the computation/approximation of the corresponding p-values: Kruskal, William H., and W. Allen Wallis. "Use of ranks in one-criterion variance analysis." Journal of the American statistical Association 47, no. 260 (1952) is statistically significant using α=0.05 o Flag_significant_0p01_on_ all equal to 1 if the difference in ranks between all groups is statistically significant using α=0.01 o Flag_significant_0p1_on_ all equal to 1 if the difference in ranks between all groups statistically significant using α=0.1 o Flag_significant_0p05_on_{ group_i_ group_j }: equal to 1 if the difference in ranks between the group i and µ is statistically significant using α=0.05 o Flag_significant_0p01_on_{ group_i_ group_j } equal to 1 if the difference in ranks between the group i and µ is statistically significant using α=0.01 o Flag_significant_0p1_on_{ group_i_ group_j } equal to 1 if the difference in ranks between the group i and µ is statistically significant using α=0.1 A PDF file with the volcano plot(s) for the difference(s).
The volcano plot is produced based on the values of the differences between the means of group i and group j for each feature (x-axis) and the corresponding negative log (base 10) p-values (y-axis). Each dot on the plot represents a feature. An individual volcano plot is generated for each pair of groups. The red dashed line in the volcano plot(s) corresponds to a p-value = 0.01 (2 on the negative log base 10 scale

LASSO/Elastic Net Variable Selection
The tool selects (identifies) features that are different between pairs of treatment groups. The selection is performed based on the logistic regression with Elastic Net shrinkage (with LASSO being a special case). The selection method is defined by shrinkage parameter α. Variable selection can be performed for any value of α in the range (0:1] where α = 1 corresponds to the fewest number of variables and the most strict selection criterion (LASSO) and α = 0 corresponds to shrinkage without variable selection (Ridge Regression). The default value is α = 0.5.
The LASSO/Elastic Net variable selection is performed using the "glmnet" and "cv. 3. In the Unique Feature ID text box, type the name of the Unique feature identifier. 4. In the Group/Treatment text box, type the name of the column in the Design File that identifies your groups. 5. In the α text box, type the value for the α parameter. This parameter specifies the penalty for the LASSO/Elastic Net procedure. Default = 0.5 6. Click Execute.

Output
The tool outputs three files: -A TSV file containing the values of the coefficients (including zeroes) for each feature generated by the algorithm for each pair of comparisons (in columns). These coefficients are produced from the transformed data (as part of the LASSO/EN method) and should be interpreted with caution. o {group_i_vs_group_j}: Coefficients for samples in group i vs samples in group j.
-A TSV file containing the corresponding flags for each feature. o {group_i_vs_group_j}_selection_flag_on: 0/1 flag where the value "1" corresponds to features selected by the method.
-A PDF file containing graphs for each pairwise comparison between the groups.
o The first graph displays the behavior of the coefficients based on the value of penalty parameter λ..
The plot shows the behavior of the LASSO/Elastic Net coefficients for the specified penalty split parameter α= 0.5. The value of α= 1 corresponds to the LASSO penalty split parameter. The value of the penalty parameter λ is displayed on the x-axis. The value λ = 0 (left part of the graph) forces all regression coefficients to be zero. The optimal value for the penalty λ is determined by a crossvalidation procedure. For easier visualization the line for each coefficient (feature) has its own color.
o The second graph provides an estimate of the cross-validation error based on the log value of penalty λ.
The value of the penalty parameter λ on the log scale is plotted along the x axis. The corresponding cross validation error (binomial deviance) is plotted along the y axis. The number of features that corresponds to the given value of penalty log(λ) is displayed on the top of the graph along the xaxis. The two vertical lines correspond to λ-s picked by the cross-validation algorithm. The first λ (dotted line on the left) corresponds to the smallest mean cross-validation error. The second λ (dotted line on the right) corresponds to the most regularized model with the error within one standard deviation from the minimum cross-validation error.

Linear Discriminant Analysis (LDA)
The tool performs linear discriminant analysis (LDA) of the data. LDA is a supervised method based on the projection of data in the linear subspace to achieve maximum separation between samples in different groups and minimum separation between samples within groups. The subspace dimension defines the number of components used to describe the variability within the data. Due to the LDA method specification, the subspace dimension must be less than the number of treatment groups. The user has an option to specify the dimension of the subspace directly (default = 2) or to perform single or double cross-validation to determine the dimension of the subspace. For single and double cross-validation, the dataset is split when model fit is performed. For double cross-validation, the data set is split into pieces and the model fit is performed on one piece using cross-validation and evaluated on the other pieces. For single cross-validation, the data are used to both fit and evaluate the model using a three-fold cross validation. Visual summaries are provided in the form of a 2D plot where samples are colored by group and plotted along the determined subspace components pairwise.

Note: A minimum of 100 samples is required by the tool for single or double cross validation
Details about the LDA method can be found in:

Output
This tool outputs: (1) TSV file containing the components produced by the model for each sample.
Component_{i}: contains the score values for each sample. The number of levels {i} is specified in the Number of components text box or determined via cross validation.
(2) TSV file containing the sample classifications produced by the model.
(3) TSV file containing the classification accuracy (in percent) of the algorithm with respect to the number of correctly classified samples.
(4) A PDF file containing 2D plots for all pairwise comparisons of components. The samples are represented as dots and colored by treatment group.
Scatterplot for the first two components for each sample. The samples are color coded based on the group they belong to.

Log and G-Log Transformation
This tool carries out either log or generalized log (g-log) transformation of values in a Wide Format dataset using the logarithm base specified by the user. The logarithmic transformation has the formula: ( ) for each data cell and generalized logarithmic transformation has the formula: ( + √ 2 + ) for each data cell. The generalized version becomes standard logarithmic transformation re-scaled by √2 if the value is 0. The value of is specified by the user. Three bases are available for both logarithmic transformations: base e (natural), base 2, and base 10.

Output
This tool outputs one file: -TSV file. Contains the same column and row names as the original Wide Dataset where the values in each cell correspond to the values obtained by either log or generalized og transformation procedure.

NOTE:This tool is primarily intended for matching mass spectrometry data processed using different parameter settings.
Each metabolite (feature) is characterized by a mass to charge (m/z) ratio and retention time (RT). After raw metabolomics data are processed (such as in mzMine), features are given internal identifiers (numbers) that are different for every run or set of parameters, making it very difficult to impossible to directly compare results across different parameter setting using the internal identifiers. However, it is possible to link internal identifiers using the m/z ratio and RT for each feature since changing parameter settings are predicted to result in only minor variations in m/z ratio and RT. This tool matches two MS datasets that correspond to different parameter settings. Each file should contain at least three columns: the m/z ratio, RT and internal identifier (feature ID). A feature matches across datasets if the m/z ratio and RT values in both MS files fall within a user defined window surrounding the m/z ratio and RT. The size of each window can be specified by the user where the final window width is two times the specified value. 8. In the Mass/Charge cut value text box enter the window value for the m/z ratio. Accuracy of the method will change depending on this value. The default is 0.005. 9. In the Retention Time Cut Value text box enter the window value for the RT. Accuracy of the method will change depending on this value. The default is 0.15 10. In the File 1 Name text box type a short name or identifier for the File 1. This name will be use in the plot. 11. In the File 2 Name text box type a short name or identifier for the File 2. This name will be use in the plot. 12. Click Execute.

Output
This tool outputs six files.
-TSV All peak combinations file. This file contains all combinations of possible features between File 1 and File 2.
-TSV Matched peak combinations file. This file contains the features that match between File 1 and File 2.
-TSV Unmatched peak combinations in file1. This file contains the features in File 1 that do not have a match in File 2.
-TSV Unmatched peak combinations in file2. This file contains the features in File 2 that do not have a match in File 1.
All four output files have the following 6 columns: parameter is provided only the sampleIDs for the group will be included.
-A PDF file of the distribution of the digit counts within each group of samples.
Histogram of the maximum differences between the largest and the smallest number of digits for a given feature across all samples. Differences larger than one might suggest that this feature is unstable.

Modify Design File
The tool creates a new design file based on an existing wide dataset and design file where the specified groups or samples are removed in the new design file.

Output
This tool will output one TSV file: -A TSV Design File where the selected Group(s)/Sample(s) to drop have been removed. The remaining column names and corresponding values will be the same as in the original Design File.

Merge Flags
The tool performs a merge of two or more flag files. The flag files can be either in wide format or in design format. The merging requirements are that (1) the number of rows should be the same in all files being merged and (2)  If the columns in file one are ("column_a", "column_b") and the columns in file two are ("flag_a","flag_b") the columns in the result file will be ("column_a", "column_b", "flag_a","flag_b").

Modulated Modularity Clustering (MMC)
Modulated Modularity Clustering method (MMC) was designed to detect latent structure of the variance-covariance matrix using weighted graphs. The method searches for optimal community structure and detects the magnitude of pairwise relationships. The optimal number of clusters and the optimal cluster size are selected by the method during the analysis.
The initial boundaries (lower and upper) for sigma as well as the number of points in the search grid are specified initially by the user. In the Unique Feature ID text box, type the name of the Unique feature identifier.

3.
In the p-value column text box, type the name of the column in your Wide Dataset that contains p-values.

4.
In the α text box, type the value to be used for multiple correction. Default α =0.05. 5.
Click Execute.

Output
This tool outputs two files: o flag_{p-value column}_bonferroni_off: 0/1 flag where "1" represents those features in which the p-value after the Bonferroni correction are smaller than α. o flag_{p-value column}_bHochberg_off: 0/1 flag where "1" represents those features in which the p-value after the Benjamin/Hochberg correction are smaller than α. o flag_{p-value column}_bYekutieli_off: 0/1 flag where "1" represents those features in which the p-value after the Benjamin/Yekutieli correction are smaller than α.

Partial Least Squares Discriminant Analysis (PLS-DA)
The tool performs partial least square discriminant analysis (PLS-DA) for the two treatment groups selected. The subspace dimension defines the number of components that will be used to describe the variability within the data. The subspace dimension is specified in the range of two to the number of sample in the two selected groups. The user has the option to specify the dimension of the subspace directly (Default=2) or to perform single or double cross-validation to determine the dimension of the subspace. For double cross-validation the data set is split into pieces and the model fit is performed on one piece using cross-validation and evaluated on the other pieces. For single cross-validation using three-fold cross validation is used.
PLS-DA is performed using the "PLSRegression", GridSearchCV, and "cross_val_score" functions from the scikit-learn package in Python.

NOTE: A minimum of 100 samples is required by the tool for single or double cross validation.
Details about the PLD-DA method are available via references: Geladi, Paul, and Bruce R. Kowalski. "Partial least-squares regression: a tutorial." Analytica chimica acta 185 (1986)

NOTE: The user has to insure that group names themselves do not contain commas or spaces. The separator for the two groups should only include comma and no extra spaces.
1. Select the Wide Dataset from the drop-down menu.

2.
Select the Design File from the drop-down menu. 3.
In the Unique Feature ID text box, type the name of the unique feature identifier. 4.
In the Group/Treatment text box, type the name of the column in the Design File that identifies your groups. 5.
In the Name of the Groups to Compare text box, type the names of the two groups to compare. The user has to insure that group names themselves do not contain commas. The separator for the two groups should only include comma and no extra spaces.

6.
Select Cross-validation Options. None corresponds to no cross-validation. Nested corresponds to nested cross-validation 7. In the Number of components text box, type the number of components for the analysis to use (default = 2). This field is used only when Cross-validation Options field is set to none. 8. Click Execute.

Output
This tool outputs three files: A PDF file containing the 2D plots for all pairwise comparisons of components between the two treatment groups.

Penalized Mahalanobis Distance (PMD)
The Penalized Mahalanobis distance (PMD) tool can be used to compare samples within a group and to accounts for the correlation structure between metabolites. In contrast, Standardized Euclidian distance (SED) relies solely on geometric distance and ignores any dependency structures between features. PMD incorporates the correlation structure inside the distance measurement.
When correlation structure and dependency between metabolites are ignored, the features inverse variance-covariance matrix simplifies to a diagonal matrix with diagonal values. In this case, MD simplifies to SED. When the number of features is greater than the number of samples, the inverse of the features variance-covariance matrix does not exist. This is the case for most -omic data. Here, the inverse is estimated using a regularization method (Archambeau et al. 2004).

NOTE: Groups with less than three samples will be excluded from the analysis.
1. Select the Wide Dataset from the drop-down menu. 2. Select the Design File from the drop-down menu. 3. In the Unique Feature ID text box, type the name of the Unique feature identifier. 4. In the Group/Treatment [Optional] text box, type the name of the column in the Design File that identifies your groups. 5. In the Input Run Order Name [Optional] text box, type the name of the column in your Design File that has the Run Order variable. 6. In the Additional groups to separate by [Optional] text box, type the name of additional columns in your Design File to separate your data by. 7. In the Threshold text box, type the threshold, specified as a percentile, to be used for the distribution of the distance values. The default is 0.95. 8. In the ƛ Penalty text box, type the percentile to be used on the calculation of the penalized variance-covariance matrix.

Output
This tool outputs three different files: -TSV file. This file contains the distances from each sample to the estimated mean. o sampleID o distance_to_mean: distance from the sample to the estimated mean. If Group/Treatment [Optional] is given then distances will be calculated from each sample to the estimated group mean.
-TSV file. This contains a n x n matrix (where n is the number computed samples) of the pairwise distances between the samples. If the Group/Treatment [Optional] variable is specified, the distances will be computed within the groups. o sampleID -A PDF file containing o Boxplots of the distribution of distances.
Penalized Mahalanobis distances are computed pairwise for each sample within a group. All pairwise distances computed are summarized for each sample as boxplots. Potential outliers (blue dots), means (red squares), and median (dark blue bars) are displayed. The threshold to declare a potential outlier a threshold (specified as a percentile) in the input. This example shows the Standardized Euclidean Distances in the "Chardonnay, Carneros, CA 2003 (CH02)" group. The dashed lines correspond to the cutoffs are computed from beta, normal and chi-squared distributions in red, yellow and green respectively. Please note that normal and chi-squared cutoffs are expected to be very close to each other.
A scatterplot of the Penalized Mahalanobis Distances in the "Chardonnay, Carneros, CA 2003 (CH02)" group. Distances are computed pairwise between samples within a group. The dashed lines correspond to the cutoffs are computed from beta, normal and chi-squared distributions in red, yellow and green respectively. Please note that normal and chi-squared cutoffs are expected to be very close to each other.

Principal Component Analysis (PCA)
The tool performs principal component analysis (PCA) of the data. Visual summaries are provided in the form of 2D and 3D scatter plots for the first three principal components. The samples in the scatter plots are colored based on group classifications.
PCA is performed using the "PCA" function from the scikit-learn package in Python. -TSV file. This file contains the scores for each sampleID in the Wide Dataset. o SampleID o PC{i}: Each column contains the scores for each component/sample.
-A PDF file of scatter plots of the first three principal components.
The samples are color coded based on group classifications. Scatterplot of the first two principal components (left) and 3D scatterplot shows the first three components (right).

Random Forests (RF)
The tool identifies features that are different between treatment groups based on the random forest (RF) algorithm. Based on Classification and Regression Trees (CART), random forests are an ensemble learning method for classification, regression and variable importance evaluation.
Random Forest classification is performed using the "RandomForestClassifier" function from the scikit-learn package in Python.
Details about the algorithm can be found in the book: Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. The variable importance plot displays the 20 most important features based on the random forest algorithm. The color of each feature changes from the most important among the 20 (dark blue) to the least important among the 20 (light blue).

Output
This tool outputs two files:

NOTE: This tool is primarily intended for flagging features with variation in retention times in mass spectrometry data analysis. The goal of the tool is to identify potential problems with the instrument or with data processing and pre-processing.
The retention time for a given feature is predicted to be relatively consistent across samples. This tool identifies potential abnormalities or shifts in the retention time for a feature. The tool uses multiple criteria to identify feature's discrepancies.

Output
The tool outputs two files: -TSV file. This file contains flags for each feature, it has the following columns: o Unique Feature ID o flag_RT_Q95Q05_outlier: 0/1 flag where the value "1" is for features where the difference in the retention time between the 95 th and 5 th percentile (customizable to 90 th and 10 th percentiles) is greater than the user-specified threshold in minutes (default is 0.2 minutes). o flag_RT_max_gt_threshold: 0/1 flag where the value "1" is for features where the difference between the retention time maximum and median is greater than userspecified threshold in minutes/2 (default is 02/2 = 0.1 minutes).
o flag_RT_min_lt_threshold: 0/1 flag where the value "1" is for features where the difference between the retention time minimum and median is greater than userspecified threshold in minutes/2 (default is 02/2 = 0.1 minutes).. o flag_RT_min_max_outlier: 0/1 flag where the value "1" is for features where the retention time minimum or maximum is greater than 3 times the standard deviation away from the mean. o flag_RT_big_CV: 0/1 flag where the value "1" is for features whose coefficient of variation (CV) in retention time is greater than some threshold percentile of all the features. The default value is 0.1 which corresponds to 90 th percentile.
-PDF file. This file contains a density plot of the coefficients of variation (CV) for the retention time.
The histogram and smoothed density plot of the coefficients of variation for retention time. The red dotted line shows the cutoff for the top 10 % of the data.
o flag_feature_runOrder_pval_01: 0/1 flag for each feature where the value "1" is assigned for p-values smaller than 0.01. o flag_feature_runOrder_pval_05: 0/1 flag for each feature where the value "1" is assigned for p-values smaller than 0.05. -a PDF file with fitted regression plots for each significant feature.
This plot shows the color-coded scatterplot of the values of "_1001826" feature for all samples. The run order of the samples is displayed along the -axis and the feature value is displayed on the -axis. Each color represents a different group of samples. A regression line has been fit and the R 2 Is displayed on the plot. If there is no effect of run order, the slope of the regression line will be zero. A test of the null hypothesis that the slope is zero is performed and the resultingvalue. The fitted line (light blue) and the 95% confidence bands (red) are displayed on the graph.
The tool provides a 2D scatter plot of values in a Long Format file. If coloring by group is desired, the column with the sample names in the Long Format dataset has to have the name "sampleID" to match the name in the Design File. Scatter plot 2D allows the user to plot any pair of values from the Principal Component Analysis (PCA) output or plot other data.

1.
Select the Long Dataset from the dropdown menu. The scores in the PCA tool are output in the long format or the user can upload some other data set.

2.
In the Sample ID text box, type the name of the Sample ID column from the Long Format dataset. In order for the coloring to work correctly this name should be the same as in the Design File i.e. the name for the samples column should be sampleID.

3.
In the X Group Title text box, type the name the column name from your Long Format that you want to use for X axis. 4.
In the Y Group Title text box, type the name the column name from your Long Format that you want to use for Y axis. 5.
In the Z Group Title text box, type the name the column name from your Long Format that you want to use for Z axis. 6 The user has an option to specify the palette and the color scheme within the palette. If the palette is specified the color scheme has to be specified. The list of available palettes is diverging, qualitative, sequential, cubehelix, tableau, and wesanderson.

Output
This tool outputs one file: -A PDF file containing a 3D scatterplot of the selected data.

Standardized Euclidean Distance (SED)
The tool is designed to identify samples that are different using the standardized Euclidian distance (SED) between samples. The tool estimates the variance of features and calculates the SED between each pair of samples in addition to the SED between each sample and the estimated mean. If a group or treatment variable is provided, then the same distance plots are generated for each group and for all samples together. SED computations are performed using the "DistanceMetric" function from the scikit-learn package in Python. NOTE: Groups with less than three samples will be excluded from the analysis.

Output
This tool will output one file: -TSV Flag File. This file contains the same column names and values as the original Flag File with the addition of the following columns: o flag_sum: The sum of the value of the flags by feature (row). For example, if "row_1" flags are equal to "0","1", and "1" then flag_sum will be "2". o flag_mean: The mean of the flags by feature. For example, if "row_1" flags are equal to "0","1", and "1" then flag_mean will be "0.666". o flag_any: This flag will be a "1" if any of the flag values for the feature are a "1". For example, if "row_1" flags are equal to "0", "1", and "1" then flag_any will be "1". o flag_all: This flag will be a "1" only if all the flag values for the feature are a "1". For example, if "row_1" flags are equal to "0", "1", and "1" then flag_all will be "0".

Support Vector Machine (SVM) Classifier
Given a set of supervised samples in a Training Dataset, the SVM training algorithm builds a model based on these samples that can be used for predicting the categories of new, unclassified samples in a Target Dataset. The Target Dataset is not used for model training or evaluation, only for prediction based on the finalized model. SVM classification is performed on the target data and accuracy is estimated for both Target and Training Datasets.
SVM classification is performed using the "SVC", "GridSearchCV", and "cross_val_score" functions from the scikit-learn package in Python.

NOTE: Design files for both target and training datasets are required and the grouping variable column should be present in both files and have exactly the same name in both files. Unique Feature ID name should also be the same for both datasets and the feature names should be identical. Training Dataset can also be used as Target Dataset.
SVM uses different kernel functions to carry out different types of classification such as radial bassis (gaussian) To use the SVM tool, users need to provide the following information: (i) a Training Dataset with known categories in the training design file and (ii) a Target Dataset with predicted categories in the target design file.
(iii) the name of the Group/Treatment classification column should be the same for both design files.
(iv) the Unique Feature IDs should be the same in both the wide datasets.
(v) the number of Unique Feature IDs should be the same in both the wide datasets.

NOTE: A minimum of 100 samples is required by the tool for single or double cross validation. The use of machine learning algorithms (including the support vector machine classifier) and cross-validation on datasets with small numbers of samples is controversial.
1. Select your training Wide dataset from the drop down menu.

2.
Select your training Design file in the corresponding drop down menu.

3.
Select your target Wide dataset from the corresponding drop down menu. Training wide dataset can also be used as target wide. 4.
Select your target Design file in the corresponding drop down menu. Training design dataset can also be used as target design. 5.
In the Group/Treatment text box, type the name of the column in the Design File that identifies your groups. 6.
In the Unique Feature ID text box, type the name of the unique feature identifier. 7.
Select the In the Group/Treatment text box, type the name of the column in the Design File that identifies your groups. 8.
In the Unique Feature ID text box, type the name of the Unique feature identifier Select the elect the Polynomial Degree used in the algorithm as an integer. The default degree is cubic (3). The parameter is used for polynomial kernels only. 9.
Select the SVM Kernel function 10.
Select the Polynomial Degree used in the algorithm as an integer. The default degree is cubic (3). The parameter is used for polynomial kernels only.

11.
Select Cross-validation Options parameter. None corresponds to no cross-validation. Nested corresponds to nested cross-validation which is also called double. 12. Select Regularization Parameter C. This field is relevant only if Cross-validation Options field is none. Larger values for C add to the bias and overfitting but decrease misclassification. Smaller values for C decrease bias and overfitting but increase misclassification. The default value is 1 and must be positive. The parameter is used for all kernels.

Select Regularization Parameter C (Lower Bound). This field is relevant only if Cross-
validation Options parameter is either single or nested. Has to be positive. 14. Select Regularization Parameter C (Upper Bound). This field is relevant only if Crossvalidation Options parameter is either single or nested. Has to be positive and bigger than Regularization Parameter C (Lower Bound). 15. Select Coefficient A the kernel coefficient gamma parameter. The parameter must be greater than zero. If zero value is specified it will be converted to 1/(number of features). This is the default value. The parameter is only relevant for radial basis function, polynomial and sigmoid kernels. 16. Select Coefficient B the independent term in the kernel function. The default value is zero.
The parameter is only relevant for polynomial and sigmoid kernels. 17. Click Execute.

Threshold Based Flags
This tool flags a feature in a given group with a binary indicator if, for half (or more) of the samples within the group, the feature value is below a user specified threshold or is missing. The default threshold value of 30,000 is primarily useful for peak intensities from mass spectroscopy data and should be evaluated carefully for other types of values (e.g. for peak height).

1.
Select the Wide Dataset from the drop-down menu.

2.
Select the Design File from the dropdown menu.

3.
In the Unique Feature ID text box, type the name of the Unique feature identifier.

4.
In the Group/Treatment text box, type the name of the column in the Design File that identifies your groups.

5.
In the Cutoff text box, type a threshold value. The default value is 30,000.

Output
This tool outputs one file: -TSV file: The file contains flags for each feature by group. The file contains the following columns: o Unique Feature ID o flag_feature{group}_off is flagged a "1" if the feature is below the Cutoff value in half of more of the samples in {group}. o flag_feature_any_off is flagged a "1" when at least one of the flag_feature{group}_off for that feature is flagged a "1". o flag_feature_all_off is flagged a "1" when all the flag_feature_{group}_off for that feature are flagged a "1". o flag_significant_0p05_on_{group_i_Mu} is equal to 1 if prob_greater_than_t_for_diff_{group_i_Mu} is less than 0.05 o flag_significant_0p01_on_{group_i_Mu} is equal to 1 if prob_greater_than_t_for_diff_{group_i_Mu} is less than 0.01 o flag_significant_0p1_on_{group_i_Mu} isequal to if prob_greater_than_t_for_diff_{group_i_Mu} is less than 0.1 -PDF file with the volcano plot(s) for the difference(s).
The volcano plot is produced based on the values of the differences between the mean of a current group i and the value of µ for each feature and the corresponding negative log (base 10) p-values. The differences are displayed on x-axis and the re-scaled p-values are displayed on the y-axis. Each dot on the plot represents a feature. Individual volcano plots are generated for each group, if Group/Treatment is given, and single volcano plot is generated if no grouping variable is provided. The red dashed line in the volcano plot(s) corresponds to a p-value = 0.01 (2 on the negative log base 10 scale). Volcano plots were first described in: Jin, Wei, Rebecca M. Riley, Russell D. Wolfinger, Kevin P. White, Gisele Passador-Gurgel, and Greg Gibson. "The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster." Nature genetics 29, no. 4 (2001): 389-395.

Output
This tool outputs two TSV files: one file with summary statistics and another file with the flags. The comparisons are performed for pairs of different groups i and j provided by the user in Group/Treatment variable. Only two group are compared in paired t-test due to the test design. The number of comparisons in unpaired t-test is not limited and is controlled by the number of groups provided by the user in Group/Treatment variable.
The volcano plot is produced based on the values of the differences between the means of group i and group j for each feature and the corresponding negative log (base 10) p-values. o flag_significant_0p05_on_{group_i_ group_j} is equal to 1 if perm_greater_than_t_for_diff_{group_i_ group_j} is less than 0.05 o flag_significant_0p01_on_{group_i_ group_j} is equal to 1 if perm_greater_than_t_for_diff_{group_i_ group_j} is less than 0.01 o flag_significant_0p1_on_{group_i_ group_j} is equal to if perm_greater_than_t_for_diff_{group_i_ group_j} is less than 0.1 -PDF file with the volcano plot(s) for the difference(s).
The volcano plot is produced based on the values of the differences between the means of group i and group j for each feature and the corresponding negative log (base 10) p-values estimated from permutation. The differences between the group values are displayed on x-axis and the re-scaled p-values are displayed on the y-axis. Each dot on the plot represents a feature. Individual volcano plots are generated for each pair of groups compared. The red dashed line in the volcano plot(s) corresponds to a p-value = 0.01 (2 on the negative log base 10 scale). Volcano plots were first described in: Jin, Wei, Rebecca M. Riley, Russell D. Wolfinger, Kevin P. White, Gisele Passador-Gurgel, and Greg Gibson. "The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster." Nature genetics 29, no. 4 (2001): 389-395.