Data in support of enhancing metabolomics research through data mining

Metabolomics research has evolved considerably, particularly during the last decade. Over the course of this evolution, the interest in this ‘omic’ discipline is now more evident than ever. However, the future of metabolomics will depend on its capability to find biomarkers. For that reason, data mining constitutes a challenging task in metabolomics workflow. This work has been designed in support of the research article entitled “Enhancing metabolomics research through data mining”, which proposed a methodological data handling guideline. An aging research in healthy population was used as a guiding thread to illustrate this process. Here we provide a further interpretation of the obtained statistical results. We also focused on the importance of graphical visualization tools as a clue to understand the most common univariate and multivariate data analyses applied in metabolomics.


Subject area
Chemistry/Biology

More specific subject area
Human metabolomics.
Type of data Table, R code files, graph, figure.

How data was acquired
Mass spectrometry, clinical laboratory.

Experimental factors
Serum samples from healthy male and female, collected under fasting conditions. Experimental features Methanol and chloroform/methanol serum extracts were analyzed with three separate ultraperformance liquid chromatography-mass spectrometry based platforms. Data source location Basque Country, Spain.

Value of the data
Metabolites related to aging in healthy population are highlighted as a result of two different postacquisition approaches, considering age as a categorical and a continuous variable.
R functions are provided for different statistical test, including graphical visualization tools. Data are presented through a web application. This is expected to help with the visualization and interpretation of univariate and multivariate data analyses.

Data
Serum samples and anthropometric data from healthy male and female volunteers included in this study were provided by the Basque Biobank for Research-OEHUN (http://www.biobancovasco.org/) and were processed with appropriate approval of the Ethics Committee. Samples were analyzed in a COBAS 6000 (Roche Diagnostics GmbH, Germany) and hematological parameters in a GEN-S (Beckman COULTER Inc., USA) at OSARTEN K.E. laboratory.
Metabolomics profiling data acquired by ultra-performance liquid chromatography coupled to mass spectrometry (UPLC-MS) were pre-processed using the TargetLynx application manager for MassLynx 4.1 (Waters Corp., Milford, MA). The peak-picking process included 466 metabolic features, identified prior to the analysis.

Experimental design, materials and methods
In metabolic profiling, there is no single platform or method to analyze the entire metabolome of a biological sample, mainly due to the wide concentration range of the metabolites coupled to their extensive chemical diversity [2,3]. The current study used multiple UPLC-MS platforms, which were optimized for extensive coverage of the serum metabolome. Metabolite extraction was accomplished by fractionating the samples into pools of species with similar physicochemical properties, using appropriate combinations of organic solvents [4]. Then, three separate UPLC-MS based platforms were used. Briefly, UPLC-single quadrupole-MS amino acid analysis system was combined with two separate UPLC-time-of-flight-MS based platforms analyzing methanol and chloroform/methanol extracts. Identified ion features in the methanol extract platform included non-esterified fatty acids, oxidized fatty acids, acyl carnitines, N-acyl ethanolamines, bile acids, steroids, monoacylglycerophospholipids, and monoetherglycerophospholipids. The chloroform/methanol extract platform provided coverage over glycerolipids, sphingolipids, diacylglycerophospholipids, acyl-ether-glycerophospholipids, cholesteryl esters, and primary fatty acid amides.
Data pre-processing, data pre-treatment and data processing steps have been widely described [5]. A schematic flowchart of this metabolic profiling workflow is shown in Fig. 1.

Statistical analysis of anthropometric, analytical and hematological parameters
A heatmap for the correlation between age and the anthropometric, analytical and hematological parameters is included in Fig. 2. Variations in age and gender of each variable were evaluated by a two-way ANOVA ( Table 1). The analysis per variable was completed with a boxplot and a table indicating the mean value and standard deviation per group. Those results are presented in Supplementary Material 1.

Statistical analysis and visualization
The advantages of using both univariate and multivariate approaches in data mining have been recently reviewed [6]. Both approaches are complementary and their results do not necessarily coincide. Following the advice to combine the use of both univariate and multivariate approaches, we have developed a web application. This is expected to help with the visualization and interpretation of the data analyses.

AgingAnalysis: an interactive web application
The AgingAnalysis application has been developed using the R package shiny. This application is accessible from the following link 〈http://rstudio.owlmetabolomics.com:8031/AgingAnalysis/〉. The application itself contains a manual with the description of the different configuration options. This guide is included in the 'Appendix' tab. In addition, aging project's data can be downloaded from the web site (Fig. 3).
Univariate and multivariate analyses that can be performed through this interactive web site are briefly described.

Univariate analysis:
Univariate data analysis indicates that only one variable is analyzed at a time. The available statistical test and visualization tools are described:  Data pre-processing, data pre-treatment and data processing steps are widely described in [5].
-'Fold-change heatmap' window: Heatmap represents metabolomic signatures associated to aging. For each comparison, log transformed ion abundance ratios are depicted, as represented by the scale. Darker green and red colors indicate higher drops or elevations of the metabolite levels with age, respectively. Gray lines correspond to significant fold-changes of individual metabolites, darker gray colors have been used to highlight higher significances (Student's t-test p-value po0.05, p o0.01 or po0.001). It is relevant to highlight that metabolites present in this heatmap are ordered according to the carbon number and unsaturation degree of their esterified chains.

Multivariate analysis:
Multivariate data approaches analyze two or more variables at once. The application provides the results of several multivariate analyses, in which the 466 metabolites are included:

Statistical analysis using R functions
R is a strongly functional language and an environment for statistical computing and graphical techniques [12,13]. With a freely-distributed system, R is a popular tool due to the extremely easy to learn R programming syntax, its powerful graphics facilities and the wide range of available statistical techniques.
Here, we provide R functions for three statistical tests, which include an easier determination of optimal lambda in Box-Cox transformations (tboxcox) and the determination of homoscedasticity Table 1 Two-way ANOVA analysis of biochemical parameters. Factors: gender and age. Mean difference is significant at the 0.05 level (po 0.001 nnn ; po 0.01 nn ; p o0.05 n ; po 0.1).

Variable
Age through Levene's and Barlett's tests (levene_test and bartlett_test, respectively). These functions also include graphical visualization tools.

Box-Cox transformations using tboxcox R function
Normal distribution of the data is one of the most important assumptions in multivariate analysis. If violated, Box-Cox transformation provides a systematic procedure for correcting this non-normal distribution. The optimal transformation is achieved by the calculation of a lambda parameter. The proposed R function tboxcox determines the optimal value for lambda, including graphical visualization tools (Supplementary Material 2). Several examples generated with this code are provided in Supplementary Material 3, illustrating the most common transformations. Those are generic examples, created by generating values of a normal distribution and applying the inverse transformations on them.

bartlett_test and levene_test R functions for testing the homogeneity of variances
Levene's and Barlett's tests are used to verify the homogeneity of variance. Here, R functions of both tests are provided (Supplementary Material 4). The results obtained with Levene's and Barlett's tests in our aging research data were compared. Homoscedasticity was accepted for 348 and rejected for three out of 361 variables in both cases. However, homogeneity of variance of eight and two additional variables was rejected by Barlett's test and Levene's test (p o0.01), respectively (Fig. 4).
In addition, the importance of the assumption of homogeneity of variance, as well as two examples of acceptance and rejection, is included in Supplementary Material 5.

MANOVA
A multivariate analysis of variance (MANOVA) was one of the multivariate models selected to decipher an aging metabolic signature [5]. This model was considered for studying age as a categorical variable, establishing the groups according to the age of the volunteers.

Age as independent variable
In order to fulfill the sample size requirements of the MANOVA analysis, a screening of the data was performed to find out which variables presented more evident differences among the age groups. The ANOVA test per variable revealed that 45 out of 148 metabolites agreed that p o0.01 (Supplementary Material Table 1). A heatmap representation of the mean vectors is depicted in Fig. 5a.

Age and gender as independent variables
As in the previous case, an ANOVA test was applied for each variable (Supplementary Material  Table 2). Only 15 out of 141 metabolites agreed that p o0.01. A heatmap representation of the mean vectors is displayed in Fig. 5b.

Linear analysis
A linear least-squares regression analysis was the second multivariate model selected. In this case, age was considered as a continuous variable [5]. Previous to model construction, a random division of samples into estimation (80% of the volunteers) and validation (20%) data set was performed. Possible overfitting of the model was assessed by comparison of the residuals of both data sets. Complete information about residuals evaluation is available in Supplementary Material 6.