Targeted UPLC-MS Metabolic Analysis of Human Faeces Reveals Novel Low-Invasive Candidate Markers for Colorectal Cancer

Low invasive tests with high sensitivity for colorectal cancer and advanced precancerous lesions will increase adherence rates, and improve clinical outcomes. We have performed an ultra-performance liquid chromatography/time-of-flight mass spectrometry (UPLC-(TOF) MS)-based metabolomics study to identify faecal biomarkers for the detection of patients with advanced neoplasia. A cohort of 80 patients with advanced neoplasia (40 advanced adenomas and 40 colorectal cancers) and 49 healthy subjects were analysed in the study. We evaluated the faecal levels of 105 metabolites including glycerolipids, glycerophospholipids, sterol lipids and sphingolipids. We found 18 metabolites that were significantly altered in patients with advanced neoplasia compared to controls. The combinations of seven metabolites including ChoE(18:1), ChoE(18:2), ChoE(20:4), PE(16:0/18:1), SM(d18:1/23:0), SM(42:3) and TG(54:1), discriminated advanced neoplasia patients from healthy controls. These seven metabolites were employed to construct a predictive model that provides an area under the curve (AUC) median value of 0.821. The inclusion of faecal haemoglobin concentration in the metabolomics signature improved the predictive model to an AUC of 0.885. In silico gene expression analysis of tumour tissue supports our results and puts the differentially expressed metabolites into biological context, showing that glycerolipids and sphingolipids metabolism and GPI-anchor biosynthesis pathways may play a role in tumour progression.


Introduction
Colorectal cancer (CRC) is the second leading cause of cancer death in developed countries [1]. Although knowledge of the genetic-and diet-associated mechanisms involved in CRC establishment and progression is rapidly increasing [2], still the best prognosis is obtained when malignancy is detected early. CRC screening, which detects both precancerous polyps and CRC, can reduce both colorectal cancer incidence and mortality [3][4][5][6][7]. Through screening, the incidence of colorectal cancer can be reduced by 30% with a mortality reduction of 50% depending on the screening modality and the participation rates [7,8]. These data clearly support the strategy to have efficient and sensitive screening methods. Screening tests available include detecting haemoglobin or DNA mutations/alterations in feces [4,9], radiologic or endoscopic (flexible sigmoidoscopy, colonoscopy, and computed tomographic colonography) methods [10]. Each test has its own advantages, has demonstrated to be cost-effective, and has associated limitations and risks [10]. Although colonoscopy is considered the most accurate test for early detection and prevention of colorectal cancer [11], its applicability is limited due to the secondary effects associated with it (mild and severe), the low adherence in average and familial-risk populations and the limited resources available [12,13].
On the other hand, most of CRC are still diagnosed in symptomatic patients, even when CRC screening programs are established [14]. In this regard, symptoms and symptom-based prediction models have a limited accuracy for CRC detection in this population. CRC diagnostic biomarkers, such as faecal haemoglobin, can improve the diagnostic process either alone or within prediction models [15][16][17]. For all those reasons, the development of non-invasive methods to detect CRC either in asymptomatic and symptomatic patients is an area of interest for patients, clinicians and healthcare providers.
Metabolomics is the omics technology dedicated to the measurement of small molecules (<2000 Da) that are present in a biological system. Major advances and new development of analytical instruments, together with the implementation of bioinformatics tools for robust data analysis allows simultaneous measurement and analysis of a huge number of metabolites from a biological system [18][19][20][21]. In consequence, metabolomics has become one of the main technologies for biomarker identification and for unraveling pathophysiological mechanisms in many diseases, including cancer. The development of ultra-performance liquid chromatography (UPLC) has improved both resolution and sensitivity of metabolomics analysis. It has also allowed the rapid separation of metabolites when compared to conventional LC methods [22,23]. Notably, several metabolomics studies have been performed aiming to identify new CRC biomarkers, as reviewed by Zhang et al. [24]. For diagnostics purpose, several studies exist, although the majority of them have been performed on serum samples [25][26][27][28][29][30][31][32][33], tissue [34][35][36] and urine [37]. To our knowledge, only one study was found that studied metabolomics differences directly in human feces samples, like our project design, using NMR-based metabolomics [38]. Metabolomics study of faeces may be more effective in detecting novel colon cancer makers than other approaches because faeces are in close proximity to the colorectal mucosa and are a product of interactions between dietary components and the microbiota. This latter is affected by and seems to play an important role in the progression of colon cancer [39,40]. Existent literature has identified several metabolites, some being consistently altered in CRC individuals and others being increased in some studies and decreased in other ones [24]. These studies have allowed the identification of several altered metabolic pathways, including carbohydrate and amino acid metabolisms, and lipid-related metabolic pathways. Significantly, most of the studies found differences in metabolites of the tricarboxilic acid (TCA) cycle. Also, importantly, alterations on short-chain fatty acids (SCFAs) levels were found for feces-metabolomics study, which clearly indicates a role for the CRC-specific microbiota composition [38]. Lipid metabolism is an important pathway of cellular energy metabolism and its alteration has been related to CRC development and progression. Alterations on metabolic pathways for the eight distinct pathways of lipid metabolism, including corresponding genes and lipid-specific cell receptors, have been reviewed by Yan et al. 2016 [41].

Results
For the study, we have analysed faecal samples collected from 49 healthy, 40 CRC patients and 40 AD patients (see Materials and Methods for more details). On these samples, we have performed a metabolomics profiling using the UPLC-MS approach as described in Materials and Methods. There is no single method to analyse the entire set of metabolites of a biological sample, mainly due to the wide concentration range of the metabolites joined to their extensive chemical diversity. For this study, we have employed an UPLC-MS method (Supplementary Figure S1) capable of detecting consistently the 105 identified metabolites listed in Supplementary Table S1, that includes fatty acyls, glycerolipids, glycerophospholipids, sterol lipids and sphingolipids.

Multivariate Analysis
First, we analysed the metabolomic profiling of the 105 metabolites by unsupervised principal component analysis (PCA). We did not find any clustering of samples according to their classification as cases (AD and CRC) and controls (C), as seen on the score plot in Figure 1; neither, did if each group (AD, CRC and C) was compared separately each other (Supplementary Figure S2).

Results
For the study, we have analysed faecal samples collected from 49 healthy, 40 CRC patients and 40 AD patients (see Materials and Methods for more details). On these samples, we have performed a metabolomics profiling using the UPLC-MS approach as described in Materials and Methods. There is no single method to analyse the entire set of metabolites of a biological sample, mainly due to the wide concentration range of the metabolites joined to their extensive chemical diversity. For this study, we have employed an UPLC-MS method (Supplementary Figure S1) capable of detecting consistently the 105 identified metabolites listed in Supplementary Table S1, that includes fatty acyls, glycerolipids, glycerophospholipids, sterol lipids and sphingolipids.

Multivariate Analysis
First, we analysed the metabolomic profiling of the 105 metabolites by unsupervised principal component analysis (PCA). We did not find any clustering of samples according to their classification as cases (AD and CRC) and controls (C), as seen on the score plot in Figure 1; neither, did if each group (AD, CRC and C) was compared separately each other (Supplementary Figure S2).  Neither the application of orthogonal (partial least squares) projections to latent structures (OPLS) or multivariate analysis was suitable for obtaining a separation between the groups of samples (data not shown). This lack of discrimination between groups through multivariate analysis highlights the expected high heterogeneity that exists between individuals.

Univariate Analysis
As it is complementary to the multivariate analysis, we have applied a univariate approach that has been shown to be an alternative for metabolomics data sets with elevated heterogeneity [26]. The comparison of the 105 metabolites between cases (AD plus CRC) versus control (C) samples, showed significant (adjusted p-value < 0.05) difference of the fold change for 18 of them as can be observed in the Volcano plot ( Figure 2A). Differences were mostly seen in sphingolipid family (SM and Cer, but not CMH), but also included ChoE, PC, PE and TG metabolites. The most altered metabolite was Cer(42:3), and all metabolites were higher in the case group, except for two of them, Cer(d18:1/16:0) and TG(54:1), which were lower than the control group ( Figure 2A). Other highly altered metabolites (log2 fold change < 1) were Cer(d18:1/24:1) + Cer(d18:2/24:0), PE(16:0/18:1), PE(16:0/18:2) and TG(54:1) (Figure 2A). Neither the application of orthogonal (partial least squares) projections to latent structures (OPLS) or multivariate analysis was suitable for obtaining a separation between the groups of samples (data not shown). This lack of discrimination between groups through multivariate analysis highlights the expected high heterogeneity that exists between individuals.

Univariate Analysis
As it is complementary to the multivariate analysis, we have applied a univariate approach that has been shown to be an alternative for metabolomics data sets with elevated heterogeneity [26]. The comparison of the 105 metabolites between cases (AD plus CRC) versus control (C) samples, showed significant (adjusted p-value < 0.05) difference of the fold change for 18 of them as can be observed in the Volcano plot ( Figure 2A). Differences were mostly seen in sphingolipid family (SM and Cer, but not CMH), but also included ChoE, PC, PE and TG metabolites. The most altered metabolite was Cer(42:3), and all metabolites were higher in the case group, except for two of them, Cer(d18:1/16:0) and TG(54:1), which were lower than the control group ( Figure 2A). Other highly altered metabolites (log2 fold change < 1) were Cer(d18:1/24:1) + Cer(d18:2/24:0), PE(16:0/18:1), PE(16:0/18:2) and TG(54:1) ( Figure 2A). Paired comparisons of sample groups revealed significant differences for some metabolic classes between CRC and AD, and also between CRC and C individuals (Table 1). Stool samples of patients with CRC had higher levels than AD or C samples of PC and also ChoE and SM metabolite classes. TG family showed the maximum differences when AD was compared to C samples, with alterations in 12 metabolites of the family; it was lower in AD than C. Actually, most of the differences between AD and C groups were found in this metabolite family, with only one metabolite altered for DG, PC and PE families. CMH and MG families did not show any difference in any comparison.
Ceramides, ChoE, PC and SM metabolite families were consistently increased in cancer samples. Only TG metabolites showed a specific trend for AD samples, being decreased with respect to the control samples, but showing no differences when comparing C versus CRC samples. Only PE family was consistently increased in both CRC and AD samples when compared to C group. Paired comparisons of sample groups revealed significant differences for some metabolic classes between CRC and AD, and also between CRC and C individuals (Table 1). Stool samples of patients with CRC had higher levels than AD or C samples of PC and also ChoE and SM metabolite classes. TG family showed the maximum differences when AD was compared to C samples, with alterations in 12 metabolites of the family; it was lower in AD than C. Actually, most of the differences between AD and C groups were found in this metabolite family, with only one metabolite altered for DG, PC and PE families. CMH and MG families did not show any difference in any comparison.
Ceramides, ChoE, PC and SM metabolite families were consistently increased in cancer samples. Only TG metabolites showed a specific trend for AD samples, being decreased with respect to the control samples, but showing no differences when comparing C versus CRC samples. Only PE family was consistently increased in both CRC and AD samples when compared to C group.
The analysis of the individual metabolites also showed a difference between sample groups ( Figure 2B). The heatmaps display the fold change of the 105 metabolites included in the analysis and their significances according to the Student's t-test for the comparisons performed between CRC and C, CRC and AD and between AD and C groups. In the comparison of case (AD plus CRC) versus C groups, significant metabolites were found mainly in Cer, ChoE, PE and SM families. While the ceramide family included both increased and decreased metabolites; only increased levels of metabolites belonging to ChoE, PC, PE and SM families were found in the case group.
The comparisons of CRC versus C, and CRC versus AD groups also revealed significant alteration of the levels of metabolites belonging to Cer, ChoE, PE and SM families, but in this case also the abundance of many metabolites belonging to the PC family were significantly altered. Most of the metabolites of these families were elevated in the CRC group in both comparisons. All these changes were not observed when comparing the AD and control groups indicating that those metabolites were mostly altered in the CRC group. Interestingly, a significant down-regulation of metabolites belonging to the TG family was observed mainly in the AD group ( Figure 2B).
We also performed ANOVA test to detect significant differences in the metabolic profile between the three groups studies (C vs. AD vs. CRC). As a result, 29 differentially expressed metabolites belonging to Cer, ChoE, PC, PE and SM classes were found to be statistically significant in agreement with the previous paired analysis (Supplementary Table S2). Also, in concordance with the previous analysis, TG altered metabolites showed a specific pattern, being decreased in the AD group.

Predictive Models
In order to construct prediction models for cases (CRC and AD), the cohort was randomly separated in the training set containing 80% of the samples, and the validation set containing the remaining 20% of samples. To avoid possible bias derived from the data separation, we applied a bootstrap method, generating 10,000 different combinations of both training and validation datasets. By applying general linear models to the training set, we were able to find seven metabolites that when combined provide an AUC value of 0.821 (sensitivity 0.833 and specificity 0.800) ( We computed also the potential effects of age and sex upon the performance of our model. We were able to slightly increase the predictive ability of the model when adding the age (AUC = 0.838), sex (AUC = 0.837) and the combination of both (AUC = 0.848) features to the model ( Figure 3C). When combining our metabolite model with faecal occult blood (FOB) parameter we were able to increase the AUC value up to 0.885.

Correlation of the Metabolites with Clinical Parameters
A number of clinical parameters were available for the 129 samples analysed in this study including age, gender, FOB test (cut-off 100 ng/mL), carcinoembryonic antigen (CEA) test and COLONPREDICT index. COLONPREDICT is a CRC prediction model that takes into account demographic, symptoms, laboratory and anorectal examination results applicable both in primary and secondary healthcare units [16]. Thus, we evaluated if any of the 105 metabolites analysed in faecal samples correlated with any of the clinical parameters (Supplementary Table S3). There was not strong correlation with age, neither with CEA nor COLONPREDICT or gender, and there were only minor correlations with some clinical data as follows. Several TG metabolites correlated inversely with age data. Also, some metabolites belonging to the DG family correlate with age data, in the same direction as the TG metabolites. COLONPREDICT test showed the highest degree of correlation with metabolites of different families including CMH, PC, ChoE, PE, and SM. Although only slightly, ChoE(18:2) correlated directly with the FOB parameter (Supplementary Table S3).
We also studied how clinical parameters classified samples between the three groups (C, CRC and AD) and between two groups (C and Case) (Supplementary Figure S3). Both ANOVA test for the classification into three groups ( Table 2) and Tukey's HSD test for the classification into two groups ( Table 2) showed that COLONPREDICT was the best index to discriminate between samples, followed by FOB. We could see that gender had nearly no differences upon the discrimination between groups, compared to all other clinical parameters. It is important to note that no clinical parameter was able to significantly differentiate between C and AD sample groups.

Correlation of the Metabolites with Clinical Parameters
A number of clinical parameters were available for the 129 samples analysed in this study including age, gender, FOB test (cut-off 100 ng/mL), carcinoembryonic antigen (CEA) test and COLONPREDICT index. COLONPREDICT is a CRC prediction model that takes into account demographic, symptoms, laboratory and anorectal examination results applicable both in primary and secondary healthcare units [16]. Thus, we evaluated if any of the 105 metabolites analysed in faecal samples correlated with any of the clinical parameters (Supplementary Table S3). There was not strong correlation with age, neither with CEA nor COLONPREDICT or gender, and there were only minor correlations with some clinical data as follows. Several TG metabolites correlated inversely with age data. Also, some metabolites belonging to the DG family correlate with age data, in the same direction as the TG metabolites. COLONPREDICT test showed the highest degree of correlation with metabolites of different families including CMH, PC, ChoE, PE, and SM. Although only slightly, ChoE(18:2) correlated directly with the FOB parameter (Supplementary Table S3).
We also studied how clinical parameters classified samples between the three groups (C, CRC and AD) and between two groups (C and Case) (Supplementary Figure S3). Both ANOVA test for the classification into three groups ( Table 2) and Tukey's HSD test for the classification into two groups ( Table 2) showed that COLONPREDICT was the best index to discriminate between samples, followed by FOB. We could see that gender had nearly no differences upon the discrimination between groups, compared to all other clinical parameters. It is important to note that no clinical parameter was able to significantly differentiate between C and AD sample groups. Table 2. Differences between sample classification of several clinical parameters, either for the groups comparison (C, AD and CRC) and for the pairwise comparison (Control vs. Case). ANOVA test has been used for the study of differences between the three groups classification (C, AD and CRC) and Tukey's HSD test was used to analyse pairwise classifications (C vs. AD, C vs. CRC and AD vs. CRC). Tukey's HSD column depicts those pairwise combinations (of the three tested combinations) that showed to be significantly different. Avg. stands for average.

Gene Expression Analysis of Enzymes Involved in the Metabolism of Altered Metabolites
Metabolites that were differentially expressed between case and control samples (Figure 2A), and with a KEGG or HMDB code already defined, were employed to identify possible metabolic pathways altered in colorectal cancer. By using the differentially expressed metabolites, we could in-silico identify 211 gene-encoding proteins that mainly clustered in three different metabolic pathways ( Figure 4A). The identified pathways were glycerophospholipids metabolism, sphingolipids metabolism and the glycosylphosphatidylinositol (GPI)-anchor biosynthesis pathway suggesting that these pathways could be altered in colorectal cancer (Supplementary Figure S4). We analysed the expression levels of these gene-encoding proteins in the available gene-expression dataset of biopsies of colorectal cancer and normal mucosae of the colon [42]. We have observed that 15 of them showed a significantly different fold change between control and cancer (case) samples ( Figure 4B). We have also observed a downregulation of CERS4, SMPD1 and SMPD3 ( Figure 4B), which are responsible for the transformation of sphingosines and sphingomyelins to ceramides. We also observed downregulation of genes that encoded enzymes that catalyse the degradation of phosphocholine into choline metabolite, mainly from the phospholipase D (PLD) family: PLB1, PLD1, PNPLA7, PLA2G12B, PLA2G4C ( Figure 4B). Furthermore, there was a significant downregulation of the genes PIGK and PIGZ, which encode enzymes involved in GPI-anchor biosynthesis. In addition, an upregulation of the genes LPCAT1 and LCAT ( Figure 4B) that encode enzymes involved in the synthesis of phosphatidylcholine and cholesteryl esters, respectively, was also observed. Together, all these alterations on genes involved in lipid metabolism of the tumoral tissue support the lipid changes detected in the faecal samples.  . Gene networks of enzymes related with metabolism of stool CRC-altered lipids. Three major pathways could be observed: Sphingolipid and glycerophospholipid metabolisms, and GPI-anchor biosynthesis (A). Gene expression in silico analysis of CRC tumoral tissue. The expression of gene-encoding enzymes involved in the metabolism of stool-altered lipids was analysed in publicly available GEO dataset GSE37364 that compared tumoral versus healthy tissue of the same individual. All displayed genes were highly significant (p-value < 0.001) except PLPP1 (p-value = 0.05) and PIGK (p-value = 0.02) (B).

Discussion
CRC screening with faecal occult blood (FOB) test has demonstrated efficacy in randomized trials. Nonetheless, the low sensitivity for advanced neoplasia of the test suggests the need for more accurate alternative diagnostic tests. In the present study, we have performed an UPLC-based targeted metabolomics analysis of stool to detect candidate endogenous metabolites suitable for the assessment of colon cancer using minimally invasive techniques. Metabolomic study of faeces can be more effective, because faeces are in close proximity to the colorectal mucosa. To date, metabolomics analyses of faecal samples have mostly been restricted to experimental studies in animal and small cross-sectional studies in humans [42][43][44][45][46][47][48][49][50][51][52]. While GC/MS-based metabolic profiling of faecal water has been reported [53][54][55], there exists only limited studies on the profiling and identification of metabolites within the complete faecal material; notably, lyophilized human faeces where its metabotype was confirmed to be more comprehensive than faecal water [47]. Previously, Ponnusamy et al. [56] profiled whole faeces from irritable bowel syndrome using GC/MS and identified several metabolites as candidate biomarkers for the disease. In the current work, a semi-quantitative analysis of 105 metabolites reveals significant differences in the faecal composition of cancer samples in the  Figure 4B). Some of the genes were of special interest as they serve as union nexuses of different metabolic pathways. Thus, PLPP1 and PLPP3 genes encoded lipid phosphate phosphatases (LPPs) with broad substrate specificity that dephosphorylate lipid substrates including phosphatidic acid, lysophosphatidic acid, ceramide 1-phosphate, sphingosine 1-phosphate, and diacylglycerol pyrophosphate [57]. One of their enzymatic reactions is the conversion of phosphatidic acid to diacylglycerol which is a central lipid for glycerophospholipids, triacylglycerols and sphingolipid metabolisms. In consequence, they modulate different signalling pathways and generate building blocks for lipid metabolism-regulating physiological and pathological processes including vascular function and tumor progression [58]. These also indicate that the altered metabolism of the tumour could be detected in stools, and consequently be detected in a non-invasive manner.
In our study, the most significant lipids altered in stool were cholesteryl esters, particularly ChoE(18:2) and ChoE(20:4) that were increased in CRC samples. This was in agreement with the fact that acetate-a short chain fatty acid-which is the precursor molecule for endogenous cholesterol production, has been reported to be elevated in CRC [59]. In addition, our in silico analysis of the gene expression profile of tumoral tissue reported by Valcz et al. [42] shows increased tumoral levels of the gene encoding the enzyme phosphatidylcholine-sterol acyltransferase responsible for the cholesteryl ester synthesis. Together, the data suggest that the levels of cholesteryl esters in stools can be a suitable non-invasive measurement to detect and follow up colorectal cancer. Interestingly, the combination of FOB with our 7-metabolites of our metabolomics model increases the discriminating ability as judged by the AUC value that passed from 0.821 to 0.885.
It is important to highlight that one of the strengths of our study includes careful processing and preservation of the faecal specimens, and our quantification of within-subject intraclass correlation coefficient (ICC), from which we could estimate statistical power with our cutting-edge faecal metabolomics platform. Our platform has high sensitivity and technical reproducibility, but it has limited ability to detect some volatile and larger molecules.
Our study's major limitations are its small size and cross-sectional, hospital-based case-control design. It provided no assessment of temporality and could only detect strong associations with CRC. Also, the fact that this is a targeted metabolomics obviously biases the results towards lipid species, which is also an important limitation. As we mentioned in the introduction section, lipid alterations have been previously associated with CRC development and progression [41]. We considered, therefore, that our panel of metabolites would be sufficient to find potential CRC biomarkers. Also, keeping in mind the diagnostics aim of this study, we decided to use targeted metabolomics because it's cheaper than an untargeted one, making it a more affordable option. Targeted metabolomics allows an easier interpretation of results and, therefore, an easier translation to clinical practice, which we also considered to be an important point for the diagnostics purpose. As no restriction on diet was provided to the participants in the study, another limitation is the lack of control for potential diet-confounding factors. Nevertheless, we believe this potential diet's effects to be minimal, as all participants came from two Spanish regions that share the same dietary patterns. We did not specifically control for age, sex, tumour position and staging for this study, which constitutes another important limitation. The decision of not to control for those factors was done taking into account the sample size, not big enough to generate sufficiently big subgroups to obtain statistically robust data. In order to minimize those variables effects, we incorporated the 10,000 iterations through random subsetting of the population for the modelization step, thus generating 10,000 different populations, covering a huge range of different composition trains and test subpopulations that could reduce the potential bias towards some of the mentioned factors. Another strength of our study is the comparison against the FOB test and other clinical parameters. For every one of these comparisons, our model composed by the 7-metabolites performed better than the clinical parameters alone. Also, the integration of gene expression data in the study supports the identification of differentially expressed metabolites and puts them into context, providing some insights on how and why the levels are different between healthy controls and cancer patients.

Chemicals
HPLC-MS grade solvents were purchase from Sigma Aldrich (St. Louis, MO, USA). Reference metabolite standard compounds were obtained from Sigma Aldrich, Larodan Fine Chemicals (Malmö, Sweden) and Avanti Polar Lipids (Alabaster, AL, USA).

Clinical Samples and Study Population
The samples were collected during COLONPREDICT study, a multicentre, cross-sectional, blinded study of diagnostic tests aimed to create and validate a CRC prediction index in symptomatic patients based on available biomarkers, clinical and demographical data [16]. The study was approved by the Clinical Research Ethics Committee of Galicia (Code 2011/038). As the samples were collected from the COLONPREDICT study, the population selection characteristics were the same of that study. The cohort consisted of consecutive patients with gastrointestinal symptoms referred for colonoscopy from primary and secondary health care to Complexo Hospitalario Universitario de Ourense, Spain. Exclusion criteria for the COLONPREDICT study were: age under 18, pregnancy, asymptomatic individuals undergoing colonoscopy for CRC screening, patients with previous history of colonic disease, patients requiring hospital admission, patients whose symptoms had ceased within 3 months of evaluation, and patients who declined to participate after reading the informed consent form. Patients self-collected a faecal sample from one bowel movement without specific diet or medication restrictions the week before a colonoscopy was performed at home and delivered to the hospital. The faecal sample was brought to the laboratory in less than 4 hours, split in aliquots and immediately frozen at −80 • C. We selected samples from 40 patients with advanced adenoma-AD-(≥10 mm, villous histology, high-grade dysplasia), 40with CRC and 49 with a normal colonoscopy. The characteristics of the patients differed with respect to age (CRC = 73.1 ± 10.6 years, AD = 68.8 ± 44.6 years, normal = 61.5 ± 14.4 years; p < 0.001) and sex (CRC = 60.0% male, AD = 59.1% male, normal = 27.5% male; p = 0.004). The CRC were located in the rectum (32.5%), colon distal to splenic flexure (45%) and proximal to splenic flexure (22.5%). The tumour stage at diagnosis was: I (24.2%), II (30.3%), III (30.3%) and IV (15.2%). After brief vortex mixing, the samples were incubated for 1 hour at −20 • C. After centrifugation at 16,000 × g for 15 min, 35 µL of the lower organic phase was collected and the solvent was removed. The dried extracts were then reconstituted in 1000 µL acetronitrile/isopropanol (1:1), centrifuged (16,000 × g for 5 min), and transferred to vials for UPLC®-MS analysis on an Acquity-Xevo G2 QTof system (Waters Corp., Milford, MA, USA). Samples were randomly divided into three batches, which contained a maximum of 78 samples. Chromatographic method and mass spectrometric detection conditions were described by Barr et al. [60]. Of the different platforms described, the one corresponding to ours was Platform 3.

Data Pre-Processing
Data pre-processing was processed using the TargetLynx application manager for MassLynx 4.1 (Waters Corp). A total of 105 UPLC-MS features were analysed, all of them identified prior to the analysis. Peak detection and noise reduction were performed as previously described [61,62]. Intra-and inter-batch normalization process was based on multiple internal standards and the pool calibration samples approach described by Martinez-Arranz et al. [62].

Data Analysis
The biomarker assessment in this study was organized in sequential and consecutive phases for discovery and biological validation. Firstly, 133 metabolites including glycerolipids, glycerophospholipids, sterol lipids and sphingolipids were selected as candidate biomarkers for initial analysis faeces samples from advanced neoplasia cases, colorectal cancer and cancer-free controls (Discovery Phase). Secondly, the potential clinical use of the most promising validated candidates was tested in faeces samples from colon cancer cases, a small set of adenomas, and cancer-free controls. Reported STARD guidelines have been the basis for defining our protocol.
Metabolites with less than 70% of the values present were removed from the analysis (remaining 105 metabolites into the analysis). Remaining missing values were imputed metabolite by metabolite, taking the minimal value for the metabolite and dividing it by 10. Data was then normalized with the log10 transformation. Univariate statistical analyses were also performed calculating group percentage changes and the analysis of variance (ANOVA) for the comparison among the different groups: CRC, AD and control (C). Student's t-test p-values were calculated for the comparison between cases (AD and CRC) and C groups, as well as for the comparisons CRC and C, CRC and AD and between AD and C groups. Multivariate analyses were also performed, including both Principal Component Analysis (PCA) and Partial Least Squares Discriminant Analysis one (PLS-DA). ANOVA tests and Tukey's HSD tests were also calculated for several clinical parameters (FOB, sex, age, CEA and COLONPREDICT test) to determine its effectiveness to classify our samples into categories (CRC, AD and C or Case-Control). All p-values were adjusted with Bonferroni methodology unless otherwise stated.
A logistic regression (LR) was performed to identify a predictive signature capable of distinguishing between cases and control groups. LR is a commonly used technique for data classification. We first analysed the correlations between metabolites, establishing a cut-off at ρ 0.75. For each pair of correlated metabolites, we removed the one that separated the worst out of the two groups. A forward stepwise method was selected as variable selection approach, where the analysis started with an empty model and variables were added one at a time as long as these additions are worthy, by measuring the Area Under the Curve (AUC) value. This process finished when no more variables could be added. All samples were randomly divided into estimation (80% of all subjects; n = 101) and validation (20% of all subjects; n = 26) groups, both cohorts having an equal proportional representation of individuals belonging to cases and control groups. Ten-thousand iterations of both subsetting into estimation and validation groups and model constructing were generated, to avoid population-based biases. Receiver operating characteristic (ROC) curve analysis was used to assess its discriminatory power. Overall diagnostic accuracy for a given two-class comparison was given by the area under the ROC curve (AUC) with its associated standard error. Sensitivity, specificity and accuracy values were calculated.
All calculations were performed using statistical software package R v.3.1.1 (R Development Core Team, 2011; http://cran.r-project.org) with caret, caTools and receiver operating characteristic R (ROCR) packages to produce ROC curves and AUC estimate; MASS package was used to generate the LR. Additionally, SIMCA-P+ 12.0.1 (Umetrics AB, Umeå, Sweden) was used for PCA and PLS-DA multivariate data analysis.
Retrieval of genes and enzymes related with differentially expressed metabolites found in the study was done with custom Python scripts, which takes advantage of the published Python packages Biopython [63] and bioservices [64], which were used to access both HMDB and KEGG databases. These custom scripts retrieve information on the metabolite entries on both HMDB and KEGG databases regarding the enzymes involved in the metabolism of cited metabolites, as in which pathways are they present. We identified gene-encoding proteins involved in the metabolism of the seven metabolites of the predictive model, and we uploaded those genes to the STRING database [65], in order to identify the interaction between them, any potential clusterization and possible affected metabolic pathways. Genetic expression was obtained from publicly available GEO databaset GSE37364 [42]. The datasets were uploaded to R and the expression of selected genes was plotted into boxplots. Mapping of both metabolites and genes into metabolic pathways was done with pathview package [66] and custom R scripts.

Conclusions
This study highlights the power of UPLC-MS-based metabolomics approach in the discovery of novel non-invasive markers for colorectal cancer. With this study, we identified alterations in two main metabolic pathways, the glycerophospholipids and glycosphingolipids metabolisms. We found 18 metabolites differentially expressed between case samples (CRC + AD) and healthy controls, being mainly increased in case ones. We also showed how a discrimination model based only on metabolite species was able to differentiate between case (CRC+AD) samples and healthy ones and is better than those used nowadays, based in several clinical parameters like FOB, CEA, etc. The model generated included these metabolites: ChoE

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.