Metabolite Predictors of Breast and Colorectal Cancer Risk in the Women’s Health Initiative

Metabolomics has been used extensively to capture the exposome. We investigated whether prospectively measured metabolites provided predictive power beyond well-established risk factors among 758 women with adjudicated cancers [n = 577 breast (BC) and n = 181 colorectal (CRC)] and n = 758 controls with available specimens (collected mean 7.2 years prior to diagnosis) in the Women’s Health Initiative Bone Mineral Density subcohort. Fasting samples were analyzed by LC-MS/MS and lipidomics in serum, plus GC-MS and NMR in 24 h urine. For feature selection, we applied LASSO regression and Super Learner algorithms. Prediction models were subsequently derived using logistic regression and Super Learner procedures, with performance assessed using cross-validation (CV). For BC, metabolites did not increase predictive performance over established risk factors (CV-AUCs~0.57). For CRC, prediction increased with the addition of metabolites (median CV-AUC across platforms increased from ~0.54 to ~0.60). Metabolites related to energy metabolism: adenosine, 2-hydroxyglutarate, N-acetyl-glycine, taurine, threonine, LPC (FA20:3), acetate, and glycerate; protein metabolism: histidine, leucic acid, isoleucine, N-acetyl-glutamate, allantoin, N-acetyl-neuraminate, hydroxyproline, and uracil; and dietary/microbial metabolites: myo-inositol, trimethylamine-N-oxide, and 7-methylguanine, consistently contributed to CRC prediction. Energy metabolism may play a key role in the development of CRC and may be evident prior to disease development.


Introduction
Breast cancer (BC) and colorectal cancer (CRC) are the first and third highest incident cancers in women in the US, respectively [1].Substantial evidence outlined in the Third Expert Report of the World Cancer Research Fund (WCRF)/American Institute for Cancer Research (AICR) continuous update project supports the premise that dietary patterns and lifestyle factors significantly influence the risk of these cancers [2].The Expert Report emphasizes the importance of maintaining a healthy weight, engaging in regular physical activity, adopting a diet high in fruits, vegetables, whole grains, and dietary fiber, and reducing intakes of red meat, animal fats, and refined carbohydrates [2][3][4][5].Further, evidence suggests that even moderate alcohol consumption can contribute to an increased risk of post-menopausal BC and CRC [6].
Diet is a complex mixture of nutrients, bioactives, additives, and other components that can contribute to the risk of cancer [7].Some chemicals, such as heterocyclic amines and polycyclic aromatic hydrocarbons formed when meat or fish are cooked at high temperatures, may be directly carcinogenic [8].Other nutrients, such as saturated fat or added sugars, may be linked with cancer risk indirectly through alterations in various signaling pathways, such as insulin or inflammation [9][10][11].Saturated fat may also contribute to excess caloric intake and weight gain [12], while foods rich in fermentable fiber may lead to beneficial gut microbial community structure [13,14].These exposures along with phenotypic information can be captured with high-dimensional tools applied to blood or urine, such as metabolomics.Metabolomics is the comprehensive, qualitative, and quantitative study of the small molecules in an organism and includes both aqueous and lipid metabolites [15].The metabolome reflects both endogenous processes, as well as diet and other environmental exposures.Thus, it provides a sensitive approach for testing and tracing the involvement of altered biological pathways and networks associated with chronic diseases, such as cancer.Although metabolomics has been used extensively to search for biomarkers of early cancer detection [16][17][18], metabolomic profiles are now being used as risk markers associated with environmental exposures [19,20].
In this study, our aims were to find potential prediagnostic serum and urine metabolite predictors of BC and CRC using multiple metabolomics platforms that provided predictive power above and beyond well-established risk factors within the Women's Health Initiative (WHI) Bone Mineral Density (BMD) subcohort.Specifically, comparing several variable selection and prediction models, we assessed the competitive performance for BC and CRC prediction using metabolites compared to prediction models with only demographic, clinical, and lifestyle covariates, and assessed whether metabolites improved the prediction performance when added to these well-established risk factors.Also, comparing the results across variable selection and prediction approaches provides an evaluation of the robustness of the selected metabolites and their prediction performance.These analyses may provide novel metabolite-cancer associations and mechanisms, particularly for dietrelated metabolites.

Women's Health Initiative
The WHI recruited 161,808 post-menopausal women from 40 clinical centers nationwide between 1 October 1993 and 21 December 1998 [21].All women were 50-79 years old when they were enrolled in at least one of three clinical trials (CT; n = 68,132) or an observational study (OS; n = 93,676).The three WHI CTs were a randomized controlled clinical trial of menopausal hormone therapy, of low-fat dietary modification, and of calcium/vitamin D supplementation.The WHI BMD subcohort included all participants at three clinical centers (Birmingham, AL; Pittsburgh, PA; and Tucson, AZ, with satellite in Phoenix, AZ) (n = 11,020) chosen to maximize racial and ethnic diversity.All women provided core questionnaires including medical history, reproductive history, family history, medication use, dietary intake, and personal habits [21].

Case and Control Selection
Cases and controls for this analysis were selected from the WHI BMD subcohort.The eligible sample was restricted to women who had sufficient serum (300 µL) and urine (550 µL) samples from the same time point, prior to and closest to BC or CRC case diagnosis date, and required to have no missing covariate data (n = 10,451).Clinical outcomes were reported biannually in the CT until 2005 through the trial periods, then annually, and annually in the OS.An initial report of invasive cancer during cohort follow-up was confirmed by a review of medical records and pathology reports by physician adjudicators.
The cases were defined as earliest incident invasive BC or CRC so that the biospecimen collection would be comparatively proximate.Each of the 758 case women was matched 1-to-1 to a control woman, disease free at the case occurrence follow-up time, based on age (within 2 years; Table 1), WHI enrollment date (within 2 months to control for follow-up duration), and self-identified race or ethnicity; the closest match was selected based on criteria to minimize an overall distance measure [22].In total, 54% of the selected sample were in the OS, 34% in the dietary modification (DM) trial, and 12% in the hormone trials (HT) (but not in the DM trial).Our final population included n = 758 adjudicated cancers (577 invasive breast and 181 colorectal) and n = 758 controls.Targeted LC-MS: Serum samples were analyzed by targeted LC-MS/MS using liquid chromatography coupled to a Sciex Triple Quad 6500+ Triple Quadrupole mass spectrometer equipped with an ESI ionization source as described previously [23].The instrument was attached to two Shimadzu UPLC pumps, and the pumps were connected to an autosampler in parallel so that chromatography separation could be performed using two analytical hydrophilic interaction liquid chromatography (HILIC) columns independently, one for positive ionization mode and the other for negative ionization mode.Identical columns (Waters XBridge BEH Amide XP) were used for both separations, and the samples were injected for each column separately.While one column was performing separation and MS data acquisition in ESI+ ionization mode, the other column was equilibrated and readied for analysis in ESI mode.The LC-MS system was controlled using AB Sciex Analyst 1.6.3software.Serum metabolites were extracted using methanol in a 1:2 (v/v) ratio, dried, and reconstituted in HILIC solvent.MS data acquisition was performed in multiple reaction monitoring (MRM) mode.Measured MS peaks were integrated using AB Sciex MultiQuant 3.0.3software.A total of 304 metabolites were targeted (see Supplemental Table S4), of which 150 were detected with less than 20% missing values.A total of 304 metabolites were targeted, of which 150 were detected with less than 20% missing values.

Measurements of Urine Metabolites
NMR spectroscopy: Metabolite profiles from 24 h urine samples were analyzed by NMR spectroscopy using a Bruker Avance III 800 MHz NMR spectrometer.Each sample (300 mL) was mixed with 300 mL phosphate buffer in D 2 O (pH = 7.4) containing an internal standard, 3-(trimethylsilyl)propionic acid-2,2,3,3-d4 sodium salt (TSP).Data were acquired at 298 K using a one-dimensional pulse sequence with suppression of the residual water signal using presaturation.Spectral width, time domain points, relaxation delay, and number of transients were 10,000 Hz, 32,768, 2 s, and 64, respectively.The raw data were Fourier transformed after zero filling by a factor of two and multiplied using an exponential window function with a line broadening of 0.5 Hz.The resulting spectra were phase and baseline corrected and referenced to the internal standard, TSP.Metabolite peaks were identified using databases and relative concentrations for 59 metabolites were obtained (Supplemental Table S4).None of the metabolites had missing values.

Metabolite Quality Controls (QC)
Analysis protocols used multiple layers of QC samples as well as isotope-labeled or unlabeled internal standards to assess instrument stability/performance during the analysis and help with normalization and metabolite quantitation.Different types of QCs used included: (a) unblinded instrument QC samples (commercially obtained pooled human serum from Innovative Research, Inc. (Novi, MI, USA)) run every 10 samples and at the beginning and end of each batch of samples; (b) blinded, pooled study samples (5% for urine; 10% for serum) interspersed with the biological study samples (3 QCs/batch of 27 study samples), used to normalize batches of samples over the run; (c) 17 splitsample blinded duplicates of study samples also interspersed with study serum and urine samples, used to calculate reported median metabolite coefficient of variation (CV) values; (d) isotope-labeled internal standards for targeted analysis of aqueous metabolite (n = 33) and lipids (n = 54) in serum, which enabled absolute concentration determination and ensured evaluation of instrument stability and data quality; (e) internal standard, TSP, used to assess the spectral quality, calibrate spectra, and help with data normalization of urine NMR spectra; and (f) FAME (fatty acid methyl esters) of different fatty acid chain lengths for retention time indexing and myristic acid-d 27 for help with metabolite identification and data normalization, respectively.Median CVs of blinded pooled study QC samples for the four different platforms (two for serum analysis and two for urine analysis) across the samples were 2.9% for global NMR from 24 h urine, 6.4% for targeted lipidomics, 20.7% for targeted LC-MS/MS, and 45.4% for global GC-MS.

Statistical Analysis 2.5.1. Participant Data
From the originally collected participant data, we selected a base set of covariates (age, chronologic time of visit, and race or ethnicity) and all demographic, clinical, and lifestyle covariates that were adjusted for in Prentice et al. [28].Some categorical variablesincluding race or ethnicity, education level, and income-were recoded as binary variables.The base set of variables listed above and other demographic, clinical, and lifestyle covariates considered are summarized in Table 1.We considered all identified metabolites from the four metabolomics platforms with less than 20% missing data.For each outcome, we used all cases corresponding to that outcome and all controls (i.e., all 758 controls were used in predicting both outcomes), which has been shown to improve prediction performance [29].

Imputing Missing Data
While the base set of covariates were measured on all participants in this study, there were missing data in some of the demographic, clinical, and lifestyle variables [BMI < 1%, waist circumference < 1%, smoking history < 1%, energy expenditure 6%, and intake of alcohol, calcium, folate, and red and processed meat missing for fewer than 3% of participants].We used multiple imputation via chained equations [30] to perform imputations.We ignored the outcome in all imputation models to simplify our procedure for assessing prediction performance described below (see, e.g., [31]).For metabolomics variables, those with more than 20% missing values were removed toward ensuring robust results.For the remaining variables, half of the minimum nonzero value was used to impute the values that were below detection limits.For each platform, we created one set of multiple imputed datasets consisting of only the metabolites and base set of variables, and a second set of multiple imputed datasets consisting of the metabolites and all risk factor variables (including the base set of variables).More detail on the imputation procedure is provided in the Supplemental Methods.
After each platform-specific imputation step was complete, we further processed the data following a similar specification to Zheng et al. [32].In particular, outliers were truncated to within three times the interquartile range of the first and third quartile.For LC-MS and GC-MS metabolites, we normalized the data within each imputation round and batch using local polynomial regression fitting (in the R package loess) with tuning parameter set to 0.75 among quality-control samples.
To minimize the effect of possible correlated variables on our results and to study the utility of different platforms, we first considered each measurement platform (NMR, LC-MS, GC-MS, and Lipidyzer) separately.Then, for each platform, we performed analyses based on metabolomics alone, and established risk factors + metabolomics, with the base set of covariates (age, chronologic time of visit, and race or ethnicity) always included in each analysis.In a sensitivity analysis, we pooled the metabolites from all platforms together.

Algorithms Used for Selecting a Set of Metabolites
In all analyses, we adjusted for the base set of variables to account for the sampling design [33].We report adjusted analyses using only the metabolites and using the metabolites plus other established risk factors.For a given platform and set of adjustment covariates, to evaluate the robustness of variable selection and prediction performance, we applied three algorithms to select a set of metabolites and covariates to use in the final risk-prediction algorithm, described below.The first algorithm performed no variable selection (i.e., allowed all metabolites and covariates into the final prediction algorithm).The second was lasso regression [34] implemented in the R package glmnet, with tuning parameters selected using ten-fold cross-validation.We forced the base set of covariates into all lasso models to ensure proper adjustment for these variables.We then selected all variables with a nonzero estimated coefficient.
The final procedure used the Super Learner [35] implemented in the R package Su-perLearner [36].The Super Learner is a particular implementation of stacking models [37]; in this algorithm, a library of candidate learners is fit to the data, and cross-validation is used to create the convex combination of these candidate learners that minimizes a crossvalidated loss criterion.In these analyses, we used the non-negative log-likelihood loss function.The resulting convex combination has both finite-sample and asymptotic guarantees on its performance [35].Our candidate library consisted of elastic net regression [38], boosted trees [39], and random forests [40].The R implementations of these algorithms and the tuning parameters used are provided in Supplemental Table S1.To perform variable selection using the Super Learner, we first computed a variable importance measure for each candidate algorithm: an estimated coefficient for the elastic net and a decrease in Gini impurity for both trees and forests.We then ranked the variables from most to least important by the algorithm-specific metrics and combined the ranks using the convex weights of the Super Learner.We then selected variables with weighted rank (weights based on the Super Learner; see the Supplemental Methods) in the top 20.This ensures that algorithms with high weight in the Super Learner ensemble-implying that the algorithm has favorable cross-validated performance-have a large influence in selecting variables.

Assessing Prediction Performance
After selecting a set of metabolites and covariates, we addressed the performance of these variables in predicting either BC or CRC.We fit two final prediction algorithms for each platform.The first was a simple logistic regression.The second was the Super Learner, using the same approach as described above.This resulted in four procedures based on variable selection: variable selection with the lasso, followed by either logistic regression (denoted lasso + GLM below) or the Super Learner (denoted lasso + SL) for prediction; and variable selection with the Super Learner, followed by logistic regression (denoted SL + GLM) or the Super Learner (denoted SL + SL) for prediction.We compared these four approaches with two that did not use variable selection: the Super Learner with all variables (denoted SL) and the Super Learner with all variables that used a library of candidate learners augmented with variable selection algorithms [denoted SL (with screens)].Further details on these procedures are provided in the Supplemental Methods.
Assessing prediction performance of sets of selected variables was complicated by the fact that these variables were not determined a priori [41].We used cross-validation to assess the performance of a combined procedure for variable selection and prediction using the selected variables, whereby the selected variables and prediction algorithm were determined on training data and prediction performance was evaluated on independent data.We repeated this cross-validated procedure 100 times for each platform and set of adjustment variables (base set of variables only or all risk factor variables).We measured prediction performance using the cross-validated area under the receiver operating characteristic curve (CV-AUC).Detail on this cross-validated procedure is provided in Supplemental Methods.

Final Selection of Metabolites
We obtained a final set of selected variables from each platform by applying the variable selection procedure to the full set of observations for each imputed dataset; our final set consisted of those metabolites and covariates that were selected in over 70% of the individual imputed datasets.For each platform and set of adjustment covariates, we then took the union of the sets resulting from the two variable selection procedures.Our final set of metabolites was a further union of the platform-specific selected sets, while the final set of adjustment covariates was the unique covariates selected in any of the platform-specific analyses (Supplemental Figures S1 and S2).

Post Hoc Sensitivity Analyses
Prior studies within WHI cohorts suggest an interaction between HT use and insulin such that associations between obesity-related measures, i.e., BMI, adipokines, levels of insulin, etc., and both BC and CRC were only observed among non-HT users [42,43].It has been proposed that oral HT exposes the liver to a large dose of estrogen, leading to altered hepatic protein synthesis.Because HT could potentially alter metabolites associated with BC and CRC in our analysis, we conducted a sensitivity analysis for both cancer outcomes, excluding women randomized to the active arms of the HT or who reported current HT use at baseline.

Results
Characteristics of the WHI BMD participants stratified by BC and CRC cases and controls are given in Table 1.The mean time between blood draw and cancer diagnosis was 7.2 years (IQR 2.4-11.6 years).
Of the four metabolomics platforms, the greatest prediction potential was observed with LC-MS, which targeted water-soluble metabolites in serum.We present the crossvalidated performance of each procedure for predicting BC and CRC using the LC-MS platform in Figure 1.In the left panel, we see that the base set of covariates, forced into all prediction models, and demographic, clinical, and lifestyle variables alone were moderately predictive of BC, and similar across all six prediction procedures, with a maximum CV-AUC of 0.572 from the SL (with screens) procedure.Performance for predicting CRC based on addition of risk covariates was similar (absolute difference 0.014), at a maximum CV-AUC of 0.558.In the right panel, we overlay the prediction performance using the metabolites and the prediction performance using all covariates and the metabolites.Metabolites alone were not good predictors of BC (CV-AUCs at or below 0.5), and the prediction performance of risk covariates plus metabolites was similar to that of the covariates alone, without performance improvement.In contrast, for CRC, prediction performance was improved for all six algorithms when using metabolites alone or metabolites plus risk covariates, with a maximum CV-AUC of 0.593 based on metabolites alone and 0.608 combining metabolites and clinical variables from the SL algorithm with no variable selection.
Of the four metabolomics platforms, the greatest prediction potential was observed with LC-MS, which targeted water-soluble metabolites in serum.We present the crossvalidated performance of each procedure for predicting BC and CRC using the LC-MS platform in Figure 1.In the left panel, we see that the base set of covariates, forced into all prediction models, and demographic, clinical, and lifestyle variables alone were moderately predictive of BC, and similar across all six prediction procedures, with a maximum CV-AUC of 0.572 from the SL (with screens) procedure.Performance for predicting CRC based on addition of risk covariates was similar (absolute difference 0.014), at a maximum CV-AUC of 0.558.In the right panel, we overlay the prediction performance using the metabolites and the prediction performance using all covariates and the metabolites.Metabolites alone were not good predictors of BC (CV-AUCs at or below 0.5), and the prediction performance of risk covariates plus metabolites was similar to that of the covariates alone, without performance improvement.In contrast, for CRC, prediction performance was improved for all six algorithms when using metabolites alone or metabolites plus risk covariates, with a maximum CV-AUC of 0.593 based on metabolites alone and 0.608 combining metabolites and clinical variables from the SL algorithm with no variable selection.Results for the remaining platforms tended to also be consistent across the six algorithms.GC-MS and Lipidyzer-detected metabolites provided little to no additional prediction performance for either BC or CRC over the clinical variables.NMR-detected metabolites tended not to increase prediction performance for BC over the clinical variables; for prediction of CRC, these metabolites had a performance comparable to the risk covariates and also led to a slight increase in prediction performance when added to the risk covariates.The full set of results are presented in Supplemental Tables S2 and S3.Results for the remaining platforms tended to also be consistent across the six algorithms.GC-MS and Lipidyzer-detected metabolites provided little to no additional prediction performance for either BC or CRC over the clinical variables.NMR-detected metabolites tended not to increase prediction performance for BC over the clinical variables; for prediction of CRC, these metabolites had a performance comparable to the risk covariates and also led to a slight increase in prediction performance when added to the risk covariates.The full set of results are presented in Supplemental Tables S2 and S3.
In Table 2, we present the selected risk covariates and metabolites for predicting both BC and CRC, as well as the estimated proportion of variation explained (PEV) by each metabolite.Several risk covariates shown to be predictive of BC (including Gail 5-year risk score) or CRC (at least one colonoscopy or colon polyp removed) were selected, lending validation of our selection results.For individual metabolites, the PEVs were similar and in the range of 0.21 to 0.25, suggesting that many of the metabolites do not differentiate prediction performance alone, but can result in differential prediction performance together.Glycerate explained the most variability (PEV = 0.25) in CRC (adjusted for all risk factor variables). Metabolites selected for CRC, along with function, are given in Table 3. 1 All variables listed below were selected by either the lasso or SL selection procedure in the corresponding platform-specific analysis.The base set of covariates (forced into all models) were age, WHI enrollment date, and self-reported race or ethnicity.Selected covariates for breast cancer: education level, income, alcohol intake, current smoking, total folate intake, Gail 5-year risk, family history of CRC, prior removal of ≤1 colon polyp, currently using estrogen, waist circumference, BMI (kg/m 2 ), randomized to CaD or HT, date of sample draw visit.Selected covariates for colorectal cancer: age, self-reported race/ethnicity, education, income, alcohol intake, total folate intake, waist circumference, BMI (kg/m 2 ), ≥1 colonoscopy, prior removal of ≥1 colon polyp, sample draw visit, randomized to DM control arm. 2 The proportion of explained variation (PEV) was estimated by first creating a dataset with only the selected metabolites and covariates for each outcome.Then, we used cross-validation to fit a logistic regression on each set of training data and predict on the test data; the PEV is defined as the correlation between the observed outcomes and the predictions. 3Positive direction of the estimated coefficient from the multiple logistic regression model implies higher odds of being a case; negative direction implies lower odds of being a case. 4In CE, X:A; FFA, X:A; DAG, X:A/Y:B; HCER, X:A; PC, X:A/Y:B; PE, X:A/Y:B; and LPC, X:A, X and Y indicate the number of carbon atoms and A and B indicate the number of double bonds in the fatty acid chains.Lipids without both A and B represent the sum of all fatty acids in that class.For example, DAG (14:1) equals the sum of all diacylglycerol, i.e., summing all DAG (x/14:1) and DAG (14:1/x). 5Values represent mass at retention time of the unknown metabolites, i.e., 73 12.10 indicates a mass of 73 at 12.10 min.In TAG, X:A/Y:B, X indicates the total number of carbon atoms and A indicates the total number of double bonds in the three fatty acid chains, and Y indicates the number of carbon atoms and B indicates the number of double bonds in one of the fatty acid chains.
To assess the effect of performing variable selection and estimating prediction performance based on each platform separately, we performed a sensitivity analysis.In this analysis, we pooled the metabolites from each platform together after imputation but before the variable selection and prediction performance analysis.Here, we only fit the lasso + GLM algorithm since we observed similar performance across procedures in the primary analysis.Estimated prediction performance based on the pooled set of metabolites was similar to that observed for the lasso + GLM algorithm for the LC-MS metabolites: CV-AUCs of 0.554 (BC) and 0.58 (CRC).In Table 4, we present the set of selected risk covariates and metabolites, along with the estimated PEV.Many metabolites selected from this sensitivity analysis were also selected in the platform-specific analyses, more so for CRC than BC, which had few metabolites selected in the pooled analysis.The estimated PEV was also similar for most metabolites, with glycerate, which was positively associated with CRC, again providing the largest PEV.   2 The proportion of explained variation (PEV) was estimated by first creating a dataset with only the selected metabolites and covariates for each outcome.Then, we used cross-validation to fit a logistic regression on each set of training data and predict on the test data; the PEV is defined as the correlation between the observed outcomes and the predictions. 3Positive direction of the estimated coefficient from the multiple logistic regression model implies higher odds of being a case; negative direction implies lower odds of being a case. 4FFA: free fatty acid; FA: fatty acid; TAG: triacylglyceride; PC: phosphatidyl choline; CE: cholesterol ester. 5Values represent mass at retention time of the unknown metabolites, i.e., 73 12.10 indicates a mass of 73 at 12.10 min.In post hoc analyses excluding women using HT, prediction performance in the subpopulation was modestly improved for CRC compared to the full population (CV-AUCs range from 0.622-0.637,while in the full population they range from 0.589-0.608).Prediction performance for BC was slightly decreased in the subpopulation compared to the full population (CV-AUCs range from 0.535-0.554,while in the full population they range from 0.559-0.563).Several LC-MS metabolites were selected in the subgroup analysis that were also selected in the whole-cohort analysis: cysteinyl glycine, N-isovaleryl glycine, and valine (BC); adenosine, leucic acid, glycerate, hydroxyproline, and 2-hydroxyglutarate (CRC).Additional metabolites selected in the subgroup analysis for CRC included adipic and 3-hydroxybutyric acids, involved in fatty acid metabolism; betaine, a marker of whole grains; glucuronate, found in gums and fermented beverages; and trigonelline, found in coffee.

Discussion
In this well-characterized cohort of post-menopausal women, we evaluated whether the addition of serum and urine metabolites from multiple platforms were equivalent to or provided improved prediction of BC and CRC, beyond well-established risk factors.For BC, risk covariates alone provided moderate predictive power, in the range of CV-AUC 0.57, with metabolites contributing no improvement.In fact, the highest CV-AUC using both risk covariates and metabolites was <0.56.Conversely, for CRC, the addition of metabolites, particularly serum aqueous species from the LC-MS platform, modestly improved prediction performance over risk covariates alone, from CV-AUC of 0.54 to 0.61.This improvement was consistent across various prediction algorithms and metabolite platforms and held whether we performed variable selection within each platform separately or after pooling all metabolites together.
While metabolites did not provide additional prediction power for BC in our analyses, of those that were selected, a large proportion were lipids (22 of 43 named metabolites) or metabolites related to lipid metabolism.Associations between lipids and BC align with accumulating evidence associating excess adiposity, especially after menopause, with increased BC risk [2,5,44].Obesity is associated with systemic inflammation, insulin resistance, altered steroid metabolism, and other metabolic derangements-factors mechanistically linked to carcinogenesis [44][45][46].However, few lipids were selected in sensitivity analyses where all metabolites were pooled across all platforms.Moreover, variables, such as alcohol intake, waist circumference and BMI, current estrogen use, and Gail 5-year risk score, were superior to metabolites in predicting BC.
In contrast, mainly aqueous and urinary metabolites were selected in predictive models for CRC, with few if any lipids.Twenty-two different named metabolites were selected in the various prediction algorithms that contributed to CRC prediction, the majority consistently selected across procedures.Several were related to energy metabolism, including adenosine, 2-hydroxyglutarate, and glycerate, with additional metabolites, Nacetyl-glycine, taurine, threonine, and lysophosphatidyl choline [LPC (FA20:3)] related to fatty acid metabolism in particular.These metabolites suggest altered metabolism, a hallmark of cancer.An even larger proportion of metabolites were involved in protein metabolism.Histidine, N-acetyl-glutamate, and allantoin were inversely associated with CRC.As has been previously reported in two other large prospective cohorts, higher circulating histidine, even up to 10 years prior to diagnosis, was associated with reduced risk of CRC [47].N-acetyl-glutamate functions as a cofactor in ureagenesis, converting nitrogen from protein to urea acids such as allantoin [48].Other amino acids and peptides were positively associated with CRC, potentially reflecting higher protein intakes.For example, trimethylamine N-oxide (TMAO) is elevated in blood after consumption of fish or foods rich in choline and carnitine, such as red meat, eggs, and dairy products, which can be converted to trimethylamine by gut microbes [49], and subsequently to TMAO by hepatic enzymes in the liver.This metabolite has also been previously linked with CRC [50,51].The branched-chain amino acids isoleucine and leucic acid, a metabolite of leucine, hydroxyproline, methylguanine, and n-acetyl-neuraminate, are all animal protein derived metabolites.Higher intakes of animal protein, especially red and processed meat, are a known risk factor for CRC [52,53].In addition to generation of ATP, adenosine, along with the purine 7-methylguanine and pyrimidine uracil, are involved in DNA and RNA synthesis as well as participating as signaling molecules.Lastly, myo-inositol is a biomarker found in whole grains.These metabolites as a group are highly representative of dietary exposures and support conclusions from the WCRF/AICR Third Expert Report indicating probable or convincing evidence for several dietary components contributing to CRC risk, i.e., red and process meat, heme-containing foods in general, and low intake of fruits and non-starchy vegetables, but less so for BC risk, with strong evidence limited to alcohol intake [2].
While metabolite biomarkers have historically been used for cancer detection, studies are now using pre-diagnostic metabolites to examine environmental exposure and cancer risk.To date, 10 studies have focused on BC, with varying metabolite signatures [19,[54][55][56][57][58][59][60][61][62][63].These studies, including information on population, sample size, follow-up time, and analytic platforms have been extensively detailed in His et al. [19].Most studies report significant associations with one or more metabolites, most commonly specific lipid species and amino acids.However as has been noted, except for steroids, there is little metabolite overlap across studies, including our own, making it very difficult to conduct comparisons [19,60].
Similar work in large prospective cohorts using pre-diagnostic samples has been conducted in the context of CRC.A case-control study nested within two Shanghai cohorts identified several serum phosphatyidylcholines and phosphatidylethanolamines that were inversely associated with CRC, suggesting that dysregulation of glycerophospholipids may contribute to CRC [64].In the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial, an inverse association was reported between leucyl-leucine, a metabolite representing incomplete protein catabolism, and CRC risk after eight years of follow-up, although the association did not remain significant after adjusting for multiple comparisons [65].In the European Prospective Investigation into Cancer and Nutrition, concentrations of two lipid species-hydroxysphingomyelin C22:2 and acylakyl-phosphatidylcholine C34:3-were significantly inversely associated with CRC risk using a targeted metabolomics approach [66], with nine additional features, including two potentially annotated ceramides, reported in a follow-up analysis using untargeted lipidomics [67].Investigators also identified a metabolite signature of greater body size, i.e., BMI, waist circumference, and waist-to-hip ratio, associated with a CRC.These metabolites were mainly related to amino acids and lipids, some of which were reversible with weight loss in a small subset of participants in a weightloss pilot intervention [68].In a multicenter study, a panel of 17 urine metabolites separated CRC patients from controls, with two providing good prediction in post hoc analyses (AUC of 0.86): diacetylspermine and kynurenine [69].Another untargeted metabolomics approach was employed in the Cancer Prevention Study II Nutrition Cohort, where six named metabolites were related to CRC risk, including guanidinoacetate, 2 ′ -I-methylcytidien, vanillylmandelate, bilirubin (E,E), N-palmitoylglycine, and 3-methylxanthine [70].Finally, using untargeted metabolomics on plasma obtained up to 26 years prior to diagnosis in the Northern Sweden Health and Disease Study, seven features were found to be associated with CRC risk, two of which were identified as pyroglutamic acid, an amino acid derivative, and hydroxytigecyline, an antibiotic metabolite [71].In both of the latter two studies, efforts to replicate previous findings in prospective cohorts were unsuccessful, except for 3-hydroxybutyric acid [71].While the majority of metabolites selected in our predictive analyses for CRC were also novel, histidine, TMAO, and hydroxy proline were previously reported in other studies [47,50,51,69].
The strengths of this study include a well-characterized cohort, the novel use of four different metabolomics platforms, inclusion of both pre-diagnostic serum and urine, and several variable selection and prediction algorithms, which all yielded similar results, lending confidence to our findings.Further, our results are comparable to those reported by Wang et al. [72], using an alternate statistical approach in this population [73].Our study population comprised post-menopausal women and may not be generalizable to other populations.We did not adjust for medication use, which may alter metabolite concentrations [74]; however, sensitivity analyses excluding women randomized to HT or women reporting use of HT at baseline yielded similar prediction performance results for both cancer outcomes.Other limitations include those commonly associated with metabolomic studies.A single data point may not be sufficient to adequately capture environmental or dietary exposures, and different metabolites measured may represent either exogenous exposures or alterations in endogenous processes.Further, our detected metabolite coverage of all pathways is incomplete.Nonetheless, we identified a panel of metabolites that were associated with risk of CRC and are biologically plausible.
In summary, we report a panel of metabolites associated with CRC risk in a subset derived from a large prospective cohort of post-menopausal women.That we identified Institutional Review Board Statement: The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Fred Hutchinson Cancer Center (protocol code #6299 approved 31 December 2006 with continuation approval obtained 29 November 2023).
Informed Consent Statement: Written informed consent was obtained from all participants involved in the study.
Data Availability Statement: Data, codebook, analytic code used in this report may be accessed in a collaborative mode as described on the Women's Health Initiative website (www.whi.org).

Figure 1 .
Figure 1.Cross-validated area under the receiver operating characteristic curve (CV-AUC) averaged over 100 Monte Carlo replications of each variable selection + regression procedure for predicting breast cancer and colorectal cancer, with 95% confidence intervals (CIs).(panel A) an analysis using only the covariates.(panel B) analyses using covariates only (circles), LC-MS metabolites + base set of covariates (triangles), and LC-MS metabolites + all covariates (squares).Point estimates of CV-AUC are provided at the bottom of each panel (on the right-hand panel, the point estimates correspond to the LC-MS metabolites + all covariates analysis).

Figure 1 .
Figure 1.Cross-validated area under the receiver operating characteristic curve (CV-AUC) averaged over 100 Monte Carlo replications of each variable selection + regression procedure for predicting breast cancer and colorectal cancer, with 95% confidence intervals (CIs).(panel A) an analysis using only the covariates.(panel B) analyses using covariates only (circles), LC-MS metabolites + base set of covariates (triangles), and LC-MS metabolites + all covariates (squares).Point estimates of CV-AUC are provided at the bottom of each panel (on the right-hand panel, the point estimates correspond to the LC-MS metabolites + all covariates analysis).

Table 1 .
Demographic, clinical, and lifestyle characteristics of the breast (BC) and colorectal cancer (CRC) cases and controls in the WHI Bone Mineral Density subcohort 1 .

Table 2 .
Metabolites selected with proportion of explained variation for predicting breast cancer and colorectal cancer 1 .

Table 3 .
Metabolite predictors of colorectal cancer derived across all platforms and prediction algorithms, and sensitivity analyses, along with class and function 1 .

Table 4 .
Metabolites selected for predicting breast cancer and colorectal cancer in pooled analysis 1 .

Table 4 .
Cont.All variables listed below were selected using the lasso for variable selection with all four platforms pooled together prior to variable selection.The base set of covariates (forced into all models) are age, WHI enrollment date, and self-reported race or ethnicity.Selected covariates for breast cancer: age, self-reported race/ethnicity, income, Gail 5-year risk score, waist circumference, sample draw visit, randomized to the CaD control arm.Selected covariates for colorectal cancer: age, self-reported race/ethnicity, income, education, waist circumference, sample draw visit.