Highly accurate diagnosis of pancreatic cancer by integrative modeling using gut microbiome and exposome data

Summary The noninvasive detection of pancreatic ductal adenocarcinoma (PDAC) remains an immense challenge. In this study, we proposed a robust, accurate, and noninvasive classifier, namely Multi-Omics Co-training Graph Convolutional Networks (MOCO-GCN). It achieved high accuracy (0.9 ± 0.06), F1 score (0.9± 0.07), and AUROC (0.89± 0.08), surpassing contemporary approaches. The performance of model was validated on an external cohort of German PDAC patients. Additionally, we discovered that the exposome may impact PDAC development through its complex interplay with gut microbiome by mediation analysis. For example, Fusobacterium hwasookii nucleatum, known for its ability to induce inflammatory responses, may serve as a mediator for the impact of rheumatoid arthritis on PDAC. Overall, our study sheds light on how exposome and microbiome in concert could contribute to PDAC development, and enable PDAC diagnosis with high fidelity and interpretability.


INTRODUCTION
Pancreatic ductal adenocarcinoma (PDAC) is the fourth leading cause of death. 1,2It ranks firmly last among all cancer in terms of prognosis and only about 4% of patients would live five years after diagnosis as it often presents at an advanced stage. 3,4Recent studies have explored the PDAC biomarkers in tumor, 5,6 blood, 7 pancreatic tissue, 8 urine, 9 and serum. 10Currently, the only FDA-approved biomarker for pancreatic cancer is carbohydrate antigen (CA) 19-9; however, its specificity is limited by a high false positive rate, as its concentration may increase in benign diseases like gallstone and bile duct obstruction. 11,12Consequently, a noninvasive, robust, and accurate screening and diagnostic tool for PDAC is still urgently needed.
Numerous studies have explored links between PDAC and the oral [13][14][15] or fecal microbiome. 16,17Nagata et al. conducted a multinational study and accurately predicted PDAC using 30 gut and 18 oral microbial species, achieving high area under the receiver operating characteristic (AUROCs) of 0.78-0.82. 16Kartal et al. proposed a fecal metagenomic classifiers based on 27 gut microbial species that could identify PDAC with high accuracy (0.84 AUROC) and validated the classifier in an independent German cohort (0.83 AUCROC) and confirmed the specificity in 25 publicly available studies. 17Additionally, Half et al. 18 and Ren et al. 19 employed the random forest to predict PDAC with high accuracy of 0.825 and 0.842 AUROC.These studies have shown that microbiota-based screening for the detection of PDAC is feasible.
The exposome is the comprehensive collection of all exposures, including smoking, alcohol, diet, exercise, other lifestyle factors, medication, host diseases, and more. 20Risk factors associated with the development of PDAC include alcohol, 21 advancing age, 22,23 smoking, 24 family history, 25,26 diabetes, 27,28 obesity, 29 etc.Changes in the gut microbiome can both affect and mediate the effects of exposome on the risk of PDAC.For example, dysbiosis of the microbiota has been linked to an increased incidence of obesity, 30 with signaling pathways leading to NF-kB activation, contributing to inflammatory agents. 31Conversely, exposures that affect pancreatic tumor evolution could also affect the gut microbiome.Phillip et al. suggested that long-term alcohol consumption could induce the dysbiosis of Firmicutes and Bacteroidetes, which are enriched or depleted in PDAC. 32Meanwhile, physical activity may protect against PDAC by increasing the abundance of SCFA-producing bacteria. 33Therefore, the microbiome and exposome could in concert influence the metabolic and immune pathways of PDAC.
Pancreatic cancer used to be considered a localized disease because of its occurrence in pancreas tissue, an organ in the abdomen that lies behind the lower part of stomach. 34However, the current understanding of PDAC is that it is not solely a tumor microenvironment issue but also a systemic and environmental disease that involves both the microbiome and exposome.The possible pathways of microbiome and exposome interactions that influence pancreatic carcinogenesis are shown in Figure S1A.Treatment efficiency and adverse effects can differ vastly between individuals due to differences in age, sex, and environmental factors.The aim of precision medicine is thus to design the

Impact of exposome on microbial composition in pancreatic cancer patients
To identify the impact of exposome on microbial composition in pancreatic cancer, we used gut microbiome data from Spanish, 17 which included 57 PDAC patients and 50 controls, along with detailed 24 host variables (Figure S1C).Each metadata was classified as binary, with positive and negative classes.Our dataset comprised 24 host metadata variables categorized into four groups: subject characteristics, lifestyle factors, oral health, medication use, and host disease.Most of these variables are known risk factors for pancreatic cancer.Subsequently, our aim was to determine whether there were disparities in the distribution of microbial composition among the participants based on variations in host variables.
To achieve this, we created two cohorts: a confounder-unmatched cohort and a matched cohort, based on whether the confounding variables matched.We reselected PDAC patients in a pairwise manner by identifying a control participant who was matched for values of each host metadata variables with only five of the 24 variables differed at most.In the same way, for 'unmatched' cohorts, patients and controls were unmatched for confounding variables as much as possible.Eventually, the matched and unmatched cohorts were composed of 50 samples with 25 cases and 25 controls respectively.Specific cohorts were shown in Table S4.We then conducted a series of statistical analyses to explore whether matching cases and controls for confounding variables could reduce observed differences in the microbiota.Our analysis of beta diversity revealed a significant difference in microbiota between PDAC patients and controls in the unmatched cohort (PERMANOVA, p = 0.001), but not in the matched cohort (p = 0.249) (Figure 1A).Similarly, the alpha diversity difference in unmatched cohort (Wilcoxon test, p = 0.002) was more significant than matched cohort (p = 0.099) (Figure 1B).Besides, we used Wilcoxon to test enriched taxa between PDAC and control (Figures 1C and 1D) and there were 5 taxa with p value below 0.001, 31 taxa with p value below 0.01, 88 taxa with p value below 0.05 in unmatched cohort, while 12 taxa with p value below 0.01, 83 taxa with p value below 0.05 in matched cohort (Figure S2).Additionally, we calculated the area under the receiver operating characteristic curve (AUROC) values by a 25-repeat stratified 4-fold cross-validation random forest for PDAC and controls before and after matching variables.According to the values of AUROC, the matched and unmatched cohort differed markedly by machine learning (Figure 1E).Overall, our results indicated that exposome played a significant role in shaping the composition of the gut microbiome in PDAC patients.We used the aforementioned random forest framework (Figure 2A) to predict binary variables by microbiome data in turn.The resulting values of AUROC (Figure 2B) revealed significant associations between microbiota and certain host variables, including jaundice, alcohol, acid regurgitation medication use, family history of PDAC, corticosteroids medication use, diabetes, and country (mean AUROC >0.6).Besides, we observed 21 significant associations (FDR Spearman < 0.05) between 21 species and 6 exposures (Figure 2C and Table S1).For instance, the consumption of probiotics showed a significant positive correlation with the abundance of Lactobacillus species and Clostridium species.The cellular components and metabolites of these species play a crucial role in probiotic functions, primarily by activating gut epithelial cells and improving the integrity of the intestinal barrier. 36

Exposome-microbiome mediation effects in PDAC
To explore the connections between exposome, the gut microbiome, and PDAC, we conducted a bi-directional mediation analysis using 23 exposures and 9 species that exhibited significant associations with PDAC (FDR Spearman < 0.05).We identified a total of 29 mediating linkages (FDR mediation < 0.05 & FDR inverse-mediation > 0.05), with 23 involving the exposome impacting on PDAC through the microbiome, and 6 involving the microbiome impacting on PDAC through the exposome (Figures 3A and 3B).Most of these linkages were related to the impact of lifestyle factors and host disease on PDAC through microbiome.For example, we observed that diabetes can mediate the abundance of Alloscardovia omnicolens, thereby affecting the risk of PDAC (Figure 3C).Rheumatoid arthritis (RA) is a chronic and systemic disease primarily characterized by inflammatory synovitis, the underlying cause of which remains unknown. 37Fusobacterium hwasookii nucleatum has the ability to activate the immune system and trigger inflammatory responses. 38n patients with RA, immune dysregulation and the progression of joint inflammation can potentially influence the composition of the microbial community, thereby contributing to an increase in Fusobacterium hwasookii nucleatum abundance.We observed that Fusobacterium hwasookii nucleatum can mediate the impact of rheumatoid arthritis on PDAC (Figure 3D; P mediation = 2.2 3 10 À16 ).Previously, Motasem et al. conducted a comprehensive nationwide study, which proposed that RA can manifest with extra-articular involvement in multiple organs, including the pancreas.Their findings revealed an elevated risk of pancreatic cancer among patients with RA, and those with a history of RA often exhibited a poorer prognosis. 39e exposome is used in concert with gut microbiome data to predict pancreatic cancer To investigate the potential of combining exposome and gut microbiome data in predicting pancreatic cancer, we employed a framework called MOCO-GCN, which integrates predictions from both sources by leveraging their potential influence on the disease course and outcomes.MOCO-GCN is composed of a two-view co-training Graph Convolutional Networks (GCNs) and a View Correlation Discovery Network (VCDN) to classify PDAC and controls.The framework of MOCO-GCN is shown in Figure 4A.Specifically, co-training GCNs mainly predict initial labels with exposome and microbiome data by distilling knowledge from each other, while VCDN can effectively integrate initial labels by exploring the latent associations in the higher-level label space across exposome and microbiome data. 40To evaluate the performance of MOCO-GCN, we performed a 4-fold cross-validation, and assessed the final model performance using the average accuracy (ACC), average F1-score (F1), average AUROC, average Sensitivity (Sn), average Specificity (Sp), and average Matthews Correlation Coefficient (MCC).We trained our model based on 23 exposures and 125 species selected through difference abundance analyses (Wilcoxon, p < 0.05; Table S2) and achieved excellent performance with 0.9 G 0.06 ACC, 0.9G 0.07 F1, 0.89G 0.08 AUROC (Figure 4B), 0.86 G 0.13 Sn, 0.93G 0.10 Sp, and 0.80G 0.11 MCC.Additionally, we conducted a sensitivity analysis that focused on the parameter k, which represents the average number of edges retained per node.Figure 4C illustrates the performance of MOCO-GCN as k varies from 2 to 10, demonstrating the stability of our model.We compared the performance of our model with several traditional machine learning methods, including Support vector machine classifier (SVM), Linear regression trained with L2 regularization (Lasso), Random Forest classifier (RF), Gradient tree boosting-based classifier (XGBoost), and other multi-omics classification methods: MOGONET (Multi-Omics Graph Convolutional NETworks); 41 NN_VCDN (fully connected NN with the same layers as the GCN in MOGONET).These traditional machine learning methods were trained with the direct concatenation of the 125 species and 23 exposures as input.According to the classification results (Figure 4D), our model outperformed the previous methods and was more capable to predict pancreatic cancer with the integration of exposome and microbiome data.
According to the calculation of feature importance by our model, the top 45 features consist of three exposures, and 42 species are shown as a heatmap in Figure S3.Seventeen bacterial species were increased in the PDAC patients (n = 57) in comparison to those of the controls (n = 50), whereas 25 bacterial species were decreased.Among the 42 significantly important species, 26 (61.9%) belonged to the Firmicutes phylum, 7 (16.67%)belonged to Fusobacteria phylum, 3 (7.1%)belonged to CFB group bacteria phylum, 1 (2.4%) belonged to Actinobacteria phylum, 1 (2.4%) belonged to Basidiomycete fungi phylum, 3 (7.1%)belonged to High G + C Gram-positive bacteria class, and 1(2.4%) belonged to B-proteobacteria class.These results demonstrated the crucial role of the Firmicutes phylum in shaping the division between PDAC and controls.Species increased in the gut microbiomes of PDAC included Fusobacterium hwasookii nucleatum, Alloscardovia omnicolens, Veillonella spp.(Veillonella atypica and Veillonella parvula) and several unknown species in the phylum Firmicutes, while species depleted included several from the order Clostridiales, Bacteroides coprocola, Faecalibacterium prausnitzii, Bifidobacterium bifidum, and unknown Bacteroidales.Of note, our results were consistent with previous studies [16][17][18][19]42 for 27 out of the 42 species investigated.

Validation on an external cohort and comparison with previous studies
To evaluate the specificity of the trained models for PDAC, we assessed the accuracy of predictions using a dataset from a German study. 17his dataset consisted of 44 PDAC patients and 32 controls, with detailed information on 14 exposures.On the validation population from (E) The comparison between this study and previous studies that predict PDAC using gut microbiome alone.
Germany, the MOCO-GCN model demonstrated a performance of 0.89 G 0.07 in terms of accuracy (ACC), 0.91 G 0.04 in terms of F1 score, and 0.81 G 0.19 in terms of area under the receiver operating characteristic curve (AUROC) (Figure 4B).To further validate the performance of our model, we collected studies conducted within the past five years [16][17][18][19]43 that investigated microbial prediction of pancreatic cancer. Thee studies primarily utilized traditional machine learning methods such as random forest and lasso regression.As illustrated in Figure 4E, our model exhibited superior predictive capabilities compared to previous studies by incorporating exposome data.These results collectively demonstrate the practicality and efficacy of MOCO-GCN in predicting pancreatic cancer by leveraging exposome and microbiome data.

DISCUSSION
This study represents an advancement in our understanding of the complex relationship between PDAC, microbiome, and exposome.Our findings provide compelling evidence for the influential role of exposome in microbiome-related studies of pancreatic cancer.We not only accurately predicted PDAC with the combination of microbiome and exposome, but also yielded important insights into the species and exposure level associations between these factors.First, our model demonstrated superior performance compared to other methods and previous studies, achieving satisfactory results on an external cohort.This underscores the importance of comprehensively considering the microbiome and exposome in PDAC-related research.Second, we emphasize the pivotal role of exposome in the interplay between PDAC and microbiome, highlighting the need to account for the specificity and correlation between these factors in future studies.Third, we assert that pancreatic cancer should not be regarded only as a localized disease, but rather as a systemic, environmental, and microenvironmental disease.Taken together, this study represents an important contribution to our understanding of the factors that contribute to PDAC development and underscores the importance of incorporating microbiome and exposome data in future research and clinical practice.

Limitations of the study
This study provided valuable insights into the role of the microbiome and exposome in PDAC.However, it has several limitations that warrant acknowledgment.First, the small sample size and the challenging nature of collecting comprehensive gut microbiome and exposome data from a large population of pancreatic cancer patients highlight the need for more extensive longitudinal data to further elucidate the clinical translational and practical applications of this research.Second, the meta-variables data were limited to binary values, which restricted our ability to perform a more detailed analysis of factors such as alcohol consumption and smoking.Moreover, due to the lack of available data, we were unable to analyze two crucial risk factors for PDAC, exercise and diet.Third, despite the exceptional performance of our predictive model, the inherent complexity of deep learning models limits their interpretability, posing challenges in understanding the factors driving predictions.Therefore, more comprehensive follow-up research and analysis are required to clarify the mechanisms underlying PDAC as a microenvironmental and systemic disease.Addressing these limitations can provide a more comprehensive understanding of the factors contributing to PDAC development, informing the development of more effective prevention and treatment strategies.While our study primarily focuses on clinical aspects, we recognize the need for additional research to address challenges related to clinical costs and to advance practical applications in real-world medicine.In conclusion, our research provides initial methodologies and evidence for PDAC diagnosis based on microbiome and exposome data, but further research based on large-scale and more in-depth pancreatic cancer data are essential.

Figure 1 .
Figure 1.Variation of the matched and unmatched cohort in microbiota due to confounding variables between PDAC and controls (A) Principal coordinates analysis (PCoA) plot of PDAC and controls in the confounding-unmatched and matched cohort, with the PERMANOVA p value.The centroids for the PDAC and controls are depicted by an outlined circle.Colors denote groups, with blue for controls and red for PDAC patients.(B) Alpha diversity measurements comparing PDAC and controls in the unmatched and matched cohort.It was calculated as the Shannon index.Colors denote groups, with blue for controls and red for PDAC patients.Pairwise comparisons were performed using the Wilcoxon test.(C and D) The difference abundance analysis between PDAC and controls in the matched and unmatched cohort.It was implemented by the Wilcoxon test.Y-axis is log10 (p values), X axis is generalized fold change.Purple dots represent significantly differentially abundant in either group, while black dots show nonsignificant species.(E) Random forest AUROC values for PDAC and controls before and after matching for confounding variables.

Figure 2 .
Figure 2. The random forest analysis framework and the significant association between species and exposome (A) The random forest analysis framework.The 25-repeat stratified 4-fold cross-validation over 75/25 splits was used for each binary variable.(B) The results of receiver operating characteristic (AUROC) for 23 binary lifestyle and disease variables.(C) Correlation network diagram showing the significant association between 21 species and 6 exposures.It was calculated using the Spearman correlation coefficient.The FDR was calculated using the Benjamin-Hochberg correction.Red line denotes a positive relationship, while blue line denotes a negative relationship.

Figure 3 .
Figure 3. Mediation analysis identifies linkages between the microbiome, exposome and pancreatic cancer (A) Parallel coordinates chart showing the 23 mediation effects of exposome on PDAC through the microbiome, with significant level (FDR < 0.05).Shown are exposome (left), microbiome (right).The curved lines connecting the panels indicate the mediation effects.(B) Parallel coordinates chart showing the 6 mediation effects of the microbiome on PDAC through the exposome, with significant level (FDR < 0.05).Shown are microbiome (left), exposome (right).(C) Analysis of the effect of diabetes on PDAC as mediated by the abundance of Alloscardovia omnicolens.(D) Analysis of the effect of rheumatoid arthritis on PDAC as mediated by the abundance of Fusobacterium hwasookii nucleatum.

Figure 4 .
Figure 4. Illustration and performance of MOCO-GCN (A) The framework of MOCO-GCN.It combines of a Two-view Co-training Graph Convolutional Networks (GCNs) module that learns different omics data features by distilling knowledge from each other and a View Correlation Discovery Network (VCDN) module that integrates multi-omics data.Each species-GCN and exposome-GCN are trained to perform class prediction and the corresponding sample similarity network generated from the exposome and microbiome data.The co-training allows them to distill knowledge from each other by adding their most confident unlabeled data into the training set.The cross-omics discovery tensor is calculated from the initial predictions of omics-specific GCNs and forwarded to VCDN for final prediction.MOCO-GCN is an end-to-end model and all networks are trained jointly.(B) The performance of the MOCO-GCN on Spanish and German cohorts are shown as receiver operating characteristic (ROC) curve with 95% CI shaded in corresponding color.(C) Performance of MOCO-GCN under different values of hyper parameter k. (D) The comparison between MOCO-GCN and several traditional machine learning methods, including Support vector machine classifier (SVM), Linear regression trained with L2 regularization (Lasso), Random Forest classifier (RF), Gradient tree boosting-based classifier (XGBoost), and other multi-omics classification methods: MOGONET (Multi-Omics Graph Convolutional NETworks); NN_VCDN (fully connected NN with the same layers as the GCN in MOGONET).Data are represented as mean G SEM.(E) The comparison between this study and previous studies that predict PDAC using gut microbiome alone.