Large-Scale Machine Learning Analysis Reveals DNA Methylation and Gene Expression Response Signatures for Gemcitabine-Treated Pancreatic Cancer

Background: Gemcitabine is a first-line chemotherapy for pancreatic adenocarcinoma (PAAD), but many PAAD patients do not respond to gemcitabine-containing treatments. Being able to predict such nonresponders would hence permit the undelayed administration of more promising treatments while sparing gemcitabine life-threatening side effects for those patients. Unfortunately, the few predictors of PAAD patient response to this drug are weak, none of them exploiting yet the power of machine learning (ML). Methods: Here, we applied ML to predict the response of PAAD patients to gemcitabine from the molecular profiles of their tumors. More concretely, we collected diverse molecular profiles of PAAD patient tumors along with the corresponding clinical data (gemcitabine responses and clinical features) from the Genomic Data Commons resource. From systematically combining 8 tumor profiles with 16 classification algorithms, each of the resulting 128 ML models was evaluated by multiple 10-fold cross-validations. Results: Only 7 of these 128 models were predictive, which underlines the importance of carrying out such a large-scale analysis to avoid missing the most predictive models. These were here random forest using 4 selected mRNAs [0.44 Matthews correlation coefficient (MCC), 0.785 receiver operating characteristic–area under the curve (ROC-AUC)] and XGBoost combining 12 DNA methylation probes (0.32 MCC, 0.697 ROC-AUC). By contrast, the hENT1 marker obtained much worse random-level performance (practically 0 MCC, 0.5 ROC-AUC). Despite not being trained to predict prognosis (overall and progression-free survival), these ML models were also able to anticipate this patient outcome. Conclusions: We release these promising ML models so that they can be evaluated prospectively on other gemcitabine-treated PAAD patients.


Introduction
Pancreatic cancer (PC) is the seventh leading cause of death among all cancers (495,773 new cases and 466,003 deaths worldwide in 2020 [1]).Surgical resection remains the only potential therapeutic option for patients with early-stage PC [2].However, PC is difficult to diagnose due to nonspecific initial symptoms.Consequently, most patients are at the advanced or metastatic stage when they are diagnosed, where the 5-year survival rate is about 3% [3].In addition, PC also lacks effective early detection methods, predictive biomarkers, and risk prediction models.
While several target therapies, chemotherapies, and immunotherapies have been approved for PC, gemcitabine has been widely used as the standard first-line treatment for PC.Gemcitabine is administered in combination with other cytotoxic agents, such as nab-paclitaxel, carboplatin, cisplatin, or oxaliplatin.Such gemcitabine-containing treatments generally offer enhanced patient survival and clinical benefit [4][5][6][7][8][9][10].The anti-proliferative properties of gemcitabine come from inhibiting DNA synthesis [11].Gemcitabine requires intracellular phosphorylation to be converted into its active form, difluorodeoxycytidine triphosphate (dFdCTP), which competes with deoxycytidine triphosphate (dCTP) for incorporation into DNA during DNA elongation, resulting in inhibition of DNA synthesis and inducing apoptosis of cancer cells [12].However, 54% of gemcitabine-treated PC patients can exhibit primary resistance [13].It is therefore important to find an effective way to anticipate which patients will benefit from gemcitabine.In this way, a more promising drug treatment could be administered without delay while avoiding gemcitabine's adverse effects for patients who are unlikely to respond.
Several studies examined the gemcitabine-related biomarkers involved in its metabolism [14,15].Most studies about predictive markers of gemcitabine treatment response focus on human equilibrative nucleoside transporter 1 (hENT1) expression, with high hENT1 expression being reported to show clinical benefit in selected treatment settings and survival endpoint [16][17][18][19].However, some studies did not observe these benefits [20,21].This might be due to study differences such as the way hENT1 expression is measured, different clinical endpoints, or different gemcitabine combos.This study will propose promising alternatives for this marker.
ML techniques have been proven to be powerful tools to generate predictive computational models from the typically highdimensional pharmaco-omics datasets.In particular, large-scale ML analysis, including algorithms enhancing supervised learning with feature selection such as optimal model complexity (OMC), has been shown to be advantageous [44,45,50].To our knowledge, despite its promise, no study has yet applied a broad panel of ML algorithms to classify PC patients' response to gemcitabine using various molecular profiles and clinical datasets.With this purpose, we retrospectively collected molecular profiles of The Cancer Genomic Atlas (TCGA)-Pancreatic Adenocarcinoma (PAAD) tumors from the U.S. National Cancer Institute (NCI) Genomic Data Commons (GDC) database (https://gdc.cancer.gov/).One of the objectives of GDC is generating standardized molecular profiles and clinical datasets across different cancer genome programs [51].We span 8 molecular profiles ranking from 1,339 miRNA isoforms (isomiR) to the 450,000 DNA methylation of CpG probes and 8 ML classification algorithms that employed all features and OMC feature selection, producing a total of 128 diverse models to predict PAAD patients' responses to gemcitabine.The goal of this study is to identify the robust ML model to classify gemcitabine responders and nonresponders in PAAD patients and to compare the predictive performances to gemcitabine single-gene marker.

Clinical data acquisition and preprocessing
The most common type of PC is PAAD [52].The open access molecular profiling and clinical data of 185 primary pancreatic tumor samples in TCGA-PAAD project was downloaded from GDC application programming interface (API) (version 25.0).Neuroendocrine tumors and other noncancerous samples were thus not considered.To ensure the consistency of drug name, misspelling and synonyms (e.g., gemzar and gemcitabine HCl) of drugs were standardized according to the NCI drug dictionary and DrugBank to gemcitabine.Then, drug response information for each patient was obtained by querying the clinical records.The patients with missing gemcitabine response or responding inconsistently to a drug were excluded to retain only valid records.The patients who received gemcitabine prior to tumor resection were removed as indicated by the time of tumor procurement and the start of treatment.After these curation steps, 70 gemcitabine-treated PAAD patients remained.The gemcitabine responses provided by TCGA-PAAD were binarized into 2 classes: responder and nonresponder.We defined a responder as a patient who had complete response (CR) or partial response (PR), and a nonresponder as a patient who had stable disease (SD) or progressive disease (PD).TCGA-PAAD reports best response over the entire treatment period, starting from the initiation of the treatment until the end of treatment, i.e., not only the response observed at last follow-up.

Molecular data acquisition and preprocessing
The curated gemcitabine responses were combined with the corresponding molecular profiles via TCGA patient IDs.Eight molecular profiles provided by GDC were considered in this analysis.The mRNA and miRNA expressions were generated from next-generation sequencing (NGS).mRNA(FPKM) was defined as mRNA in fragment per kilobase of exon per million mapped fragments (FPKM) unit, while mRNA(FPKM-UQ) was defined as mRNA in upper-quartile normalized FPKM unit.miRNA and isomiR were log 2 -transformed reads per million mapped reads (RPM).CpG (5′-cytosine-phosphateguanine-3′) is beta values of DNA methylation level at known CpG sites using about 450,000 probes from Illumina Human Methylation 450 BeadChip.CGI is the average methylation beta of all probes' values at known CpG sites corresponding to their CpG island.CNV(mean) is the average of copy number variations (CNVs) across segmented DNA corresponding to their gene using CNTools (version 3.8) of R package (version 3.5.1).CNV(median) is determined in the same way as CNV(mean), except the median is calculated.Those features missing any value across patients were removed.Thus, 8 datasets were generated, where each dataset was served as a set of features and subjected to ML classification algorithms to make the predictions.Note that we consider slightly different profile normalization to elucidate the impact of these differences on ML model performance.

Data preparation for ML
A computational framework was developed to estimate the predictive ability of molecular profiles on PAAD patients' response to gemcitabine as shown in Fig. 1.In brief, each dataset was randomly split into a training and testing set using stratified K-fold, for K values of 5, 10, and 70.Eight ML classification algorithms, including classification and regression tree (CART), random forest (RF), extreme gradient boosting (XGBoost), light gradient boosting machine (LGBM), logistic regression (LR), support vector classifier (SVC) with linear kernel, SVC with the radial basis function (RBF) kernel, and k-nearest neighbors (KNN) algorithms, employed all features and OMC feature selection, producing a total of 128 models that were used to predict patients' responses to gemcitabine.We performed 5 K-fold cross-validation (CV) repetitions, each with different random seed to assess the variability of resulting ML models.The median of evaluation metrics across 5 repetitions was reported.All analyses of this study were performed using Python version 3.7.3(https://www.python.org/)and scikit-learn version 0.24.2 (https://scikit-learn.org).

Building all-features classification models
Stratified K-fold standard CV runs were performed to estimate the predictive performance of ML algorithms for classification using all available features in each molecular dataset.After partitioning of the samples into K-folds, one of the K-folds was used as the test set for testing the model trained on the remaining K − 1 folds.The CV process was then repeated K times with each one of the K-folds used sequentially once in the test set.CV ensures that each patient has exactly one out-of-sample prediction regardless of the value of K. Model performance was not calculated on the test set of every fold and then averaged because each of the test sets would be too small and thus harm comparison to CV with other K values.As it is common with low-sample scenarios, we used instead a merged CV, where the 70 out-of-sample predictions were merged from all folds and Matthews correlation coefficient (MCC) was calculated once with all 70 samples.

Building OMC classification models
The OMC models [45] were implemented in this study to overcome the high dimensionality problem, in this case, the number of features is larger than the number of patients.Stratified K-fold nested CV was performed to estimate how well classification ML algorithms perform using only the most relevant features to gemcitabine responses.In brief, the analysis of variance (ANOVA) test was used to calculate the P value of each feature, with a low P value indicating high discriminative power in terms of gemcitabine response.Then, classification algorithms were trained on only the considered subset of features (the top 2 to n/2 subset of features, where n is the number of samples).Among all n/2 trained models in the inner loop of the nested CV, the best model that achieved the highest MCC was selected and used to predict samples in its outer loop.The resulting 70 out-of-sample predictions were also merged from all folds, and each model performance metric was calculated once with all 70 samples.

Building a model based on hENT1 expression
hENT1 has been identified as a prognostic molecular marker in PC patients treated with gemcitabine [16,17].To compare the performance of the best models to that of using a single hENT1 gene only, hENT1 expression derived from the mRNA(FPKM) profile was trained on the same algorithms as the 2 most predictive models (RF and XGBoost) using standard 10-fold CV from merging the out-of-sample predictions from the 10 folds.

Model performance evaluation
When a model estimated real-valued class probabilities, these were transformed to binary classes by setting a threshold to 0.5.The probabilities above this threshold were classed as responders; otherwise, they were classed as nonresponders.The 4 confusion matrix categories, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), Fig. 1.ML workflow to predict PAAD patient response to gemcitabine.The TCGA-PAAD clinical, biospecimen, and molecular profiles comprising mRNA in FPKM, and FPKM-UQ unit, miRNAs and isoforms, copy number variation, DNA methylation at CpGs, and CGI profiles from high-throughput technologies of patient tumors were retrieved from the NCI GDC.These molecular profiles were matched with corresponding clinical information using the patient's TCGA barcode.The samples meeting the exclusion criteria were filtered out to retain only valid high-quality records.Each molecular profile of the tumors was preprocessed and subsequently submitted by each of the 8 supervised learning algorithm analyses.The performances of the resulting all-features ML models were evaluated by 10-fold CV (5 repetitions per model, each using a different random seed).Thus, 5 MCC and ROC-AUC determinations were carried out for each model.To mitigate the high dimensionality of the datasets, each of the 8 algorithms was coupled with the OMC strategy, with the resulting models undergoing identical performance evaluation.Overall, a total of 128 ML models (16 binary classification models for each of the 8 molecular profiles) were evaluated on the same cohort of patients.Further details can be found in the Methods section.
were counted and used to calculate performance metrics including MCC and receiver operating characteristic-area under the curve (ROC-AUC).

Construction of protein-protein interaction (PPI) network
To unravel the relationships among the genes corresponding to 12 predictive CpG probes, these genes were mapped to STRING database version 11.5 (https://string-db.org/)[53] to retrieve both known and predicted protein-protein interactions (PPIs).We limited the species to "Homo sapiens" and required at least the minimum confidence interaction score (combine score > 0.15).The text mining, experiments, database, co-expression, co-occurrence, neighborhood, and gene fusion evidences were included to construct PPI.

The pathway enrichment analysis
In order to understand the roles of the genes found to be associated with the predictive CpG probes in PAAD treated with gemcitabine, the genes corresponding to 12 CpG probes were subjected to the Database for Annotation, Visualization and Integrated Discovery (DAVID) web server [54] for Gene Ontology (GO) pathway enrichment analysis.In addition, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was also investigated using the KEGG Orthology-Based Annotation System (KOBAS) web server (http://kobas.cbi.pku.edu.cn/)[55].A statistically significant P value was calculated using a Fisher exact test, which further employed the Benjamini-Hochberg multiple-testing correction for the correct P value.The significant enriched pathways were identified based on a P value less than 0.05.For each pathway, input genes that are part of the pathway are counted, and the enrichment ratio was also calculated.

Statistical analysis
The two-tailed unpaired Welch's t test was used to generate P values in data analysis.The median overall survival (OS) and progression-free survival (PFS) were calculated for 4 groups, including actual and predicted responders and nonresponders, representing the time point on the Kaplan-Meier plot where 50% of the patients in each group have survived (for OS) or have not experienced disease progression (for PFS).Log-rank test was used to compute statistical differences in OS and PFS.The univariate and multivariate Cox regression analyses were performed to assess hazard ratio (HR) and 95% confidence intervals (CIs) of gemcitabine response predictors (4 mRNAs and 12 CpG probes).A P value less than 0.05 was considered as statistically significant.

Results
A total of 70 molecularly profiled PAAD patients treated with gemcitabine (41 nonresponders and 29 responders) were derived from TCGA-PAAD project.The patient's distribution and available features of each molecular profile, spanning from the 1,339 miRNA isoform expressions to the 450,000 methylation levels of DNA probes, were presented in Fig. S1.We trained each of 8 molecular profiles from PAAD patients' tumor on 8 classification ML algorithms with and without OMC feature selection, producing a total of 128 gemcitabine response prediction models.Computational pipeline is described in the Methods section and is represented in Fig. 1.

ML model performances at predicting PAAD patient response to gemcitabine
For each of the algorithm-profile pair, the model performance was evaluated by stratified 10-fold CV with 5 repetitions, each repetition with a different random seed.Figure 2 shows the out-of-sample median MCC (mMCC) of the 5 repetitions (the boxplot with these 5 MCC determinations can be seen in Fig. S2).OMC models, which are enhanced with further feature selection, tend to perform better than all-features models.Indeed, only 31 (48.4%) of the 64 OMC models yield mMCC values below 0.1, compared to 56 (87.5%) of the all-features models.Models exploiting mRNA and DNA methylation profiles distinguish responders from nonresponders with mMCCs of over 0.3.Both mRNA(FPKM) and mRNA(FPKM-UQ) exhibited the similar predictive performances regardless of normalized gene expression metrics used.mRNA(FPKM) achieved the highest MCC; hence, it is recommended for this challenging problem.RF-OMC and LR-OMC incorporating mRNAs are the 2 best ML models, achieving mMCC of 0.44, and 0.4, respectively.In addition, XGBoost incorporating DNA methylation of CpG probes obtained mMCC of 0.32.Such poor performance was also observed when using SVC for these problem instances.All nonresponders were misclassified using some molecular profiles [either mRNA(FPKM), mRNA(FPKM-UQ), CpG, or CGI], leading to undefined MCCs.Median ROC-AUCs range from 0.407 to 0.674 (Fig. 2).

OMC models are generally more predictive while retaining a concise subset of predictive features
OMC is the process of selecting the most informative features that considerably reduces the number of features for model building.As shown in Fig. 2, the OMC models tend to be more predictive, whereas the all-features models obtained a nearrandom predictive level.The most predictive models with an mMCC of at least 0.3 were OMC models that employed either mRNAs or CpG probes (Fig. S3).These 7 predictive OMC models also obtained the highest median ROC-AUCs in this challenging problem, ranking from 0.654 with CART-OMC employing mRNA(FPKM) to 0.785 with RF-OMC employing mRNA(FPKM) (Fig. 3).We employ MCC as the main evaluation metric, as it is less flattering than ROC-AUC.

The differential expression of the gemcitabine predictors
mRNA expression is the most predictive profile for stratifying PAAD patients treated with gemcitabine, giving the highest MCC.Four (SEPW1P, RP11-179A10.1,ATF4P4, and CTC-429L19.3) of 60,483 mRNAs are the most predictive features selected by RF-OMC that employed mRNA(FPKM) profile (mMCC of 0.44).Figure 4 and Table S2 show the differential expression of these 4 predictive genes between the responders and nonresponders using two-sided Welch's t test.All these 4 predictive genes are significantly (P < 0.01) up-regulated in responders compared to nonresponders.CTC-429L19.3 is the most expressed genes, while ATF4P4, RP11-179A10.1, and SEPW1P are the least expressed genes.Interestingly, these 4 predictive genes were commonly selected by other 5 most predictive OMC models with an mMCC of at least 0.3 employing mRNA profiles (Table S3).In particular, CART-OMC, the second most predictive model that employed mRNA(FPKM), also combined the same 4 predictive mRNAs to build classification tree (mMCC of 0.35) (Fig. S4).

PPI network analysis of genes corresponding to the 12 predictive CpG probes
The interactions among the genes corresponding to 12 predictive CpG probes were assessed using STRING web server [53], which integrates both known and predicted PPIs.Deciphering the protein interaction network could help in understanding the physical and functional associations of these protein that could influence chemo-susceptibility. Eleven of 13 genes corresponding to predictive CpG probes were mapped in the STRING database, while the remaining 2 non-protein-coding genes were not mapped (MIR6815 is miRNA, and RP11-429P3.3 is antisense).Figure 5 shows the network with 11 nodes and 14 edges of PPI.The nodes are proteins, while edges are their interactions (PPI enrichment P value of 0.035).The known and predicted interactions were observed among 6 proteins (CRIP1, PXN, COL18A1, B2M, ELANE, and PLOD3).TMEM191B and DNAH2 were linked with text mining of scientific literature evidence.The remaining 3 genes (C14orf80, AHRR, and ZNF703) were disconnected nodes in the network.These 13 genes corresponding to predictive CpG probes were also enriched in cancer-associated pathways in GO and KEGG pathway enrichment analysis (see the next section).Overall, the results show that these gene interactions could contribute to the gemcitabine resistance in PAAD patients.
An analogous analysis could not be performed with the 4 predictive mRNAs, as these do not code for proteins and hence are not included in the STRING database.Two of them (ATF4P4 and SEPW1P) are annotated as pseudogenes, while the other 2 (RP11-179A10.1 and CTC-429L19.3) are long noncoding RNAs.

GO and KEGG pathway enrichment analysis of genes corresponding to the 12 predictive CpG probes
The GO enrichment analysis of genes corresponding to 12 predictive CpG probes revealed a total of 43 significantly (P < 0.05) enriched GO pathways (Fig. S5 and Table S4) across 3 GO categories: biological process (BP), cellular component (CC), and molecular function (MF).By analyzing BP, we found that Fig. 2. Heatmap presenting the mMCC of the five 10-fold CV repetitions for each of the 128 models.The vertical axis shows the considered molecular profiles, while the horizontal axis shows the ML algorithms used.The first 8 columns on the left show supervised ML algorithms implemented with OMC feature selection considering only the subset of informative features during the model training (the suffix "OMC" was added to the algorithm name), while the remaining 8 columns on the right show the corresponding algorithms that used all features available in datasets.Five MCCs were obtained from five 10-fold CV runs ranking from −1 to 1.The mMCC was shown in the heatmap.A positive MCC means that the model accurately predicts gemcitabine responses better than a random classifier.MCC of 0.0 indicates a prediction no better than a random.A negative MCC means that the model performs worse than a random.A model that obtained undefined MCC, predicting the same class for all instances where the denominator of MCC is zero, is indicated as blank boxes.The 2 most predictive molecular profiles were mRNA expression and DNA methylation of CpG: mMCC of 0.44 from RF-OMC using mRNA expression profiles and mMCC of 0.32 from XGBoost-OMC using DNA methylation of CpG profile.In addition, the OMC models tend to perform better than the corresponding all-features models.
genes were enriched in 39 GO BP pathways.Most of these pathways are cancer-related pathways and contributed to cancer progression, for example, cell differentiation and development, cell proliferation, regulation of cell adhesion, and cell motility.Cancer cells undergo rapid proliferation and migration [56].Cellular adhesion is also an important process in cancer progression, as it allows cancer cells to interact with and invade surrounding tissue [57].Additionally, the extracellular matrix and extracellular structures are important components of the tumor microenvironment and play a critical role in cancer progression [58].These pathways provide insights into the underlying biological mechanisms that drive cancer development.Interestingly, 8 genes (COL18A1, PXN, ZNF703, AHRR, PLOD3, CRIP1, B2M, and ELANE) were predominantly enriched in response to chemical and response to stimulus pathways.Dysregulation of these 8 enriched genes in response to chemical or stimulus pathways could promote carcinogenesis and drug resistance [59][60][61].Six of these 8 enriched genes were also linked with known and predicted interaction evidences in PPI network, suggesting that these genes may function in coordinated manner and contributed to drug resistance (Fig. 5).Furthermore, GO enrichment through CC showed that the terms endoplasmic reticulum lumen, macromolecular complex, and specific granule lumen indicate an intracellular localization where they are likely to interact.By analyzing MF, 2 genes were enriched in transcriptional repressor activity and RNA polymerase II transcription factor binding pathways.This pathway was related to gemcitabine mechanism, which was involved in negative regulation of transcription from an RNA polymerase II promoter that led to apoptosis [62].
The KEGG pathway enrichment analysis (Table S5) showed that genes corresponding to CpG probes were significantly (P < 0.05) enriched in 10 KEGG pathways, with the metabolism of xenobiotics by cytochrome P450 pathway being particularly noteworthy due to its involvement in drug metabolism and patient response to treatment.Overexpression of cytochrome P450 causes rapid drug elimination before it reaches the target site [63].Moreover, antigen processing and presentation pathway is critical for the recognition of tumor cells by the immune system.Tumor cells can present antigens on their surface, which can be recognized by T cells and trigger an immune response.Immunotherapy approaches such as immune checkpoint inhibitors target this pathway to enhance T cell activity and improve antitumor immunity [64].S6.The orange line inside each boxplot represents the median expression across patients for a given feature, while the green triangle represents the mean expression.The boxplot for each feature contains the P value of mean differential expression between responders and nonresponders using a two-sided Welch's t test.****P ≤ 0.0001, significant difference in expression levels between 2 groups.
Altogether, our findings show that dysregulation of DNA methylation of predictive CpGs in this study could be associated to the alteration of their corresponding genes and pathways in which they are involved, leading to cancer development and gemcitabine resistance in PAAD patients.The pathway enrichment analysis provides insight into the underlying biological mechanisms of genes corresponding to predictive CpG that could be targeted for cancer therapy.However, functional validations are needed to test our bioinformatics findings.

Assessing the robustness of the best model in predicting PAAD patients' gemcitabine responses
We next investigated the effect of training dataset sizes on the predictive performance of the best model.Leave-one-out CV (LOOCV), 5-fold CV (5CV), and 10-fold CV (10CV) were used to evaluate the RF-OMC that employed mRNA(FPKM) profile model performance.All the CVs were repeated exactly in the same manner for all-features models.The MCCs and ROC-AUCs of 5 repetitions, each with different random seeds, are presented in Fig. 6.The results show the improvement in predictive performances as we increase the training dataset size from 80% of the data in 5CV to 99% of the data in LOOCV, while all-features models obtained a near-random predictive level and were significantly worse than those obtained from OMC models in all cases (Fig. 6).We also investigated whether the performances obtained from the signal in the dataset was better than what would be predicted by chance from class-permuted version.As a result, all permutation models were significantly worse than those arising from OMC models in all cases (Fig. S6).The similar result was obtained when using XGBoost-OMC, which employed DNA methylation of CpG profile and showed improved predictive performance as the training dataset size increased (Fig. S9).

CpG-based XGBoost-OMC can predict survival of gemcitabine-treated PAAD patients
We evaluated the OS (defined as the time interval from the date of diagnosis to death or the last known follow-up date) and PFS (defined as the time interval from the date of diagnosis to disease progression or the last known follow-up date) time difference between actual and predicted responders and nonresponders by the XGBoost-OMC combined 12 CpG probes.Among 70 patients with survival data and gemcitabine responses, 41 (58.6%) nonresponders and 29 (41.4%)responders (Fig. 7A), XGBoost-OMC incorporating 12 CpG probes classified these patients into 43 predicted nonresponders and 27 predicted responders.The confusion matrix was shown in Table S3.S8), as annotated by the GDC, were input into STRING database for PPI network analysis.Out of 13 genes corresponding to predictive CpG probes, 11 protein-coding genes were found in the STRING database, while the remaining 2 genes (MIR6815 and RP11-429P3) were not found as they are non-protein-coding genes (MIR6815 is miRNA, and RP11-429P3.3 is antisense).These 11 protein-coding genes were connected by 14 edges representing their known and predicted interactions as shown in the legend.A PPI enrichment P value of 0.035 was obtained, indicating that the protein input has more interactions among themselves than what would be expected for a set of proteins of the same size and degree distribution drawn at random from the genome.As a result, the proteins were connected as a group.The nodes in the PPI network represent proteins, and the edges represent their known and predicted interactions.The median OS and PFS were calculated for 4 groups, including actual and predicted responders and nonresponders by the XGBoost-OMC model.The median OS and PFS were not significantly different between the actual and the predicted responders (log-rank test P value of 0.19 for OS and 0.14 for PFS) and nonresponders (log-rank P value of 0.33 for OS and 0.11 for PFS).The actual responders had significantly longer OS (median OS was 72.7 months) and PFS (median PFS was 53.3 months) than actual nonresponders (median OS was 19.7 months; median PFS was 12.1) with log-rank test P value < 0.005.Similarly, the predicted responders had longer OS (median OS was 35.3 months) and PFS (median PFS was 17.1 months) compared to the predicted nonresponders (median OS was 20.6 moths; median PFS was 13.1).However, the difference was not statistically significant with log-rank test P value of 0.19 and 0.09 in OS and PFS, respectively (Fig. 7B).

mRNA-based RF-OMC can also predict survival of gemcitabine-treated PAAD patients
Among 65 mRNA profiled patients, 36 nonresponders and 29 responders (Fig. 8A), RF-OMC incorporating 4 mRNAs classified these patients into 35 predicted nonresponders and 30 predicted responders.The confusion matrix was shown in Table S3.The median OS and PFS were not significantly different between the actual and the predicted responders (log-rank test P value of 0.21 for OS and 0.15 for PFS) and nonresponders (log-rank P value of 0.45 for OS and 0.20 for PFS).The OS and the PFS were significantly longer in the actual responders (median OS was 72.7 months; median PFS was 53.3 months) compared to the actual nonresponders (median OS was 17.7 months; median PFS was 10.2 months) with log-rank test P value < 0.005.Similarly, the OS and PFS were longer in the predicted responders (median OS was 35.3; median PFS was 16.4) compared to the predicted nonresponders (median OS was 21.1; median PFS was 12.2) with log-rank test P value of 0.19 for OS and 0.04 for PFS (Fig. 8B).
Notably, the median OS and PFS time of actual response to gemcitabine and those predicted by CpG-based XGBoost and mRNA-based RF-OMC models exhibited the similar results.The median PFS of predicted responders and predicted nonresponders by the mRNA-based RF-OMC model shows a statistically significant difference (log-rank test P value of 0.04), despite the small sample size (Fig. 8B).This suggests that RF-OMC combining 4 mRNAs could be a prognostic predictor of gemcitabine-treated PC patients.

Identification of the predictive 4 mRNAs and the 12 CpG probes as the independent survival predictors in gemcitabine-treated PAAD patients
We next investigated whether gemcitabine response predictors (predictive 4 mRNAs and 12 CpG probes) could be survival predictors in PAAD patients.The univariate Cox regression analysis of 4 mRNAs from 65 mRNA profiled patients revealed that 3 genes could be independent biomarkers for prediction of OS and PFS (P < 0.05) (Fig. S7A).The high expression of SEPW1P (HR = 0.59 for OS; HR = 0.72 for PFS), RP11-179A10.1 (HR = 0.63 for OS; HR = 0.58 for PFS), and CTC-429L19.3 (HR = 0.64 for OS; HR = 0.58 for PFS) was significantly associated with improved OS and PFS, but it was not significant in multivariate Cox regression analysis (Fig. S7,  A and C).These finding were consistent with the differential expression of these 4 mRNAs, which were shown to be highly expressed in responders (Fig. 4).
Furthermore, the univariate analysis of the 12 CpG probes from 70 DNA methylation profiled patients revealed that 3 CpG probes could be used as independent biomarkers for OS prediction (P < 0.05) (Fig. S7B).The high DNA methylation of cg15200796 (HR = 0.68) and cg16049691 (HR = 0.7) was significantly associated with improved OS.In contrast, high DNA methylation of cg03208198 (HR = 1.4) was significantly associated with poor OS.In addition, 4 CpG probes could be used as independent biomarkers to predict PFS (P < 0.05) (Fig. S7B).The high DNA methylation of cg07239938 (HR = 0.71), cg15200796 (HR = 0.77), and ch.2.184672693R (HR = 0.61) was associated with improved PFS, whereas high DNA methylation of cg03208198 (HR = 1.47) was negatively associated with PFS.The remaining CpG probes had no significant association with patient survival.In multivariate Cox regression analysis, only cg03208198 (HR = 1.56 for OS; HR = 1.41 for PFS) was shown to be significantly associated with OS and PFS (Fig. S7D).These results were consistent with the differential expression of these 12 CpG probes, where high DNA methylation of cg03208198 was associated with nonresponders, while the remaining CpG probes showed high DNA methylation in responders (Fig. 4).These suggest that gemcitabine response predictors could be used as independent survival predictors of PAAD patients receiving gemcitabine.
Subsequently, we also performed univariate Cox regression analysis on hENT1 gene.The result in Fig. S8A shows that hENT1 gene was not significantly associated with both OS (HR = 1.01;P = 0.63) and PFS (HR = 1.0;P = 0.76) in the univariate model, suggesting that hENT1 gene was not prognostic in PAAD patient receiving gemcitabine.In addition, Fig. S8B demonstrates poor patient stratification by using hENT1 gene trained on RF model, in which there was a trend toward better survival in predicted nonresponders.

The effect of data integration on the gemcitabine response prediction
We next investigated the effect of data integration on the predictive performance of the best model.The RF-OMC was trained on mRNA profile, clinical features (Table S1), and integrated mRNA and clinical features.Figure 9A shows the MCCs across five 10-fold CV repetitions.Combining mRNA and clinical features (FPKM + clinical) did not improve the model performance (mMCC of 0.44) compared to those built on only mRNA features (FPKM) (mMCC of 0.44).RF-OMC prediction based on only clinical data (clinical) led to the lowest performance (mMCC of 0.014).If merging the 2 best molecular profiles provided a performance boost, it would justify the time and financial overhead of determining both profiles in every patient.However, RF-OMC integrating the 2 most predictive CpG methylation and mRNA profile performed worse than those by individual profile (mMCC of 0.349).Last, RF-OMC employed all the molecular profiles and clinical features (All_profiles) did not improve the prediction (mMCC of 0.253).Overall, there is no benefit in integrating patients' clinical data with molecular profiles on the predictive performance of the best model.Importantly, increasing number of features could reduce the model performances as a curse of dimensionality.We did not carry out systematic ablation experiments combining more than 2 profiles mostly due to unfavorable cost-benefit reasons.

Comparing the model prediction based on mRNA and CpG methylation profiles to the hENT1, a well-established gemcitabine response biomarker
Many studies reported that hENT1 has shown to be a predictive biomarker of gemcitabine response in PC.The high expression of hENT1 is associated with sensitivity to gemcitabine and increased overall and disease-free survival [16,17,65].Figure 9B shows that RF-OMC combining 4 mRNAs (mMCC of 0.44) significantly outperformed RF prediction using only hENT1 gene expression (mMCC of −0.066).Similarly, XGBoost-OMC combining 12 CpG probes (mMCC of 0.32) obtained a significantly higher MCC compared to XGBoost that employed only hENT1 gene expression (mMCC of −0.045).Table S6 also presents the predictive performances in terms of MCC and ROC-AUC.The results suggest that the prediction based on either 4 predictive mRNAs or 12 CpG probes outperforms those by hENT1 single-gene biomarker in predicting gemcitabine responses.

Discussion
Omics-based precision oncology, also known as targeted therapies or genomic medicine, has improved cancer patient management.The focus has been on either single-gene marker [17,66,67], a single molecular profile [67][68][69][70], or the discovery of rare actionable mutations [71,72].However, predicting drug responses is a challenging task in oncology.Using a single-gene marker or a few molecular profiles is often insufficient to correctly predict drug responses [50,73,74].The previous researches [44,50,75] demonstrated that the multiple molecular profiles provide new opportunities to identify robust biosignatures that can discriminate drug responses and aid to tailor treatments for individual cancer patients [44,50,75].
Supervised ML algorithms are increasingly being applied to the multiple molecular profiles as they can predict the clinical outcomes for new data based on what they learned from the previous data.Owing to well-curated molecular profiling data of GDC, we present a large-scale analysis of supervised 128 ML models built on multiple molecular profiles and algorithms to predict gemcitabine responses in 70 PAAD patients.Smaller sample sizes are not uncommon for cancer-specific drug response prediction problems [76].Here, out-of-sample predictions were carried out with rigorous CV procedures, which have been shown to anticipate held-out test set performance in smaller datasets than the one used in our study [77].
Using five 10-fold CV runs, 7 of 128 (5%) models are predictive with mMCC > 0.3.These 7 predictive models were all OMC models.Training only the most informative features in OMC models increases the chance of making accurate prediction by removing the irrelevant features from high dimensionality dataset early prior to model training, and model performance improves compared to all-features models.The 2 most predictive molecular profiles in this study revealed 4 of 60,483 mRNAs (0.066%) from RF-OMC model and 12 of 450,000 CpG probes (0.027%) from XGBoost-OMC model as the most informative biosignatures for predicting gemcitabine responses in PAAD patients.These models did not only exhibit high out-of-sample mMCC values robust to different partitions and repetitions, but these values were also significantly higher than those from all-features models (Fig. 6) and class-permuted models (Figs.S6 and S9).Moreover, the performance of these 2 OMC models monotonically increased with larger training set sizes (Fig. 6).Last, note that, in general, other combinations of molecular profiles and algorithms will be predictive in different drugs and cancer types [44,45,50].
The RF-OMC model nonlinearly combined 4 predictive mRNA features including SEPW1P, RP11-179A10.1,ATF4P4, and CTC-429L19.3 that have accurately predicted PAAD patients' responses to gemcitabine.Two of these predictive mRNAs (ATF4P4 and SEPW1P) are pseudogene, while the remaining mRNAs are RNA genes (RP11-179A10.1 and CTC-429L19.3)(Table S7).These genes might be involved in drug resistance and cancer progression through alterations in their expression.Other studies have already examined the roles of these genes in other types of cancer.For example, depletion of SEPW1P enhances tumor suppressor activities in breast cancer [78].The single-nucleotide polymorphisms (SNPs) located in long noncoding RNA RP11-179A10.1 were reportedly associated with poorer outcomes in ovarian cancer [79], but their association with PAAD has not been documented.
Previous studies have shown that potential cancer biomarkers can be identified using DNA methylation profiles [80][81][82].In this study, we could identify 12 DNA methylation of CpG probes affecting 14 genes as the predictive biomarkers of gemcitabine responses in PAAD patients by XGBoost-OMC model.The genes found to be associated with these CpG probes are presented in bracket: cg03208198 (COL18A1 and MIR6815), cg07239938 (ELANE), cg11528307 (C14orf80 and CRIP1), cg15200796 (TMEM191C), cg16049691 (AHRR), cg16379910 (B2M), cg17804635 (ZNF703), cg23854567 (PXN), cg24128434 (DNAH2), cg25653341 (PLOD3), cg27152190 (RP11-429P3.3),and ch.2.184672693R (not linked to any gene).The 2 CpG probes (cg115283072 and ch.2.184672693R) were commonly selected by other models that employed CpG profile, including CART-OMC, LGBM-OMC, and RF-OMC, with mMCC of 0.28, 0.18, and 0.18, respectively (Table S8).In addition, we also investigated the differential expression of gemcitabine response predictors and found that increased 4 mRNA expressions and DNA methylation of 11 CpG sites in pretreatment biopsies from gemcitabine-treated PAAD patients are correlated with better response to therapy, while increased DNA methylation of cg03208198 predicts resistance to therapy (Fig. 4).In fact, dysregulation of DNA methylation (hypo-or hypermethylation) is related to affected gene activation or associated with gene silencing including known tumor suppressor genes and loss of gene functions in cancer [83].Some of the genes corresponding to the predictive CpG probes have been previously reported to be associated with the prognosis and clinical outcomes in various cancers.The high expression of COL18A1 in PC [84], gastric cancer [85], and esophageal squamous cell carcinoma [86] showed poor clinical outcomes.Down-regulation of CRIP1 has also been identified in invasive PC [87].ZNF703 is an oncogenic transcriptional regulator involved in cell adhesion, movement, and proliferation [88].Overexpression of ZNF703 is linked to tamoxifen resistance [89].From this subset of predictive genes, we carried out PPI and pathway analyses to further support follow-up mechanistic studies.Here, we found that the genes corresponding to predictive CpG probes were significantly enriched in a number of cancer-associated pathways and drug resistance from GO and KEGG pathway database (Fig. S5 and Tables S4 and S5).
We next analyzed the survival difference observed between predicted responders and nonresponders by using the best models among gemcitabine-treated PAAD patients.The Kaplan-Meier survival analysis shows that the median OS and PFS were not significantly different between the actual and the predicted  S1), and combining all of 8 molecular profiles and clinical data (All_profile + Clinical).MCC ranges between −1 (misclassification) and 1 (perfect classification), with 0 corresponding to no better than a random.Altogether, integrated patients' clinical data have no effect on the predictive performance of the best model.Importantly, increasing number of features could reduce the model performances.(B) To compare the predictive performances of the 2 most predictive models with those using hENT1 only, hENT1 gene expression from mRNA(FPKM) profile was trained on the same algorithms as the 2 most predictive models (RF and XGBoost).Boxplots comparing the MCCs of the 2 most predictive models [mRNA(FPKM)_RF-OMC and CpG_XGBoost-OMC] to those using the expression of the hENT1 gene only (hENT1_RF and hENT1_XGBoost) in gemcitabine response prediction of PAAD patients.The 2 most predictive models obtained in this study significantly outperformed the prediction based on hENT1 expression as a single predictor.Statistical comparisons between different models were performed using two-sided Welch's t test.Stars denote the P value, where nonsignificant "ns" means 0.05 < P ≤ 1.00.*0.01 < P ≤ 0.05, **0.001 < P ≤ 0.01, ***0.0001< P ≤ 0.001, and ****P ≤ 0.0001.
responders and nonresponders from the best models.Both OS and PFS were increased in responders compared to nonresponders (Figs. 7 and 8).The univariate analysis suggests that gemcitabine response predictors could be used as independent features for survival prediction of PAAD patients receiving gemcitabine (Fig. S7).Some studies have reported these genes as predictors of PAAD survival outcomes [68,90].Note that building a better predictor of PAAD survival would require considering all PAAD patients regardless of the administered drug, which is out of the scope of this study.
Additionally, merging features did not improve the model performances (Fig. 9A).Clinical features alone were barely predictive, and combining them with predictive mRNA profile did not improve the model performance.Furthermore, there is no improvement by integrating 2 most predictive molecular profiles.In addition, loss in predictive accuracy was observed when combining all molecular profiles and clinical features.These suggest that RF-OMC was not able to capture the hidden informative features when the dimensionality dataset exceeded 500,000 features.
Furthermore, several studies have reported hENT1 expression as a gemcitabine response biomarker in PC [16,17], but the mMCC across five 10-fold CV runs was barely predictive (mMCC ~ 0), while the prediction based on 4 predictive mRNAs (mMCC of 0.44) and 12 predictive CpGs (mMCC of 0.32) achieved a significantly higher performance (Fig. 9B).This could suggest the attractiveness of the identified biosignatures.However, the improvement of these ML models over the hENT-1marker will have to be more accurately quantified in a prospective clinical trial.
Although the best models, based on mRNAs and CpG profiles, have predictive and robust performances, there are several limitations to be considered.First, this is a retrospective study using publicly available data from TCGA-PAAD project.While the number of patients is large for a cancer-specific drug response study (diagnosis or prognosis datasets are not restricted to a single administered drug and hence are larger), a larger cohort would further improve model performance.Indeed, as shown in Fig. 6, the more data used to train the model, the more accurate the model.Second, an independent validation cohort would further test the generalizability of the best model, i.e., whether this is able to stratify patients with different characteristics or technical validations (e.g., sample preparation, sequencing procedure, and bioinformatics pipeline).With that said, this is not critical here as it is when using clinical trial data, as TCGA data come from clinical practice in different hospitals.Third, the effect of drug combinations could be further evaluated along with its individual effect, as a few patients were given other drugs in combination with gemcitabine.

Conclusions
In summary, this systematic ML analysis revealed the discovery of predictive, reproducible, and robust gene expression and DNA methylation computational models for gemcitabinetreated PAAD patients.These models can be employed as response signatures to guide clinicians to stratify PAAD patients who will respond to gemcitabine in order to propose other suitable drugs without delay and avoid side effects to patients who are unlikely to respond.In addition, these signatures could be a starting point to explore the gemcitabine resistance mechanism.

Fig. 3 .
Fig. 3. ROC-AUC across five 10-fold CV runs of the 7 most predictive models with the highest median ROC-AUC.(A) Highest ROC-AUC among five 10-fold CV runs for each model.A ROC curve depicts the trade-off between the true-positive rates and the false-positive rates of a classifier by varying the threshold of the probability of response.A ROC-AUC of 1 is the best model performance at distinguishing between responders and nonresponders, while a ROC-AUC of 0.5 corresponds to the random guessing denoted by a dashed line.RF-OMC that employed the mRNA model has the highest ROC-AUC of 0.785.(B) ROC-AUC variability across the five 10-fold CV runs for each of the 7 most predictive models with the highest median ROC-AUC.The horizontal dashed line at 0.5 represents random-level performance.

Fig. 4 .
Fig. 4. Boxplot showing the differential expression of the predictive genes and DNA methylation of CpG probes between responders and nonresponders.The y axis represents the normalized expression level for 4 predictive genes from RF-OMC employing mRNA(FPKM) profile (A) and 12 predictive DNA methylation CpG probes from XGBoost-OMC employing CpG profile (B) between responders and nonresponders.The corresponding gene(s) for each CpG probe can be found in TableS6.The orange line inside each boxplot represents the median expression across patients for a given feature, while the green triangle represents the mean expression.The boxplot for each feature contains the P value of mean differential expression between responders and nonresponders using a two-sided Welch's t test.****P ≤ 0.0001, significant difference in expression levels between 2 groups.

Fig. 5 .
Fig. 5.The PPI network analysis.The genes associated with the 12 predictive CpG probes (TableS8), as annotated by the GDC, were input into STRING database for PPI network analysis.Out of 13 genes corresponding to predictive CpG probes, 11 protein-coding genes were found in the STRING database, while the remaining 2 genes (MIR6815 and RP11-429P3) were not found as they are non-protein-coding genes (MIR6815 is miRNA, and RP11-429P3.3 is antisense).These 11 protein-coding genes were connected by 14 edges representing their known and predicted interactions as shown in the legend.A PPI enrichment P value of 0.035 was obtained, indicating that the protein input has more interactions among themselves than what would be expected for a set of proteins of the same size and degree distribution drawn at random from the genome.As a result, the proteins were connected as a group.The nodes in the PPI network represent proteins, and the edges represent their known and predicted interactions.

Fig. 6 .
Fig. 6.The impact of training dataset size on the performance of best model versus its corresponding all-features model.The boxplots present the distributions of MCC (A) and ROC-AUC (B) obtained across five 10-fold CV runs of RF-OMC employing mRNA(FPKM) profile with 5CV, 10CV, and LOOCV (i.e., K = 5, 10, and 70, respectively).RF incorporated OMC feature selection (RF-OMC) was represented in deep purple, while RF trained on all available 60,483 mRNA features [RF(all-features)] was represented in light purple.Each boxplot contains the P value of mean differential expression between 2 groups with a two-sided Welch's t test.****P ≤ 0.0001.The model performances improved as the training dataset size increased.In addition, the OMC models significantly outperformed the corresponding all-features models in all CVs.The dashed line represents the random classifier at MCC of 0 and ROC-AUC of 0.5.MCC or ROC-AUC of 1 indicates a perfectly accurate prediction.

Fig. 7 .
Fig. 7. Comparison between the median OS and PFS time of actual patient response to gemcitabine and those predicted by the CpG-based XGBoost-OMC model.(A) Flowchart of patient selection.(B) Median OS (defined as the time interval from the date of diagnosis to death or the last known follow-up date) and PFS (defined as the time interval from the date of diagnosis to disease progression or the last known follow-up date) time for 4 groups, including actual and predicted responders and nonresponders by the XGBoost-OMC model.The median OS and PFS are defined as the time point on the Kaplan-Meier plot, where 50% of the patients in each group have survived (for OS) or have not experienced disease progression (for PFS).The figure highlights the ability of the CpG-based XGBoost-OMC model to accurately predict patient gemcitabine response, as demonstrated by the similar trend between the predicted and actual outcomes.The actual and predicted responders had longer OS and PFS than actual and predicted nonresponders.

Fig. 8 .
Fig. 8.Comparison between the median OS and PFS time of actual patient response to gemcitabine and those predicted by the mRNA-based RF-OMC model.(A) Flowchart of patient selection.(B) Median OS (defined as the time interval from the date of diagnosis to death or the last known follow-up date) and PFS (defined as the time interval from the date of diagnosis to disease progression or the last known follow-up date) time for 4 groups, including actual and predicted responders and nonresponders by the RF-OMC model.The median OS and PFS are defined as the time point on the Kaplan-Meier plot where 50% of the patients in each group have survived (for OS) or have not experienced disease progression (for PFS).The figure highlights the ability of the mRNA-based RF-OMC model to accurately predict patient gemcitabine response, as demonstrated by the similar trend between the predicted and actual outcomes.The actual and predicted responders had longer OS and PFS than actual and predicted nonresponders.

Fig. 9 .
Fig. 9. Comparison of the predictive performances of the best model to those combining other datasets and those using hENT1 single gene only in predicting gemcitabine response in PAAD patients.(A) Boxplots comparing the MCCs obtained from five 10-fold CV runs (each dot represents a run) of the RF-OMC prediction based on mRNA profile [mRNA(FPKM)], combining mRNA and clinical dataset [mRNA(FPKM) + Clinical], combining mRNA and CpG profile [mRNA(FPKM) + CpG] and clinical data only (Clinical) (TableS1), and combining all of 8 molecular profiles and clinical data (All_profile + Clinical).MCC ranges between −1 (misclassification) and 1 (perfect classification), with 0 corresponding to no better than a random.Altogether, integrated patients' clinical data have no effect on the predictive performance of the best model.Importantly, increasing number of features could reduce the model performances.(B) To compare the predictive performances of the 2 most predictive models with those using hENT1 only, hENT1 gene expression from mRNA(FPKM) profile was trained on the same algorithms as the 2 most predictive models (RF and XGBoost).Boxplots comparing the MCCs of the 2 most predictive models [mRNA(FPKM)_RF-OMC and CpG_XGBoost-OMC] to those using the expression of the hENT1 gene only (hENT1_RF and hENT1_XGBoost) in gemcitabine response prediction of PAAD patients.The 2 most predictive models obtained in this study significantly outperformed the prediction based on hENT1 expression as a single predictor.Statistical comparisons between different models were performed using two-sided Welch's t test.Stars denote the P value, where nonsignificant "ns" means 0.05 < P ≤ 1.00.*0.01 < P ≤ 0.05, **0.001 < P ≤ 0.01, ***0.0001< P ≤ 0.001, and ****P ≤ 0.0001.