Diagnostic plasma small extracellular vesicles miRNA signatures for pancreatic cancer using machine learning methods

Highlights • Potential utility of sEV-miRNA signature in PDAC diagnosis using machine learning methods.• A novel sEV biomarker, miR-664a-3p, was identified for the diagnosis of PDAC.• SEV-miR-664a-3p can potentially promote angiogenesis and metastasis.• SEV-miR-664a-3p provide new insight into PDAC pathogenesis, and reveal novel regulators of this disease.


Introduction
Pancreatic ductal adenocarcinoma (PDAC), the most common pathological type of pancreatic cancer, is one of the most highly malignant gastrointestinal tumours.Pancreatic cancer has become the third leading cause of cancer-related death in the United States, with a 5year survival rate of less than 10% [1].Lacking typical clinical symptoms and effective diagnostic biomarkers, pancreatic cancer is often Abbreviations: :PDAC, pancreatic ductal adenocarcinoma; CA19-9, carbohydrate antigen 19-9; SEV, small extracellular vesicles; miRNA, MicroRNA; RAB3GAP2, RAB3 GTPase activating non-catalytic protein subunit 2; GPA, granulomatosis with polyangiitis; HC, healthy control; CP, chronic pancreatitis; BPT, benign pancreatic tumor; BPD, benign pancreatic disease; LASSO, least absolute shrinkage and selection operator; RF, random forest; SVM-RFE, support vector machine recursive feature elimination; SVM, support vector machines; GA, genetic algorithm; PPV, positive predictive value; NPV, negative predictive value; AUC, the area under the curve; GEO, gene expression omnibus; BP, biological process; GO, gene ontology; DEGs, differentially expressed genes; FISH, fluorescent in situ hybridization; KEGG, Kyoto encyclopedia of genes and genomes; ECM, extracellular matrix; ctDNA, circulating tumour DNA; CTCs, circulating tumour cells; cfDNA, cell-free DNA.
diagnosed at an advanced stage [2].This can lead to missed opportunities for patients to receive surgery at an early stage.Carbohydrate antigen 19-9 (CA19-9) is used in the diagnosis and treatment evaluation of PDAC cases.However, CA19-9 has been reported to distinguish PDAC from non-PDAC with a limited sensitivity of 0.80 (0.72-0.8) and specificity of 0.75 (0.68-0.80) [3].
Small extracellular vesicles (SEV or Exosomes) contain various biomolecular cargoes, including proteins, nucleic acids, and lipids [4].The identification of specific sEV cargo has led to biomarkers being discovered and developed, which can be used for cancer diagnosis and prognosis [5].In recent years, small EV-derived microRNAs (sEV-miRNAs) have attracted attention as a promising blood-based biomarker for cancer detection [6][7][8].
Machine learning has improved the ability to discover relevant features in large and high-dimensional data sets collected from sequencing studies [9,10].This empowers the revelation of sophisticated patterns and correlations, potentially missed by conventional methodologies.Machine learning algorithms, foster an environment of perpetual learning from novel data.This progressively refines and elevates the processes involved in biomarker identification and, consequently, significantly boosts the precision and accuracy of diagnosis [11,12].From these specific selected features, supervised machine learning methods have been used to develop classifiers for disease diagnosis.Commonly used machine learning methods include the least absolute shrinkage and selection operator (LASSO), support vector machine (SVM) and random forest (RF).Each algorithm possesses unique variable filtering capabilities, and their combined application can effectively address the limitations of individual methods [13,14].To enhance accuracy, it may be more effective to employ a combination of multiple methods when screening for diagnostic biomarkers.In our findings, we employed three distinct machine learning algorithms: LASSO, SVM and RF.
Multiple reports have shown that miR-664a-3p, which is genomically located in the intron of RAB3 GTPase Activating Non-Catalytic Protein Subunit 2 (RAB3GAP2), is abnormally expressed in a number of cancers, including gastric cancer, cervical cancer, osteosarcoma, breast cancer, and T-cell acute lymphoblastic leukaemia [15][16][17][18][19]. Another study demonstrated that there were increased levels of miR-664a-3p in EVs in granulomatosis with polyangiitis (GPA), which could be used to discriminate between active GPA and remission [20].However, it is unclear whether miR-664a-3p expression is altered in pancreatic cancer tissues or plasma sEV.Additionally, its biological functions in PDAC tumorigenesis, progression, and metastasis remains unknown.
In this study, we identified plasma sEV-miRNA biomarkers associated with PDAC using three supervised machine learning feature selection methods.The candidate plasma sEV-miRNAs were used to establish diagnostic models with or without CA19-9.Then, we assessed the performance of these models in a validation cohort.MiR-664a-3p was screened by all three machine learning methods.Next, we further explored the diagnostic performance, correlation with clinicopathological characteristics, and biological functions of miR-664a-3p in PDAC.The findings of this study highlight the clinical and fundamental significance of sEV-miRNAs in PDAC and shed new light on liquid biopsies using sEV-miRNAs using machine learning methods.

Patient enrollment and sample collection
A total of 106 participants were recruited for the training cohort and validation cohort, including patients with PDAC (n=58), healthy controls (HC, n=20), CP patients (n=12) and BPT patients (n=16).All the participants were enrolled from 2020.8 in Sir Run Run Shaw Hospital, school of medicine, Zhejiang university (Hangzhou, China).Characteristics of participants including age, gender, tumor stage, CA19-9 and CEA level were shown in Table 1.
Blood samples were collected from participants with 10 mL EDTAcoated Vacutainer tubes.All the samples were collected before surgery, chemotherapy, radiotherapy, targeted therapy and immunotherapy.Within two hours of blood collection, plasma was separated by centrifugation at 1,300g for 10 min at 4 • C. The samples were stored at -80 • C.

Plasma sEV
For each patient, 1.5 mL of plasma was used, and sEV was isolated by affinity-based binding to spin columns using an exoRNeasy Serum/ Plasma Kit (Qiagen, Hilden, Germany) following the manufacturer's instructions.

Culture medium sEV
For sEV isolation from culture medium, cells were cultured in serumfree medium for 48 h and the culture supernatant was collected.Then the tubes were immediately centrifuged at 3,000 g for 20 min at 4 • C to remove cells and debris and filtered on a 0.22 μm filter.Next, the sEV was isolated by using Cell Culture Media Exosome Purification Kits (Norgen Biotech Corp., Thorold, ON, Canada).

sEV RNA
RNAs were extracted using a Norgen Exosomal RNA Isolation Kit (Norgen Biotek, 58000, Canada) in accordance with the manufacturer's instructions.

Small RNA sequencing and bioinformatics analysis
The experimental workflow was performed according to standard procedures provided by Illumina, including library preparation and sequencing experiments.Small RNA sequencing libraries were prepared using TruSeq Small RNA Sample Prep Kits (Illumina, San Diego, USA).After the library preparation was completed, the constructed library was sequenced using Illumina Hiseq2000/2500, and the sequencing read length was single-ended 1 × 50bp.The miRNA data analysis software provided by LC Biology is ACGT101-miR (LC Sciences, Houston, Texas, USA) independently developed by the company.The analysis process is as follows (clean reads are obtained after the raw data is processed by quality control, the 3′ joints are removed from the clean reads, and length screening is performed to keep the base length in 18-25nt (plant) or 18-26nt (animal) sequence.Align the remaining sequences against various RNA database sequences (excluding miRNAs), such as mRNA data library, RFam database (including rRNA, tRNA, snRNA, snoRNA, etc.) and Repbase database (repeated sequence database), and filtered, and finally obtained valid data and it can be used for subsequent small RNA data analysis.Differential expression of miRNAs based on normalized deep-sequencing counts was analyzed by Student t test in this study.The significance threshold was set to be 0.05.

RNA sequencing analysis
Cells were treated with hsa-miR-664a-3p mimics or Ctrl mimics for 48 h.Then, total RNA was extracted from cells using TRIzol (Invitrogen).Library construction and sequencing were performed by LC Bio Technology CO.,Ltd (Hangzhou, China).The RNA libraries were sequenced on the illumine Novaseq TM 6000 platform by LC Bio Technology CO.,Ltd (Hangzhou, China).

Real-time qPCR
Total RNA was extracted with TRIzol reagent (Invitrogen, USA) and isopropanol precipitation methods.The extracted RNA was quantified by measuring the absorbance at 260 nm with a PuXin UV/visible spectrophotometer TU-1810.For miRNA level detection, the hsa-miR-664a-3p RT-qPCR was carried out using a Bulge-Loop miRNA qRT-PCR Starter Kit from Ribo Biotechnology (Guangzhou, China) with the provided miRNA reference gene (u6).Hsa-miR-664a-3p and u6 RT-specific forward and reverse primers and specific bulge-loop miRNA qRT-PCR primers were designed and purchased from Ribo Biotechnology (Guangzhou, China).For mRNA detection, total RNA (1μg) was removed DNA and performed Reverse transcription with Hifair II 1st Strand cDNA Synthesis SuperMix for qPCR.All real-time PCR analyses were performed using gene-specific primers and the LightCycler® 480 real-time PCR system (Roche) using SYBR Premix Ex TaqTM (TaKaRa, RR420A).The relative amounts of each mRNA were analyzed using the comparative CT method, as described by Schmittgen and Livak49.All data are expressed as the mean ± SE of n = 3 independent repeats.All statistical analyses were performed using a two-tailed, paired Student's t-test, and P< 0.05 was considered significant.Primer sequences were listed in Supplementary Table 1.

Western blot
Cells were lysed with RIPA buffer (#FD009, FUDE, China) containing a mixture of protease inhibitor cocktail kit (#P1049, Biyuntian, China).Then the lysates were cleared by centrifugation and the concentrations of proteins were measured by BCA protein assay kit (#P0009, Biyuntian, China).Each total protein sample (4µg) was separated on SDS-PAGE gels and then transferred to PVDF membranes.After incubation of the membranes with primary antibodies at 4 • C overnight, the samples were incubated with the secondary antibodies conjugated with horseradish peroxidase for 1h at room temperature.The Chemiluminescence signals were visualized using a ChemiDoc Touch Imaging System (Bio-Rad).Below is the detailed antibody information (Anti-β-Actin (ABM40032, Abbkine), Anti-Snail (#3879, CST), Anti-Slug (#9585, CST), Anti-E-Cadherin (ab40772, Abcam), Anti-N-Cadherin (ab18203, CST).

miRNA fluorescence in situ hybridization (FISH)
FISH assays were performed using miRNA Fluorescence in situ Hybridization Kit (GenePharma) according to the instructions.Oligonucleotide modified probe labeled with cy3 for hsa-miR-664a-3p was provided by GenePharma.First, formalin-fixed paraffin-embedded sections were preheated for 30 min at 60 • C and fresh xylene was used to remove paraffin from the tissue.Then, sections were rehydrated by 10 min incubations in decreasing concentrations of ethanol, and incubated with Proteinase K solution for 15 min at 37 • C and denaturing solution for 8 min at 78 • C.After dehydration, probes were added in the hybridization solution and incubated for 14 h at 37 • C. Sections were washed and counterstained with DAPI (GenePharma).Images were acquired using a fluorescence microscope (Olympus fv3000).

Data and statistical analyses
The raw read count of RNA-seq data was converted to TPM values for scale and normalization across all samples.Variates expressed in less than 25% of the entire samples were excluded, and the remaining features were used for subsequent statistical analyses.The plasma sEV miRNA-seq profiles were randomly distributed in training (n=72) and validation cohorts (n=34).The student t test was used to calculate differential expression exmiRs in 58 PDAC patients and 48 non-PDAC individuals.In addition, the false discovery rates (FDRs) of each marker were controlled by Benjamini-Hochberg method.
We applied several machine learning approaches for feature selection, including LASSO regression, random forest (RF) and support vector machines recursive feature elimination (SVM-RFE).LASSO was implemented using R package "glmnet", with alpha value equal to 1. Random Forest was implemented using R package "randomForest".SVM-RFE was implemented using R package "e1071", "kernlab" and "caret".
LASSO.Least absolute shrinkage and selection operator (LASSO) on binomial logistic regression using the glmnet package in R (v4.0) was used to select relevant miRNAs, by eliminating parameters with a coefficient of 0. One of the advantages to using a LASSO method is that coefficients are shrunk and removed, reducing variance without substantially increasing the bias [21].The lasso approach permits learning X. Pu et al. and weighting features most relevant to predicting PDAC/non-PDAC status through variable selection.Ultimately, by combining the selected features, the lasso logistic regression model was used to assign a probabilistic score to all samples that were being analysed both in training cohort and validation cohort.
Random forest.The random forest algorithm (RF) is an ensemble learning method based on bagging, which can handle classification problems well.It introduces randomness, is not easy to overfit, and can handle high-latitude data and big data, with high accuracy [22].After feature selection, Matlab 2019b was used to develop binary classification model.SVM-RFE.Support vector machines (SVM) are a powerful tool to analyze data with a number of predictors approximately equal or larger than the number of observations.However, originally, application of SVM to analyze biomedical data was limited because SVM was not designed to evaluate importance of predictor variables [23].Support vector machine recursive feature elimination (SVM-RFE), can filter relevant features and remove relatively insignificant feature variables in order to achieve higher classification performance [24].On the basis of using SVM-RFE for screening features, the genetic algorithm (GA) is used to find the best g and c values, and then the GA-SVM algorithm is used for classification.
To compare classifiers, we assessed sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy and the area under the curve (AUC) among the classifiers both in training cohort and validation cohort.
Statistical analyses were performed using GraphPad Prism 9 software (GraphPad Software, La Jolla, CA, USA) to assess differences among experimental groups.Statistical differences were analyzed by two-tailed Student's t-test, paired t test, or one-way ANOVA test followed by Fig. 1.A summarized diagram of the study and workflow of data generation and analyses.A total of 106 participants were recruited for the training cohort and validation cohort, including patients with PDAC (n=58), HC (n=20), CP (n=12) and BPT (n=16).MiRNA sequencing was performed after isolation of plasma sEV miRNA.251 plasma sEV-DEmiRs were differentially expressed in 58 PDAC patients compared with 48 non-PDAC (20HC, 12CP and 16BPT) individuals.To further screen sEV-DEmiRs, three machine learning methods, LASSO regression method, the RF algorithm and SVM-RFE method were used to perform variable selection.The candidate plasma sEV-DEmiRs were used to establish diagnostic models with or without CA19-9 in training cohort.Then, the performance of these models was assessed in validation cohort.MiR-664a-3p was identified by all three feature selection methods.The diagnostic performance of plasma sEV miR-664a-3p in PDAC combined with or without CA19-9 was evaluated both in training cohort and validation cohort.Then the biological functions of sEV miR-664a-3p in PDAC was explored both in vivo and in vitro.PDAC, pancreatic ductal adenocarcinoma; HC, healthy control; CP, chronic pancreatitis; BPT, benign pancreatic tumor; MiRNA, microRNA; SEV, small extracellular vesicles; DE-miRs, differentially expressed microRNAs; LASSO, least absolute shrinkage and selection operator; RF, random forest; SVM-RFE, support vector machine recursive feature elimination; CA19-9, carbohydrate antigen 19-9.
The workflow was described in Fig. 1.

Establishment of sEV-miRNA d-signatures for PDAC using machine learning methods
Overall, 251 plasma sEV-miRNAs were differentially expressed in 58 PDAC patients compared with 48 non-PDAC (20 healthy controls (HC), 12 chronic pancreatitis (CP), and 16 benign pancreatic tumour (BPT)) individuals.To further screen sEV-miRNAs, we used the least absolute shrinkage and selection operator (LASSO) regression method, random forest (RF) algorithm, and support vector machine recursive feature elimination (SVM-RFE) method to perform variable selection.
Twenty-five sEV-miRNAs were selected using the LASSO regression analysis (Fig. 2A).From these results, we filtered out three sEV-miRNAs (miR-144-5p, miR-1294, and miR-148a-3p) that were downregulated in PDAC patient plasma samples compared with those in samples from non-PDAC individuals.Before multivariate analysis, correlation analysis (Fig. 2B) and collinearity diagnostics (Supplementary Table 2) were conducted to exclude collinearity between features.Finally, 11 sEV-miRNAs (miR-664a-3p, miR-652-5p, miR-33a-3p, miR-5010-3p, miR-335-3p, miR-548e-5p, miR-940, miR-616-5p, miR-490-3p, miR-5100-3p, and miR-548d-3p) remained and were used for logistic regression modelling.The LASSO-logistic sEV-miRNA diagnostic signature (dsignature) comprised the 11 sEV-miRNAs that distinguished PDAC from non-PDAC with an AUC of 0.968, sensitivity of 92.3%, and specificity of 97.0% in the training cohort and an AUC of 0.835, sensitivity of 73.7%, and specificity of 100.0% in the validation cohort (Fig. 2C).To further evaluate the diagnostic performance of the LASSO-logistic sEV-miRNA d-signature among other groups, we also compared the PDAC group with the HC and benign pancreatic disease (BPD), CP and BPT patients) groups.The results showed that the LASSO-logistic d-signature could distinguish PDAC from HC with an AUC of 0.978, sensitivity of 94.9%, and specificity of 100.0% in the training cohort and an AUC of 0.939, sensitivity of 89.5%, and specificity of 100.0% in the validation cohort.The PDAC patients were also distinguished from BPD patients with an AUC of 1, sensitivity of 100.0%, and specificity of 100.0% in the training cohort and an AUC of 0.789, sensitivity of 63.2%, and specificity of 100.0% in the validation cohort (Table 2.1).Unsupervised hierarchical clustering using the 11 sEV-miRNAs could effectively distinguish PDAC from non-PDAC in both the training and validation cohorts (Fig. 2D and  E).
The top 30 sEV-miRNAs (Supplementary Table 3) were selected by Gini importance using the RF algorithm (Fig. 3A).Using the RF method with 30 sEV-miRNAs, we developed a diagnostic model and generated a RF sEV-miRNA d-signature for PDAC.This was able to distinguish PDAC patients from non-PDAC individuals with an accuracy of 100% (Fig. 3B), sensitivity of 100.0%, and specificity of 100.0% in the training cohort (Fig. 3C) and an accuracy of 79.4% (Fig. 3B), sensitivity of 68.4%, and specificity of 93.3% (Fig. 3C) in the validation cohort.When differentiating PDAC patients from HC, the RF sEV-miRNA d-signature displayed an AUC of 1, sensitivity of 100.0%, and specificity of 100.0% in the training cohort and an AUC of 0.952, sensitivity of 89.5%, and specificity of 66.7% in the validation cohort.The RF d-signature appeared to be less effective for distinguishing PDAC from BPD, with an AUC of 1, sensitivity of 100.0%, and specificity of 100.0% in the training cohort and an AUC of 0.470, sensitivity of 73.3%, and specificity of 44.4% in the validation cohort (Table 2

.2).
Four sEV-miRNAs (Supplementary Table 4) were identified by using SVM-RFE (Fig. 3D).The GA-SVM was further used to establish a diagnostic model.The results showed that the SVM-RFE d-signature could distinguish PDAC patients from non-PDAC patients with an AUC of 0.882, sensitivity of 71.8%, and specificity of 97.0% in the training cohort and an AUC of 0.79, sensitivity of 68.4%, and specificity of 93.3% in the validation cohort (Fig. 3E).When distinguishing PDAC from HC, for the training set, this model obtained an AUC of 0.947, sensitivity of 89.7%, and specificity of 78.6%, while for the validation set, it had an AUC of 0.956, sensitivity of 100.0%, and specificity of 66.7%.For differentiating PDAC from BPD, the SVM-RFE d-signature had an AUC of 0.908, sensitivity of 87.2%, and specificity of 78.9% in the training cohort and an AUC of 0.760, sensitivity of 73.3%, and specificity of 77.8% in the validation cohort (Table 2.3).
Because CA19-9 has been reported as a blood-based tumour marker for pancreatic cancer [25], in this study, we calculated the AUC of CA19-9 for distinguishing between PDAC patients and non-PDAC individuals.The AUC of CA19-9 was 0.830 (0.732-0.929), with a sensitivity of 71.8% and specificity of 93.9% in the training cohort (Supplementary Fig. 1).This was lower than the AUC values of all three d-signature models.In the validation cohort, the AUC of CA19-9 was 0.8351 (0.700-0.971), with a sensitivity of 57.8% and specificity of 93.3% (Supplementary Fig. 1).
To further improve the performance of the sEV-miRNA d-signatures, the prediction model was combined with the serum CA19-9.The data suggested that the combination of the sEV-miRNA signature with serum CA19-9 resulted in a better performance than with each model alone.When in combination with CA19-9, the model could distinguish PDAC from non-PDAC and improve the AUC to 1, 1, and 0.955 in the training cohort and to 0.919, 0.919, and 0.881 in the validation cohort by using features selected via LASSO-logistic, RF, and SVM-RFE, respectively (Fig. 4A-C).When differentiating PDAC from HC, features selected by all three machine learning methods combined with CA19-9 had an excellent diagnostic performance with an AUC of 1, 1, and 0.962 in the training cohort and 0.974, 0.962, and 0.974 in the validation cohort via LASSO-logistic, RF, and SVM-RFE, respectively.When distinguishing PDAC from BPD, the LASSO-logistic d-signature and SVM-RFE d-signature combined with CA19-9 had respective AUC values of 1 and 0.908 in the training cohort and 0.877 and 0.877 in the validation cohort.However, the RF d-signature combined with CA19-9 appeared to have a less effective diagnostic performance with an AUC of 1 in the training cohort and 0.478 in the validation cohort (Fig. 4D).
To make a clearer comparison of the performances of the different dsignatures using these three machine learning methods with or without CA19-9, we summarized several indicators in a table (Fig. 4D), including the sensitivity, specificity, and AUC values.More complete and detailed data are presented in Table 2.

Plasma sEV miR-664a-3p levels can distinguish PDAC from non-PDAC
Plasma sEV miR-664a-3p was identified by all three feature selection methods (Fig. 5A).To evaluate the potential of sEV miR-664a-3p as a biomarker for PDAC diagnosis, we first compared the expression levels of sEV miR-664a-3p in plasma samples from PDAC patients and non-PDAC individuals.The results showed that plasma sEV miR-664a-3p levels were significantly higher in PDAC patients than in non-PDAC subjects in both the training and validation cohorts (Fig. 5B).In addition, we analysed the expression of plasma sEV miR-664a-3p in HCs and BPD patients.Both the training and validation cohorts showed significant differences in plasma sEV miR-664a-3p levels between PDAC patients and HCs (Fig. 5C) and PDAC patients and BPD patients (Fig. 5D), and was higher in PDAC group.Then, we performed ROC curve analysis to evaluate the diagnostic value of plasma sEV miR-664a-3p in PDAC.Plasma sEV miR-664a-3p levels could be used to distinguish PDAC patients from non-PDAC individuals (AUC=0.826 in the training cohort, AUC=0.730 in the validation cohort), from HC (AUC=0.848 in the training cohort, AUC=0.737 in the validation cohort), and from BPD patients (AUC=0.810 in the training cohort, AUC=0.725 in the validation cohort) (Supplementary Fig. 2A-C).Next, we combined plasma sEV miR-664a-3p with CA19-9 to evaluate its diagnostic performance.The results suggested that the combination led to a greater diagnostic

Table 2
Performance of d-signature in the PDAC using LASSO-logistic regression, RF and SVM-RFE with/without CA19-9.performance for discriminating PDAC from non-PDAC individuals (AUC=0.8943 in the training cohort, AUC=0.940 in the validation cohort), from HCs (AUC=0.951 in the training cohort, AUC=1.000 in the validation cohort), and from BPD patients (AUC=0.810 in the training cohort, AUC= 0.924 in the validation cohort) (Supplementary Fig. 2D-F).More detailed data are summarized in Fig. 5E and F.
To explore miR-664a-3p expression in other tissue types in PDAC patients, such as blood, tumour tissue, and normal-adjacent tissues, we used public databases for further analysis.Serum miR-664a-3p expression levels in PDAC patients and HCs showed no significant difference in the GSE59856 and GSE85589 data sets (Supplementary Fig. 3A and C).ROC curve analysis showed that serum miR-664a-3p can hardly distinguish PDAC patients from healthy individuals in GSE59856 and GSE85589 (Supplementary Fig. 3B and D).There was no differential expression of miR-664a-3p in normal-adjacent and pancreatic cancer tissues by analysing the miRNA-seq data from The Cancer Genome Atlas (TCGA) database in PDAC (Supplementary Fig. 4A).ROC curve analysis showed an AUC of 0.52 (0.19-0.85), indicating a poor diagnostic performance in PDAC patients (Supplementary Fig. 4B).Additionally, Kaplan-Meier survival analysis showed that tissue miR-664a-3p expression levels were not related to the prognosis of PDAC patients (Supplementary Fig. 4C).We further compared the miR-664a-3p expression levels in seven paired cancer and para-cancerous tissues in our centre.A paired Student's t-test was used to assess the statistical significance of this analysis, which suggested that there were no significant differences in miR-664a-3p expression levels between PDAC tissues and their para-cancerous tissues (Supplementary Fig. 4D), P=0.8125).To further validate this result, we evaluated miR-664a-3p expression by analysing the GSE119794 cohort from the Gene Expression Omnibus (GEO) database, which contains miRNA-seq profiles of pancreatic cancer in 10 paired tumour and normal pancreatic samples from patients.In accordance with our results, no significant differences were observed between the cancer tissues and para-cancerous tissues (Supplementary Fig. 4E), P=0.4316).
Taken together, miR-664a-3p was enriched in plasma sEV of PDAC and was capable of distinguishing PDAC from non-PDAC, HC, and BPD individuals with or without CA19-9.However, tissue expression levels of miR-664a-3p were not significantly different between PDAC and noncancer tissues.

MiR-664a-3p is enriched in sEVs and may play a role in PDAC metastasis
To further investigate the role of miR-664a-3p in PDAC, we measured miR-664a-3p expression levels in six PDAC cell lines and one normal pancreatic ductal epithelial cell line (HPDE6C7) (Fig. 6A).We found that the expression levels of miR-664a-3p in ASPC-1, BxPC-3, and CFPAC-1 cells were relatively higher than in HPDE6C7 cells, while two additional PDAC cell lines, MiaPaCa-2 and Panc-1, expressed lower amounts of miR-664a-3p compared with HPDE6C7.Additionally, there was no significant difference in miR-664a-3p expression between HPC-Y5 and HPDE6C7 cells.These results indicate that miR-664a-3p is expressed at different levels in various PDAC cell lines.Next, we investigated miR-664a-3p expression in sEV extracted from the culture medium (CM) of six PDAC cell lines and HPDE6C7 cells (Fig. 6B).We observed that the expression levels of CM sEV miR-664a-3p were higher in ASPC-1, BxPC-3, CFPAC-1, and MiaPaCa-2 cells than in HPDE6C7 cells.The results also showed that the CM sEV miR-664a-3p levels in HPC-Y5 and Panc-1 cells were not significantly different compared with the levels in HPDE6C7 cells, but the sEV levels were all remarkably higher compared with those in PDAC cell lines normalized to expression in HPDE6C7 (Fig. 6C).The abovementioned data show that miR-664a-3p is enriched in PDAC cell-derived sEV.
After evaluating miR-664a-3p expression levels in PDAC cell lines and their CM sEV, we separated these cell lines into groups with high (ASPC-1, BxPC-3, and CFPAC-1) and low (MiaPaCa-2 and Panc-1) miR-664a-3p expression.To explore potential biological functional differences between the high and low expression groups, we downloaded the microarray expression dataset GSE40098 from the GEO database and analysed 54 differentially expressed genes (DEGs) by using the online analysis tool GEO2R.Biological Process (BP) of gene ontology (GO) analysis revealed 16 significantly enriched GO terms (Fig. 6D).Among them, GO analysis associated these DEGs with vessel formation and adhesion and motility-related pathways, such as angiogenesis, cell migration, cell-cell adhesion, and cell motility (Fig. 6E).These play important roles in the process of metastasis.Additionally, through online analysis and visualization, we compared the metastatic potential of these five PDAC cell lines in MetMap (https://pubs.broadinstitute.org/metmap).The petal plots showed the metastatic ability of the five PDAC cell lines in five organs, including brain, bone, kidney, liver, and lung.Forest plot showed the overall metastatic potential combined with five distant organs of five PDAC cell lines (Fig. 6F).This analysis suggested that ASPC-1, BxPC-3, and CFPAC-1 cells did exhibit higher metastatic abilities than MiaPaCa-2 and Panc-1 cells.Therefore, we hypothesized that miR-664a-3p is associated with PDAC metastasis and is enriched in sEV to exert its functions.
To investigate this further, we analysed the correlation between plasma sEV miR-664a-3p levels and clinicopathological characteristics using the chi-square test in 58 PDAC patients.The patients were grouped into miR-664a-3p-Low (n=29) or miR-664a-3p-High (n=29) groups based on the miR-664a-3p median expression value.We found that increased plasma sEV miR-664a-3p was significantly positively associated with the presence of vascular invasion (P=0.0139) and lower surgery ratio (P=0.0178).To some extent, higher levels of plasma sEV miR-664a-3p was positively correlated with advanced TNM stage (P=0.0654), the presence of distant metastasis (P=0.0656), and poor differentiation (P=0.0572),though these results were not statistically significant (Table 3).We also re-analysed the correlation by grouping the patients into the high (upper 75 th percentile) or low (lower 75 th percentile) expression groups.The results showed that increased miR-664a-3p expression was significantly positively correlated with poor differentiation (P=0.0202) and associated with vascular invasion (P=0.0588)(Supplementary Table 5).This suggests that higher sEV miR-664a-3p levels indicate a higher degree of malignancy and greater chance of metastasis.
Next, we performed fluorescent in situ hybridization (FISH) and immunofluorescence staining to detect miR-664a-3p and VEGFA expression in the PDAC tissues.The data showed that miR-664a-3p was mainly distributed in the PDAC cancer stroma, including fibrous tissues and vessels (Fig. 6G), and was positively correlated with VEGFA (Supplementary Fig. 5).This was consistent with former findings in this study, showing that miR-664a-3p may promote angiogenesis and metastasis.

MiR-664a-3p may promote PDAC metastasis by regulating the epithelialmesenchymal transition (EMT) and angiogenesis
Then, we investigated the roles of miR-664a-3p in PDAC cell lines.Cell proliferation was not markedly regulated by miR-664a-3p in ASPC-1, BxPC-3, CFPAC-1, MiaPaCa-2, and Panc-1 cells (Supplementary Fig. 6).To gain further insight into the functions of miR-664a-3p in these cell lines, we performed RNA-seq in MiaPaCa-2 cells treated with miR-664a-3p mimics for 48 h.From these results, 114 DEGs were selected for further analysis.The gene ontology (GO) functional terms were identified after functional enrichment analysis of the DEGs (Fig. 7A), and we found enrichment of GO terms associated with cell adhesion, extracellular exosomes, and collagen containing extracellular matrix, which was consistent with effects on cell migration and metastasis.Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis suggested that overexpression of miR-664a-3p was associated with focal adhesion, extracellular matrix (ECM)-receptor interaction, the Apelin signalling pathway, and the Rap1 signalling pathway (Fig. 7B), which are reportedly related to migration and angiogenesis  Cell adhesion and migration are critical steps in tumour invasion and metastasis.Because the EMT is known to broadly regulate cancer invasion and metastasis [30], we next assessed this in PDAC cells treated with miR-664a-3p mimics or inhibitor by investigating the protein expression levels of EMT markers.Western blot analysis indicated that miR-664a-3p mimic transfection increased the levels of Snail and N-cadherin and attenuated the levels of E-cadherin in MiaPaCa-2 cells.The CFPAC-1 cells treated with the inhibitor showed the opposite trends (Fig. 7C).The scratch assay suggested that miR-664a-3p mimics promoted the migration of MiaPaCa-2 cells, whereas the migration of CFPAC-1 cells were inhibited by miR-664a-3p inhibitor (Fig. 7D).Transwell assays were further performed to investigate the effects of miR-664a-3p on cell migration, and the results were in accordance with those of the scratch assay, indicating decreased migration after miR-664a-3p inhibitor treatment in CFPAC-1 cells and miR-664a-3p mimics treatment in MiaPaCa-2 cells (Fig. 7E).As indicated by the DEGs, angiogenesis was enriched in the miR-664a-3p high expression cell lines compared with that in the low expression cell lines.It was also enriched in the MiaPaCa-2 cells treated with miR-664a-3p mimics compared with that in cells treated with the control mimics.HIF-1α, VEGFA, and VEGFC have been reported to act as factors that contribute to tumour angiogenesis [31].qRT-PCR was performed to detect the expression levels of these three genes.The results suggested that the HIF-1α, VEGFA, and VEGFC mRNA expression levels were decreased in CFPAC-1 cells treated with the miR-664a-3p inhibitor, while no significant difference was observed when the cells were treated with miR-664a-3p mimics (Fig. 7F).From these data, we can preliminarily conclude that miR-664a-3p can promote EMT and angiogenesis in PDAC cells, thereby increasing the possibility of metastasis.To further explore the target genes of miR-664a-3p, we used online tools (miRDB, Targetscan, and miRWalk) to predict.By overlapping the results of three databases and pathway genes which associated with EMT and angiogenesis, we obtained six genes (AMIGO2, OMD, THBS2, ACTN2, C1QTNF3, and MME) (Fig. 7G).Among these six genes, OMD, ACTN2, and C1QTNF3 were upregulated in miR-664a-3p mimics treated group.And MME level was not detected (value of 0) in miR-664a-3p mimics treated group.Therefore, we selected AMIGO2 and THBS2 as potential targets of miR-664a-3p for validation by qRT-PCR.The result showed that inhibition of miR-664a-3p could significantly augment the expression of THBS2 in CFPAC-1.The cells treated with the mimics showed the opposite trends (Fig. 7H).While the expression level of AMIGO2 was not detected for the possibility of an extremely low abundance.Collectively, THBS2, related to angiogenesis and cancer progression, may be a potential target gene of miR-664a-3p.

Discussion
In recent years, liquid biopsy has gained increasing attention and been developed as a "non-invasive" diagnostic method.Circulating tumour DNA (ctDNA), circulating tumour cells (CTCs), and EVs have promise for being used as future tools in liquid biopsies.During cell death, ctDNA is formed when fragments of somatic tumour DNA are released into the blood circulation [32].This can allow for the detection of certain mutations specific to the tumour cells.However, it was reported that in early tumour stages, ctDNA may represent only <1% of total cell-free DNA (cfDNA) [33].A few studies have even described no detection of ctDNA in early-stage patients, which increases the difficulty of using it as an early diagnostic tool [34].CTCs can diffuse from the primary tumour site into distant sites through the vasculature.An advantage of CTCs is that they contain RNA, DNA, and protein from the tumour cells, and have a spatial resolution of biomarkers [35].However, they are poorly enriched in blood, with only 1 to 10 CTCs in 1 mL of blood [36].This considerably increases the detection difficulty and amount of blood needed.Therefore, the use of sEV can complement and offer an exciting alternative to other liquid biopsy methods for better diagnostic outcomes through the following advantages ((1) they are abundant in the blood and easy to enrich), (2) they are highly stable and easy to preserve), and (3) they can be used to not only diagnose, but also treat diseases.
Recent studies have suggested that sEV miRNAs can serve as noninvasive biomarkers for cancer diagnosis and prognosis [37].For the first time, Que et al. found that serum-derived sEV miR-17-5p and miR-21 were significantly upregulated in PDAC patients and could be used to distinguish these from HCs and patients with other pancreatic diseases with AUC values of 0.887 and 0.897, respectively [38].Since then, sEV miR-21 was widely used to distinguish PDAC patients from non-PDAC individuals with an AUC ranging from 0.72 to 1 [39][40][41][42].In addition to sEV miR-21, several other sEV miRNAs, single or combined, were applied to establish a diagnostic model in PDAC.Lai et al. reported that miR-10b, miR-30c, miR-181a, and miR-let7a were able to distinguish PDAC patients from HCs with an AUC of 1 [8].Zhou et al. and Xiao et al. also found that sEV miR-10b could serve as biomarker for PDAC diagnosis, with AUC values of 0.88 and 0.99 [41,43].Those studies, however, were limited by the small number of sEV miRNAs detected in plasma, less varied controls, a lack of a validation set, a relatively small sample size, or a lack of effective calculation methods.
In this study, we obtained miRNA-seq expression profiles from 106 human plasma sEV, including 58 PDAC patients and 48 non-PDAC individuals.Though the sample size was not the largest, we included more comprehensive cases in the non-PDAC group, such as HC, CP patients, and BPT patients.Additionally, we divided the participants into two groups, and along with the training set, a validation set was used to assess the algorithm's effectiveness.To the best of our knowledge, this is the only published plasma sEV miRNA-seq expression profile library that includes all these groups from human plasma sEV.
In addition, we used several machine learning methods to select features, then developed diagnostic models on this basis.In this study, we used three machine learning algorithms to select features.We identified 11, 30, and 4 sEV miRNAs by using the LASSO-logistic regression, RF, and SVM-RFE methods, respectively.After this, we began to establish diagnostic models with or without CA19-9 by using the features selected in the training cohort and re-accessed in a validation cohort.The results suggested that the LASSO d-signature, RF d-    [44].With machine learning, not only can we predict patient status, but we can also identify previously unknown interactions and identify novel biological features [45].We then focused on miR-664a-3p because it was selected by all three machine learning methods.Our data are the first to suggest that plasma sEV miR-664a-3p is significantly upregulated in PDAC patients.Plasma sEV miR-664a-3p could distinguish PDAC patients from non-PDAC individuals, HC, and BPD patients with or without CA19-9.
However, there have been no reports regarding sEV miR-664a-3p in PDAC, and reports on miR-664a-3p in other diseases are even more rare.Marcin et al. reported that miR-664a-3p was overexpressed in GPA and could discriminate between active disease and remission [20].MiR-644a was decreased in blood and tissue samples from Lichen sclerosis (LS) patients compared with that in samples from healthy female volunteers [46].Apart from the clinical significance of miR-664a-3p, there are several studies that investigated the biological role of this miRNA in cancers and other diseases.He et al. reported that the upregulation of miR-664a-3p inhibits the proliferation of ovarian granulosa cells and increases apoptosis rates by downregulating the expression of BCL2A1 and blocking the MAPK/ERK signalling pathway [47].However, Wang et al. described that miR-664a-3p was frequently upregulated in GC tissues and cells, and elevated expression of miR-664a-3p could significantly promote cancer cell proliferation and invasion in vitro and in vivo [15].These findings suggest that miR-664a-3p plays varied roles in different cases.These studies prompted us to further investigate the role of miR-664a-3p in PDAC, especially sEV miR-664a-3p.
Analysis of miR-664a-3p expression in PDAC cell lines and their CM sEV normalized to the levels in HPDE6C7 showed that miR-664a-3p was enriched in PDAC CM sEV.Similarly, plasma sEV miR-664a-3p was upregulated in PDAC patients, while it showed no significant difference between cancer and para-cancer tissues.These results underscored that miR-664a-3p in sEV plays an important role in pancreatic cancer diagnosis.Further analysis of the correlation between the expression of plasma sEV miR-664a-3p in PDAC patients and clinicopathological characteristics indicated that higher plasma sEV miR-664a-3p levels were positively correlated with the presence of vascular invasion, lower surgery ratio, advanced TNM stage, presence of distant metastasis, and poor differentiation.FISH assay also showed that miR-664a-3p was mainly distributed in neovascularization and fibrous tissues within PDAC stroma.Our study also demonstrated that upregulating miR-664a-3p in CFPAC-1 cells could increase the protein expression levels of EMT markers Snail and N-cadherin and attenuate the levels of E-cadherin.Cells treated with a miR-664a-3p inhibitor showed the opposite expression trends for these markers.Meanwhile, inhibition of miR-664a-3p could significantly decrease the expression of HIF-1α, VEGFA, and VEGFC, which have been reported to act as factors that contribute to tumour angiogenesis.These results suggest that miR-664a-3p is involved in the invasion and metastasis of pancreatic cancer, which are possibly mediated by sEV.Subsequently, we found THBS2 as a downstream target of miR-664a-3p.THBS2, secreted by not only cancer cells but also endothelial cells and immune cells [48], plays an important role in cancer progression, though serving as either a tumour-suppressor or tumour-promoter remains controversial [49].As a multifunctional glycoprotein, it was involved in angiogenesis, cell motility, and cytoskeletal organization [50,51].And it has been well acknowledged as potent endogenous inhibitors of angiogenesis [52].In this study, overexpression of miR-664a-3p could decrease the expression level of THBS2, which may lead to angiogenesis and cancer progression.This brings up a potential target for the treatment of PDAC.
However, there are still several limitations in our study.Firstly, despite the rarity of the sample, the in-house dataset had not enough sample size.Secondly, due to a lack of large and comprehensive public database, we have not performed external validation.Thirdly, though we've demonstrated the diagnostic efficacy of sEV miR-664a-3p in PDAC and conducted initial exploratory direction in angiogenesis and metastasis, the deep mechanism about how sEV transfers miR-664a-3p to recipient cells is not fully elucidated.
In summary, our study used three machine learning feature selection algorithms to identify three diagnostic miRNA signatures for PDAC plasma sEV.These signatures showed robust diagnostic abilities in distinguishing PDAC from HC, non-PDAC patients, and BPD patients.We created an integrated analysis of features selected using the three machine learning methods, then focused on an overlapping miRNA.Plasma sEV miR-664a-3p may serve as a novel PDAC diagnostic biomarker, and higher expression levels are also suggestive of a higher likelihood of tumour invasion and metastasis.Collectively, these sEV miRNAs, especially miR-664a-3p, and their target genes may provide a novel PDAC diagnostic signature, reveal novel disease mechanisms, and highlight future putative drug targets.

Fig. 2 .
Fig. 2. A plasma sEV miRNAs signature in the diagnosis of PDAC based on LASSO Logistic regression.(A) The LASSO logistic regression method is used for feature selection), (B) Matrix of correlation (Spearman) between miRNAs selected by LASSO Logistic regression), (C) ROC curves of sEV miRNAs for PDAC patients in training cohort (solid line) and validation cohort (dashed line)), (D) Unsupervised hierarchical clustering of 11 sEV miRNAs selected for use in the d-signature in the training cohort), (E) Unsupervised hierarchical clustering of 11 sEV miRNAs selected for use in the d-signature in the validation cohort.ROC, receiver operating characteristic; AUC, area under the curve; PPR, positive predictive ratio; FPR, false predictive ratio.

Fig. 3 .
Fig. 3.A plasma sEV miRNAs signature in the diagnosis of PDAC based on random forest machine learning algorithm and SVM-RFE feature selection and SVM classification algorithm.(A) Variable importance in random forests considering mean decrease in Gini index), (B) Prediction results (training cohort prediction results (upper), validation cohort prediction results (lower)), (C) Confusion matrix of training data (upper) and validation data (lower)), (D) Cross-validated RMSE by recursive feature elimination (RFE) algorithm), (E) ROC curves of 4 sEV miRNAs selected by SVM-RFE for PDAC patients in training set (solid line) and validation set (dashed line).RMSE, root mean square error.

Fig. 4 .
Fig. 4. ROC curves of three plasma sEV miRNAs signatures selected by different machine learning methods combined with CA19-9 for PDAC patients in training cohort (solid line) and validation cohort (dashed line) and summaries of indicators of diagnostic performance.(A) LASSO logistic regression utilising 11 sEV miRNAs), (B) Random forest algorithm utilising 30 sEV miRNAs), (C) SVM-RFE algorithm utilizing 4 sEV miRNAs.(D) Summaries of performance of d-signature based on machine learning methods with or without CA19-9.D-signature, diagnostic signature.

Table 1
Characteristics of participants both in training cohort and validation cohort.

Table 2 .
1 Performance of d-signature in the PDAC using LASSO-logistic regression with/without CA19-9.

Table 2 .
2 Performance of d-signature in the PDAC using RF with/without CA19-9.

Table 3
Plasma sEV miR-664a-3p was correlated with clinical pathologic characteristics of PDAC patients.
signature, and SVM-RFE d-signature could all successfully distinguish PDAC patients from HC with respective AUC values of 0.978, 1, and 0.9469 in the training cohort and of 0.939, 0.9524, and 0.9561 in the validation cohort.When combined with CA19-9, the performances could be improved further, with respective AUC values of 1, 1, and 0.9615 in the training cohort, and of 0.974, 0.98, and 0.9737 in the validation cohort.The LASSO d-signature and RF d-signature performed better than the SVM-RFE d-signature for distinguishing PDAC patients from non-PDAC individuals both in the training and validation sets with or without CA19-9.In addition, the LASSO d-signature performed best for distinguishing PDAC patients from BPD patients with an AUC of 1 in both the training and validation sets.Furthermore, the LASSO d-signature had AUC values of 1 in the training set and 0.877 in the validation cohort when combined with CA19-9.Wang et al. reported in a study that included 17 PDAC and 12 BPD patients that sEV miR-1226-3p had acceptable performance for differentiating PDAC from BPD patients, with an AUC of 0.74