Discovery and validation of serum glycoprotein biomarkers for high grade serous ovarian cancer

Abstract Purpose This study aimed to identify serum glycoprotein biomarkers for early detection of high‐grade serous ovarian cancer (HGSOC), the most common and aggressive histotype of ovarian cancer. Experimental design The glycoproteomics pipeline lectin magnetic bead array (LeMBA)‐mass spectrometry (MS) was used in age‐matched case‐control serum samples. Clinical samples collected at diagnosis were divided into discovery (n = 30) and validation (n = 98) sets. We also analysed a set of preclinical sera (n = 30) collected prior to HGSOC diagnosis in the UK Collaborative Trial of Ovarian Cancer Screening. Results A 7‐lectin LeMBA‐MS/MS discovery screen shortlisted 59 candidate proteins and three lectins. Validation analysis using 3‐lectin LeMBA‐multiple reaction monitoring (MRM) confirmed elevated A1AT, AACT, CO9, HPT and ITIH3 and reduced A2MG, ALS, IBP3 and PON1 glycoforms in HGSOC. The best performing multimarker signature had 87.7% area under the receiver operating curve, 90.7% specificity and 70.4% sensitivity for distinguishing HGSOC from benign and healthy groups. In the preclinical set, CO9, ITIH3 and A2MG glycoforms were altered in samples collected 11.1 ± 5.1 months prior to HGSOC diagnosis, suggesting potential for early detection. Conclusions and clinical relevance Our findings provide evidence of candidate early HGSOC serum glycoprotein biomarkers, laying the foundation for further study in larger cohorts.


INTRODUCTION
While ovarian cancer is only the third most common gynaecological cancer worldwide, it is the leading cause of gynaecological cancer mortality. One of the main factors for this high mortality rate is diagnosis at an advanced stage due to non-specific symptoms and the lack of effective predictive or diagnostic blood biomarkers. Late diagnosis is associated with high mortality. Only 29% of women with distant metastases survive 5 years, compared to 92% with localized disease [1].
However, the existing ovarian cancer diagnostic tests in clinical use, transvaginal ultrasound and serum CA125, do not have the sensitivity required for detecting the disease in early stage. Indeed, two large ovarian cancer screening trials using a combination of these modalities found no evidence of a reduction in disease-specific mortality on longterm follow-up [2,3]. Furthermore, neither test is specific to cancer and both trials reported unnecessary surgery in women without cancer [4]. This has led to a concerted effort to discover biomarkers for early detection of ovarian cancer [5].
Cancer is associated with alterations in the glycosylation machinery and glycan structures on circulating proteins [6]. Several studies have shown that specific glycoforms of cancer biomarkers can improve specificity. For ovarian cancer, glycosylated forms of CA125 measured by microarray [7], lectin immunoassay [8] or glycosylation-specific antibodies [9] can significantly improve differential diagnosis. This suggests that a glycoform-specific glycoprotein biomarker panel may achieve the high specificity and sensitivity required for ovarian cancer screening. Glycomic and glycoproteomics studies on ovarian cancer serum and tissues have revealed differential abundance of several types of N-glycans in ovarian cancer, including fucose, sialic acid, high mannose types [10][11][12][13][14][15]. However, these potential biomarkers are yet to be clinically validated.
Here, we report on a study using lectin magnetic bead array (LeMBA)-coupled mass spectrometry (MS) platform [16] for ovarian cancer serum glycoprotein biomarker discovery and validation. LeMBA is a one-pot, high throughput glycoproteomics method with no need for abundant serum protein depletion, potentially increasing robustness of the biomarker development process as we previously reported for oesophageal adenocarcinoma [17,18] and canine haemangiosarcoma [19]. It has not been previously used for glycoprotein biomarker studies in ovarian cancer. We have focused on the most common and aggressive histotype, high grade serous ovarian cancers (HGSOC), which accounts for ∼70% of ovarian cancers and most of the disease-specific mortality [5]. Furthermore, for discovery of biomarkers with the potential for early detection, in addition to using samples collected at clinical diagnosis as is the norm, we also evaluated samples collected prior to ovarian cancer diagnosis from the multicentre randomised controlled trial, the United Kingdom Collaborative Trial of Ovarian Cancer Screening (UKCTOCS).

Study design
Case control studies were undertaken using serum samples from women with HGSOC patients (cases) and two age-matched groups (controls)-women with a benign ovarian neoplasm and healthy women using sample sets from three independent cohorts -(1) a clinical set of 30 serum samples from the United Kingdom Ovarian Population Study (UKOPS) collected from women at diagnosis of HGSOC (n = 10), benign ovarian neoplasms (n = 10) and healthy controls (n = 10). (Table S1); (2) a pre-clinical set of 30 serum samples from the UKCTOCS trial [20] collected from women at a mean interval of 11.1 ± 5.1 months prior to diagnosis of HGSOC (n = 10), benign ovarian neoplasms (n = 10) and healthy controls (n = 10). (Table S1); and (3) a clinical set of 95 serum samples from the Australian Ovarian Cancer Study (AOCS) collected from women at diagnosis of HGSOC (n = 39), benign ovarian neoplasms (n = 28) and healthy controls (n = 28) (Table S2).
The discovery phase included the UKOPS clinical set and the UKC-TOCS pre-clinical set. A shortlist of candidate proteins and lectins from the discovery phase was then validated using the independent clinical set from the AOCS (Figure 1). This study was approved by the QIMR

Biomarker discovery phase
LeMBA-MS was used for both discovery and validation phases ( Figure 1). For discovery, seven lectins were selected from the litera-

Lectin magnetic bead pulldown and on-bead trypsin digest
LeMBA-MS with the selected seven lectins was performed as previously described [16,18] Pulldown using the prepared lectin magnetic beads was performed on the AssayMAP Bravo liquid handler workstation (Agilent Technologies, USA) using one lectin per microplate. Briefly, 50 µL conjugated beads and 100 µL denatured serum was added to each well of a 96-

Statement of Clinical Relevance
Ovarian cancer continues to be associated with high disease mortality. Much of this is due to the diagnosis at an advanced stage of the most common and aggressive histotype-high grade serous ovarian cancer (HGSOC). This has led to significant efforts to detect the disease earlier when treatment is more effective. However, to date there is no effective screening test. We describe discovery and validation of a novel blood glycoprotein signature using lectin magnetic bead array (LeMBA)-coupled mass spectrometry for HGSOC using a case control study of clinical samples collected at diagnosis.

2.2.3
Data dependent acquisition mass spectrometry Shotgun proteomics using data-dependent acquisition was performed on a SCIEX 5600 TripleTOF 5600+ mass spectrometer (SCIEX, USA) coupled to a Shimadzu LC-20AD Prominence nano liquid chromatography system (Shimadzu, Japan). All solvents and reagents were of MS grade (Thermo Fisher, USA). The mass spectrometer was controlled using Analyst 1.7 software (SCIEX, USA). Digested peptides were resuspended in 0.1% v/v FA and injected onto a Protecol C18 analytical column (200 Å, 3 µm, 150 mm × 150 µm, Trajan Scientific, Australia) connected to a Protecol guard column (Polar 120 Å, 3 µm, 10 mm × 300 µm, Trojan Scientific, Australia) and the sample injection order was randomised in the worklist. Column compartment was maintained at 45 • C. The peptides were eluted using mobile phase A (0.1% v/v FA) over the specified gradient of mobile phase B (95% F I G U R E 1 Biomarker study design. Discovery and validation of HGSOC biomarkers was conducted in two phases, starting from separate clinical cohorts (1) and sera collection (2). Lectin selection (3) was based on literature for discovery phase and the discovery results for the validation phase. Both phases use LeMBA (4), liquid handler-assisted pulldown (5) and on-bead digestion (6). Shotgun mass spectrometry was conducted for discovery phase (7) followed by discovery of candidates (8) for development of a targeted mass spectrometry assay (9) for validation phase. Both univariate and multivariate analyses were conducted for biomarker validation (10).
acetonitrile, 5% v/v water, 0.1% v/v FA) for 60 min at a flow rate of The acquired raw ion spectra for each lectin batch were searched against the reviewed UniProt human proteome database (20,365 proteins, accession date 1 January 2020) using MaxQuant software, v.1.6.6.0 [21]. The MaxQuant contaminant database (247 entries) was also searched to identify contaminants such as keratin. MaxQuant parameters were set as follows: Digestion = trypsin, with two missed cleavages; fixed modification was set to cysteine carbamidomethylation; variable modifications were set as methionine oxidation and N-terminal acetylation; label free quantification (LFQ) was enabled with minimum ratio count set to 2; unique and razor peptides were used for protein identification; match between runs was set as TRUE; and false discovery rate (FDR) for protein and peptide identification was set at 0.01. AB Sciex Q-TOF was set as instrument type using default settings. First search peptide tolerance was set to 0.07 Da and 0.06 Da for the main search. MS/MS tolerance was set at 40 ppm. The search results were imported into R software v1.4.1103 (www.R-project.org) for further data processing and statistical analyses.

2.2.4
Mass spectrometry data processing and statistical analysis The generated protein list for each lectin batch was filtered to remove contaminants, reverse identified protein IDs, proteins with < 2 peptide IDs and score < 5. Proteins which were missing in < 25% of all samples were considered missing at random and imputed using localized least square regression (llsimpute) [22]. Proteins missing in > 25% were imputed with the minimum detected value (values drawn randomly from a normal distribution centred at sample minimum and with SD estimated from non-missing proteins). Quantitative analysis was conducted at the protein level using the summed intensity of all peptides mapped to each protein. Log2 transformed data were analysed using the R limma package [23] and Student's T-test. As candidates from the discovery phase will be further confirmed by targeted MS, and application of false discovery rate to the dataset yielded very few significant differences, we shortlisted candidates based on non-adjusted p-values. Differentially abundant proteins were visualised by volcano plots, using the criteria p-value < 0.05 and Log 2 fold change > 1, and the nomenclature 'lectin-UniProt entry name' . All graphical output has been generated using R or GraphPad Prism v9 (San Diego, USA) and figures prepared using Illustrator v26.3.1 (Adobe Inc, USA) and Biorender (www.biorender.com).

2.3
Biomarker validation phase Due to a mass spectrometer software failure, data for the first batch of AAL-pulldown samples were not saved. The entire plate had to be rerun using remaining sample volume, and 25 samples were noted to have lower remaining volume. In addition, an injection problem was noted for one SNA-pulldown sample.

Data analysis
The data analysis for each lectin MRM-MS was performed independently with no comparisons performed across the three lectins. Data quality control was conducted using peak area for the three internal standard chicken ovalbumin peptides ( Figure S1) and the three spiked-in SIS peptides ( Figure S2). While most chicken ovalbumin peptides were consistent, the 25 AAL and one SNA samples with noted aberrations at the MS step showed larger variability, and were removed as outliers ( Figure S1, S2). After outlier removal, the calculated %CV for all peptide standards was less than 10% (except peptide AAL-VASMASEK, 11.3%) (Table S3) and the peak area distribution of all samples remaining in the analysis exhibited a normal data distribution ( Figure S3). As the dataset has low %CV, we decided normalisation is not required. Student's t-test was performed between the case control groups and false discovery rate using Benjamini-Hochberg method was applied to identify significantly differing proteins at padjusted-value < 0.05. All graphical output has been generated using R or GraphPad Prism and figures prepared using Adobe Illustrator and Biorender.

Multi-marker panel development
Generalized regression with binomial distribution and lasso estimation was used to develop multi-marker panels using JMP Pro version 16.2.0 (JMP Pro Inc., Carey, NC, USA). Performance of the multimarker panels were assessed using leave-one-out cross-validation, where area under the receiver operating curve, specificity, sensitivity was calculated on the left-out observations. The number of times each parameter (peptide) was chosen in the cross-validation models is presented as an indication of the relative importance of the markers for prediction. All models were also run using peptides standardised by subtraction of the mean of the three internal standard chicken ovalbumin peptides, but are not reported as they yielded comparable results.
Inclusion of the 25 AAL outliers also was tested and led to similar albeit slightly worse prediction models.

Discovery of candidate serum glycoprotein biomarkers
In the UKOPS discovery set, we found 15 and 16 differentially abundant proteins when HGSOC samples were compared to benign and healthy samples, respectively ( Figure 2, Table 1, Table S4). Com-  Figure 2D).

F I G U R E 2
Biomarker discovery data. Volcano and two-way scatter plots visualising the differentially abundant proteins and correlated proteins, respectively, between the benign and HGSOC (A, C, E) and healthy and HGSOC (B, D, F) clinical comparisons for UKOPS and UKCTOCS sample sets. The volcano plots highlight all differential glycoproteins according to the criteria p < 0.05, Log 2 Fold Change > 1. The scatter plots highlight select glycoproteins (Log 2 Fold Change > 0.5) that are upregulated (green dots) and downregulated (red dots) in both sample sets. All candidates are indicated using the nomenclature 'lectin-Uniprot entry name' .
Additionally, there was an overlap of a subset of candidates that displayed the same expression trend in both the sample sets such as LPHA-LG3BP, STL-HEP2 and SNA-CO9 in the HGSOC versus benign comparison ( Figure 2E) and ECA-C08G, SNA-ATRN and STL-A1AT in the healthy versus HGSOC comparison ( Figure 2F).

Validation of candidate biomarkers
Three lectins (AAL, SNA and STL) with the largest number of candidate proteins discovered in both UKOPS and UKCTOCS sets (Table 1) were selected for validation in the independent clinical AOCS cohort. Note: For each lectin, significant proteins with p-value < 0.05 and log 2 FC > 1 identified in each clinical cohort have been detailed out below. The total number of candidate proteins across each lectin and clinical cohort accounts for overlaps. The proteins are labelled by their UniProt entry name.
The list of protein candidates discovered from UKOPS and UKCTOCS were combined to generate a list of 44 proteins, which fell short of the target number of ∼60 candidate proteins that we previously used for biomarker validation [17,18]. In order to assess additional candidates which may be just outside of the p < 0.05 cut-off, we expanded the  Table S6.
Univariate analysis for HGSOC versus benign and HGSOC versus healthy samples was conducted at the peptide level on each lectin dataset, with significance cut-offs set at adjusted p-value < 0.05 and log 2 fold change > 0.5 (Table S7) Table S8).

Development of multi-marker signature for HGSOC
The receiver operating curve (AUC), specificity and sensitivity of the developed multi-peptide models are detailed in Table 3, Table S9. All four models performed similarly with the AAL signature having the highest AUC (87.5%), sensitivity (70.4%) and specificity (90.7%). To further inspect the stable peptides for each of the models, we filtered peptides chosen in at least 50% of the cross-validation runs (Table 3).
This analysis revealed several interesting observations. The IBP3 peptide ALAQCAPPPAVCAELVR was always selected for each lectin signature, indicating strong predictive value for HGSOC. Two peptides were highly stable for SNA, STL and the combined signatures, namely, A2MG_NEDSLVFVQTDK and CHLE_NIAAFGGNPK. For the combined signature, both SNA and STL binding IBP3_ALAQCAPPPAVCAELVR were selected with high stability (100% and 91.6%, respectively).  Note: Table shows the number of significant peptides for either HGSOC versus benign or HGSOC versus healthy comparison (q-value < 0.05 and log 2 FC > 0.5), the number of proteins with any significant peptide, and the number of proteins with all measured peptides significant and consistent in direction. Protein with all peptides consistent are arranged by alphabetical order of their Uniprot entry name, according to the direction of change in HGSOC.

Evaluation of validated biomarker candidates for early HGSOC detections
To determine if any of the nine validated univariate protein biomarkers were altered in the pre-clinical samples, we re-examined the UKCTOCS discovery LeMBA-DDA-MS data set for the nine proteins (Table 2).
Three proteins, namely, CO9, ITIH3 and A2MG were significantly altered in the UKCTOCS samples. Figure 4 illustrates the comparative data from discovery UKCTOCS (protein level) and validation (peptide level) phases. AAL-CO9 ( Figure 4A) and STL-ITIH3 ( Figure 4B) were significantly higher in HGSOC group compared to benign and healthy groups in the UKCTOCS set, while SNA-A2MG was lower in HGSOC group ( Figure 4C Table S8).

DISCUSSION
This study, involving multiple independent clinical and pre-clinical sample sets, provides evidence of glycoproteins as serum biomarkers for HGSOC and lays the foundation for further research on larger patient cohorts. Excitingly, three glycoproteins (CO9, ITIH3 and A2MG) were altered in pre-clinical serum samples collected 11.1 ± 5.1 months prior to HGSOC diagnosis, suggesting promise in early detection.
The observed changes in proteins pulled down by the three lectins (AAL, SNA and STL) suggest alterations in α-fucose, sialic acid and Nacetylglucosamine during HGSOC development that require further characterization.
To increase likelihood of successful biomarker development, our glycoprotein-focused biomarker pipeline addressed the issue of tech-nical and biological variations by using LeMBA as a common platform across all phases, and developing multi-marker panels, respectively.
The LeMBA-MRM platform also has the advantage of being able to be deployed as a clinical assay [24] reducing the time it takes for the findings to be translated for patient use. Alternatively, lectinimmunoassays can be developed for the discovered biomarkers [25].
We employed a phased biomarker study design to operate within budget [26] where the discovery phase screen uses a relatively small sample size to generate a shortlist of lectins and protein candidates for validation in a larger cohort. In view of the small discovery samples size, our choice of low-stringency statistics on the discovery data, and experimental design to analyse all candidate proteins against the three selected lectins (using the single MRM assay) was ultimately critical for successful biomarker validation. Notably, only one specific lectinprotein discovery phase candidate (STL-PON1, Table 1) was ultimately confirmed in the validation cohort ( Table 2). The validated biomarkers were comprised mostly of lectin-protein combinations that were just outside the initial cut-off for the discovery analysis but were analysed in the validation cohort as the custom MRM assay was used on all three selected lectins. This outcome highlights the need for data-specific Atlas [29], and correlated with VTE incidence in 32 cancer types in a previously reported Dutch study [27]. While a moderate correlation was found between VTE risk and expression of Tissue Factor (F3), a major pro-coagulant factor [29], the authors noted heterogeneity TA B L E 3 Signature peptides from lectin-pulldowns for distinguishing HGSOC from benign/healthy and the stability of each peptide assessed by number of times selected during cross-validations.  [36]. A2MG is a broad-spectrum protease inhibitor which inhibits thrombin and the complement pathway [37]. Reduced A2MG levels may indicate elevated Coagulation and complement pathway activity, although the role of the SNA-A2MG glycoform is currently unknown.
C9 is the terminal Component of the complement cascade, which has also been found to be elevated in serum of gastric [38], lung [39], colorectal [40,41] and esophageal cancers [17,18] [39]. Recently, we reported the release of C9 + EVs by esophageal adenocarcinoma cells as a potential mechanism of the elevated serum C9 glycoform in esophageal cancer [42]. However, the specific glycosylation differences in cancer serum and EVs, as well as the molecular mechanisms underpinning the altered C9 glycosylation in different cancer types remains to be determined.
Inter-alpha-trypsin inhibitor family members, including ITIH3, have been implicated in inflammation and carcinogenesis [43]. In addition to its protease inhibitor activity, ITIH3 is thought to stabilize the extracellular matrix through binding to hyaluronic acid [44]. Interestingly, previous reports suggest blood ITIH3 to be elevated for a similar range of cancers as C9, namely, lung [45], gastric [46], pancreatic [47] and colorectal [48] cancers.
In conclusion, we report the discovery and validation of serum glycoprotein markers for HGSOC using a lectin-assisted proteomics workflow that is directly translatable to blood tests. The validated markers show high specificity when bench-marked against the existing ovarian cancer biomarker, CA125. Their utility in ovarian cancer diagnosis and monitoring will need to be evaluated in additional cohorts, such as at diagnosis and following surgery/chemotherapy.

DATA AVAILABILITY STATEMENT
The mass spectrometry proteomics data for the discovery phase have been deposited to the ProteomeXchange Consortium via the PRIDE [49] partner repository with the dataset identifier PXD032299. The targeted mass spectrometry data for the validation phase have been deposited to PRIDE via Panorama Public database with the dataset identifier PXD033108.