Identification of multiple proteoforms biomarkers on clinical samples by routine Top-Down approaches

Top-Down approaches have an extremely high biological relevance, especially when it comes to biomarker discovery, but the necessary pre-fractionation constraints are not easily compatible with the robustness requirements and the size of clinical sample cohorts. We have demonstrated that intact protein profiling studies could be run on UHR-Q-ToF with limited pre-fractionation (Schmit et al., 2017) [1]. The dataset presented herein is an extension of this research. Proteoforms known to play a role in the pathophysiology process of Alzheimer's disease were identified as candidate biomarkers. In this article, mass spectrometry performance of these candidates are demonstrated.


a b s t r a c t
Top-Down approaches have an extremely high biological relevance, especially when it comes to biomarker discovery, but the necessary pre-fractionation constraints are not easily compatible with the robustness requirements and the size of clinical sample cohorts. We have demonstrated that intact protein profiling studies could be run on UHR-Q-ToF with limited pre-fractionation (Schmit et al., 2017) [1]. The dataset presented herein is an extension of this research.
Proteoforms known to play a role in the pathophysiology process of Alzheimer's disease were identified as candidate biomarkers. In

Value of data
Proof of concept for intact protein analysis on biofluid. This data exhibited identified proteoforms originating from CSF of Alzheimer's disease patients Proteoform sequences and/or modifications will be shared with the community to extend available information, in order to better understand the physiopathology.

Data
Proteins contained in CSF were directly analyzed by LCMS with a Top-Down approach. This type of analysis gave information of proteoforms composition. This proof-of-concept study was applied to a patient cohort (30 samples) in Alzheimer's disease context. These samples were separated into 3 groups: group 1 (patients with Alzheimer's disease), group 2 (patients with other neurodegenerative diseases), and group 3 (patients with non-neurodegenerative diseases).
The number of compounds after MS analysis (filtered for charge state 41 and SNAP correlation 40.75, compounds with mass difference of less than 2 ppm and retention time differences of less than 2 min were considered as identical) totaled between 12,000 and 18,000. More than 5000 compounds common to the datasets from all 30 patients were used for the statistical analysis. Compounds whose p-value was below a 0.02 threshold were then tested for correlation (Pearson's correlation) with all the clinical markers including AD markers (Tau concentration; results of memory tests). Positively correlated compounds (r 2 40.8) were then selected for further MS/MS analysis and their identifications. MS results containing monoisotopic pattern, Extraction Ion Chromatogram, and MS/MS spectra were presented in 6 figures.
Proteoforms found to be regulated in AD pathology are listed in Table 1. These proteoforms come from 3 canonical proteins: clusterin, secretogranin-2, or chromogranin-A. These proteins were known as biomarkers of AD and neurodegenerative disorders [1]. -Clusterin proteoforms are shown in Figs. 1-3. CLUS-01 to 03 present C termini part of the full-length protein starting at position 390 or 391 (Fig. 1). These closed species co-eluted. CLUS-01 was identified by Byonic™ ( Fig. 2) with a score of 770.1, and the two-other species were eluted by mass differences (below 2.4 ppm) on MS1 level. CLUS-04, a shorter proteoform starting at position 420 was identified by Mascot with a peptide score of 52, with a good mass precision (MS1: 3.4 ppm; MS2: 4.24 ppm) (Fig. 3). -Secretogranin-2 proteoforms are shown in Fig. 4. One proteoform corresponding to the middle part (182-214) of the protein was identified by Byonic™ with a score of 1213.1. -Chromogranin-A proteoforms were show on Figs. 5 and 6. Two proteoforms were detected (Chrom-01 and Chrom-02). These proteoforms were detected with very different intensities that indicated a completely different stoichiometry. Form 439 to 457 was present in very low quantity and MS/MS identification required manual de novo sequencing. This sequencing used very high criteria in term of mass precision at MS1 level (o 3 ppm) and MS2 level (o10 ppm). A longer proteoform based on MS1 ion extraction was 37.5 times higher and could be identified with Byonic™ with a score of 468.2 (Fig. 6).

Experimental design, materials and methods
Experimental design and the materials and methods have been reported previously [1].  for 5 min, ramp to 9% B in 10 min, then ramp to 35% B in 110 min, ramp to 40% B in 8 min, ramp to 60% B in 9 min and ramp to 95% B in 3 min maintained for 15 min. Ramp down to 5% B in 3 min, reequilibrate for 23 min). The nano-LC-system was coupled to an Impact II™ benchtop UHR-Q-ToF (Bruker Daltonik, Bremen, Germany) through a CaptiveSpray nanoBooster™ source (Bruker Datonik, Bremen, Germany). Drying gas flow and temperature were set to 5 l/min and 180°C, respectively, and nanoBooster gas pressure was set to 0.2 bars. The nanoBooster reservoir was filled with acetonitrile.

LC-MS data processing for CSF samples
Data Processing: LC-MS data were automatically processed (calibration, protein signal extraction with Dissect™, deconvolution and determination of monoisotopic masses with SNAP™, charge state filtering, similarity filtering, export of deconvoluted monoisotopic masses with corresponding retention time and intensities) in Data Analysis 4.2™ (Bruker Daltonik, Bremen, Germany). Singly charged compounds have been automatically excluded. Only the isotopically resolved compounds have been taken into account. Statistical analyses were performed with Profile Analysis 2.1™ (Bruker Daltonics). The retention times, intensities and deconvoluted masses obtained for each compound from the Data Analysis processing have been used to generate the bucket table. The mass accuracy and retention time tolerance were set to 2 ppm and 0.5 (High Flow analysis) or 2 min (Low Flow analysis). Compounds sharing the same mass and retention time coordinates within those tolerances have been considered as similar. The bucket tables were built with all compounds present at least in 60% of one class, and the missing values were replaced by the average value of the bucket in the class the analysis belongs to. Intensities values were then normalized with the quantile normalization algorithm available in Profile Analysis. A student's t-test was performed to reveal compounds that were capable of discriminating 2 classes (p value o0.02). Statistical analysis was performed with the MedCalc™ 12. either manually with BioTools 3.2™ (Bruker Daltonik, Bremen, Germany) or automatically with Byonic™ (ProteinMetrics, SanCarlos, USA). With BioTools, the Top-Down Sequencing search 285 functionality was used with Mascot 2.4(Matrix Science) to identify proteoforms with a partially unmodified sequence. When this approach did not suffice to identify the protein the designated proteoform originates from, a blast search that was performed after an initial tag determination. In both cases, the full characterization was then obtained by mutation/modification searches performed with the Sequence Editor functionality available in BioTools 3.2™. Byonic searches Top-Down data in the same way as Bottom-Up data, meaning that the user supplies a protein database, allowed PTMs, and specificity of N-and C-termini, where "fully specific". Byonic searches were performed with various protein databases (one containing only full secretogranin, transthyretin, and chromogranin sequences; modifications are applied to all potential sites in a protein, with separate limits for each type of modification as well as a limit on the total number of modifications. The searches allowed 10 ppm precursor mass tolerance, 30 ppm fragment mass tolerance, and symmetric "narrow" compensation for precursor monoisotopic mass calls, which allows no error in nominal mass for precursors up to 2500 Da, 71 Da error for precursors from 2500 to 5000 Da, and 7 2 Da error for precursors of mass greater than 5000 Da.