Multiple Glycation Sites in Blood Plasma Proteins as an Integrated Biomarker of Type 2 Diabetes Mellitus

Type 2 diabetes mellitus (T2DM) is one of the most widely spread metabolic diseases. Because of its asymptomatic onset and slow development, early diagnosis and adequate glycaemic control are the prerequisites for successful T2DM therapy. In this context, individual amino acid residues might be sensitive indicators of alterations in blood glycation levels. Moreover, due to a large variation in the half-life times of plasma proteins, a generalized biomarker, based on multiple glycation sites, might provide comprehensive control of the glycemic status across any desired time span. Therefore, here, we address the patterns of glycation sites in highly-abundant blood plasma proteins of T2DM patients and corresponding age- and gender-matched controls by comprehensive liquid chromatography-mass spectrometry (LC-MS). The analysis revealed 42 lysyl residues, significantly upregulated under hyperglycemic conditions. Thereby, for 32 glycation sites, biomarker behavior was demonstrated here for the first time. The differentially glycated lysines represented nine plasma proteins with half-lives from 2 to 21 days, giving access to an integrated biomarker based on multiple protein-specific Amadori peptides. The validation of this biomarker relied on linear discriminant analysis (LDA) with random sub-sampling of the training set and leave-one-out cross-validation (LOOCV), which resulted in an accuracy, specificity, and sensitivity of 92%, 100%, and 85%, respectively.


Introduction
Diabetes is an ubiquitously spread disease with a worldwide occurrence exceeding 387 million in 2014, and expected to reach 592 million by the year 2035 [1]. Among the total number of cases, more than 90% are represented by the type 2 diabetes mellitus (T2DM), characterized with impaired insulin action and/or insulin secretion [2]. As the onset of the early metabolic alterations (insulin tolerance and Figure 1. Early and advanced glycation in human blood plasma. The major pathways of advanced glycation end product (AGE) formation: monosaccharide autoxidation (Wolff-pathway), autoxidation of aldimins (Namiki-pathway), polyol pathway, autoxidation of Amadori products (Hodge-pathway), and non-oxidative pathway.
Fasting plasma glucose (FPG) is the preferred diagnostic parameter, but an increase of blood glucose is also detected in patients suffering from other diseases besides diabetes mellitus [12]. Currently, the blood content of hemoglobin isoform, HbA1c, glycated by the N-terminus of its α-chain (˃6.5% of the total hemoglobin fraction), is one of the principle diagnostic criteria of this disease [13] and an efficient tool of long-term (60-90 days) glycemic control [14]. However, HbA1c does not deliver any information about short-term alterations in plasma glucose concentrations accompanying the onset of metabolic syndrome [15]. In contrast, short-living plasma proteins provide a good possibility to decrease the time dimensions of glycemic control. Thus, human serum albumin (HSA), the major blood plasma protein with a half-life of 21 days, can be used as a marker of T2DM [16]. Its global glycation rates can be quantitatively assigned by an array of enzymatic [17], colorimetric [18], immunochemical [19], electrophoretic [20], and chromatographic [21] methods. Importantly, the levels of HSA glycation vary from 1% and 10% in healthy individuals to 20% to 90% in patients with diabetes [22]. However, as the glycation rates at individual lysyl residues in the HSA molecule differ essentially [23], the sensitivity of the potential glycation sites to short-term changes of blood glucose levels might also be different. Early and advanced glycation in human blood plasma. The major pathways of advanced glycation end product (AGE) formation: monosaccharide autoxidation (Wolff-pathway), autoxidation of aldimins (Namiki-pathway), polyol pathway, autoxidation of Amadori products (Hodge-pathway), and non-oxidative pathway.
Fasting plasma glucose (FPG) is the preferred diagnostic parameter, but an increase of blood glucose is also detected in patients suffering from other diseases besides diabetes mellitus [12]. Currently, the blood content of hemoglobin isoform, HbA 1c , glycated by the N-terminus of its α-chain (>6.5% of the total hemoglobin fraction), is one of the principle diagnostic criteria of this disease [13] and an efficient tool of long-term (60-90 days) glycemic control [14]. However, HbA 1c does not deliver any information about short-term alterations in plasma glucose concentrations accompanying the onset of metabolic syndrome [15]. In contrast, short-living plasma proteins provide a good possibility to decrease the time dimensions of glycemic control. Thus, human serum albumin (HSA), the major blood plasma protein with a half-life of 21 days, can be used as a marker of T2DM [16]. Its global glycation rates can be quantitatively assigned by an array of enzymatic [17], colorimetric [18], immunochemical [19], electrophoretic [20], and chromatographic [21] methods. Importantly, the levels of HSA glycation vary from 1% and 10% in healthy individuals to 20% to 90% in patients with diabetes [22]. However, as the glycation rates at individual lysyl residues in the HSA molecule differ essentially [23], the sensitivity of the potential glycation sites to short-term changes of blood glucose levels might also be different.
In this context, the monitoring of glycation rates at specific glycation sites might be advantageous in comparison to the quantification of global glycation levels. Therefore during the last decade, mass spectrometry was intensively employed in the establishment of such techniques [24][25][26]. In all cases, analytics relied on the bottom-up proteomic approach, based on nano-scaled liquid chromatography-mass spectrometry (nanoLC-MS) and tandem mass spectrometry (MS/MS). Thus, we showed that only some HSA lysyl residues are differentially glycated in plasma of T2DM patients, whereas glycation rates at other potential modification sites seemed to not be affected [27]. Hence, tryptic peptides, representing differentially glycated sites, might be considered as prospective T2DM biomarkers. For some of them, it was additionally confirmed by absolute quantification using internal standardization with dabsylated bi-labeled Amadori-modified peptides [28] or 13 C, 15 N synthetic analogs of specifically glycated peptides [29]. As was shown recently, one of the plasma glycation sites (namely K 141 in haptoglobin) might provide an additional diagnostic tool in combination with well-established T2DM markers, such as fasting plasma glucose (FPG) and HbA 1c . The main advantage of combining two markers-HbA 1c and glycated haptoglobin, K 141 -is the simultaneous consideration of two proteins with different half-life times, i.e., 3 to 4 and 2 to 4 days, respectively. It makes this biomarker is sensitive to long-and short-term fluctuations of blood glucose concentrations. The set of glycated K 141 of haptoglobin and HbA 1c provided a sensitivity of 94%, a specificity of 98%, and an accuracy of 96% to identify T2DM [30].
In this context, it is logical to assume that a biomarker strategy, based on multiple specific glycation sites in plasma proteins, could essentially increase the efficiency of glycemic control and disease prediction. Indeed, the involvement of several glycated proteins with different half-life times (τ 1/2 ) allows several time segments in the glycemic control to be addressed without additional analyses. Moreover, this approach might decrease the impact of individual glycation sites in the overall dispersion of the data, when larger cohorts are considered. Therefore, here, we present a mass spectrometry-based biomarker approach relying on multiple glycation sites. We demonstrate its applicability to different time points in the span of glycemic control. Based on the results of linear discriminant analysis (LDA) performed for the cohorts of T2DM patients and individuals without diabetes, we characterize a set of Amadori-modified peptides with a diagnosis accuracy of up to 92%.

Normalization by Plasma Protein Contents and Tryptic Digestion
According to the applied workflow ( Figure 2), the normalization of analyte abundances relied on the determination of plasma protein contents and quality controls. The plasma protein concentrations, determined by the Bradford assay, varied from 36.2 to 82.4 mg/mL (Table S1), and were 58.8 ± 9.85 and 61.9 ± 13.5 mg/mL in the T2DM and normoglycemic groups, respectively. These values were used for the normalization of protein amounts taken for tryptic digestion and were considered in the label-free quantification procedure ( Figure 2). Hence, as an adequate normalization is a pre-requisite for the satisfactory precision of label-free quantification [31], the plasma protein contents, determined by the Bradford assay, were cross-verified by sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) (protein load of 5 µg per line) ( Figure S1), and total line densities were recorded (Table S1). The whole-line densitometric analysis revealed the average density of 19,483 ± 1289 arbitrary units (AU, relative standard deviation (RSD) = 6.6%), thereby providing sufficient precision for reliable quantification of individual glycated peptides by the signal intensities of corresponding quasi-molecular ions. The following tryptic digestion was considered to be quantitative, as the human serum albumine (HSA) band could not be detected in the SDS-PAGE experiments with tryptic digests (Figure S2), indicating a digestion efficiency better than 95% [27].

Annotation of Prospectively Glycated Peptides
Due to the high resolution and mass accuracy of the measurements (typically more than 12,000 and less than 3 ppm, respectively), individual glycated peptides could be reliably annotated by their retention times (t R ) and exact masses (m/z) of multiple-charged quasi-molecular ions. This step relied on the list of glycated tryptic peptides, obtained from the plasma proteins of T2DM patients, based on the comprehensive works of Zhang et al. [26,32] and representing, to the best of our knowledge, the most complete set of reliably identified in vivo plasma glycation sites (more than 350). Therefore, for each glycated peptide identified by Zhang et al., all possible charge states were calculated with consideration of their amino acid composition. On the basis of this data, characteristic extracted ion chromatograms (XICs) were constructed in the mass ranges of m/z ± 0.02 for all matched signals with z ≥ 2. For this, the time-of-flight (TOF) scans obtained with pooled diabetic plasma protein digests (n = 3) were used. Most of the glycated peptides were annotated with sub-ppm mass accuracies and in all cases represented the minimal mass deviations from the theoretically predicted values in the ranges used for building the corresponding XICs (Table 1). This strategy revealed a total of 51 Amadori-modified peptides present in the pooled plasma protein digests in each sample, representing 10 major plasma proteins. As these features could be reliably detected, i.e., demonstrated a signal to noise ratio ≥ 3 in the corresponding XICs, their relative abundances in the T2DM and without diabetes cohorts were addressed by the label free quantification approach.

Annotation of Prospectively Glycated Peptides
Due to the high resolution and mass accuracy of the measurements (typically more than 12,000 and less than 3 ppm, respectively), individual glycated peptides could be reliably annotated by their retention times (tR) and exact masses (m/z) of multiple-charged quasi-molecular ions. This step relied on the list of glycated tryptic peptides, obtained from the plasma proteins of T2DM patients, based on the comprehensive works of Zhang et al. [26,32] and representing, to the best of our knowledge, the most complete set of reliably identified in vivo plasma glycation sites (more than 350). Therefore, for each glycated peptide identified by Zhang et al., all possible charge states were calculated with consideration of their amino acid composition. On the basis of this data, characteristic extracted ion chromatograms (XICs) were constructed in the mass ranges of m/z ± 0.02 for all matched signals with z ≥ 2. For this, the time-of-flight (TOF) scans obtained with pooled diabetic plasma protein digests (n = 3) were used. Most of the glycated peptides were annotated with sub-ppm mass accuracies and in all cases represented the minimal mass deviations from the theoretically predicted values in the ranges used for building the corresponding XICs (Table 1). This strategy revealed a total of 51 Amadori-modified peptides present in the pooled plasma protein digests in each sample, representing 10 major plasma proteins. As these features could be reliably detected, i.e., demonstrated a signal to noise ratio ≥ 3 in the corresponding XICs, their relative abundances in the T2DM and without diabetes cohorts were addressed by the label free quantification approach.   Individual tryptic digests obtained from T2DM patients (n = 20) and corresponding controls (n = 18) were analyzed by capillary RP-HPLC-ESI-QqTOF-MS. The relative quantification relied on a label-free strategy using a digest prepared from T2DM pooled plasma as a quality control (injected after each 8th sample). Statistical analysis relied on the Mann-Whitney test. C*, M*, and K* denote carbamidomethylated cysteine, methionine sulfoxide, and fructosamine lysine residues, respectively. Ac, Sp, and Sn denote the mass accuracy, accuracy, specificity, and sensitivity, respectively (calculated in MS Excel with RealStatisticsaddon, http://www.real-statistics.com).

Label-free Quantification
A paired Mann-Whitney test performed for the intensities of individual peptide signals (expressed as integral peak areas of corresponding XICs at defined retention times (t R s)) revealed 42 differentially (p ≤ 0.01) glycated peptides representing nine plasma proteins ( Figure 3 and Figure S3). Among this number, for 10 peptides, the biomarker properties were already proposed earlier [27,30], whereas for 32 species, a statistically significant T2DM-related abundance increase was observed here for the first time. Thereby, the levels of glycation at 25 HSA lysyl residues (represented by 27 tryptic peptides and comprising 43% of the total 58 present in the protein sequence) were significantly increased in the plasma of T2DM patients in comparison to the healthy controls. In contrast, only six modification sites were up-regulated in serotransferrin (six tryptic peptides, 10.3% of the totally available lysines), whereas the other proteins were represented with only two (α-2-macroglobulin and Ig kappa chain C region) or one site, demonstrating enhanced glycation in T2DM plasma ( Table 1). Most of the annotated prospective biomarkers could distinguish T2DM patients with sensitivities of 85% to 100% and accuracies exceeding 80% (Table 1). Thus, each of the differentially glycated proteins was represented with at least one marker peptide with biomarker sensitivities typically higher than 85% (with one exception for the apolipoprotein A-derived peptide, LAEYHAKATEHLSTLSEK, modified at lysine 219, Table 1). Interestingly, the τ 1/2 values of these proteins completely covered a time span of three weeks ( Table 2).
Integration of selected peptide signals in the quality control (QC) samples revealed high intraand inter-dayprecision of analyte retention times (t R s) and abundances. Thus, the RSD for intra-and inter-dayprecision of t R s typically did not exceed 0.54% and 0.94%, whereas for peak areas, it was not higher than 11.0% and 16.46%, respectively (Table S2).

Label-free Quantification
A paired Mann-Whitney test performed for the intensities of individual peptide signals (expressed as integral peak areas of corresponding XICs at defined retention times (tRs)) revealed 42 differentially (p ≤ 0.01) glycated peptides representing nine plasma proteins (Figures 3 and S3). Among this number, for 10 peptides, the biomarker properties were already proposed earlier [27,30], whereas for 32 species, a statistically significant T2DM-related abundance increase was observed here for the first time. Thereby, the levels of glycation at 25 HSA lysyl residues (represented by 27 tryptic peptides and comprising 43% of the total 58 present in the protein sequence) were significantly increased in the plasma of T2DM patients in comparison to the healthy controls. In contrast, only six modification sites were up-regulated in serotransferrin (six tryptic peptides, 10.3% of the totally available lysines), whereas the other proteins were represented with only two (α-2-macroglobulin and Ig kappa chain C region) or one site, demonstrating enhanced glycation in T2DM plasma ( Table 1). Most of the annotated prospective biomarkers could distinguish T2DM patients with sensitivities of 85% to 100% and accuracies exceeding 80% (Table 1). Thus, each of the differentially glycated proteins was represented with at least one marker peptide with biomarker sensitivities typically higher than 85% (with one exception for the apolipoprotein A-derived peptide, LAEYHAKATEHLSTLSEK, modified at lysine 219, Table 1). Interestingly, the τ1/2 values of these proteins completely covered a time span of three weeks ( Table 2).
Integration of selected peptide signals in the quality control (QC) samples revealed high intraand inter-dayprecision of analyte retention times (tRs) and abundances. Thus, the RSD for intra-and inter-dayprecision of tRs typically did not exceed 0.54% and 0.94%, whereas for peak areas, it was not higher than 11.0% and 16.46%, respectively (Table S2).

Sequence Assessment of Differentially Abundant Glycated Peptides
All features demonstrating significantly higher abundances in patients with diabetes were identified by tandem mass spectrometry (MS/MS, Table S3). The MS/MS spectra of 18 glycated peptides could be acquired in data-dependent acquisition (DDA) experiments with tryptic digests obtained from pooled T2DM plasma (n = 3). The fragmentation of the other 24 species was achieved in targeted LC-MS/MS experiments. The sequences of the annotated peptides were unambiguously confirmed by a search against Uniprot human database (downloaded 7 June 2016) and verified by manual interpretation of the MS/MS spectra as exemplified in Figure 3 (for the spectral information see Figure S4). The sequences of six peptides (3,4,32,34,37,40) could not be identified by the Mascot search engine and were assigned by manual interpretation (the complete spectral information is presented in Figure S5).

Sequence Assessment of Differentially Abundant Glycated Peptides
All features demonstrating significantly higher abundances in patients with diabetes were identified by tandem mass spectrometry (MS/MS, Table S3). The MS/MS spectra of 18 glycated peptides could be acquired in data-dependent acquisition (DDA) experiments with tryptic digests obtained from pooled T2DM plasma (n = 3). The fragmentation of the other 24 species was achieved in targeted LC-MS/MS experiments. The sequences of the annotated peptides were unambiguously confirmed by a search against Uniprot human database (downloaded 7 June 2016) and verified by manual interpretation of the MS/MS spectra as exemplified in Figure 3 (for the spectral information see Figure S4). The sequences of six peptides (3,4,32,34,37,40) could not be identified by the Mascot search engine and were assigned by manual interpretation (the complete spectral information is presented in Figure S5).
For all annotated peptides, manual interpretation revealed a good coverage of the peptide sequences by intense b-and y-fragment ions (Figure 3), whereas their glycated status, in the majority of cases, could be confirmed by the patterns of water and formaldehyde losses from the quasi-molecular ions, characteristic for glucose-derived Amadori compounds [40,41]. Additionally, the presence of a sugar moiety at the corresponding lysyl residues was confirmed by pyrylium Amadori-related fragment ions.

Linear Discriminant Analysis
First, the data were subjected to principle component analysis (PCA), and the relative influence of the extracted individual principle components (PCs) was addressed, leaving out only the first nine components, which are responsible for 95% of the sample variance (see the corresponding variable loadings listed in Table S4). As can be seen from Figure 4, even the first two principal components provide decent class separation. For all annotated peptides, manual interpretation revealed a good coverage of the peptide sequences by intense b-and y-fragment ions (Figure 3), whereas their glycated status, in the majority of cases, could be confirmed by the patterns of water and formaldehyde losses from the quasi-molecular ions, characteristic for glucose-derived Amadori compounds [40,41]. Additionally, the presence of a sugar moiety at the corresponding lysyl residues was confirmed by pyrylium Amadori-related fragment ions.

Linear Discriminant Analysis
First, the data were subjected to principle component analysis (PCA), and the relative influence of the extracted individual principle components (PCs) was addressed, leaving out only the first nine components, which are responsible for 95% of the sample variance (see the corresponding variable loadings listed in Table S4). As can be seen from Figure 4, even the first two principal components provide decent class separation. Alternatively, another procedure for the reduction of dimensionality, i.e., a variance inflation factor (VIF)-based factor filtering (following the workflow for generation of the LDA model presented in the Materials and Methods part), was applied to the original data. Therefore, diverse feature subsets could be generated depending on the order of the feature testing. In order to characterize possible non-collinear feature combinations, this algorithm was run 1000 times for the VIF cutoff values of 5 and 10. On the basis of the obtained results (Tables S5 and S6), the most abundant combinations of factors were selected for further training with LDA.
The results of the filtering procedure, based on the VIFs, are summarized in Tables S5 and S6, where individual variables are labeled as in Table 1. For the further processing, one of the most highly represented sets of variables with VIF = 10 and two sets, represented at 70% of all optimization runs with VIF = 5, were selected. Validation of the training and prediction accuracy was performed for the selected sets of variables in leave-one-out cross validation (LOOCV) and the sub-sampling LDA parameterizations modes ( Figure 5). The resulting values for the accuracy, specificity, and sensitivity are summarized in Table 3. Alternatively, another procedure for the reduction of dimensionality, i.e., a variance inflation factor (VIF)-based factor filtering (following the workflow for generation of the LDA model presented in the Materials and Methods part), was applied to the original data. Therefore, diverse feature subsets could be generated depending on the order of the feature testing. In order to characterize possible non-collinear feature combinations, this algorithm was run 1000 times for the VIF cutoff values of 5 and 10. On the basis of the obtained results (Tables S5 and S6), the most abundant combinations of factors were selected for further training with LDA.
The results of the filtering procedure, based on the VIFs, are summarized in Tables S5 and S6, where individual variables are labeled as in Table 1. For the further processing, one of the most highly represented sets of variables with VIF = 10 and two sets, represented at 70% of all optimization runs with VIF = 5, were selected. Validation of the training and prediction accuracy was performed for the selected sets of variables in leave-one-out cross validation (LOOCV) and the sub-sampling LDA parameterizations modes ( Figure 5). The resulting values for the accuracy, specificity, and sensitivity are summarized in Table 3.

Figure 5.
Values of the linear discriminant analysis (LDA) response variable plotted for T2DM (opened diamonds) and control (closed circles) individuals, calculated using selected predictive variable sets (as summarized in Table 3). (A) Estimated via leave-one-out cross-validation; (B) estimated from original samples, with LDA trained on generated samples.

Experimental Setup
Protein glycation is one of the most universal markers of diabetes mellitus. It is recognized as an indirect, but adequate marker of blood glucose contents over a certain period of time [42]. Indeed, as blood sugar concentrations fluctuate in a wide range, depending on circadian rhythmus, food intake, medication, and other factors [43], the determination of actual blood glucose levels usually does not deliver clinically valuable information. In this context, HbA1c is commonly used both in diagnostics of T2DM and long-term therapy control of the disease [42], whereas determination of short-term markers, like glycated albumin and total plasma fructosamine fraction [44,45], are still much less common in clinical practice. Thus, persisting hyperglycemia is typically confirmed only over the last three months prior to determination, whereas no or little information about the dynamics of glycation (and, hence, blood glucose levels) is available.
Therefore, the monitoring of several glycated proteins, varying in their half-life times, would be a promising alternative to the traditional approach for glycemic control. However, as the glycation rates at individual protein lysyl residues vary essentially [23], the total glycation levels of each protein would be less sensitive to sugar fluctuations than modification levels at individual residues. To address this concept in statistically representative cohorts, we analyzed the relative quantities of Figure 5. Values of the linear discriminant analysis (LDA) response variable plotted for T2DM (opened diamonds) and control (closed circles) individuals, calculated using selected predictive variable sets (as summarized in Table 3). (A) Estimated via leave-one-out cross-validation; (B) estimated from original samples, with LDA trained on generated samples.

Experimental Setup
Protein glycation is one of the most universal markers of diabetes mellitus. It is recognized as an indirect, but adequate marker of blood glucose contents over a certain period of time [42]. Indeed, as blood sugar concentrations fluctuate in a wide range, depending on circadian rhythmus, food intake, medication, and other factors [43], the determination of actual blood glucose levels usually does not deliver clinically valuable information. In this context, HbA 1c is commonly used both in diagnostics of T2DM and long-term therapy control of the disease [42], whereas determination of short-term markers, like glycated albumin and total plasma fructosamine fraction [44,45], are still much less common in clinical practice. Thus, persisting hyperglycemia is typically confirmed only over the last three months prior to determination, whereas no or little information about the dynamics of glycation (and, hence, blood glucose levels) is available.
Therefore, the monitoring of several glycated proteins, varying in their half-life times, would be a promising alternative to the traditional approach for glycemic control. However, as the glycation rates at individual protein lysyl residues vary essentially [23], the total glycation levels of each protein would be less sensitive to sugar fluctuations than modification levels at individual residues. To address this concept in statistically representative cohorts, we analyzed the relative quantities of corresponding glycated species in the group (n = 20) of T2DM patients matched by age and gender to healthy controls. This cohort size was a good starting point for this pilot study, but due to a high number of potential marker peptides, we employed an α-level of 0.01 (i.e., lower, than usual) and implemented a family-wise error rate control procedure [46], thus ensuring that the samples sizes used in our study were sufficient to prove all related inferences. As for the PCA\LDA, dimensionality reduction techniques are well suited for the high parameters of samples ratio, and according to Sharma et al. [47] and references therein, our case (approximately 40 dimensions to 40 samples) is several orders of magnitude below typical thresholds, above which small sample size problems should be specifically addressed.
To access individual glycation sites, we applied an experimental setup relying on the bottom-up proteomic approach and LC-MS-based label-free quantification ( Figure 2). As we have shown previously, the abundances of such Amadori peptides (i.e., the areas of corresponding peaks in appropriate XICs) can be used to quantify corresponding glycation sites in bovine serum albumin [25]. Moreover, testing of this approach in small cohorts by nano-scaled reversed-phase high performance liquid chromatography-electrospray ionization-tandem mass-spectrometry (nanoRP-HPLC-ESI-MS/MS) DDA experiments revealed the differential abundance of some glycation sites in T2DM patients and normoglycemic individuals, although the poor precision of the quantification restricted the reliability of the results' interpretation [27].
Besides employing larger cohorts, this limitation can be generally overcome by using a reliable single-step chromatography system (i.e., skipping trapping procedure) as a part of a targeted or untargeted quantitative platform. Although the first approach, recently proposed by Spiller and co-workers [30], provides sufficient precision [29], it targets only a limited number of potential marker sites. Moreover, the selection of marker peptides, addressed in the multiple reaction monitoring (MRM) triple quadrupole (QqQ) method, proposed by the authors was based on our pilot study, which was performed in small cohorts consisting of only five probands [27]. This experimental setup could certainly result in missing many glycation sites with biomarker properties.
Therefore, here, we decided on an unbiased profiling approach based on high resolution mass spectrometry (HR-MS) coupled with on-line to microbore reversed-phase high performance liquid chromatography (RP-HPLC) with subsequent annotation of previously identified glycated peptides by the m/z of corresponding quasi-molecular ions. For this, we relied on the best available database of in vivo plasma glycation sites, built by Metz and co-workers on the basis of multi-dimensional in-depth plasma proteomics analysis [26]. Although in the absence of enrichment and pre-fractionation steps our straightforward approach (comprising boronic acid affinity chromatography (BAC) and RP-HPLC steps) yielded only 51 of the most abundant peptides, representing 46 modification sites among 2205 identified by nano-LC-MS/MS-based proteomics [26], their quantification was accurate and extremely precise. Indeed, our method provided a 14-and 7-fold improvement of intra-and inter-day precision for the peak area in comparison to the values, obtained by Frolov et al. (2014) [27], respectively (Table S2). Thereby, inter-day precision values were even slightly better than those generated with QqQ-MS instrumentation [29] with an essentially higher number of considered peptides. Without any doubt, it positively affected the significance of differences between T2DM and control groups (Table 1, Figure 3 and Figure S3), with a simultaneous reduction of the analysis time and costs in comparison to the conventional LC-based proteomic approach.
In this context, the analytical workflow ( Figure 2) was reduced to the combination of three principle steps-enrichment of glycated peptides with BAC, desalting of the samples by SPE, and subsequent reversed phase chromatography (RPC) separation [24,48]. In our recent work we proved that the the solid phase extraction (SPE) step is absolutely mandatory for the acceptable recovery of Amadori peptides from the reversed phase [29]. Therefore, it was ultimately included in the proposed workflow.

Biomarker Potential of Glycated Peptides
Although the size of the cohorts used in our previous study was insufficient for reliable conclusions, five individuals per comparison group was a good starting point to establish new biomarkers based on specific glycation sites in plasma proteins [27]. The increase of the cohort size in combination with an untargeted LC-MS approach resulted in the identification of a higher number of significantly up-regulated individual glycation sites with higher confidence (Table 1) in comparison to the published data, obtained with both spectral counting [49] and integration of specific signals in individual XICs [27]. Thus, our setup resulted in a higher biomarker potential of the plasma protein glycation sites considered in this context earlier [27][28][29] (Figure 6), and yielded new prospective Amadori peptides, whose biomarker behavior has not been reported yet.

Biomarker Potential of Glycated Peptides
Although the size of the cohorts used in our previous study was insufficient for reliable conclusions, five individuals per comparison group was a good starting point to establish new biomarkers based on specific glycation sites in plasma proteins [27]. The increase of the cohort size in combination with an untargeted LC-MS approach resulted in the identification of a higher number of significantly up-regulated individual glycation sites with higher confidence (Table 1) in comparison to the published data, obtained with both spectral counting [49] and integration of specific signals in individual XICs [27]. Thus, our setup resulted in a higher biomarker potential of the plasma protein glycation sites considered in this context earlier [27][28][29] (Figure 6), and yielded new prospective Amadori peptides, whose biomarker behavior has not been reported yet. As all these peptides belonged to proteins with varying half-life times, we succeeded in addressing the time scale of glycemic control (Table 2). Indeed, relatively long-living proteins (HSA and the immunoglobulin (Ig) kappa chain C region with τ1/2 of up to three weeks [39]) are exposed to plasma glucose for times much longer than those relevant for α-2-macroglobulin and complement C4-A protein (τ1/2 of 1 to 3 days [33,34]). Hence, the glycation sites representing the latter proteins would deliver valuable information about blood sugar levels over the days directly preceding the analysis. Such data would directly show if short-term fluctuations of plasma glucose levels do occur. Our results indicate a significantly lower degree of such fluctuations in the plasma of healthy individuals in comparison to T2DM patients (Table 1, Figures 6 and S4). Thus, in the future, the corresponding tryptic peptides might be addressed as possible markers of impaired glucose tolerance.

Specific Set of Glycation Sites as an Integrated Biomarker of T2DM
As the proteins listed in Table 2 cover a large range of half-lives, consideration of the interference between representative glycation sites might provide a generalized T2DM marker, addressing a dynamic aspect of glycemic control. In this context, we tried to identify a set of multiple glycated sites in several marker proteins with various τ1/2. This set is expected to provide sensitivity sufficient for the prediction of T2DM and efficient glycemic control, i.e., provide separation of As all these peptides belonged to proteins with varying half-life times, we succeeded in addressing the time scale of glycemic control (Table 2). Indeed, relatively long-living proteins (HSA and the immunoglobulin (Ig) kappa chain C region with τ 1/2 of up to three weeks [39]) are exposed to plasma glucose for times much longer than those relevant for α-2-macroglobulin and complement C4-A protein (τ 1/2 of 1 to 3 days [33,34]). Hence, the glycation sites representing the latter proteins would deliver valuable information about blood sugar levels over the days directly preceding the analysis. Such data would directly show if short-term fluctuations of plasma glucose levels do occur. Our results indicate a significantly lower degree of such fluctuations in the plasma of healthy individuals in comparison to T2DM patients (Table 1, Figure 6 and Figure S4). Thus, in the future, the corresponding tryptic peptides might be addressed as possible markers of impaired glucose tolerance.

Specific Set of Glycation Sites as an Integrated Biomarker of T2DM
As the proteins listed in Table 2 cover a large range of half-lives, consideration of the interference between representative glycation sites might provide a generalized T2DM marker, addressing a dynamic aspect of glycemic control. In this context, we tried to identify a set of multiple glycated sites in several marker proteins with various τ 1/2 . This set is expected to provide sensitivity sufficient for the prediction of T2DM and efficient glycemic control, i.e., provide separation of groups different from existing ones, e.g., glycated hemoglobin ( Figure 7). However, a biomarker, based on multiple glycated proteins would give a higher degree in the flexibility of the glycemic control depth.
A basic T2DM marker, relying on two glycation sites, was proposed recently. Using a decision tree algorithm, Spiller and co-workers successfully combined quantitative information on a selected haptoglobin glycation site (K 141 ) with a well-established T2DM marker, such as HbA 1c [30]. Unfortunately, due to its "step-by-step" decision tree design, this methodology is generally not applicable to the assessment of multiple factor contributions. Besides, when large cohorts are used, uncertainties, related to overfitting and multiple testing, need to be considered [49], i.e., a proper inspection of all resulting decision trees becomes critical for the accuracy of disease prediction. Indeed, as this calculation strategy does not assess factors simultaneously, the risk that the chosen set of classification steps and cut-off procedures "learns" some specific features of the training set is relatively high. groups different from existing ones, e.g., glycated hemoglobin ( Figure 7). However, a biomarker, based on multiple glycated proteins would give a higher degree in the flexibility of the glycemic control depth.
A basic T2DM marker, relying on two glycation sites, was proposed recently. Using a decision tree algorithm, Spiller and co-workers successfully combined quantitative information on a selected haptoglobin glycation site (K141) with a well-established T2DM marker, such as HbA1c [30]. Unfortunately, due to its "step-by-step" decision tree design, this methodology is generally not applicable to the assessment of multiple factor contributions. Besides, when large cohorts are used, uncertainties, related to overfitting and multiple testing, need to be considered [49], i.e., a proper inspection of all resulting decision trees becomes critical for the accuracy of disease prediction. Indeed, as this calculation strategy does not assess factors simultaneously, the risk that the chosen set of classification steps and cut-off procedures "learns" some specific features of the training set is relatively high. Therefore, to avoid a potential bias associated with the application of the decision tree design to multiple features, we decided on LDA as the main classification algorithm. The corresponding output variable represented a simple weighed sum of the analyzed features, which could successfully meet the requirements for the establishment of a time dimension in glycaemic control. Thereby, the procedures for selection of the model parameters aimed at the removal of highly-correlated sites (duplicates or linear combinations) in the feature space (VIF-based filtering) or for transformation of the feature space itself (PCA). We believe that the application of several complementary feature space transformation and randomization techniques (i.e., PCA, VIF-based filtering, random sub-sampling of the training set, and leave-one-out cross-validation) allowed us to exclude any experimental bias and ensure a high reliability of the prediction model set.
Generally, all model generation and validation procedures resulted in similar metric values (Table 3). It can be noted that the models generated with a VIF cutoff of 5 demonstrated a clearly better performance (sensitivity, accuracy, and specificity of 80%, 90%, and 100%, respectively), in comparison to the models with a VIF cutoff of 10, containing higher numbers of features. Performance of the PCA-based model was somewhere in the middle, possibly due to the averaging of the specifics of individual features upon calculation of the principle component (PC) variable values. It is important to stress that the diagnostic results presented in Table 3 are conservative, i.e., bias-free [49]. Also, as the sub-sampling training sets discard the information about feature cross-correlations, they might be less informative than ones used for leave-one-out cross-validation, thus providing less accurate predictions.
Also, in the future, for the collection of blood samples, it is necessary to keep in mind not only therapy, but also the diet of patients. Thus, the influence of dietary polyphenols on blood glucose at Therefore, to avoid a potential bias associated with the application of the decision tree design to multiple features, we decided on LDA as the main classification algorithm. The corresponding output variable represented a simple weighed sum of the analyzed features, which could successfully meet the requirements for the establishment of a time dimension in glycaemic control. Thereby, the procedures for selection of the model parameters aimed at the removal of highly-correlated sites (duplicates or linear combinations) in the feature space (VIF-based filtering) or for transformation of the feature space itself (PCA). We believe that the application of several complementary feature space transformation and randomization techniques (i.e., PCA, VIF-based filtering, random sub-sampling of the training set, and leave-one-out cross-validation) allowed us to exclude any experimental bias and ensure a high reliability of the prediction model set.
Generally, all model generation and validation procedures resulted in similar metric values (Table 3). It can be noted that the models generated with a VIF cutoff of 5 demonstrated a clearly better performance (sensitivity, accuracy, and specificity of 80%, 90%, and 100%, respectively), in comparison to the models with a VIF cutoff of 10, containing higher numbers of features. Performance of the PCA-based model was somewhere in the middle, possibly due to the averaging of the specifics of individual features upon calculation of the principle component (PC) variable values. It is important to stress that the diagnostic results presented in Table 3 are conservative, i.e., bias-free [49]. Also, as the sub-sampling training sets discard the information about feature cross-correlations, they might be less informative than ones used for leave-one-out cross-validation, thus providing less accurate predictions.
Also, in the future, for the collection of blood samples, it is necessary to keep in mind not only therapy, but also the diet of patients. Thus, the influence of dietary polyphenols on blood glucose at different levels may also help control and prevent diabetes complication via a decrease of hyperglycemia and an improvement of acute insulin secretion and insulin sensitivity [50].

Reagents
Unless stated otherwise, materials were obtained from the following manufacturers. AMRESCO LLC (

Setup of Experimental Cohorts
The T2DM patient (n = 20) and normoglycemic control (n = 18) groups comprised non-smoking female volunteers aged 45 to 75 years (63.4 ± 7.9 and 60.7 ± 4.7, respectively), who were not receiving hormone replacement therapy and had no clinically manifested diabetes complications (Tables 4 and 5). The control individuals' HbA 1C levels did not exceed 6.5%, and did not have diagnosed diabetes and anti-hyperglycemic therapy in their medical history. All participants provided written informed consent. The study was approved 02-03-2015 by the Local Ethical Committee of the Federal Almazov North-West Medical Research Centre, Saint-Petersburg, Russian Federation, and was performed in agreement with the Declaration of Helsinki.

Blood Sampling and Plasma Isolation
The blood samples (approximately 10 mL each) were collected in polypropylene tubes coated with etylenediaminetetraacetic acid (Becton Dickinson, Franklin Lakes, NJ, USA). Plasma was separated by centrifugation (1200× g, 15 min, 4 • C) and transferred to 1.5 mL polypropylene tubes. The total plasma protein contents were determined by Bradford assay in a 96-well microtiter plate format as described by Greifenhagen and co-workers [51]. The precision of protein determination was verified by SDS-PAGE according to an established protocol [11]. Average densities across individual lanes (expressed in arbitrary units) were determined by a ChemiDoc XRS imaging system controlled by Quantity One ® 1-D analysis software (Bio-Rad Laboratories Ltd., Moscow, Russia). Thereby, for inter-gel normalization, the first and the last plasma protein samples loaded on each gel were replicated in the previous and following gels, respectively ( Figure S1). For the calculation of RSDs, the densities of individual lines were normalized to the gel average value. Individual plasma samples were split into aliquots of 20 µL, and stored at −80 • C. Alternatively, 30 µL of each T2DM sample were combined to obtain a representative pool of diabetic material. The plasma contents of human serum albumin were determined calorimetrically at 628 nm after a color reaction with bromocresol purple (Clinical Chemistry Analyzer CA90, Furuno Electric Co. LTD, Nishinomiya, Japan) using an Albumin kit (Aptec Diagnostics nv) and the control serum kits, Precinorm U and Precipath U (Roche Diagnostics). The plasma HbA 1c level was measured using the commercial kit, BioRad (Hercules, CA, USA), on a hemoglobin analyzer BioRad D-10 (Bio-Rad Laboratories Inc, Hercules, CA, USA).

Tryptic Digestion
The plasma proteins were digested according to Frolov and co-workers [27] with slight modifications. Briefly, aliquots of plasma containing 150 µg of protein were diluted with 100 mmol/L ammonium bicarbonate buffer (pH 8.0), complemented with 10 µL of SDS (0.5% in water, w/v) and 10 µL of 50 mmol/L TCEP in 100 mmol/L ammonium bicarbonate buffer, and incubated for 30 min at 37 • C under continuous shaking (450 rpm). Afterwards, the samples were cooled to room temperature (RT), and 11 µL of 100 mmol/L iodoacetamide in 100 mmol/L ammonium bicarbonate buffer were added, and alkylation of free sulfhydryls was performed during 15 min in darkness at RT. After completion of the incubation, the proteins were sequentially digested at 37 • C with trypsin (25 µg/mL in 100 mmol/L ammonium bicarbonate buffer) taken in the 1:20 and 1:40 (w/w) enzyme-protein ratio for 5 and 12 h, respectively, under continuous shaking (450 rpm).
The completeness of the digest was verified by SDS-PAGE as described by Schmidt and co-workers [52] with modifications. Briefly, aliquots of digested samples containing 5 µg of protein were diluted with sample buffer (65.8 mmol/L Tris-HCl, pH 6.8, 20% (v/v) glycerol, 2% SDS, 10% (v/v) β-mercaptoethanol, 0.05% (v/v) bromophenol blue) at least 2-fold and heated at 95 • C for 5 min. Afterwards, the samples were separated on a polyacrylamide gel (T = 12.00%, C = 2.65%), and stained with colloidal Coomassie Brilliant Blue G 250 dye. The digests were frozen and stored at −80 • C before further analysis.

Boronic Acid Affinity Chromatography
Enrichment of glycated peptides was performed according to Soboleva and co-workers [28] with slight modifications. In detail, the pH of tryptic digests was adjusted to 8.0 with 25% (v/v) ammonia hydroxide using indicator paper (Lachema, Brno, Czech Republic), before 400 µL of ice cold (4 • C) loading buffer (250 mmol/L ammonium acetate, 50 mmol/L magnesium acetate, pH 8.1) were added. The samples were loaded on 1 mL polypropylene gravity flow columns packed with m-aminophenylboronic acid (mAPBA) agarose, and unbound peptides were washed out with 12 mL of ice-cold (4 • C) loading buffer. Afterwards, glycated peptides were sequentially eluted with 0.1 and 0.2 mol/L warm (37 • C) acetic acid (8 and 2 mL, respectively). The eluates were combined, and loaded on Oasis HLB SPE cartridges, installed on the VacElut 12 Manifold (Agilent Technologies, Moscow, Russia), pre-conditioned with 1 mL of methanol, and pre-equilibrated with 2 mL of 0.1% (v/v) aqueous (aq.) formic acid. After a wash with 2 mL of 0.1% (v/v) formic acid, peptides were eluted in a step gradient of 40%, 60%, and 80% acetonitrile (0.33 mL each), as described by Spiller et al. [29]. The SPE-eluates were combined, dried under vacuum by a CentriVap Vacuum Concentrator (Labconco, Kansas City, MO, USA), and stored at −20 • C for further analysis.

LC-MS Analysis
The dried eluates were reconstituted in 40 µL of 3% (v/v) acetonitrile in aq. 0.1% (% v/v) formic acid, and 8 µL of the obtained solutions were loaded on a ZORBAX SB column (C18, ID 0.3 mm, length 150 mm, particle size 3.5 µm, Agilent Technologies, Moscow, Russia) using an Agilent 1200 Compact liquid chromatograph equipped with an Agilent 1200 Infinity autosampler and Agilent 1260 Infinity capillary pump (Agilent Technologies, Moscow, Russia). The eluents, A and B, were 4% and 90% acetonitrile, respectively, both containing 0.1% (v/v) formic acid. After a 5-min isocratic step (0% eluent B), glycated peptides were eluted at the flow rate of 5 µL/min at 25 • C in sequential linear gradients to 45% and to 100% eluent B in 30 and 2 min, respectively. The column effluents were introduced on-line in an Agilent 6538 Ultra High Definition Accurate-Mass Q-TOF quadrupole-time of flight (QqTOF) mass spectrometer via a dual electrospray ionization (ESI) source (Agilent Technologies, Moscow, Russia). The instrument was operated in the positive ion mode under the settings summarized in Table S7 and controlled by MassHunter Workstation software (Agilent Technologies, Moscow, Russia). Analyte annotation and label free quantification relied on TOF-MS scans acquired in the mass range of 400-2000 m/z. Thereby, a pooled enriched tryptic digest, obtained from T2DM patients, was used as an external QC, injected after each eighth sample. Prospective Amadori-modified peptides were annotated by t R s and exact m/z values (mass accuracy better than 3 ppm). Relative abundances of individual prospectively glycated peptides were calculated by integration of characteristic extracted ion chromatograms (XICs, m/z ± 0.02) built for the annotated m/z values at specific t R s (quantitative analysis tool of the MassHunter Workstation software).

MS/MS Analysis
The sequences and glycation status of differentially abundant glycated peptides were confirmed by MS/MS using a combination of DDA and targeted MS/MS experiments using an LTQ-Orbitrap Velos Pro mass spectrometer. For this, a digest of pooled T2DM plasma (1.5 µg) was loaded on the Acclaim®PepMap100 pre-column (C18-phase, ID 75 µm, length 2 cm, particle size 3 µm), and separated on the EASY-Spray ES803 C18 column (500 × 0.075 mm, 2 µm particle size, 40 • C) using an EASY-nLC 1000 nano liquid chromatography system controlled by the Xcalibur 2.1.0 software. Eluents A and B were water and acetonitrile, respectively, both containing 0.1% (v/v) formic acid. The analytes were eluted at the flow rate of 0.3 µL/min as follows: 5% B (0-15 min), 5% to 40% B    Table S8. Tandem mass spectra were searched against a FASTA file containing sequences of the proteins annotated by QqTOF-MS using the Mascot search engine within Proteome Discoverer 1.4 software (Thermo-Fisher Scientific, Bremen, Germany) with the following settings: Peptide tolerance-7 ppm, MS/MS match tolerance-0.8 Da, 3 missed cleavage sites per peptide. Glycated peptides were annotated by their t R s, m/z values, and isotopic patterns. The results were filtered with consideration of peptide confidence (medium), rank (one), false positive rate (0.05), and post-translational modifications-carbamidomethylation (C), oxidation (M), and glycation (K).

Statistical Analysis
The statistical significance of differences in the relative abundances of specific glycated peptides detected in the plasma protein digests obtained from T2DM patients and healthy individuals was determined by the Mann-Whitney U test [53]. Holm-Bonferroni correction was applied to adjust significance in terms of multiple hypothesis tests [46]. Since the intensities observed for the signals representing individual charge states of the same peptides demonstrated high correlation, these values were averaged for further processing. To distinguish the T2DM patients from the age-matched controls without diabetes, a linear discriminant analysis (LDA) approach was applied as described by Venables and Ripley [54]. All LDA-related procedures were performed in R (MASS package); all other calculations were performed in MS Excel with RealStatisticsaddon (http://www.real-statistics.com).

Generation of the LDA Model
Because the initial correlation analysis of the feature space showed a high degree of multicollinearity between factors (see Supplementary Figure S6 for a factor cross-correlation plot), we employed two strategies for feature space optimization prior to LDA calculation: (i) PCA-based transformation of the feature space, and (ii) generation of an orthogonal set of original features by the removal of factors with high variance inflation factor (VIF) values [55]. The latter can be calculated for any variable in the set (1) and is a numeric measure of to which extent the variable in question can be predicted by the linear combination of other variables, with the most commonly used critical VIF values being 5 and 10, corresponding to the multiple correlation of 90% and 95% between the tested variables and the rest variables in the set.
where: VIFi-variance inflation factor for variable i; R 2 i -coefficient of determination, i.e., the proportion of the variance in the variable, I, which is predictable based on another independent variable The application of PCA was straightforward and only the PCs responsible for 95% of the original sample variance were selected. The corresponding values were estimated for each sample, and the resulting reduced feature set was subjected to further processing with LDA. The VIF-based filtering relied on an iterative procedure, employing a random subset of two features (a test sub-set), and defined in the beginning of each calculation run. Afterwards, further features were randomly selected from the original set and sequentially included in the test subset in each new analysis run. Then, individual features with the VIF values above a selected cutoff (set as described above [43]) were iteratively removed from the test subset in the order of decreasing values. This procedure was further applied in this way to test all initial features.

Validation of the LDA Model
To assess the model performance, two complementary approaches were employed. In terms of the first one, accuracy, sensitivity, and specificity were determined for the established set of diagnostic peptides. To avoid the overfitting problem [56], parameterization of the LDA model was performed in silico with a set of 400 calculated sample patterns (200 for each cohort), obtained by the sub-sampling feature distributions of original samples as described by Politis and co-authors [57]. Thereby, all selected training samples shared less than 25% of the factor values, obtained with any of the original (verification) vectors. The second approach relied on the so-called leave-one-out cross-validation procedure [58]. It employed a set of iterations, where at each run one sample from the original sample set was taken for verification, whereas the other samples were used for parameterization of the LDA model. The performance characteristics (i.e., accuracy, sensitivity, and specificity) were calculated from the set of individual iteration outcomes as described elsewhere [59,60]. For the PCA-generated feature subset, only the leave-one-out cross-validation procedure was performed.

Conclusions
T2DM is one of the most widely spread metabolic disorders. Typically, the first stages of the disease are slow and are not accompanied by any clinically manifested symptoms. Because of this reason, T2DM is most often discovered at the step of complications, which makes therapy less efficient and more expensive. Unfortunately, HbA 1c -a recognized T2DM marker-delivers information about changes in glycaemic status over three months, and, hence, is insensitive to short-term glucose excursions preceding the disease. Thus, control of glycemic status over various, especially short, periods of time might increase the rates of early T2DM discovery. In this context, our approach might bring the desired "time dimension" in glycemic control. Indeed, the integrated biomarker, proposed here, not only covers three weeks before blood sampling, but also indicates a continuous character of the glycation process throughout this period. Secondly, although our integrated biomarker relies on multiple glycation sites, all required information is acquired in one experiment, which is advantageous in comparison to the approaches employing several tests. Finally, the proposed marker has potential for further optimization. Thus, it can be "tuned" for shorter times by excluding relatively long-living proteins. On the the other hand, implementation of immunoaffinity depletion might essentially increase the pattern of marker peptides, and, hence, the selection of marker proteins and the reliability of the marker.