Plasma cytokines for predicting diabetic retinopathy among type 2 diabetic patients via machine learning algorithms

Aims: This study aimed to investigate changes of plasma cytokines and to develop machine learning classifiers for predicting non-proliferative diabetic retinopathy among type 2 diabetes mellitus patients. Results: There were 12 plasma cytokines significantly higher in the non-proliferative diabetic retinopathy group in the pilot cohort. The validation cohort showed that angiopoietin 1, platelet-derived growth factor-BB, tissue inhibitors of metalloproteinase 2 and vascular endothelial growth factor receptor 2 were significantly higher in the NPDR group. Machine learning algorithms using the random forest yielded the best performance, with sensitivity of 92.3%, specificity of 75%, PPV of 82.8%, NPV of 88.2% and area under the curve of 0.84. Conclusions: Plasma angiopoietin 1, platelet-derived growth factor-BB, and vascular endothelial growth factor receptor 2 were associated with presence of non-proliferative diabetic retinopathy and may be good biomarkers that play important roles in pathophysiology of diabetic retinopathy. Materials and Methods: In pilot cohort, 60 plasma cytokines were simultaneously measured. In validation cohort, angiopoietin 1, CXC-chemokine ligand 16, platelet-derived growth factor-BB, tissue inhibitors of metalloproteinase 1, tissue inhibitors of metalloproteinase 2, and vascular endothelial growth factor receptor 2 were validated using ELISA kits. Machine learning algorithms were developed to build a prediction model for non-proliferative diabetic retinopathy.


INTRODUCTION
Diabetic retinopathy (DR), one of the most prominent microvascular complications of diabetes mellitus (DM), is the leading cause of vision impairment and new-onset blindness in the working-age population and diabetes mellitus patients [1,2]. The increase in the global prevalence of diabetic eye diseases, comprising DR and diabetic macular edema (DME), is intimately connected to the soaring prevalence of DM [3][4][5]. It was reported that across China, the prevalence of DR and sight-threatening DR were 27.9% and 12.6% in diabetic patients, respectively [6].
For algorithm development, deep learning techniques have been used for automated detection of DR and DME, based on features in retinal fundus photographs and achieved robust performance [7][8][9][10]. Although image-based features of DR are well-known, knowledge about its protein phenotype are limited. It is accepted that angiogenesis and inflammation crosstalk are intrinsic components of DR [11,12]. Increasing evidence shows that, in retinal cells and tissues, various cytokines, including vascular endothelial growth factor (VEGF), matrix metalloproteinases (MMPs), and tissue inhibitors of metalloproteases (TIMPs), play essential roles in the progress of DR via angiogenic, inflammatory and fibrotic reactions [13][14][15][16][17]. Thus, cytokines play AGING important roles in the pathophysiology of DR. However, the associations between plasma cytokines and nonprogressive DR (NPDR) are unclear. This is the first study to investigate the associations between plasma cytokines and non-progressive DR (NPDR) and to build a prediction model for NPDR. In this study, we hypothesized that the pathological processes leading to NPDR caused characteristic changes in the concentrations of plasma proteins. We then investigated the characteristic changes in plasma cytokines, generating a detectable disease-specific protein phenotype, and finally developed machine learning classifiers for NPDR at the protein level.

Study subjects
For plasma protein profiling, 14 patients with NPDR and 14 patients with T2DM were selected as the pilot cohort. The mean ages of patients with NPDR or T2DM were 62.71 vs. 58.50 years, respectively, and the median durations of diabetes were 13.57 vs. 8.08 years, respectively. The proportion of hypertension was significantly higher in the NPDR group (78.6% vs. 28.6%, p = 0.023). For validation, 115 patients with NPDR and 115 patients with T2DM were selected as the validation cohort. The mean ages of patients with NPDR or T2DM were 60.40 vs. 58.63 years, respectively, and the median durations of diabetes were 8.69 vs. 6.92 years, respectively. In the same manner, the proportion of hypertension was significant higher in the NPDR group (60.9% vs. 47.0%, p = 0.047) ( Table 1).

Identification of predominant plasma cytokines in NPDR patients
We profiled plasma cytokines by using the human glassbased arrays and obtained semi-quantifiable results for 60 plasma cytokines. Compared with T2DM patients, the relative changes of the 60 cytokines were shown in Figure 1A. There were 27 cytokines significantly different between the two groups, among which the fold change of 12 plasma cytokines were larger than four ( Figure 1B). As shown in the volcano plot, the top 10 increased cytokines were PDGF-BB, leptin, ANG-1, TIMP-1, RANTES, TIMP-2, ENA-78, angiostatin, CXCL16, and VEGFR2, and the top 10 decreased cytokines were IL-10, ANGPTL4, bFGF, VEGFR3, HB-EGF, IL-12p40, IGF-1, IL-17, I-309, and LIF ( Figure  1C). Based on the top 10 increased cytokines, PCA was performed, showing a clear separation between the two groups (Supplementary Figure 1). These findings suggested that plasma cytokines may be helpful to distinguish NPDR from T2DM patients.

Correlation between cytokines and clinical characteristics
Pearson's correlation analysis was performed to investigate the potential relationships among cytokines and clinical characteristics. For NPDR, PDGF-BB was weakly correlated with diabetic duration (r = 0.34), and VEGFR2 was weakly correlated with total cholesterol (r = 0.33) and low-density lipoprotein (r = 0.30) (Supplementary Figure 2A). For T2DM patients, there was no obvious relationship between plasma cytokines and clinical characteristics (Supplementary Figure 2B).
The proportion of hypertension was significant higher in the NPDR group in both the pilot and validation cohorts. To further eliminate the interference of hypertension on the six plasma cytokines, we focused on comparing concentrations of the six plasma cytokines in patients with or without hypertension. Supplementary Figure 3 shows that there was no significant difference of the mean levels of ANG-1, CXCL16, PDGF-BB, TIMP-1, TIMP-2, and VEGFR2 in the NPDR and T2DM groups (322.83 vs. 315.24 ng/mL, 3,796.44 vs. 3,889.65 pg/mL, 32.17 vs. 30.74 pg/mL, 6.70 vs. 6.77 ng/mL, 107.13 vs 112.94 ng/mL, and 12.99 vs. 12.99 ng/mL, respectively). Thus, the higher concentration of these six cytokines in NPDR patients may have minimal association with hypertension in this study.

Feature selection for the machine learning algorithms
We then used PCA to compute the relative contributions of each cytokine to the separation among NPDR and T2DM patients. The first and second principal components of the PCA plot (Dim1 and Dim2) accounted for 36.0%, and 17.1% of the variations, respectively, in the dataset. The projections of samples in PCA were distinguished with relatively small over-
Finally, combined PCA, random forest, and lasso regression results for ANG-1, VEGFR2, and PDGF-BB were selected for a machine learning prediction model building.

Development and validation of machine learning classifiers
To select a high-performance classifier for prediction, we developed ANN, LR, SVM, XBG, and RF classifiers based on ANG-1, VEGFR2, and PDGF-BB. In 10-fold cross validation of the train set, the mean AUC of ANN, LR, SVM, XBG, and RF were 0.82, 0.83, 0.82, 0.82, and 0.85, respectively ( Figure 3). This finding revealed that all classifiers performed similarly and exhibited excellent performance in the training set.
For validation, the test set was used to evaluate the performance of machine learning classifiers.

DISCUSSION
Because of the main role that angiogenesis and inflammation have in the development and progression of NPDR, we hypothesized that angiogenesis-and inflammation-related cytokines in the plasma might be different in NPDR patients, and could be novel predictive biomarkers. To the best of our knowledge, this is the first large-scale study to determine specific plasma cytokines for the diagnosis of NPDR when compared with those in T2DM patients. In the pilot cohort with a small number of samples, cytokines antibody arrays were performed to identify 60 plasma cytokines. The results showed that 27 cytokines were increased in patients with NPDR, among which 12 cytokines were increased in the NPDR group (fold change > 4). In the larger-scale validation cohort, ELISA kits were used to validate six of the 12 plasma cytokines. Four out of six plasma cytokines, ANG-1, PDGF-BB, TIMP-2, and VEGFR2, were confirmed to be significantly higher in NPDR patients. These results suggested that plasma cytokines may be specifically involved in the development of NPDR.

AGING
The main goal of this study was to identify potential plasma biomarkers of patients with NPDR. Feature selection indicated that NPDR was highly associated with ANG-1, PDGF-BB, and VEGFR2, so these three cytokines were finally included in the machine learning algorithms. LR, ANN, SVM, RF, and XGB classifier confirmed that these three cytokines were highly discriminatory for NPDR in the independent test set, with the sensitivity ranging from 73.1% to 92.3%, with the specificity ranging from 65.0% to 75%, and with the AUC ranging 0.72 to 0.84. Among the five machine learning algorithms, RF classifier, with a sensitivity of 92.3% and the AUC of 0.84 in the test set, showed excellent discrimination of NPDR from T2DM patients.
Angiogenesis-and inflammation-related cytokines play vital roles in injuries of human retinal endothelial cells in culture. ANG-1, a member of the angiopoietins family, is a growth factor that plays a key role in vessel homeostasis, angiogenesis, and vascular permeability via interacting with the Tie2 transmembrane receptor tyrosine kinase [18][19][20]. Damage of the blood retinal barrier, which is induced by diabetes, is inhibited by Ang-1 in a dose-dependent manner [21]. PDGF-BB has been reported to be involved in astrogliosis and the formation of proliferative membranes in retinopathy by activating PDGFRα and PDGFRβ [22]. The upregulated combination of VEGF-A and VEGFR2 is a response to the ischemia induced by retinal vascular damage, and stimulates extraretinal vascular outgrowth to the retinal surface without amelioration of ischemia in the retina [23][24][25]. TIMP-2 is an endogenous inhibitor of matrix metalloproteinase-2 and may act as a protector to reduce the loss of capillary cells resulting in the development of diabetic retinopathy [26,27]. Although angiogenesisand inflammation-related cytokines are involved in the development and progression of DR, their changes in DR are unclear.
With the development of an algorithm, deep learning techniques have been used for automated detection of DR. Based on features in retinal fundus photographs,  Two previous studies reported that serum levels of ANG-1 were significantly higher in the NPDR group, when compared to the T2DM group [32,33]. Paine et al. also reported that the plasma levels of soluble VEGFR2 consistently increased with the severity of DR [34]. Consistent with these findings, in the present study, we showed that in NPDR, ANG-1, TIMP-2, VEGFR2, and PDGF-BB were significantly increased. The protective cytokines, ANG-1 and TIMP-2, were increased in the NPDR group. A possible explanation for this might be that the increases of ANG-1 and TIMP-2 may represent an adaptive compensatory mechanism to promote cellular repair, to suppress the development of retinal or choroidal neovascularization, and to strengthen the integrity of the vascular structure [35].

AGING
The plasma cytokine changes in NPDR patients have been controversial, and the correlations between plasma cytokines and clinical features were also unclear. Pearson's correlation indicated that for NPDR patients, PDGF-BB was weakly correlated with the duration of diabetes (r = 0.34), and VEGFR2 was weakly correlated with total cholesterol (r = 0.33) and low-density lipoprotein (r = 0.30). According to previous studies, diabetic duration, total cholesterol, and low-density lipoprotein were risk factors for diabetic retinopathy [6,36,37]. Whether PDGF-BB and VEGFR2 act independently or in concert with blood lipids during NPDR is still unclear, so further studies are needed.
The strengths of this study were as follows. It was the first study to include a large number of patients with comparable baselines. It contained a pilot study for screening of possible cytokines associated with NPDR and a large-scale cohort for ELISA verification Machine learning algorithms based on these plasma cytokines exhibited good performance for distinguishing DR from T2DM patients. However, we acknowledge several limitations in our study. First, the examination for DR was based on two-field fundus photographs, which are theoretically less sensitive than seven-field retinal photographs. However, the presence of mild DR would be underestimated only if the retinal pathologies were present in the peripheral area of the retina. Second, patients with coronary heart disease (CHD) were excluded in this study. The diagnosis of CHD, however, was based on the history of disease provided by the patients. A significant percentage of diabetic patients with coronary heart disease usually have no symptoms, so it may be inevitable that a few CHD patients were included in this study.
In summary, we report that plasma cytokines including ANG-1, PDGF-BB, TIMP-2, and VEGFR2 were increased, and that plasma cytokine patterns were comprehensive predicators of DR based on machine learning algorithms. Our results suggested that plasma cytokines could be strong risk markers of NPDR. AGING to further measure the concentrations of plasma angiopoietin 1 (ANG-1), CXC-chemokine ligand 16 (CXCL16), platelet-derived growth factor-BB (PDGF-BB), tissue inhibitors of metalloproteinase 1 (TIMP-1), tissue inhibitors of metalloproteinase 2 (TIMP-2), and vascular endothelial growth factor receptor 2 (VEGF R2) using ELISA kits.

Patients
The 2010 diagnostic criteria of T2DM from the American Diabetes Association were used: (1) glycosylated hemoglobin (HbA1c) ≥ 6.5%; (2) fasting plasma glucose ≥ 7.1 mmol/L; (3) 2 h of blood glucose during an oral glucose tolerance analysis ≥ 11.1 mmol/L; and (4) in a typical hyperglycemic or hyperglycemia crisis patient, random blood glucose was ≥ 11.1 mmol/L. Two-field retinal photographs were taken of each eye of all patients by a trained photographer, using a nonmydriatic fundus camera (Topcon, Tokyo, Japan). The diagnosis and grading of DR were conducted by two trained specialists following the Early Treatment of Diabetic Retinopathy Study Researched Group (ETDRS) [38] as follows: (1)

Clinical examination and data collection
Blood biochemical parameters, including fasting glucose, HbA1c, 2 h postprandial C-peptide, triglycerides, total cholesterol, and low-density lipoprotein were collected at the time of the screening. All information from the patients, including height, weight, diabetic related complications, other histories of diseases, retinal examination, and optical coherence tomography, were recorded. Plasma samples were collected in ethylenediaminetetraacetic acid tubes and were immediately centrifuged at 1,400 × g for 10 min at 4° C, and then the supernatant was aliquoted and stored at -80° C, avoiding freeze thaw cycles. All samples were collected with the signed informed consent from all patients, and all related procedures were performed with the approval of the internal review and ethics boards of the indicated hospitals.

Cytokine antibody assay
Plasma soluble cytokines were measured in duplicate using the Ray Biotech G-Series Human Angiogenesis Array 2 and 3 (Ray Biotech, Norcross, GA, USA) following the recommended protocols. Briefly, all samples were biotinylated. The antibodies were immobilized in specific spot locations on glass slides. The incubation of array membranes with biological samples resulted in the binding of cytokines to the corresponding antibodies. Signals were visualized using streptavidin-horseradish peroxidase conjugates and colorimetric assays. Final spot intensities were measured as the original intensities after subtracting the background. The two kits provided high sensitivity and specificity to simultaneously detect a total of 60 cytokines from the plasma. As determined by densitometry, the inter-array coefficient of variation of spot signal intensities was less than 20%.

Differential protein level analysis
To identify proteins with significant concentrations in the plasma, the raw data were normalized and then the fold change of NPDR vs. T2DM for each cytokine was calculated using the "edgeR" package [39]. The fold change values of cytokines were used to indicate their relative concentration levels. Any fold change ≥ 2 or ≤ 0.5 with FDR < 0.05 was considered as significant. Based on differential plasma protein, principle component analysis (PCA) was conducted to evaluate variation between two groups using the "ggbiplot" package.

ELISA validation
Plasma concentration of PDGF-BB, TIMP-1, TIMP-2, ANG-1, CXCL16, and VEGFR2 were determined in the validation cohort by an ELISA kit following the manufacturer's instructions (Human ELISA kit, MLbio, Shanghai, China). The intra-assay coefficient of variation was 10%, and the inter-assay coefficient of variation was 12%. No significant cross-reactivity or interference was observed.

Machine learning algorithms to distinguish NPDR from T2DM patients
The whole data set was randomly divided into the training set (80%) and the test set (20%). To prevent overfitting, the training set was randomly split into 10 equal-sized subgroups using the 10-folds cross validation method. In 10-folds cross validation, nine subgroups were retained as training data and the remaining one subgroup was used as the validation data for testing the model. The cross-validation process was then repeated 10 times, with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds then were averaged to produce a single estimation. Finally, the test set was used to evaluate the model (Supplementary Figure 5). Five machine learning AGING algorithms were trained: the artificial neural network (ANN), logistic regression (LR), random forest (RF), support vector machine (SVM), and xgradient-boosting (XGB). Parameters for each machine learning method were shown in Supplementary Table 2. The performance of each classifier was evaluated by its accuracy, sensitivity and specificity, positive predictive value (PPV), negative predictive value (NPV), f1 score, Matthews correlation coefficient (MCC), and area under the curve (AUC) of the receiver operating characteristic (ROC).

Statistical analysis
Differences in clinical characteristics and cytokines between groups were calculated using the Wilcoxon test for continuous variables and the chi-square test for categorical variables. Pearson's correlation was performed to assess the relationship of the plasma cytokines and clinicopathological characteristics. Machine learning algorithms and diagnostic performance were evaluated using scikit-learn (V0.21.3) in Python V3.7.4. Other data analysis and visualization were performed by R software, version 3.6.2 (The R Project for Statistical Computing, Vienna, Austria). A two-sided P-value < 0.05 was considered statistically significant.

Supplementary Tables
Supplementary Table 1