Derivation and Validation of Essential Predictors and Risk Index for Early Detection of Diabetic Retinopathy Using Electronic Health Records

Diabetic retinopathy (DR) is a leading cause for blindness among working-aged adults. The growing prevalence of diabetes urges for cost-effective tools to improve the compliance of eye examinations for early detection of DR. The objective of this research is to identify essential predictors and develop predictive technologies for DR using electronic health records. We conducted a retrospective analysis on a derivation cohort with 3749 DR and 94,127 non-DR diabetic patients. In the analysis, an ensemble predictor selection method was employed to find essential predictors among 26 variables in demographics, duration of diabetes, complications and laboratory results. A predictive model and a risk index were built based on the selected, essential predictors, and then validated using another independent validation cohort with 869 DR and 6448 non-DR diabetic patients. Out of the 26 variables, 10 were identified to be essential for predicting DR. The predictive model achieved a 0.85 AUC on the derivation cohort and a 0.77 AUC on the validation cohort. For the risk index, the AUCs were 0.81 and 0.73 on the derivation and validation cohorts, respectively. The predictive technologies can provide an early warning sign that motivates patients to comply with eye examinations for early screening and potential treatments.


Introduction
Diabetic retinopathy (DR) is a vision-threatening microvascular complication of diabetes, and is a leading cause of blindness among working-aged adults globally [1][2][3][4][5]. According to the 2002 American Diabetes Association Position Statement, nearly all patients with type 1 diabetes and over 60% of patients with type 2 diabetes developed retinopathy during the first 20 years of the disease [6]. In 2015, about 1.5 million Americans were diagnosed with diabetes, and an additional 84.1 million Americans had prediabetes [7]. The fast-growing new cases of diabetes suggests that DR will continue to be a major cause of vision loss and associated functional impairment in the U.S. in the coming years.
Because DR can progress to irreversible stages (impossible to restore visual acuity) with relatively few symptoms, early detection and treatment are essential in preventing DR and the subsequent vision loss [8]. Although DR diagnostic and treatment options have significantly advanced over the past decades, the early detection and screening for DR remain challenging due to poor adherence to annual examination guidelines and lack of resources to deploy comprehensive screening programs [3,9], especially in rural/undeveloped areas. Therefore, there is a critical need for stakeholders to research innovative ways that can implement timely, cost-effective detection techniques and/or programs in communities.
Clinical predictive models [10][11][12][13][14] provide an effective, alternative solution to improve the access to early screening for DR under current limitations, by forecasting accurate risk estimates of diseases based on important biomarkers. Predictive models have been extensively investigated and adopted in diabetes studies [15][16][17][18][19][20][21]. In particular for DR, many conditions comorbid with diabetes, such as hyperglycemia, hypertension and dyslipidemia have been found to be significantly associated with DR [1,2,[22][23][24][25][26][27]. In addition, HbA1c, fasting plasma glucose, hemoglobin, hematocrit and many other laboratory test values were found to be risk factors for DR development [28][29][30][31][32]. Based on the risk factors identified, a few prediction models were developed to predict the incidence and development of DR [33][34][35][36][37][38]. However, most of the prediction models incorporated a multitude of laboratory variables, leading to no consensus about which laboratory tests are required for effective and economical prediction of DR. The objective of this study is to identify a set of laboratory tests most important for DR prediction, and use the essential laboratory results, in conjunction with other key predictors, to develop an accurate and cost-effective predictive model and an easy-to-use risk index. These predictive technologies can assist healthcare providers in identifying patients at high DR risk and counseling them for ophthalmic examination and proper treatments at early stages.

Materials and Methods
We performed a retrospective, secondary analysis of electronic health records (EHRs) extracted from two different data sources for derivation and validation, respectively. The details of our data and analysis are elaborated in the remainder of this section.

Data Sources and Data Extraction
We used Cerner Health Facts ® EHR data warehouse (Cerner Corporation, Kansas City, MO, USA) as our derivation data source to identify essential predictors, and develop a model and a risk index to predict the onset of DR. Health Facts contains clinical data contributed voluntarily from over 200 hospitals using Cerner EHR systems across the U.S. spanning the past two decades. Cerner Corporation collects and integrates the data with its internally established procedures in compliance with Health Insurance Portability and Accountability Act (HIPAA) laws, thus all data are de-identified. The data in Health Facts ® are mostly time-stamped and cover a variety of aspects of patients' hospital records including encounters, diagnoses, procedures, medications, vital signs, laboratory results, etc. Since Health Facts ® has been completely de-identified according to HIPAA regulations, the Institutional Review Boards (IRB) at Oklahoma State University (OSU) exempted the study from review.
In order to validate the predictors and the predictive technologies derived from Health Facts ® , we used the Healthcare Enterprise Repository for Ontological Narration (HERON) [39] from the University of Kansas Medical Center (KUMC) as our validation data source. HERON was established with the EHRs collected from KUMC and its affiliated clinical organizations since 2010. The data contained in HERON include patient demographics and time-stamped encounter, diagnosis, procedure, laboratory, vital sign and medication records. All the HERON data are de-identified, therefore the validation study was exempted from the IRB review of both OSU and KUMC, and was approved by the HERON Data Request Oversight Committee. It should be noted that KUMC and its affiliated clinical organizations have been using Epic EHR systems since 2007, hence the data included in HERON for validation are completely independent from our derivation cohort extracted from Health Facts ® .
In this study, we identified diabetic and DR patients from the data sources using corresponding International Classification of Diseases, Ninth and Tenth Revisions, Clinical Modification (ICD-9/10-CM) diagnosis codes, as listed in Table 1. In particular, diabetic patients were defined as having at least one of 250.x, E10.x and E11.x diagnosis codes [21]. Among diabetic patients, those who had 362.0x, E10.31x-E10.35x or E11.31x-E11.35x diagnosis code(s) are identified as DR patients (case). Otherwise, the diabetic patients are considered as non-DR patients (control).

Data Preprocessing
To support developing early prediction models for DR, we employed a window-based data aggregation approach proposed by Ng et al. [40] to preprocess the data. As illustrated in Figure 1, the method first identifies an event of interest (EOI), then assigns two successive time windows prior to the event, namely a prediction window and an observation window. The risk prediction is made at the beginning of the prediction window with the aggregated data in the observation window to estimate the risk of EOI before the actual occurrence of the event. We used the first DR diagnosis as the EOI for the case cohort and selected a random encounter after the diabetes diagnosis as the EOI for the control patients. The lengths of the prediction window and observation window were set to be six months and two years, respectively. (Note that we tested the latest available data before prediction window, and one year and two years for the observation window in our preliminary studies. The two-year observation window resulted in the best predictive accuracy, thus it was chosen).

Observation Window (2 years)
Prediction Window (  Our study considered a total of 26 variables including three demographics (gender, race, and the age at the beginning of the prediction window), two additional diabetic microvascular complications-nephropathy and neuropathy (corresponding ICD-9/10 codes are listed in Table 1), duration of diabetes (measured in years from the first diabetic diagnosis to the beginning of the prediction window) and results of 20 different routine blood tests for diabetic patients as listed in Table 2. Before the aggregation of laboratory results, the interquartile range method [41] was utilized to identify and remove outliers. Then, we took the mean to aggregate the laboratory results within the observation window. The two diabetic microvascular complications-nephropathy and neuropathy-were modeled as two binary variables. Specifically, if a complication occurred before the prediction window, the associated variable was marked to be 1, otherwise 0. Furthermore, our analysis only included patients with complete records for all variables; in other words, if a patient record came with any missing demographics or laboratory results (the complication and duration of diabetes variables must not be missing), the record was excluded from our subsequent analysis.

Essential Predictor Identification and Predictive Modeling
As shown in Figure 2, our analysis first evaluated the bivariate association of each variable with the onset of DR. To that end, a Chi-squared test was applied to the categorical variables, including age, gender, race, duration of diabetes and diabetic complications, while the two-sample t-test was used for laboratory results. Furthermore, we used the odds ratio (OR) and its 95% confidence interval (CI) to compare the association strengths (with DR) among different levels of each categorical variable. Significant variables (p < 0.05) from the bivariate analysis were then selected for the subsequent predictive modeling and key predictor selection.  In order to identify a compact set of predictors with the best predictive power among the variables found statistically significant in the bivariate analysis, we randomly partitioned all the derivation data into a training data set (70%) and a testing data set (30%), and applied the machine-learning-based ensemble predictor selection (EPS) method proposed by Song et al. [42] to our training data set. On the other hand, the testing data set was left out for internal validation. The EPS method consists of two steps: (i) ranking aggregated variable importance and (ii) golden-section search for a minimal predictor count. In the first step, the method builds machine-learning models on bootstrap samples from the training data set, and returns variable importance (in terms of the contribution to the prediction accuracy) for each bootstrap sample. Then, the importance values are aggregated across the bootstrap samples, and the variables are sorted based on the aggregated importance. The second step performs a golden-section search on the sorted variables to determine a minimal set of predictors that can maintain a close predictive accuracy to that given by the full model incorporating all significant variables. In our implementation of the EPS method, we employed extreme gradient boosting (XGBoost) as our machine-learning model. XG-Boost is a popular, tree-based machine-learning technology [43], and demonstrated an outstanding performance in EPS [42]. Furthermore, we evaluated the predictive accuracy using the area under the receiver operating characteristic curve (AUC), and used weighted average to aggregate variable importance (the AUC on the bootstrap sample was used as the weight).

Risk Index Development
The machine-learning-based predictive model is a "black box" in nature, thus difficult to interpret [44,45]. To address the issue, risk indices are often developed based on essential predictors to provide predictive tools that are more user-friendly and easier to interpret [46,47]. In the research, we used the scoring method described in [48] to create an index to predict the DR risk. The scoring method consists of the following five steps: (i) Create a logistic regression based on the specified n predictors, and obtain the predictors' coefficients {β i | i = 0, 1, . . . , n}, where β 0 is the intercept and β i is the coefficient of ith predictor. (ii) Break down each numerical predictor into intervals (i.e., levels) and determine the reference level for each predictor based on clinical expertise. (iii) Calculate the distance from each level to the reference level in terms of regression risk units for each predictor. The distance is defined as β i (M ij − M iR ), where M ij and M iR are the level values of level j and the reference level of predictor i, respectively. The level value is defined as the middle point for numerical, interval levels and non-negative integers for other types of levels. (iv) Define a constant B regarding how many regression risk units can be mapped to one point in the scoring system. In this study, we let B = 5 × β age . In other words, one point in the risk scoring system corresponds to the increased regression risk units associated with a 5-year change in age. (v) Compute the score for each level of a predictor by rounding β i (M ij − M iR )/B to the nearest integer. The risk index is the summation of all predictors' scores.
All the data preprocessing and predictive technologies presented in this article were implemented using R 3.6.0 [49]. Figure 3 shows a development workflow for our derivation cohort. In the cohort, we included 3749 DR and 94, 127 non-DR diabetic patients (the DR rate is 3.8%). By applying a similar workflow to the validation data source, we obtained a validation cohort with 869 DR and 6448 non-DR diabetic patients (the DR rate is 11.9%). The cohort statistics and bivariate results are listed in Table 2, which shows that the associations of many variables with DR in the validation cohort were consistent with that in the derivation cohort. For example, a longer duration of diabetes is significantly associated with a higher risk of DR in both cohorts (p-values < 0.001) and gender is a statistically insignificant variable in both cohorts (p-values = 0.707 and 0.955 in the derivation cohort and the validation cohort, respectively). However, there still exist certain differences in statistics between the two cohorts, especially for the laboratory results. For example, triglycerides (p-value = 0.436) and an anion gap (p-value = 0.848) were found to be insignificant in the derivation cohort, but they were significant in the validation cohort (p-values = 0.003 and 0.011, respectively). Furthermore, chloride, MCHC, bilirubin and WBC were significant variables in the derivation cohort (p-values ≤ 0.001), but they were insignificant in the validation cohort (p-values were 0.082, 0.084, 0.651 and 0.074). Furthermore, an interesting observation from the results is that diabetic patients aged 65 or older have a lower risk of developing DR than their younger peers, and this observation holds for both the derivation and validation cohorts. Though seemly counter-intuitive, a similar finding was reported in [50] (interested readers may refer to the discussion therein for possible reasons, which are beyond the scope of this paper).

Results
We then excluded statistically insignificant variables in the derivation cohort (including gender, triglycerides and anion gap) and entered the remaining 23 variables into the EPS process to build a predictive model and select the essential predictors based on the derivation cohort. The variable importance returned by EPS is shown in Figure 4. Among the 23 variables, 10 were selected as the essential predictors, which include age, creatinine, HbA1c, neuropathy, duration of diabetes, WBC, nephropathy, glucose, hematocrit, and sodium. Comparing the predictive accuracies between the full model that includes all 23 variables and the compact model with only the 10 essential predictors (AUCs are shown in Figure 5), we find that the compact model (AUCs were 0.85 and 0.77 for the derivation and validation cohorts, respectively) achieved very close accuracies to that of the full model (AUCs were 0.85 and 0.78, respectively, for the derivation and validation cohorts) for both the derivation and validation cohorts.    The top 10 essential predictors were entered into a multivariate logistic regression for risk index development. Table 3 lists the logistic regression results as well as the scoring system derived from the results. The risk index calculated based on the scoring system can estimate the DR risk with considerably good accuracy. Its AUC on our derivation cohort was 0.81, and the AUC on the validation cohort was 0.73 (shown in Figure 5), which are very close to that given by the full models. Though the developed risk index ranges from 0 to 160, our results suggest to break down the index into eight intervals with cutoffs: 40, 50, 60, 70, 80, 90, and 100, as shown in Figure 6. From the risk index distributions shown in Figure 6, we can observe that the index is capable of reflecting the trend of the DR risk: as the index score increases, the patient risk of developing DR becomes higher in both the derivation and the validation cohorts, supporting the generalizability of the proposed risk index.  Figure 6. Risk index distributions in the cohorts.
The age-specific AUCs of our compact model and risk index are presented in Table 4. It shows that these technologies had better predictive accuracy for younger patients than for senior patients. Furthermore, there exist certain inconsistencies in the performance between the derivation cohort and the validation cohort for patients aged 85 or older. The phenomenon may be a result of the complex health conditions of senior patients and implies the need to improve the technologies' accuracy for this group of patients in the future.

Discussion
A large set of risk factors have been reported to be significantly associated with the incidence of DR in literature, leaving little consensus regarding which variables are essential for DR prediction. A major contribution of our work is to derive and validate a small set of the most important predictors from the large variable set. We showed that these predictors were essential because they contributed a majority of the predictive accuracy and adding more variables did not improve the accuracy significantly.
Our analysis and technologies also have various practical implications. Firstly, we derived a predictive model with a minimum number of predictors that have high prediction accuracy for DR. As previously stated, the current method for detecting or diagnosing DR is the annual ophthalmic exam. However, the annual eye examination has a low compliance rate due to the lack of specialists and high overhead to patients in many areas, especially the rural communities [51]. Our study tackles the poor compliance challenge by developing an automatic predictive model. Its prediction result can provide an effective and early warning for DR risk and trigger practitioners to counsel and refer patients for ophthalmic examination and proper treatments. Comparing with several published models that similarly aim to predict the DR risk, our model includes fewer predictors, resulting in easier interpretation and reduction in the cost related to data collection.
Secondly, the risk index we developed provides a practical and user-friendly tool to monitor the essential predictors for early warning signs of DR. Though the machine learning model achieved high predictive accuracies in both derivation and validation cohorts, its black box nature makes it difficult to understand how decisions are made. Moreover, complex machine-learning algorithms require the support of specific software (such as R or EHR systems) for their execution, which are often less user-friendly to health workers and can lead to additional costs to hospitals/clinics. The proposed risk index was designed to address these issues.
More notably, we externally validated the predictive model and the risk index by testing the technologies on a large patient cohort from an independent EHR database. The technologies demonstrated promising discriminative ability in the external data source, which implies their generalizability to the entire diabetic patient population. As far as we know, this study is the first effort to validate diagnostic predictive models for DR on two distinct patient cohorts.
With respect to how to utilize the predictive technologies, we recommend health workers and patients to use the risk index because of its easy-to-use nature. On the other hand, EHR vendors may integrate the machine-learning model into their EHR systems to make this more accurate prediction tool available to providers.
Caveats: It is worth noting that the predictive technologies are not intended to replace the regular eye examination, which is the gold standard for DR diagnosis. However, they can be useful screening tools to identify those that may be at higher risk to ensure timely diagnosis and intervention. Moreover, it is important to note that the statistical results, presented in this article, reveal the associations between the predictors and the development of DR. The associations do not necessarily reflect the causality. It is still unclear why the laboratory values are significantly associated with DR, and how they contribute to the prediction from the pathological perspective. Therefore, the pathological role of the essential predictors in the development of DR should be investigated in the future.
Future Work: There are several potential directions worth future investigations following this study. Firstly, further investigations on better-quality data of senior diabetic patients are desired to improve the prediction accuracy for this group of patients. Secondly, we plan to include more comorbidities, treatments and laboratory results into the feature selection process and predictive models to find novel, essential predictors as well as improve the predictive accuracy. We are also interested in conducting clinical trials to further validate the essential predictors and the predictive tools. Furthermore, the methods employed in this study can be extended to nephropathy, neuropathy and other diabetic complications for creating a comprehensive prognosis toolkit for diabetes and associated complications. Lastly, the pathological relationship between the development of DR and the key predictors is also an interesting direction for future research.
Limitations: There exist a couple of limitations in the research. Firstly, EHRs do not necessarily capture the complete pictures of patient health. Useful data of a patient may be missing from our EHR data sources for many reasons, such as a patient is not compliant with follow up or treatment, the DR diagnosis is not entered into the EHR problem list due to neglect, and a patient visits other hospitals with different EHR systems. Secondly, due to de-identification, there were a small number (2173) of patients older than 90 were recorded as 90 years old in the derivation cohort (there were no such patients in the validation cohort). The inaccuracy may undermine the prediction performance of our technologies for patients older than 90. More accurate data of patients aged 90 years or older can help address the limitation.

Conclusions
DR is a major cause of blindness among middle-aged adults over the world. The vision loss that occurs at the late stage of DR cannot be reversed. As a result, diagnosing DR at an early date is crucial. In this study, we conducted a retrospective analysis of EHR data to identify a set of essential predictors of DR. Based on the key predictors, we furthermore derived and validated a compact predictive model and a risk index for early detection of DR. The technologies demonstrated promising accuracies in prediction, both internally on the derivation cohort and externally on the validation cohort. The DR risk given by the predictive technologies can be used as an early warning sign to urge patients to comply with the prompt ophthalmic examination, which has a relatively low compliance rate currently. Institutional Review Board Statement: Ethical review and approval were waived for this study because all data used are HIPAA compliant and de-identified.

Informed Consent Statement: Not applicable.
Data Availability Statement: Restrictions apply to the availability of Cerner Health Facts ® EHR data. The data were obtained from the Cerner Corporation and are available from the authors with the permission of the Cerner Corporation. The HERON data presented in this study are available on request from KUMC. The data are not publicly available because it is proprietary to KUMC.