A Clinical Decision Support System to Stratify the Temporal Risk of Diabetic Retinopathy

Diabetic Retinopathy (DR) is the most common and insidious microvascular complication of diabetes, and can progress asymptomatically until a sudden loss of vision occurs. Although DR is prevalent nowadays, its prevention remains challenging. The multiple aim of this study was to predict the risk of developing DR as diabetic complication (task 1) and, subsequently, temporally stratify the DR risk (task 2) using electronic health records data. To perform these objectives, a novel preprocessing procedure was designed to select both control and pathological patients, and moreover, a novel fully annotated/standardized 120K dataset from multiple diabetologic centers was provided. Globally, although the Extreme Gradient Boosting model offers satisfying predictive performance, the Random Forest model obtained the best predictive performance to solve task 1 and task 2, reaching the best Area Under the Precision-Recall Curve of 72.43 % and 84.38 %, respectively. Also the features importance extracted from the best Machine Learning (ML) models is provided. The proposed Artificial Intelligence-based solution was proven to be capable of generalizing across different diabetologic centers while ensuring high-interpretability. Moreover, the proposed ML solution is currently being adopted as a Clinical Decision Support System in several diabetologic centers for DR screening and follow-up purposes.


I. INTRODUCTION
The diabetic retinopathy (DR) is the most common and insidious microvascular complication of diabetes, and can progress asymptomatically until a sudden loss of vision occurs [1]. Almost all patients with type 1 diabetes mellitus and 60% of patients with type 2 diabetes mellitus will develop DR during the first 20 years from onset of diabetes [1]. With the rising prevalence of diabetes and increasing numbers of people with diabetes living longer, the number of people with DR and visual impairment due to this disease is rising worldwide [2]. Early diagnosis of diabetic patients and appropriate timely treatment has gradually become an effective measure to prevent DR disease, with a positive economic impact on patients and the healthcare system. Although DR is prevalent nowadays, its prevention remains challenging. Physicians typically diagnose the presence and severity of DR through visual The associate editor coordinating the review of this manuscript and approving it for publication was Vicente García-Díaz . assessment of the fundus images by direct evaluation. Given the large number of diabetes patients globally, this process is expensive and time consuming [3], [4]. Furthermore, 75% of worldwide DR patients live in underdeveloped areas, where sufficient specialists and adequate medical infrastructures for this purpose are unavailable [5]. Consequently, millions of persons continue to experience vision impairment without proper predictive diagnosis and eye care.
Screening tests are performed in asymptomatic persons to assess for the presence of a particular disease or the risk of that disease. An effective screening test program should reduce morbidity and mortality in a population by detecting disease at a stage at which treatment will make a difference. Global screening programs have been created to counter the proliferation of preventable eye diseases, but DR exists at large a scale and its detection on individual basis is scarcely effective. Additionally, given also the current cost-conscious era of healthcare, a policy to reduce unnecessary screening for several retinal diseases is becoming necessary [6].
The availability and the huge amount of Electronic Health Record (EHR) data is exponentially growing, thus early diagnosis of patients with diabetes using EHR data has gradually become an effective measure to prevent DR disease [7]. Using routine EHR data (i.e., demographics information, lab tests, exams, pathologies), it is possible to predict whether diabetic patients are going to develop DR. Comparing this approach with conventional gold standard DR diagnosis (e.g., fundus images), it does not only eliminate the time and money costs, but also maintains an acceptable accuracy [8]. Conventional diagnosis of DR requires a professional medical facility to obtain the fundus image by advanced medical equipment and then a professional physician to evaluate the single case of study. Early diagnosis of DR with convenient, easy-to-access, free, routine EHR data may result a simple and convenient alternative to manage and treat diabetic patients.
The multiple aim of this study was to i) predict the risk of developing DR as diabetic complication (i.e., presence/absence of DR) [i.e., task 1] and, subsequently only for DR patients, ii) temporally stratify the DR risk (i.e., in 0-2 years or in 2-5 years) [i.e., task 2] using EHR data.
The purpose of this work is to bridge the gap between the ML research applied to EHR data and the development of a Clinical Decision Support System (CDSS) while keeping humans at the center of the design and evaluation process [9]- [11]. It is worth noting that the task 1 predicts the presence/absence of DR, while the task 2, if the presence of DR was predicted by the task 1, predicts when the patients will develop DR (i.e., in 0-2 years or in 2-5 years). The proposed two-stage hierarchical ML procedure can be seen as a sequential predictive process within a Clinical Decision Support System (CDSS) application. In particular, main contributions to the biomedical informatics field can be summarized as follows: (i) the employment of the novel fully annotated/standardized 120k dataset from multiple diabetologic centers, thus leading to a higher clinical impact procedure, (ii) the design of a novel preprocessing procedure for selecting control and pathological patients (iii) the application of a two-stage ML procedure to firstly predict the presence/absence of DR and secondly to stratify the temporal risk of the disease for each patient, (iv) the effectiveness of the proposed experimental procedure for generalizing across different diabetologic centers.
The rest of the paper is organized as follows: Section II gives an overview of the state-of-the-art approaches for risk prediction and risk stratification of DR; Section III describes the 120K dataset and the preprocessing procedure; Section IV describes the experimental and validation procedure of the ML models comparison; Section V shows the predictive performance and features importance results; Section VI and Section VII discuss the experimental findings and conclude the paper.

II. RELATED WORK
The gold standard of DR diagnosis is represented by the assessment of fundus images. Thus, among all diabetes-related complications, the DR is the most studied field based on DL imaging techniques. Thus several DL models based on fundus images were designed to identify DR from a non-pathological condition [5], classify different DR stages [4], and predict future DR progression [1]. Diagnostic studies for DR based on EHR data have been still poorly explored, but something more has already been attempted in terms of screening and risk prediction. Targeted screening intervals based on EHR data analysis was adopted to detect DR and to reduce the burden of unnecessary screening examinations [12]. Increasing the interval between screening visits for DR beyond 1 year in low-risk patients is reasonable, since the data showed little difference between the 1-year and 2-year screening frequency with respect to clinical outcomes [13]; additionally, extending the time interval between screening visits to every 3 or 4 years on the basis of retinopathy status and glycated hemoglobin level might effectively decrease the rates of screening adherence in the population [6].
Individualized risk assessments were studied using both epidemiologic and clinical data, including the type and duration of diabetes, glycated hemoglobin or mean blood glucose levels, blood pressure, and the presence and grade of retinopathy [14]. Almost a 30-year period of diabetic patients' fundus images were analysed to simplify individualized risk assessments: an accurate assessment of the risk of proliferative DR or clinically significant macular edema was possible with the use of only the patient's current retinopathy status and glycated hemoglobin levels [12]. Moreover, also a recommended time until the next eye examination on the basis of these two factors was estimated.
Focusing on ML-based solutions using EHR data, in [8] the predictive performance of several ML models (i.e., Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Naive Bayes (NB)) were compared with the aim to identify DR adopting features engineering techniques. However, differently from our proposed work, the temporal risk stratification task was not provided. Moreover, in [8] the classes of DR and control patients were perfectly balanced, and this aspect, beyond that favoring the predictive ML model, unfortunately never reflects the real clinical scenario. In [15] was presented a robust end-to-end ML-based SaaS framework, consisting of a ridge regularized survival SVM with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant DR risk factors associated with disease outcomes by exploiting the weak correlations in EHRs. In our work we use neither features selection [15] nor features engineering [8] because these strategies require human effort and affects the standardized EHR data not ensuring reproducibility and scalability. In [16] a data-driven survival analysis approach was presented to predict when a patient will develop complications after the initial T2D diagnosis and to rank the associated risk. Moreover, to better capture the correlations of time-to-events of multiple complications, a further multi-task version of the survival model was developed. However, as in [8], [15], the observational temporal window of interest (TWOI) for DR patients was selected considering only the DR diagnosis code, but not also the non-DR diagnosis code. Instead to the best of our knowledge, we did the first attempt to define a TWOI for DR patients based on a double-check of DR and non-DR codes. This conservative choice to avoid a misleading TWOI for DR patients is motivated by the presence of frequent typos, physician's transcription errors, and EHR framework faults or anomalies in real-world EHR scenario. On the contrary, differently from DL imaging techniques [1], [4], [5], all the studies based on EHR data [8], [15], [16] gave much more attention to model interpretability and pattern localization.
Previous breakthrough research findings rely on DL techniques to diagnose DR in patients with medical imaging. Although the medical imaging achieves reasonable recognition accuracy, the application of mass, easy-to-obtain and routine EHR data can make an early diagnosis of the DR more convenient and suitable.

A. DATASET
The 120K dataset, provided by Regione Marche, was collected by aggregating patients of several Italian diabetologic centers. The dataset consists of 120K diabetic patients and was organized in the following 3 different fields: • The demographics field stores the patient's identificative number (ID patient), gender, year of birth, and diabetes diagnosis date. In particular, the first name and the surname of the patients were anonymized and associated to a random numeric ID patient.
• The pathological field stores the ID patient, the pathology codes, and the pathology diagnosis date.
• The lab tests field stores ID patient, the lab tests codes, the lab tests values, and the lab tests prescription date.

B. PREPROCESSING
In accordance with the diabetologist, all the pathology codes associated with DR were identified and summarized in Table 1. The first pathology code (i.e., -3001) indicates a non-DR condition, while all the remaining codes indicate a DR condition.
All the pathology codes that were not included in Table 1 were removed from pathological field. Then, for each patient, both pathology codes and lab tests codes were removed if pathology diagnosis date and lab tests prescription date were earlier than diabetes diagnosis date. Finally, the inclusion criteria to select the time-widow of interest (TWOI) were presented for both control patients and retino patients as depicted in Figure 1.

1) CONTROL PATIENTS -TWOI
A control patient must have at least 2 pathology codes (i.e., -3001) of non-DR and none of the remaining pathology codes available in Table 1. A TWOI of a control patient (see Figure 1 -upper side) is delimited by the earliest pathology code of non-DR and the latest code of non-DR.

2) RETINO PATIENTS -TWOI
A control patient must have at least a pathology code (i.e., -3001) of non-DR and at least one of the remaining pathology codes available in Table 1. A TWOI of a retino patient (see Figure 1 -bottom side) is delimited by the earliest pathology code (i.e., -3001) of non-DR and the earliest pathology code of DR available in Table 1. A patient was included in the study only if the date of the earliest non-DR code (i.e., -3001) was before the earliest date of DR code.

C. TASKS DEFINITION
During the task definition stage, the matrices X 1 and X 2 fed to the ML model were defined both for task 1 and task 2, as well as the ground-truth vectors Y 1 and Y 2 1) TASK 1 The task 1, defined as the prediction between control and DR patients, was evaluated by taking the average of all the lab tests values enclosed in the range of the TWOI. The X 1 = M 1 × N 1 matrix was obtained (see Figure 2), which was composed by m 1 = 1, 2, . . . , M 1 patients and n 1 = 1, 2, . . . , N 1 − 3 unique lab tests codes (i.e., predictors). In addition to the already existing predictors (i.e., unique lab tests codes), also the information of gender, age, and duration of diabetes was added, by obtaining the final M 1 × N 1 matrix. All the missing values of X 1 matrix was filled with an extra-values imputation (i.e., −999). Several standard data imputation techniques (i.e., median, mean, KNN) were tested, but extra values imputation guaranteed the best predictive performance. This benefit can be explained by the fact that the extra value imputation allows to properly track the missing value mechanism, thus exploiting a correlation between the occurrences of missing values and the dependent variables.    Figure 3 shows the missing values distribution of the X 1 matrix. The ground-truth vector Y 1 of size M 1 ×1 is composed of control patients, labelized as negative and retino patients, labelized as positive.
2) TASK 2 The task 2, defined as the temporal stratification of the DR risk, was evaluated only among the retino patients. For each patient, the unique lab tests prescription dates enclosed in the TWOI were pointed out. Each of those represented an observation of the patient. Thus, for each patient, starting from the earliest observation close to the lower boundary of the TWOI, the mobile averages of all the lab tests values inside the range of the dynamic time-windows were taken, observation by observation, until the latest observation close to the upper boundary of the TWOI. A M 2 × N 2 matrix was obtained (see Figure 4), which was composed by m 2 = 1, 2, . . . , M 2 total observations of all patients and n 2 = 1, 2, . . . , N 2 − 4 unique lab tests codes. In addition to the already existing predictors (i.e., unique lab tests codes), also the information of gender, age, duration of diabetes, and incremental number of observations (seq) per patient was added, by obtaining the final M 2 × N 2 matrix. All the missing values of X 2 matrix was filled with VOLUME 9, 2021  an extra-values imputation (i.e., -999). The task 2 consisted in the prediction of the temporal distance between the date of each patient's observation and the date of DR diagnosis. The risk was defined ''high'' if the temporal distance is within the range of 0 − 2 years, otherwise was defined mid-low if within the range of 2 − 5 years. The ground-truth vector Y 2 of size M 2 × 1 is composed of short-term risk patients and long-term risk patients. Retino patients whose temporal distance was greater than 5 years are excluded from the study.

A. EXPERIMENTAL PROCEDURE
To perform the task 1, a Tenfold Cross-Validation (CV-10) experimental procedure was chosen. CV-10 was implemented dividing all patients in ten folds, by selecting nine folds for training and one fold for testing. During the training stage of the CV-10 procedure, SMOTE [17] was utilized to equally balance DR patients with respect to control patients. CV-10 procedure was implemented without considering the temporal evolution of predictors, providing an overall average of the patient's clinical history.
On the contrary, to perform the task 2, a Tenfold Cross-Validation Over Patients (CVOP-10) was chosen. CVOP-10 was implemented dividing all observations grouped by patients in ten folds, by selecting nine folds for training and one fold for testing. CVOP-10 procedure was implemented considering the temporal evolution of the patient's predictors and allowed to generalize across unseen patients.

B. METRICS
The proposed task 1 and task 2 were evaluated by considering the following metrics for the classification task: • Accuracy: the percentage of correct predictions; • Macro-precision (Precision): the Precision is calculated for each class and then the unweighted mean is taken; • Macro-recall (Recall): the Recall is calculated for each class and then the unweighted mean is taken; • Macro-F1 (F1): the harmonic mean of precision and recall averaged over all output categories; • Area Under the receiver operating characteristic Curve (AUC): the AUC represents the probability that the classifier will rank a randomly chosen positive sample higher than a randomly chosen negative one; • Area Under the Precision-Recall Curve (PRAUC): The PRAUC can be interpreted as the relationship between precision and recall (sensitivity) and is considered more informative than the AUC plot when evaluating binary classifiers on imbalanced data [18].

C. VALIDATION PROCEDURE
For what concern the CV-10 and CVOP-10 experimental procedures, the optimization of the hyperparameters of the ML models was performed implementing a grid-search and optimizing the Recall in a nested Fivefold Cross-Validation.
Recall was preferred over other optimization objectives, because the minimization of false negatives has the most clinical relevance for the task 1 experiment. Hence, each split of the outer loop was trained with the optimal hyperparameters tuned in the inner loop. Although this procedure was computationally expensive, it allowed to obtain an unbiased and robust performance evaluation [19]. In according with the work in [8], the predictive performances of several ML models such as XGBoosting (XGB), LR, DT, RF, SVM, and NB were compared. Table 5 summarizes the range of the hyperparameters optimized during validation stage for each ML model. The features importance of the XGB model was extracted in according to the logic of showing the number of times the feature is used to split data, while for the RF model in according to the logic of averaging the decrease in impurity over trees. We decided to not explore model-agnostic methods for showing the importance of each feature, because we aimed to emphasize the model intrinsic dependency. For that reason we show a global feature importance that is intrinsic within the designed ensemble-based white box models. Future work could be explored in order to provide further interpretability of the proposed approach by exploiting post-hoc explainable AI methodology, specifically tailored to clinician point of view. This methodology includes the possibility (i) to provide local feature importance (SHAP [20] and LIME [21]), (ii) to unraveled rules and laws (features interaction) and (iii) to provide transparent risk equations (model non-linearity) [22].  Table 6 shows the predictive performance experimental results and it is evident as RF and XGB have proved to be best models for both task 1 and task 2. In accordance also with [8], RF1 is the best model for task  Figure 5 shows the top-10 discriminant predictors for task 1, while Figure 6 for task 2. Only the two best models (i.e., XGB and RF) were considered in this evaluation.

A. PREDICTIVE PERFORMANCE
Globally, XGB and RF models obtained the best predictive performance to solve task 1 and task 2. Tree-based models (i.e., DT, RF, XGB) obtained the best predictive performance in both tasks and considerably overcome the other ML models. However, the simple DT1 did not guarantee robustness against class imbalance, because PRAUC (57.30) was significantly inferior than RF1 (72.43) and XGB1 (71.26). Thus, only the more complex tree-based models (i.e., RF, XGB) achieved the best result to solve task1 and task 2. Differently from [8] we designed the experimental setup using raw EHR data without employing features engineering techniques. This aspect assumes a great relevance in terms of data interpretability and algorithm scalability. Thus, the possibility of extracting raw features from EHRs permits the clinician to appreciate and interpret each single features contribution, and moreover, this scenario could be more easily scalable and transferable to other standardized clinician domains avoiding an hand-crafted human intervention.

B. CLINICAL SIGNIFICANCE
The contribution of the single predictors changes every time in relation to different tasks and different ML models (i.e., inter-task and inter-model variability). The common intersection regards the strong correlation between features importance and its associated missing value rate (mvr). The higher the quality and the completeness of the data, the more VOLUME 9, 2021 TABLE 6. Experimental results (i.e., task1 and task2) for each machine learning model: XGBoosting (XGB), logistic regression (LR), decision tree (DT), random forest (RF), support vector machine (SVM), and naive Bayes (NB). Predictive performance and standard deviation are reported for each metric.  Table 4.  Table 4. important the predictor will assume. This suggests how a high-quality data collection and representation always positively affects every AI-based solution.
Moving to task 2, the sequential number of observations per patient (mvr = 0%), weight (mvr ≈ 0%), and glycated hemoglobin represented the most important predictors for RF2, while BMI, fasting glycaemia, and glycated hemoglobin for XGB2. The temporal information as the sequential number of observations per patient appears predominant in RF2, while the 3 most important predictors in XGB2 were appeared in top-10 predictors in XGB1. The XGB model tends to give importance to the same cluster of predictors, on the contrary the RF model was able to capture the very important temporal information represented by sequential number of observations per patient to perform the task 2.
Both ML models (i.e., RF and XGB), adopting the proposed novel preprocessing procedure for selecting control and pathological patients, were effective capable to generalize across different diabetologic centers, a necessary requirement to transfer the AI-based solution to other clinical scenarios.
The proposed AI-based solution represents the core of a CDSS. In fact, Meteda srl, leading company in Italy for innovative software solutions for diabetes, has already efficiently integrated the proposed ML-based CDSS into the EHR architecture of some diabetic centers for a pilot study. The first release of the predictive medicine tool is focused on DR and can be addressed to primary care levels or specialists for screening and follow-up purposes. Clinicians are currently adopting the proposed ML-based CDSS trained on 120K dataset on new unseen patients from other different diabetic centers. The preliminary outcomes are proving that our proposed solution is generalizable also across heterogeneous clinical scenarios.

C. FUTURE WORK
Future work may be oriented to extend the operating range and provide to predict also other diabetic complications (i.e., cardiopathy, nefropathy, neuropathy, and vasculopathy). In this direction, a full-version of the proposed CDSS will be released to the whole clinical ecosystem. Another relevant aspect to focus on could be enhancing the quality of EHR data collection in clinical scenario; thus, the proposed ML-based CDSS could help each diabetic centers to reach baseline standards in the collection, completeness, and quality of the data. Diabetic centers located below a certain quality threshold could be alerted by the ML-based CDSS to bridge the gap. Moreover, diabetic centers should be pushed to achieve targeted data quality indicators (e.g., mvr, prescription of a particular lab test, repetition of a particular lab test, etc.), both to facilitate the predictive system and to improve the diabetic patient's quality of life. Future work may be addressed to measure the generalizability of the model across different healthcare infrastructure and lifestyle by collecting and including diabetic patients from different geographical areas.

VII. CONCLUSION
This work proposed a two-stage ML procedure as the core of a CDSS to firstly predict the presence/absence of DR and secondly to temporally stratify the risk of the disease for each patient. For this objective, a novel preprocessing procedure was designed to select both control and pathological patients, and moreover, the novel fully annotated/standardized 120K dataset from multiple diabetologic centers was provided. The proposed AI-based solution was proven to be capable of generalizing across different diabetologic centers and was utilised as pilot study in several diabetologic centers for DR screening and follow-up purposes.