Recognition and Detection of Diabetic Retinopathy Using Densenet-65 Based Faster-RCNN

: Diabetes is a metabolic disorder that results in a retinal complication called diabetic retinopathy (DR) which is one of the four main reasons for sightlessness all over the globe. DR usually has no clear symptoms before the onset, thus making disease identification a challenging task. The healthcare industry may face unfavorable consequences if the gap in identifying DR is not filled with effective automation. Thus, our objective is to develop an automatic and cost-effective method for classifying DR samples. In this work, we present a custom Faster-RCNN technique for the recognition and classification of DR lesions from retinal images. After pre-processing, we generate the annotations of the dataset which is required for model training. Then, introduce DenseNet-65 at the feature extraction level of Faster-RCNN to compute the representative set of key points. Finally, the Faster-RCNN localizes and classifies the input sample into five classes. Rigorous experiments performed on a Kaggle dataset comprising of 88,704 images show that the introduced methodology outperforms with an accuracy of 97.2%. We have compared our technique with state-of-the-art approaches to show its robustness in term of DR localization and classification. Additionally, we performed cross-dataset validation on the Kaggle and APTOS datasets and achieved remarkable results on both training and testing phases.


Introduction
Diabetes, scienti cally known as diabetes mellitus is an imbalance of metabolism that precedes to increase in the level of glucose in the bloodstream. According to an estimate provided in [1] about 415 million people are victimized by this sickness. Prolonged diabetes causes retinal complications which results in a medical condition called DR, which is one of the 4 main reasons for sightlessness all over the globe. More than 80% of people who are exposed to diabetes for a long time suffer from this medical condition [2]. The high level of glucose in circulating blood causes blood leaks and an increased supply of glucose to the retina. This often leads to abnormal lesions i.e., microaneurysms, hard exudates, cotton wool spots, and hemorrhages in the retina, thus causing vision impairment [3]. DR usually has no clear symptoms before the onset. The most common screening tool used for the detection of DR is retinal (fundus) photography.
For treatment purposes and avoiding vision impairment, the DR is classi ed in different levels concerning the severity of the disorder. According to the research of the early treatment of DR and international clinical DR, there are ve levels of DR severity. In the zeroth level of DR severity, there is no abnormality. The rst, second, third, and fourth levels are identi ed as the presence of mild-aneurysms, moderate non-proliferative Diabetic Retinopathy (NPDR), severe NPDR, and proliferative DR, respectively. Tab. 1 summarizes the ve levels of DR severity with their respective fundoscopy observations. For computerized identi cation of DR, initially, hand-coded key points were used to detect the lesions of DR [4][5][6][7][8][9][10][11][12][13][14]. However, these approaches exhibit low performance due to a huge change in color, size, intra-class variations, size, bright regions, and high variations among different classes. Moreover, the little signs other than microaneurysms, medical rule marks, and objects also contribute to the unpromising results of CAD solutions. Another reason for the degraded performance of automated DR, detection system is the involvement of non-affected regions with the affected area, which in turn gives a weak set of features. To achieve the promising performance of computer-based Diabetic retinal disease detection solutions, there must be an ef cient set of key-points.
Object detection and classi cation in images using various machine learning techniques have been a focus of the research community [15,16]. Especially with the advent of CNN, various models have been proposed to accomplish the tasks of object detection and classi cation in the areas of computer vision (CV), speech recognition, natural language processing (NLP), robotics, and medicine [17][18][19][20][21]. Similarly, there are various examples of deep learning (DL) use in biomedical applications [22,23]. In this work, we have introduced the technique that covers the Data preparation, Recognition, and classi cation of DR from retinal images. In the rst step, we have prepared our dataset with the help of ground truths. For detection and feature extraction, we have proposed a CNN algorithm named DenseNet-65 for images of size 340 × 240 pixels. We also present the performance comparison of our models in terms of accuracy with DenseNet-121, ResNet-50, and Ef cientNet-B5. Moreover, we have compared our approach against the most recent techniques. Our analysis reveals that the introduced technique has the potential to correctly classify the images. The following are the main contributions of our work: • The development of the annotations of the large dataset having images with a total of 88,704 images.
• We have introduced a customized Faster-RCNN with DenseNet-65 at the feature extraction level which can accurately increase the performance to locate the small objects while decreasing both training and testing time complexity. By removing the unnecessary layers, the Densenet-65 minimizes the loss of the bottom-level high-resolution key points and saves the data of small targeted regions, which are lost by repeated key points.
• To develop a technique for classifying DR images using DenseNet-65 architecture instead of hand-engineered features and reduce cost-effectiveness and the need for face-to-face consultations and diagnosis.
• Furthermore, we have compared the classi cation accuracy of the presented framework with other algorithms like AlexNet, VGG, GoogleNet, and ResNet-11. The results presented in this work show that DenseNet architecture performs well in comparison to the latest approaches.
The remaining manuscript is arranged as follows: In Section 2, we present the related work. This includes the work on the classi cation of DR images using handcrafted features and DL approaches. In Section 3, we present the proposed methodology of DR image classi cation using Custom Faster-RCNN. In Section 4, we present the results and evaluations of the introduced work. Finally, in Section 5 we conclude our work.

Related Work
In history, several approaches have been introduced to correctly classify the images of normal retina and retina with DR. In [24], the authors propose a technique that uses mixture-models to dynamically threshold the images for differentiating exudates from the background. Afterward, edge detection is applied to classify cotton wool spots from the background texture. The proposed work presents a sensitivity of 100% and speci city of 90%. Authors in [25] present an algorithm that performs 2-step classi cation by combining four machine learning techniques, namely, k-nearest neighbors (KNN) [26] Gaussian mixture models (GMM) [27] support vector machines (SVM) [28], and the AdaBoost algorithm [29]. The authors report the sensitivity and speci city of 100% and 53.16%, respectively. Priya et al. [30] proposes a framework to categorize fundus samples into two classes: proliferative DR and non-proliferative DR. The proposed technique rst extracts hand-engineered features of DR abnormalities, for instance, hemorrhages, hard exudates, and swollen blood vessels. These hand-engineered features are then used to train a hybrid model of probabilistic neural networks (PNN), SVM, and Bayesian classi ers. The accuracy of each model is computed separately, i.e., 89.6%, 94.4%, and 97.6% for PNN, SVM, and Bayesian classi ers, respectively. In [31], the authors propose a technique that is designed using the idea of a visual descriptor word bag. The proposed algorithm in the initial stage detects the points of interest based on hand-engineered features. Secondly, the feature vectors of these detected points are consumed to construct the dictionary. Finally, the algorithm classi es whether the input image of the human retina contains hard exudates using SVM.
With the introduction of DL, a focus is on introducing methods for classifying DR images through employing deep neural networks as a replacement for hand-coded key points. The related work of approaches to categorizing normal and DR retinas utilizing DL methodologies is discussed in Tab. 2. Li et al. [33] Fine-tuning based CNN classi er design, that is applied to all the layers of the pre-trained CNN and then applied only on selected layers of the pre-trained network. Also, an alternative CNN method is proposed to compute key points and then trains SVM for classifying the DR images.
The CNN ne-tuning of the speci c layers performed the best as compared to the method in which all the layers are ne-tuned and the SVM methods. The former method results in an accuracy of 92.01%.
The classi cation results can be improved by carefully ne-tuning an advanced architecture of CNN.
Zhang et al. [34] Deeply supervised ResNet for the classi cation of DR severity level. ResNet architecture has been modi ed by the addition of 3 sets of side-output layers in the hidden layer of 11-layer ResNet.
The 11-layer ResNet achieves a classi cation accuracy of 81.0%.
The accuracy can be improved by either making ResNet deeper or using different architectures.
Wang et al. [3] Using various CNN frameworks for the severity level classi cation of DR. The authors use AlexNet, VGG16, and InceptionNet-V3 for the classi cation of DR samples.
The accuracy of the proposed work is very low. Modi cation in architecture can lead to better accuracy.
The accuracy of the algorithms can be further improved.
Zhang et al. [36] The authors propose a system called DeepDR for the automatic recognition of DR images and their severity level using transfer learning and ensemble learning.
The proposed DeepDR has the sensitivity and speci city of 97.5% and 97.7%, respectively.
The accuracy can be improved, and the complexity of the model can be lowered.
Bodapati et al. [37] DLL based automated DR identi cation network using fundus images. The author computes deep features by employing VGG16-fc1, CGG16-fc2, and Xception networks. Based on the obtained set of hybrid features, a DNN model was used to specify the DR severity level.
The introduced framework has achieved an accuracy of 80.96%.
The prediction accuracy of the technique can be further enhanced. (Continued)

Proposed Methodology
The presented work comprises of two main parts. The rst is 'dataset preparation' and the second is Custom 'Faster-RCNN builder' for localization and classi cation.
The rst module develops the annotations for DR lesions to locate the exact region of the lesion. While the second Component of the introduced framework builds a new type of Faster-RCNN. This module comprises two sub-modules in which the rst one is a CNN framework and the other is the training component, which performs training of Faster-RCNN through employing the key points computed from the CNN model. Faster-RCNN accepts two types of input, image sample and location of the lesion in the input image. Fig. 1 shows the functionality of the presented technique. At rst, an input sample along with the annotation's bounding box (bbox) is passed to the nominated CNN model. The bbox recognizes the region of interest (ROI) in CNN key points. With these bboxes, reserved key points from training samples are nominated. Based on the computed features, the Faster-RCNN trains a classi er and generate a regressor estimator for given regions. The Classi er modules assign predicted class to object and the regressor component learns to determine the coordinates of potential bbox to locate the location of the lesion in each image. Finally, accurateness is estimated for each unit as per metrics employed in the CV eld.

Preprocessing
Like any other real-world dataset, our data contains various artifacts, such as noise, out of focus images, underexposed or overexposed images. This may lead to poor classi cation results. Therefore, we perform data pre-processing on the samples beforehand inputting them to CNNs.
where σ represents the variance, x and y represent the distance from the origin in the horizontal and vertical axes. G(x, y) is the output of Gaussian lter. Afterward, we subtract the local average color from the blurred image using Eq. (2).
where, I (x, y), I(x, y), and (G(x, y) * I(x, y)) represent the contrast corrected image, the original image and original image convolved with Gaussian lter, respectively.
Second, the removal of regions which have no information. In the original dataset, there are certain areas in the image that if removed do not affect the output. Therefore, we crop these regions from the input image. The process of cropping images not only enhances the performance of the classi cation but also assist in reducing the computations.

Annotations
The location of DR lesions of every sample is necessary to detect the diseased area for the training procedure. In this work, we have used the LabelImg tool to generate the annotations of the retinal samples and have manually created a bbox of every sample. The dimensions of the bbox and associated class for each object are stored in XML les, i.e., xmin, ymin, xmax, ymax, width, and height. The XML les are utilized to generate the CSV le, train. record le is created from the CSV le which is later employed in the training procedure.

Faster-RCNN
Faster-RCNN [19] algorithm is an extended form to already existing approaches, i.e., R-CNN [21] and Fast-RCNN [20] which employed Edge Boxes [41] technique to generate region proposals for possible object areas. However, the functionality of Faster-RCNN is changed from [21] as it utilizes Region Proposal Network (RPN) to create region proposals directly as part of the framework. It means that Faster-RCNN uses RPN as an alternative to the Edge Boxes algorithm. The computational complexity of Faster-RCNN for producing region proposals is considerably less than the edge box technique. Concisely, the ranking of anchor boxes is nalized by RPN which shows the most expected anchor boxes containing regions of interest (ROIs). So, in Faster-RCNN, region proposal generation is quick and is better attuned to input samples. Two types of outputs are generated by the Faster-RCNN: (i) Classi cation that shows the class associated with each object (ii) Coordinates of bbox.

Custom Feature Faster-RCNN Builder
A CNN is a special type of NN that is essentially developed/evolved to perceive, recognize, and detect visual attributes from 1D, 2D, or ND matrices. In the presented work, image pixels are passed as input to the CNN framework. We have employed DenseNet-65 as a feature extractor in the Faster-RCNN approach. DenseNet [42] is the latest presented approach of CNN, in which the present layer relates to all preceding layers. DenseNet comprises a set of dense blocks which are sequentially interlinked with each other with extra convolutional and pooling layers among successive dense blocks. DenseNet can present the complex transformations which result in improving the issue of the absence of the target's position information for the top-level key points to some degree. DenseNet minimizes the number of parameters which makes them cost-ef cient. Moreover, DenseNet assists the key points propagation process and encourages their reuse which makes them more suitable for lesion/digit classi cation. So, in this paper, we have utilized the denseNet-65 as a feature extractor for Faster-RCNN. The architectural description of DenseNet is given in Tab. 4 that demonstrates the name of layers through which the key points are selected for advance processing by Faster-RCNN. It also represents the query sample size to be readjusted before computing key points from the allocated layer. The training parameters for customized Faster-RCNN are shown in Tab. 3. The detailed ow of our presented approach is shown in Algorithm 1. The main process of lesion classi cation through Faster-RCNN can be divided into four steps. Firstly, the input sample along with annotation is given to the denseNet-65 to compute the feature map, then, the calculated key points are used as input to the RPN component to obtain the features information of the region proposals. In the third step, the ROI pooling layer produces the proposal feature maps by using the calculated feature map and proposals from convolutional layers and the RPN unit, respectively. In the last step, the classi er unit shows the class associated with each lesion while the bbox generated by the bbox regression is used to show the nal location of the identi ed lesion.
The proposed method is assessed employing the Intersection over Union (IOU) as described in Fig. 2a. X shows the ground truth rectangle and Y denotes the estimated rectangle with Dr lesions.
The rst decision for lesions being identi ed when the value of IOU is greater than 0.5, or not is determined when the value is less than 0.5. The Average Precision (AP) is mostly employed in evaluating the precision of object detectors i.e., R-CNN, SSD, and YOLO, etc. The geometrical explanation of precision is shown in Fig. 2b. In our framework of the detection of DR lesions, AP depends upon the idea of IOU. The Densenet-65 has two potential difference from traditional DenseNet: (i) Densenet-65 has less number of parameters from the actual model as instead of 64, it has 32 channels on the rst convolution layer, and the size of the kernel is 3 × 3 instead of 7 × 7 (ii) the number of layers within each dense block is attuned to deal with the computational complexity. Tab. 4 describes the structure of the proposed DenseNet-65 model.  After multiple dense connections, the number of FPs will rise signi cantly, the transition layer (TL) is added to decrease the feature dimension from the preceding dense block. The structure of TL is shown in Fig. 4, which comprises of BN and a 1 × 1 ConvL (decreases the number of channels to half) followed by a 2 × 2 average pooling layer that decreases the size of FPs. Where t shows the total channels and average pooling is denoted by the pool.

Detection Process
Faster-RCNN is a deep-learning-based technique which is not dependent on methods like the selective search for its proposal generation. Therefore, the input sample with annotation is given as input to the network, on which it directly computes the bbox to show the digit location and associated class.

Dataset
In this method, we employ the DR images database provided by Kaggle. There are two sets of training images with a total of 88704 images. A label.csv le is provided that contains the information regarding the severity level of DR. The samples in the database are collected using various cameras in multiple clinics, over time. The sample images of ve classes from the Kaggle database are shown in Fig. 5.
In this part, we show the simulation results of the ResNet, DenseNet-65, and Ef cientNet-B5. The results are presented in terms of accuracy for DR image classi cation. Tab. 5 presents the comparison of the 3 models used in this work for the classi cation of DR images in terms of trainable parameters, total parameters, loss, and model accuracy. As presented in Tab. 5, DenseNet-65 has a signi cantly small number of total parameters, whereas the Ef cientNet-B5 has the highest number of model parameters. This is because the architecture of DenseNet does not solely rely on the power of very deep and wide networks, rather, they make ef cient reuse of model parameters, i.e., no need to compute redundant feature maps. Therefore, resulting in a signi cantly small number of total model parameters. For instance, the architecture of DenseNet under consideration in this work is DenseNet-65, i.e., 65 layers deep. Similarly, the ResNet used in this work has 50 layers, however, the number of parameters is still signi cantly higher than that of DenseNet-65. Our analysis reveals that the classi cation performance of the DenseNet-65 is higher than the other methods as shown in Tab. 6. DenseNet-65 correctly classi es 95.6% of the images that represent the human retinas suffering from DR. Contrary, the classi cation accuracy of the ResNet and Ef cientNet-B5 is 90.4% and 94.5%, respectively. Moreover, the techniques in [36] and [43] are economically complex and may not perform well under the presence of bright regions, noise, or light variations in retinal images. Our method has overcome the existing problems by employing an ef cient network for feature computation and can show complex transformations that make it robust to post-processing attacks. AlexNet [37] 89.75 VGG [37] 95.6 GoogleNet [37] 93.36 ResNet [37] 90.40 DenseNet-121 [37] 92.39 Ef cientNet-B5 [43] 94.5 DenseNet-65 (proposed) 97.2

Localization of DR Lesions Using Custom Faster-RCNN
For localization of the DR signs, the diseased areas are declared a positive example while the remaining healthy parts are known as a negative example. The correlated area is categorized by a threshold score IOU, which was set to 0.5, less than this score, considering the area as background or negative. Likewise, the value of IOU more than 0.5 the areas are classi ed as lesions. The localization outcome of Custom Faster-RCNN as shown in Fig. 6 having to evaluate retinal samples over a con dence value. The evaluation results exhibit a greater value which is higher than 0.89 and up to 0.99.
The presented methodology results are analyzed by employing the mean IOU and precision over all samples of the test database. Tab. 7 demonstrates that the introduced framework achieved average values of mean IOU as 0.969 and a precision of 0.974. Our presented method exhibits better results because of the precise localization of lesions by utilizing Custom Faster RCNN based on DenseNet-65.

Stage Wise Performance
The stage-wise results of the introduced framework are analyzed through the experiments. Faster-RCNN precisely localized and classify the lesions of the DR. The classi cation results of DR in terms of accuracy, precision, recall, F1-score, and error-rate are presented in Tab. 8. According to the results, it can be determined that the introduced methodology attained remarkable results in terms of accuracy, precision, recall, and F1-score and shown a lower error rate. The presented technique attained an average value of accuracy, precision, recall, F1-score, and the error rate is 0.972, 0.974, 0.96, 0.966, and 0.034 respectively. The correctness of DenseNet-65 keypoints computation that shows each class in a viable manner is the reason for good classi cation. Moreover, a little association among the No and Mild DR classes is found, however, still, both are recognizable. So, because of ef cient keypoints computation, our method shows the latest DR classi cation performance that exhibits the robustness of the presented network. The confusion metrix is shown in Fig. 7.

Comparative Analysis
In the present work, we reported results by running a computer simulation 10 times. In each run, we randomly selected data with a ratio of 70% to 30% for training and testing, respectively. The average results in form of performance evaluation metrics were then considered.
In Tab. 9, we present an evaluation of the proposed approaches for DR classi cation with the methods presented in Xu et al. [32], Li et al. [33], Zhang et al. [36], Li et al. [40] and Wu et al. [44] and Pratt et al. [45], and these techniques are capable to classify DR from retinal images. However, requires intense training and exhibits lower accuracy for training samples with the class imbalance problem. Our method has acquired the highest average accuracy of 97.2% that signi es the reliability of the introduced solution against other approaches.
The proposed method achieved the average accuracy of 97.2%, while the comparative approaches attained the average accuracy of 84.735%, we can say that our technique gave a 12.46% performance gain. Furthermore, the presented approach can simply be adopted or run-on CPU or GPU based systems and every sample test time is 0.9 s which is faster than the other method's time. Our analysis shows that the proposed technique can correctly classify the images. Xu et al. [32] 94.50 Li et al. [33] 92.01 Zhang et al. [36] 81.00 Li et al. [40] 82.80 Wu et al. [44] 83.10 Pratt et al. [45] 75.00 Proposed 97.20

Cross-Dataset Validation
To more assess the presented approach, we present the validation of the cross dataset, which means we trained our method on the Kaggle database, and testing is performed on the APTOS-2019 dataset [46] by "Asia Paci c Tele-Ophthalmology Society." The dataset contains 3662 retinal samples combined from several clinics under diverse image capturing environments utilizing fundus photography from Aravind Eye Hospital in India. This dataset consists of ve classes same as in the Kaggle dataset.
We have plotted the box plot for evaluation of cross dataset in Fig. 8, the accuracy of test and train is spreading across the number line into quartiles, median, whisker, and outliers. According to the gure, we attained an average accuracy of 0.981% for training and 0.975% for testing which exhibits that our proposed work outperforms the unknown samples as well. Therefore, it can be concluded that the introduced framework is robust to DR localization and classi cation.

Conclusions
In this work, we introduced a novel approach to accurately identify the different levels of the DR by using a custom Faster-RCNN framework and have presented an application for lesion classi cation as well. More precisely, we utilized DenseNet-65 for computing the deep features from the given sample on which Faster-RCNN is trained for DR recognition. The proposed approach can ef ciently localize retinal images into ve classes. Moreover, our method is robust to various artifacts, i.e., blurring, scale and rotational variations, intensity changes, and contrast variations. Reported results have con rmed that our technique outperforms the latest approaches. In the future, we plan to enhance our technique to other eye-related diseases.