Fundus-DeepNet: Multi-Label Deep Learning Classiﬁcation System for Enhanced Detection of Multiple Ocular Diseases through Data Fusion of Fundus Images

Detecting multiple ocular diseases in fundus images is crucial in ophthalmic diagnosis. This study introduces the Fundus-DeepNet system, an automated multi-label deep learning classification system designed to identify multiple ocular diseases by integrating feature representations from pairs of fundus images (e.g., left and right eyes). The study initiates with a comprehensive image pre-processing procedure, including circular border cropping, image resizing, contrast enhancement, noise removal


Introduction
A recent report from the World Health Organization (WHO) reveals that the global population of individuals experiencing visual impairment exceeds 2.2 billion.Based on early identification and treatment, at least 45% of these instances could be avoided [1].The retina, a layer of lightsensitive tissues at the back of the human eye, transforms incoming light into neural signals processed by the visual cortex for object/scene identification.Examining the retinal tissue plays a vital role in assessing and maintaining an individual's overall health and wellness [2].Hence, it is crucial to identify ocular diseases in their early stages and provide prompt treatment to prevent permanent vision loss.Ocular diseases affecting the retina can lead to blindness or visual impairment.The most frequent ocular diseases include diabetic retinopathy, age-related macular degeneration (AMD), glaucoma, cataracts, hypertension, and myopia [3].A single fundus image may display signs of two or more ocular diseases.Furthermore, ocular diseases can be identified by observing anomalies around the optic nerve, veins, macula, optic disk, and other retinal structures.However, early symptoms of many ocular diseases are rarely visible [4].
For the early detection of ocular diseases, both optical coherence tomography (OCT) and fundus photography have proven to be effective imaging modalities [5].OCT imaging produces cross-sectional images of retinal layers, while fundus imaging provides wide-field 2D images of the retina and its surrounding structures.Fundus photography is a non-invasive and costeffective method for screening and identifying ocular disorders compared to the more expensive OCT imaging.In general, the utilization of digital retinal imaging has the potential to improve telemedical consultations, leading to increased accessibility for precise and timely sub-specialty treatments, particularly in underserved areas [6].As a result, ophthalmological experts primarily employ fundus images to identify various ocular diseases.However, utilizing fundus images for diagnosing ocular diseases can be challenging for several reasons.Firstly, the manual examination of fundus images is a time-consuming and labour-intensive task, which complicates the process of reaching a definitive and precise diagnosis [3].Many underdeveloped countries face a shortage of ophthalmologists who are capable of conducting manual assessments.
Secondly, accurately detecting common ocular diseases, such as diabetic retinopathy, glaucoma, and AMD in their early stages can be challenging due to limited initial visual indicators.Thirdly, despite the advantages of ocular fundus photography, obtaining a sufficient number of high-quality fundus images can be problematic, especially for less common fundus diseases.This difficulty primarily arises from the low contrast and the potential presence of features that resemble eye anatomy in the generated fundus images, making them hard to differentiate.Therefore, the development of an automated computer-aided diagnosis (CAD) system is of critical importance to alleviate the burden on ophthalmologists and offer a rapid and precise diagnosis based on fundus images.
Several CAD systems have been developed based on traditional handcrafted feature extraction approaches for ocular disease identification, which have numerous limitations and require a substantial amount of prior knowledge.For instance, Koh et al. [7] suggested a method for extracting distinguishing feature representationss from fundus images using Speeded Up Robust Features (SURF) and Pyramid Histogram of Oriented Gradients (PHOG) descriptors.These features are combined using canonical correlation analysis, and class labels are assigned using a K-Nearest Neighbor (K-NN) classifier.Over the past several years, deep neural networks (DNNs) have made significant advancements in the field of computer vision, surpassing the capabilities of traditional handcrafted approaches [2][8] [9].The Convolutional Neural Network (CNN), a prominent type of DNN, has witnessed considerable progress and has made substantial contributions to the field of medical imaging, encompassing disease classification and object localization.Due to their capacity to automatically learn highly discriminative feature representations from images, CNNs have found extensive application in ocular disease diagnosis, specifically in tasks like Glaucoma classification [10], retinal vessel segmentation [11], optic disc segmentation [12], and more.While CNN models have demonstrated proficiency in identifying fundus diseases, there remain certain limitations and further challenges to address.Firstly, most published studies have focused solely on a single ocular disease, as seen in [13][14] [15].Consequently, many current models yield favorable results for specific tasks but may not be adept at handling complex real-world scenarios (e.g., multiple fundus image classification).We assert that it is crucial to develop a more efficient and comprehensive fundus screening system capable of simultaneously identifying multiple ocular diseases.Secondly, there is a scarcity of datasets containing authentic fundus images with annotations for multiple ocular diseases.Thirdly, for the classification task, most publicly known CNN-based models examine fundus images from a single eye.However, in the majority of clinical cases, ophthalmologists often diagnose patients by considering evidence from both eyes.Studies have indicated a close correlation in the progression of ophthalmic diseases between bilateral eyes [16].This finding implies that using information from bilateral fundus images for diagnosing ophthalmic patients would be a more effective approach.In this scenario, several data fusion techniques may be used to amalgamate data obtained from two fundus images, thereby enhancing the overall performance of the diagnostic system.
In this work, we introduce an automated and expedited multi-label deep learning classification system, named the Fundus-DeepNet system, designed to address the challenge of detecting multiple ocular diseases in fundus images.The Fundus-DeepNet system takes pairs of fundus images captured from the left and right eyes as input.It comprises three primary parts.Initially, an image pre-processing procedure is implemented, encompassing circular border cropping, image resizing, image contrast enhancement, noise removal, and data augmentation.This is followed by the extraction of discriminative deep feature representations from a pair of fundus images, achieved by feeding them into multiple deep learning blocks acting as feature descriptors: HRNet and Attention Block.Subsequently, these extracted feature representations are refined and integrated into a single feature representation by feeding them into the SENet Block.Finally, a trained DRBM, in conjunction with the Softmax layer, generates the probability distribution of eight distinct ocular diseases, serving as a non-linear classifier.
The primary contributions of this study can be summarized as follows: 1. To improve the contrast and minimize noise in fundus images, we propose an effective image enhancement algorithm based on the contrast-limited adaptive histogram equalization (CLAHE) method, coupled with a median filter.We argue that using preprocessed fundus image data for training deep learning models, as opposed to using raw data directly, can lead to substantial enhancements in their ability to learn more valuable feature representations.Furthermore, this approach can reduce the computational complexity involved in producing an optimally trained model.Remarkably, this attention mechanism is implemented with minimal computational overhead.Using a DRBM trained as a non-linear classifier along with the Softmax layer, the system generates the probability distribution of eight different ocular diseases.
3. We demonstrate that the proposed Fundus-DeepNet system exhibits superior performance compared to existing state-of-the-art methods.This evaluation is conducted using the OIA-ODIR dataset, which is a publicly accessible dataset comprising a diverse collection of challenging fundus images encompassing eight distinct ocular diseases.
The structure of this paper is as follows: Section 2 provides a discussion of related research concerning the classification of multiple ocular diseases.In Section 3, we detail the fundus image dataset used and describe the proposed Fundus-DeepNet system.The experimental results are presented in Section 4. Finally, Section 5 concludes the work by summarizing the results and identifying potential avenues for future research.

Related Works
In this section, we provide a comprehensive review of the most advanced systems available for classifying multiple ocular diseases.We thoroughly investigate the limitations, highlight the main directions, and present proposed solutions within the context of our developed system.Koh et al. [7] proposed an automatic retinal disease screening system to distinguish between normal and abnormal fundus images, encompassing glaucoma, AMD, and DR.Their approach involved extracting discriminative feature representations by employing SURF and PHOG descriptors.
The features from these descriptors were combined using canonical correlation analysis, followed by classification using a K-NN classifier.Their dataset, comprising 1,804 fundus images, achieved impressive accuracy of 96.21%, sensitivity of 95%, and specificity of 97.42%.
Islam et al. [17] developed a CNN model trained from scratch using pre-processed fundus images from the ODIR dataset, though it did not address the simultaneous classification of multiple eye diseases from pairs of fundus images.Their best results included an F-score of 85%, an AUC value of 80.5%, and a Kappa score of 31%.Luo et al. [18] combined the EfficientNet model with a mixture loss function (FC-loss) to automatically identify normal, AMD, glaucoma, and cataract.Their system's performance was evaluated for binary classification on the OIA-ODIR dataset, with the best results achieved in categorizing fundus images as either normal or affected by cataract disease.Gour and Khanna [2]  their models directly on raw fundus images, potentially restricting the ability of the adopted CNN model to generalize effectively.We demonstrate that, in comparison to using the adopted deep learning model directly with the raw fundus image, training it using the processed fundus image can significantly enhance the generalization power and decrease the computational cost of the model.A novel multiple fundus diagnostic system is developed and named the Fundus-DeepNet system to reliably identify various ocular diseases using coloured fundus pictures and overcome the prior limitations.To expand the training dataset and mitigate issues related to overfitting and data imbalance, we employ different augmentation strategies.Furthermore, we trained the proposed Fundus-DeepNet system using the previously processed fundus images rather than the raw images directly to reduce the generalization error and prevent overfitting problems.the SENet block, merges the features learned from both left and right fundus images.As described in [22], the element multiplication method is employed in all five places due to its simplicity and efficiency, especially when utilizing a CNN's deep backbone.

Dataset Description
The accuracy of the Fundus-DeepNet system has been assessed in this study through a series of extensive experiments carried out on the OIA-ODIR dataset [3].This publicly available dataset, created by Shanggong Medical Technology Co., is the first global multiple ocular disease detection dataset relying on fundus images.It consists of 10,000 fundus images representing eight distinct ocular diseases, sourced from 5,000 patients.The dataset was divided into three subsets: a training set with 3,500 patients, an off-site test set with 500 patients, and an on-site test set with 1,000 patients.Deep networks are trained using the training set, and model selection can employ the validation set from the off-site test set.The deep network's ability to generalize is evaluated through the utilization of the on-site test set.
The OIA-ODIR dataset is a multi-class dataset that comprises eight categories for diagnosing abnormalities in the eyes, including normal case (N), diabetic retinopathy (D), glaucoma (G), cataracts (C), AMD (A), myopia (M), hypertension (H), and other abnormalities (O).Table 1 displays how the 5,000 patients were distributed among the training and testing sets.The process of annotating the dataset with ground truth took over 10 months, with experienced ophthalmologists carrying out the task.Negotiation was used to settle any disagreements until all annotators agreed [3].Some samples from the OIA-ODIR dataset are shown in Figure 2.

Image Pre-processing Stage
The fundus images in the OIA-ODIR dataset can have substantial differences in lighting conditions, image resolution, contrast, and color saturation.These variations can create difficulties for image analysis algorithms and may necessitate the application of preprocessing or normalization techniques to overcome these challenges [3][18].The proposed image preprocessing stage consists of five key processes: circular border cropping, image resizing, image contrast enhancement, noise reduction, and data augmentation.The majority of fundus images in the OIA-ODIR dataset have black borders lacking the necessary information for detecting ocular diseases (See Figure 2).Furthermore, the fundus images within this dataset exhibit diverse image sizes due to being captured using various cameras.A circular border cropping process is applied to reduce the negative impact of the black borders.In this study, the autocropping procedure is implemented as follows: 1. Utilize the OpenCV library to transform the colored image into grayscale.For pixels representing white color, the pixel value is set to 255; whereas, for pixels corresponding to black color, the pixel value becomes 0.
2. Generate a mask for clipping that comprises values of 0 and 1.If a pixel's value is greater than the specified tolerance, the mask value is set to 1 (True).Conversely, if a pixel's value is equal to or below the tolerance, the mask value is set to 0 (False).Herein, the default tolerance value is 6.
3. Identify a rectangular region encompassing rows and columns containing pixel values of 1.
4. Extract the identified rectangular region from the image in RGB format.
This step significantly decreases the computation requirements, eliminates extraneous data, and increases the effectiveness of subsequent analysis.The cropped images are resized to a standard size of (224×224) pixels, as supported by the majority of DNNs.Next, an image enhancement algorithm is utilized after converting the input fundus image into a grayscale image to improve the original images' quality since certain fundus images in the OIA-ODIR dataset have poor resolutions making the feature representations difficult to distinguish.Herein, an image enhancement algorithm utilizing the CLAHE method [23] and a median filter with a size of (3×3) pixels was applied to enhance the local contrast and eliminate noise in the input fundus image, respectively (See Figure 3).Instead of processing the full image, CLAHE works with small regions of it called tiles.The artificial borders are then eliminated by combining the neighboring tiles using bilinear interpolation.Two crucial parameters, known as the clip limit

Backbone Network
The global feature maps are extracted using the backbone network from the input pair of fundus

Attention Block
The attention block leverages features extracted from the backbone network to produce distinct feature attention maps.As shown in Figure 1, 3×3 Depthwise Separable Convolution (DSConv) [25] and 1×1 Pointwise Convolution (PConv) [26] to transfer the feature maps and obtained from the backbone network into and .In neural networks, DSConv is substantially superior to traditional convolutions, improving representational efficiency while using fewer parameters and operating at a lower computational cost.The output of the PConv layer is passed through the Batch Normalize (BN) layer to avoid the issue of overfitting and accelerate model convergence.This is followed by applying the Rectified Linear Unit (ReLU) function to increase model non-linearity.The final feature representations and generated from the attention block can be acquired by combining the feature maps produced from both DSConv and PConv as follows: Here, denotes to the concatenation procedure, and refers to the batch normalization process.The attention block is intended to emphasize relevant features while reducing irrelevant or noisy regions while detecting lesion areas in fundus images.The task is applied as follows: 1.The input image is converted into a set of feature maps using the convolutional layers in the attention block.Different degrees of abstraction from the input image are captured in these maps.
2. The attention block produces attention maps that highlight various spatial regions inside the feature maps in terms of relevance.These maps are learned during the training process.

SENet Block
A channel attention technique called SENet [18] can be trained to emphasize crucial feature representations and suppress unhelpful ones, thus enhancing network performance.
Consequently, SENet has been integrated into the proposed Fundus-DeepNet system.The main structure of the utilized SENet block is illustrated in Figure 5. Squeeze and excitation operations are employed to provide global information to a SENet block before it undergoes the subsequent transformation.During the squeeze operation, global feature maps (1×1×D) are generated by employing Global Average Pooling (GAP).This process condenses the global spatial information into a channel descriptor.In the excitation operation, the data gathered during the squeeze operation is passed through the Fully-Connected (FC) layer, Dropout, ReLU, FC, Dropout, and ReLU layers.Initially, we incorporate an FC layer with a dropout ratio of 0.3 to reduce the complexity of the intricate coadaptation of units in the FC layer by avoiding the emergence of interdependencies between them.After enhancing the network's non-linearity using the ReLU function, an FC layer, also known as a dimensionality-increasing layer, is utilized to restore the channel dimension.To augment the network's non-linear capabilities and reduce the computational load, procedures are followed involving dimensionality reduction followed by dimensionality expansion during the excitation operation.This approach also allows the network to effectively capture channel-wise dependencies.The final output feature representation is obtained by multiplying the input feature maps generated from and with the output of the last ReLU function.Finally, as in Equation ( 3), the resulting feature representations and are fused using the element-wise multiplication method to obtain the output from the SENet block.
Before passing the obtained feature representations into the adopted classifier, a (1×1) PConv and GAP are applied to extract more discriminative feature representations and reduce the computational complexity of the network.

Discriminative Restricted Boltzmann Machines
To effectively represent the combined distribution of inputs and target classes in the task of detecting multiple ocular diseases, a non-linear classifier is trained using a single Restricted Boltzmann Machine (RBM) with two sets of visible units [27][28].In addition to the visible units used to represent input feature vectors, it is employed in conjunction with the Softmax layer [29].
This combination is utilized to generate the probability distribution of eight different ocular diseases [30].During the training of the DRBM, the stochastic gradient descent method was implemented to maximize the log-likelihood of the training data.Thus, the following weightupdating rules might be implemented: Herein, denotes the learning rate, 〈.〉 and 〈.〉 denote the positive and negative phases, respectively.Lastly,  denotes the bias term for the visible units, while represents the bias term for the hidden units.As is well known, it is difficult to compute the 〈 〉 in Equation ( 4).Thus, to calculate the second term in Equation ( 4), the Contrastive Divergence (CD) method [31] was employed to adjust the parameters of a particular RBM by implementing k steps of Gibbs sampling from the probability distribution.The following steps explain how to implement a single sample of the CD algorithm: 1.The visible units ( ) are initially provided with the training data to calculate the probability of the hidden units.The same probability distribution is then sampled to create a hidden activation ( ) vector.

2.
Computing the outer product of ( ) and ( ) occurs in the positive phase.
3. visible units ( ′) are reconstructed by sampling from ( ) using the conditional probability .Then, the hidden unit activations ( ′) are resampled from ( ′) in a single Gibbs sampling step.

4.
The outer product of ( ) and ( ) is calculated in the negative phase.
For many applications, k is often set to 1 in the CD learning algorithm.For weight initialization in this study, small random values were employed, drawn from a normal distribution with a mean of zero and a standard deviation of 0.02.In the positive phase, the probabilities of the weights and visible units were computed to determine the binary states of the hidden units.Due to the increased probability of training data, this phase is known as the positive phase.While during the negative phase, the model's sample generation probability declines.A whole positivenegative phase is treated as one training epoch, and the difference between the model's generated samples and the actual data vector is calculated at the end of each epoch.To update all of the weights, the derivative of the probability of visible units concerning weights, which represents the expectation of the difference between positive and negative phase contributions was taken.

Experimental Results
This section provides comprehensive details regarding the implementation of all conducted experiments.It also outlines the performance assessment metrics for the proposed Fundus-DeepNet system are described.Finally, an evaluation of the reliability and efficiency of the Fundus-DeepNet system is conducted, comparing it to other well-established, cutting-edge systems.

Implementation Details
The code for the proposed Fundus-DeepNet system is written in Python.The development environment comprises a Google Colab server equipped with a 69K GPU graphics card, 16 GB of memory, operating on a 64-bit Windows 10 system, and an Intel(R) Core(TM) i7-43450U CPU.All deep learning models are developed using the TensorFlow framework.To ensure consistency, the original images are resized to a consistent resolution of (224×224) pixels.This standard size is chosen as it is widely accepted by many DNNs as the standard image dimension.This standardization is necessary due to the diverse sources of the OIA-ODIR dataset, collected from various hospitals with different cameras.In this study, the ratio of the training set to the testing set is 7:3 (e.g., 8,000 images used in the training set, while the remaining 3,000 images are allocated to the testing set).The training set is further divided into two subsets: a training set comprising 80% (5,950 samples) of the original training set from the OIA-ODIR dataset, and a validation set containing the remaining 20% (1,050 samples).For training all employed DNNs, the Adam optimizer is employed.The initial learning rate is set at 0.001, with a fixed batch size of 32, a weight decay of 0.0005, a momentum of 0.9, and a dropout ratio of 0.5.However, we found that using a learning rate value of 0.001 proved to be inefficient, as the DRBM took excessively long to converge due to the low learning rate.
Consequently, the learning rate was adjusted to 0.01 exclusively for the DRBM in all subsequent experiments.Moreover, the early stopping technique was employed to determine the appropriate number of training epochs for all the employed DNNs.This method stops the training process when the classification error on the validation set begins to increase once more.In this study, a consistent number of 100 epochs are employed across all the conducted experiments.

Evaluation Metrics
To evaluate the effectiveness of the proposed Fundus-DeepNet system on the OIA-ODIR dataset, we computed six quantitative performance metrics: Accuracy Rate (AR), Precision, Recall, F1-score, Kappa score, and Area Under Curve (AUC).In classification problems, AR, which measures the percentage of properly classified samples, is the fundamental evaluation metric.Precision represents the likelihood that a particular sample is correctly identified as positive among all samples predicted as positive.Recall describes the percentage of accurately diagnosed fundus diseases among all actual fundus diseases in the sample.The F1-score, a metric combining precision and recall, reaches higher values when both rates are high.The

Performance Evaluation of Different CNN Backbones
To solve the identification of multiple ocular diseases, we employed a transfer learning strategy.This involved using a pre-trained CNN model as a initial step, which proved to be more effective and straightforward than training the CNN model from scratch.Hyper-parameters were finetuned to improve the performance of the pre-trained model.The performance of various pretrained CNN models on the ImageNet dataset was evaluated in terms of their capacity to serve as a CNN backbone for classifying fundus images into eight distinct classes.To arrive at a final decision, we employed the concatenation method using element-wise multiplication to combine the outcomes obtained from both the right and left fundus images.Additionally, we evaluated the effectiveness of the suggested image enhancement procedure in improving the capacity of the utilized CNN model to acquire more distinct feature representations.The classification results of different CNN backbones, with and without the proposed image enhancement procedure, using both off-site and on-site test sets, are presented in Tables 2 and 3, respectively.From these tables, it was observed that the performance of all the employed CNN models significantly improved, achieving nearly 4% to 9% higher results compared to those obtained without applying the proposed image enhancement procedure across all calculated evaluation metrics.This improvement arises from training the DNNs using pre-processed images, which facilitates the learning of more discriminative lesion characteristics in fundus images and prevents bias toward classes with a high number of images during the training process.Furthermore, it is evident that when the ResNet model is employed as a backbone, the results show improvement compared to a linear increase in network depth.For instance, replacing the ResNet18 model with the ResNet101 model, along with the proposed image enhancement procedure using the off-site test set, led to an increase of 7.01%, 6.3%, 7.17%, 6.74%, 6.05%, and 8.2% for AR, Precision, Recall, F1-score, Kappa score, and AUC, respectively.In the on-site test dataset, the results increased by 8.81%, 7.02%, 9.49%, 8.28%, 9.54%, and 8.4% for AR, Precision, Recall, F1-score, Kappa score, and AUC, respectively.Finally, comparable performance was observed for ResNet101 and the HRNet model on the onsite test set.However, the HRNet model outperformed all employed pre-trained models, achieving AR, Precision, Recall, F1-score, Kappa score, and AUC of 74.62%, 75.23%, 76.73%, 75.97%, 77.62%, and 77.67%, respectively.Hence, we opted to employ the HRNet model as the CNN backbone in the proposed Fundus-DeepNet system.

Ablation Experiments
In this section, we conduct ablation tests to provide a more comprehensive illustration of the influence of each block in the proposed Fundus-DeepNet system.The results of these tests are presented in architecture.This results in the reweighting of these features and retrieval of more valuable information.Secondly, the DSConv, PConv, and residual connections have played a significant role in extracting more discriminative feature representations and avoiding network degradation during the learning process.

Comparison Study
The effectiveness of the proposed Fundus-DeepNet system has been assessed by comparing its performance with that of the current cutting-edge methods for classifying multiple ocular diseases.The comparison results are presented in Table 5.To ensure a fair comparison, the performance of the developed Fundus-DeepNet system was evaluated and compared with other systems in terms of metrics such as the F1-score, Kappa score, AUC, and final score, which represents their average values.Based on the results presented in Table 5, it can be observed that while BFENet [32] has achieved a slightly higher F1-score of 89.2% compared to the proposed system in the off-site test set, it has performed less effectively in terms of all other metrics on both the off-site and on-site test sets.For instance, the proposed system has increased the Kappa score, AUC, and final score in the off-site test set by 35.42%, 8.56%, and 14.44%, respectively.While in the on-site test set, the proposed system has managed to increase the F1-score, Kappa score, AUC, and final scores by 0.53%, 37.68%, 9.56%, and 15.93%, respectively.The proposed Fundus-DeepNet system outperforms other existing methods in detecting multiple ocular diseases by employing an attention mechanism to enhance feature information by incorporating both local and global feature representations within a twostream interactive architecture.Additionally, we enrich the extracted feature representations by obtaining multi-scale features.Consequently, the proposed Fundus-DeepNet system demonstrates superior performance in detecting multiple ocular diseases compared to existing methods.Although the results obtained using the proposed Fundus-DeepNet system are encouraging in accurately detecting multiple ocular diseases from a pair of fundus images.However, there are some limitations and challenges that need to be addressed to enhance the accuracy of the proposed Fundus-DeepNet system.One of the main obstacles that might limit the effectiveness of the proposed system is the lack of available data.As is well-known, an adequate amount of data is required for efficient DNN training.Finding labeled fundus image datasets that are sufficiently vast and diverse might be difficult in the field of ocular disease detection.Due to the lack of data, models may perform poorly and become overfit.Additionally, it's important to note that an unequal distribution of data among different ocular disease categories can have a notable impact on the overall effectiveness of the proposed system.This discrepancy can lead to biased model predictions, as the proposed system might encounter challenges in understanding the features of less common diseases.Nevertheless, this problem was resolved by implementing a comprehensive data augmentation process on training instances associated with less common diseases.Moreover, we conducted training on the suggested Fundus-DeepNet framework using the pre-processed fundus images instead of the original images directly.This approach aimed to decrease generalization errors and mitigate issues related to overfitting.Some instances of samples that were correctly and incorrectly diagnosed in specific types of ocular diseases are shown in Figure 6.

Conclusions and Future Work
We have successfully addressed the challenge of identifying multiple ocular diseases in fundus images by developing the Fundus-DeepNet system, an efficient automated deep learning classification system capable of handling multiple labels rapidly.This system operates on pairs of fundus images from both eyes and comprises three core components: image preprocessing, deep feature extraction, and disease classification.Through rigorous experimentation on the complex OIA-ODIR dataset, encompassing a wide range of fundus images representing eight distinct ocular diseases, the proposed Fundus-DeepNet system has exhibited remarkable performance in comparison to the most advanced existing systems for classifying multiple ocular diseases.In this study, a notable enhancement in the performance of all the utilized CNN models can be observed.This improvement translates to approximately 4% to 9% higher outcomes compared to results obtained without implementing the proposed image enhancement procedure in terms of all the calculated evaluation metrics.In the off-site test set, it achieved high F1 scores, Kappa scores, AUC, and final scores of 88.56%, 88.92%, 99.76%, and 92.41%, respectively.Similarly, in the on-site test set, it attained F1 scores, Kappa scores, AUC, and final scores of 89.13%, 88.98%, 99.86%, and 92.66%, respectively.These findings underscore the effectiveness of the Fundus-DeepNet system in accurately detecting multiple ocular diseases, offering significant potential for facilitating early diagnosis and treatment in the field of ophthalmology.The system's capability to analyze pairs of fundus images from both eyes contributes to a comprehensive assessment, thus elevating the accuracy of disease detection.Many ideas can be investigated in terms of future works.Firstly, further expanding the dataset with more diverse and representative fundus images of ocular diseases would enhance the system's generalization and robustness.Moreover, the accuracy rate of the Fundus-DeepNet system can be further increased by experimenting with other deep learning architectures, discovering new attention mechanisms, and optimizing hyper-parameters.Finally, the Fundus-DeepNet system may be implemented in actual clinical settings and prospective studies can be carried out to confirm its efficacy and usability in real-world circumstances, opening the path for its incorporation into clinical practice.

2 .
We introduce an efficient deep feature extraction framework to derive discriminative deep feature representations from a pair of fundus images.This framework consists of four major parts.The backbone CNN extracts the global features from both left and right fundus images.Additionally, we assess the effectiveness of several CNN model backbones for the task of multiple ocular disease detection.The attention block learns additional high-level feature representations to differentiate lesion portions using the output of the backbone network.The SENet block effectively incorporates channel-wise attention for feature refinement and fusion, utilizing discriminative feature maps obtained from the previous step.
suggested employing transfer learning technique to train pre-trained CNN models (InceptionV3, VGG16, ResNet, and MobileNet) for different ocular disease classifications, achieving the highest results using the VGG16 model and SGD optimizer.They attained an F1-score of 84.93% and an AUC of 85.57% on the ODIR dataset.Yang and Yi [19] developed a three-part deep learning model for automatic identification of multiple ocular diseases.The first part involved applying a simple image pre-processing algorithm to discard unwanted information and artificially enlarge the fundus images dataset.In the second part, they employed a feature extraction network, DSRA-CNN, utilizing the Xception architecture.This network integrated DS block, DSR block, and SE block function blocks.Finally, a Softmax classifier was devised based on the extracted features to classify eight distinct fundus diseases.The developed model, assessed on the ODIR dataset, achieved an accuracy rate of 87.90%, a precision of 88.50%, an F1-score of 88.16%, and a kappa score of 86.17%.Ouda et al. [20] developed a shallow multi-label CNN (ML-CNN) model and trained it from scratch on the RFMiD dataset to classify the fundus image into multiple ocular diseases.The ML-CNN model comprised three phases: pre-processing, modeling, and prediction.The pre-processing phase employed various transformation techniques for normalization and data augmentation, including Rotation, contrast, brightness, saturation adjustments, horizontal flipping, and vertical flipping.Experimental results, obtained through cross-validation, demonstrated the effectiveness of the ML-CNN model.It achieved outstanding metrics, with an accuracy rate of 94.3%, a Recall of 80%, a precision of 91.5%, a dice similarity coefficient (DSC) of 99%, and an AUC of 96.7%.Deng and Ding [21] introduced a CNN model named the EB-IRV2 model, which combines feature fusion from the Efficientnet-B2 and InceptionResNetV2 models, along with patient information such as age and gender.This integration aims to enhance the accuracy of classifying multiple ocular diseases.The performance evaluation of the EB-IRV2 model on the ODIR dataset yielded impressive results, achieving an accuracy rate of 96.00%, an F1-score of 94.11%, and a Recall rate of 92.37%.The studies mentioned above have demonstrated the potential and efficacy of employing deep learning models for the identification of multiple ocular diseases through the analysis of fundus images.However, some limitations need to be addressed, including: (i) The model's performance declines as the number of classes increases, especially when there are insufficient training examples and unavoidable image noise.(ii) Due to unbalanced and/or inadequate datasets, certain systems exhibit a conservative nature that prevents their deployment in realworld scenarios.(iii) Most CNN-based studies on classifying ocular diseases commonly train

Figure 1
Figure 1 illustrates the block diagram of the Fundus-DeepNet system proposed to address the challenge of detecting multiple ocular diseases in fundus images.This study aims to classify the fundus image into eight different types of ocular diseases.The Fundus-DeepNet system is composed of three main parts.Firstly, the input fundus image undergoes pre-processing to improve image contrast, reduce noise, and improve the learning capacity of the employed deep learning models.Secondly, discriminative deep feature representations are extracted from a pair of fundus images using an effective deep feature extraction and fusion framework.A nonlinear classifier based on a DRBM associated with the Softmax layer is used to generate the probability distribution of eight different ocular diseases, and the class label is obtained.In the proposed Fundus-DeepNet system, the fusion process is implemented five times in two different places.Two of these fusion processes are implemented in the attention block, and the remaining ones in the SENet block.The first four places are implemented to fuse feature representations learned from the same fundus image.The final fusion process, implemented in

Figure 1 :
Figure 1: The diagram illustrating the structure of the proposed Fundus-DeepNet system.

Figure 2 :
Figure 2: Some examples of fundus images in the OIA-ODIR dataset.

(
CL) and block size (BS) are set in the CLAHE method to regulate the quality of the enhanced image.Given that the input image has a low-intensity level, a higher value for the CL parameter causes the input image's brightness to increase.On the other hand, increasing the value of the BS parameter improves the contrast level and extends the range of intensity values present in the input image.In this study, the CL and BS were set to 2 and (8×8), respectively.Since each class has an unequal number of images, classes with more images will receive greater training weights than classes with fewer images, which will bias the classification results in their direction.For instance, while there are 1,135, or normal images, in the training dataset that do not contain any instances of ocular diseases, there are less than 100 fundus images of some conditions, such as hypertension.To solve class imbalances, and prevent the overfitting problem, some data augmentation techniques were applied, such as rotation, horizontal flipping, vertical flipping, saturation change, hue change, and brightness change.

Figure 3 :
Figure 3: The results of the suggested image enhancement procedure: (a) Original fundus image, (b) Cropped image, (c) Applying CLAHE, and (d) Applying median Filter.
images.The backbone network can be any CNN-based deep learning model that has been trained on the ImageNet dataset to extract two different sets of features from pairs of input fundus images.Considering the fundus images of the left and right eye, and ( where H and W stand for the given fundus photographs' height and width, while 3 denotes their three color channels).The backbone CNN module's outputs are identified as and ( ), with H, W, and C referring to the height, width, and the number of channels of the extracted features.The performance of different backbone networks has been assessed, as explained in the experimental results section.As shown in Figure 4, we choose HRNet as a backbone CNN module since it produces the best results on the adopted dataset.The first sub-networks of HRNet begin with high-resolution, and as more branches are added, high-resolution to low-resolution sub-networks are progressively introduced one at a time [24].The parallel connection of these supplementary multi-resolution sub-networks is established.Then, to improve the high-resolution representations, multiple multi-scale fusions are employed for transferring the data between parallel sub-networks.High-resolution features are maintained by HRNet, which offers 4 stages, matching 4 branches, and 4 resolutions.Following the input step, two strides (3x3) convolution layers increase the width (number of convolution layer channels) to 64 and lower the resolution to 1/4.The channel number C was chosen to be 32, which stands for HRNet-W32 (where W stands for width) in several other branches, each of which is set as C, 2C, 4C, and 8C, respectively.Moreover, the resolution decreases to (H/4 × W/4, H/8 × W/8, H/16 × W/16 and H/32 × W/32).To construct multi-scale feature maps, the final four output features are merged in this study.These merged feature maps are then utilized as input for the attention block.

Figure 4 :
Figure 4: The main architecture of HRNet.The feature maps are represented by the rectangular block and Stem refers to the down-sampling process.

Figure 5 :
Figure 5: The main structure of the adopted SENet block.
kappa score evaluates the level of agreement between classified results and corresponding ground truth labels.The Area Under the ROC Curve (AUC) measures the model's classification accuracy, improving as it approaches 1.It is often used to assess the model's stability.The following formulas calculate these six quantitative performance metrics: ∑ ∑ ∑ Here, TP, TN, FP, and FN represent True Positives, True Negatives, False Positives, and False Negatives, respectively.

Figure 6 :
Figure 6: Samples that were correctly and incorrectly diagnosed in specific types of ocular diseases using the proposed Fundus-DeepNet system.

Table 1 :
Patient case distribution within each class in the training and test sets.

Table 2 :
Classification results of different CNN backbones with and without the proposed image enhancement procedure using the off-site test set.

Table 3 :
Classification results of different CNN backbones with and without the proposed image enhancement procedure using an on-site test set.

Table 4 .
Model1 corresponds to the HRNet model, which is trained on preprocessed fundus images and utilized as the CNN backbone in the proposed Fundus-DeepNet system.This model has attained an accuracy rate of 74.62% and 71.14% using off-site and onsite test sets, respectively.Model 2 essentially encompasses Model 1 with the addition of the attention block.The accuracy of Model 2 increased by around 4.5% and 12.19% using off-site and on-site test sets, respectively, while the number of parameters is only about 14.64% of Model 1. Model 3 denotes the inclusion of the SENet block on top of Model 2, resulting in slight enhancements in all calculated metrics due to the improved capacity to recognize fundus image lesions.The proposed Fundus-DeepNet system, which combines the attention block, SENet block, 1×1 PConv, and GAP, is referred to as Model 4. The DRBM classifier is then used to generate the probability distribution for eight distinct ocular diseases using the retrieved feature representations.In the off-site test set, this model has achieved AR, Precision, Recall, F1-score,

Table 4 :
The ablation experiments of the proposed Fundus-DeepNet system.

Table 5 :
The comparison between the proposed Fundus-DeepNet system and other existing systems on the OIA-ODIR dataset.