Retinal diseases classification based on hybrid ensemble deep learning and optical coherence tomography images

: Optical coherence tomography (OCT) is a noninvasive, high-resolution imaging technique widely used in clinical practice to depict the structure of the retina. Over the past few decades, ophthalmologists have used OCT to diagnose, monitor, and treat retinal diseases. However, manual analysis of the complicated retinal layers using two colors, black and white, is time consuming. Although ophthalmologists have more experience, their results may be prone to erroneous diagnoses. Therefore, in this study, we propose an automatic method for diagnosing five retinal diseases based on the use of hybrid and ensemble deep learning (DL) methods. DL extracts a thousand constitutional features from images as features for training classifiers. The machine learning method classifies the extracted features and fuses the outputs of the two classifiers to improve classification performance. The distribution probabilities of two classifiers of the same class are aggregated; then, class prediction is made using the class with the highest probability. The limited dataset is resolved by the fine-tuning of classification knowledge and generating augmented images using transfer learning and data augmentation. Multiple DL models and machine learning classifiers are used to access a suitable model and classifier for the OCT images. The proposed method is trained and evaluated using OCT images collected from a hospital and exhibits a classification accuracy of 97.68% (InceptionResNetV2, ensemble: Extreme gradient boosting (XG-Boost) and k-nearest neighbor (k-NN). The experimental results show that our proposed method can improve the OCT classification performance; moreover, in the case of a limited dataset, the proposed method is critical to develop accurate classifications.


Introduction
Common macular and vascular diseases include age-related macular degeneration (ARMD), diabetic macular edema (DME), branch retinal vein occlusion (BRVO), central retinal vein occlusion (CRVO), and central serous chorioretinopathy (CSCR), which are the leading causes of visual impairment and blindness worldwide [1][2][3]. According to the World Health Organization (WHO), DME, which primarily affects working-age adults, affected 425 million people worldwide in 2017 and is expected to affect 629 million people by 2045 [4]. The WHO also estimates that 196 million people had ARMD in 2020; this number is expected to rise to 288 million by 2040 [5]. The prevalence of ARMD in elderly people is 40% at the age of 70 years, rising to 70% at the age of 80 years. Rogers et al. [6] discovered that BRVO and CRVO affected 13.9 million and 2.5 million of the world's population aged 30 years and older, respectively, in 2008. Men have a higher prevalence of CSCR compared to women [7]. A large population is afflicted by these diseases, and projections suggest that this number will escalate in the future. However, the first stage of these diseases can be treated, and patients can recover their vision loss through early detection and treatment [8][9][10].
Optical coherence tomography (OCT) is a noninvasive imaging modality that provides highresolution information within a cross sectional area. OCT retinal imaging enables visualization of the thickness, structure, and detail of various layers of the retina. In addition, when the retina develops a disease, OCT enables the visualization of abnormal features and damaged retinal structures [11]. Therefore, retinal OCT images are widely used in the medical field to monitor information in medical images prior to treatment or for the diagnosis of various diseases.
For several years, ophthalmologists have analyzed the comprehensive information inside the retina for retinal care services, treatment, and diagnosis using retinal OCT images in clinical settings. The clinician performs these tasks manually and wait for each process. As a result, manual analysis is time consuming when there are numerous OCT images. Even if the clinician has great expertise, this analysis may not be accurate [12]. An automated technique based on deep learning (DL) or machine learning using artificial intelligence has been proposed as a solution to overcome this limitation.
Recently, computer algorithms based on artificial intelligence, DL, and machine learning have been proposed for the automatic diagnosis of various retinal diseases and have been applied in clinical health care. Han et al. [13] modified three well known convolutional neural network (CNN) models to gain access to normal and three subtypes of neovascular age-related macular degeneration (nAMD). The classification layers of the original CNN models were replaced by new layers: four fully connected layers and three dropout layers, along with a Leaky rectified linear activation unit (ReLU) as an activation function. The modified models were trained using the transfer learning technique and tested on 920 OCT images; the VGG-16 model achieved an accuracy of 87.4%. Sotoudeh-Paima et al. [14] classified OCT images to identify normal, AMD, and choroidal neovascularization (CNV) using a multiscale CNN. This CNN was evaluated and achieved a classification accuracy of 93.40% on the public dataset. Elaziz et al. [15] developed a four-class classification method for accessing retinal diseases from OCT images based on an ensemble DL model and machine learning. First, the features are extracted from the two models, MobileNet and DenseNet, and were concatenated as full features of the input images. Then, feature selection was performed to remove irrelevant features and to input the useful features into machine learning to classify the input data. A total of 968 OCT images were used to evaluate classification performance, and an accuracy of 94.31% was achieved. Another study by Liu et al. [16] used a DL model to extract attention features from OCT images. It used the extracted features as guiding features for CNV, DME, drusen, and normal. The classification performance was assessed using public datasets, and an average accuracy of 95.10% was achieved. Minagi et al. [17] used transfer learning with universal adversarial perturbations (UAPs) for classification with a limited dataset. Three types of medical images, including OCT images, were used to assess diseases, and DL models were trained using the ImageNet dataset. The UAPs algorithm was used to generate a training set based on the data provided to train the DL model. There were 11,200 OCT images utilized in training and assessing the model's performance, and a classification accuracy of 95.3% was achieved for the four classes: CNV, DME, drusen, and normal. Tayal et al. [18] presented four ocular disease classifications based on three CNN models using OCT images. Images were enhanced before being fed to CNN models. To assess the performance of the presented method, 6,678 publicly available OCT images were evaluated. The method achieved an accuracy of 96.50% with a CNN model which compressed nine layers. The performance of the CNN models with nine layers outperformed the experimented CNN models with five and seven layers. Adversarial retraining is an algorithm used to improve the performance of DL models based on classification.
According to the literature, retinal OCT classification was developed using DL and DL based methods such as transfer learning, smoothing generative adversarial networks, adversarial retraining, and multi-scale CNN. This method is used to improve the model's performance by fine-tuning previous task knowledge using the OCT image problem, increasing the dataset size for training, applying the technique of inputting data for the training model, and changing the training input image sizes. However, the classification methods can achieve an accuracy of less than 97.00%, indicating their potential for further improvement. Moreover, these studies classify retinal diseases into fewer than five classes. This study aims to improve the classification accuracy and detect five classes of retinal diseases, which are more than the previous studies highlighted in the literature.
In this study, we propose an automatic method based on a hybrid of deep learning and ensemble machine learning for screening five different retinal diseases from OCT images to improve the performance of OCT image classification. The proposed method improves classification accuracy, outperforming standalone classifiers without a hybrid. In addition, it can be trained using a smaller dataset from our hospital, which has been strictly labelled by experts. Moreover, the proposed method enables deployment with a web server for open access to test the evaluation performance within seconds.

Dataset
All OCT images were collected from Soonchunhyang University's Bucheon Hospital. The OCT images were collected and normalized after approval by the Bucheon Hospital's Institutional Review Board (IRB). OCT images were captured using DRI-OCT (Topcon Medical System, Inc., Oakland, NJ, USA). The scan range was 3-12 mm in the horizontal and vertical directions, with a lateral resolution of 20 μm and an in-depth resolution of 8 μm. The shooting speed was 100,000 A-scans per second. The OCT images utilized were collected twice; the first comprised 2,000 images that were captured between April and September 2021, while the second consisted of 998 images, and took place over a period of approximately five months from September 2021 to January 2022. Therefore, the total number of OCT images collected twice was 2,998; these were labeled by ophthalmologists for five retinal diseases (ARMD:740, BRVO:450, CRVO:299, CSCR:749, and DME:760) as the ground truth.
This study was approved by the Institutional Review Board (IRB) from Soonchunhyang University Bucheon Hospital, Bucheon, Republic of Korea (IRB approval number: 2021-05-001). All methods were performed in accordance with relevant guidelines and regulations. Informed consent was obtained from all subjects.

OCT image preprocessing
Image processing is a technique for performing various operations on the original images to convert it into a format suitable for DL models or to extract useful features. In image classification based on deep learning, image processing is an essential initial process to change an image before feeding it to the CNN model. The CNN model requires a unique size for the image input, and higherresolution images demand longer computing times. To shorten the operating time and the suitable size required by the CNN models, all OCT images were downsized to 300 pixels in height and 500 pixels in width. The OCT image dataset was split into an 80% training set and 20% testing set. The training set was used to train the deep learning model and the testing set was used to assess performance.

Data augmentation
The size of the dataset has a significant impact on the DL performance. Therefore, a larger dataset may enable a better performance. However, in the medical field, most medical dataset has size limits. Data augmentation is a technique developed to overcome the limitations of a dataset by performing different operations on the data provided and creating new data, thereby enhancing the dataset size. Additionally, data augmentation is used to enhance performance [19], generalize the model [20], and avoid overfitting [21]. We utilize data augmentation techniques from the Python library imgaug including like vertical flip, rotation, scale, brightness, saturation, contrast, enhance and contrast, and equalization. The OCT images were augmented at angles of 170, 175, 185, and 190. The selected angle is suitable for rectangle shape representation without loss of information from the original OCT images; scale image with a random range between 0.01 to 0.12; the level of brightness from 1 to 3; the saturate operation, which ranges from 1 to 5, increases by one with each level; random contrast with contrast values ranging from 0.2 to 3; enhance and contrast with levels ranging from 1 to 1.5; and image equalization with levels ranging from 0.9 to 1.4. At the end of the data augmentation process, one OCT image can serve as the basic for generating 29 augmented images. Therefore, the training set comprised a total of 69,455 OCT images, including samples. The acquired OCT and augmented images are shown in Figure 1. Applying data augmentation, only the training set is used for training the proposed method. After finishing the augmentation operation, the OCT images are passed through the 10-fold crossvalidation technique to partition the data into folds for the training model (training data) and to test the  Figure 2 shows the architecture of the proposed method that comprises three significant blocks: feature extraction, classification, and boosting performance. First, transfer learning based on CNN models extracts one thousand features from the OCT images. Second, various machine learning algorithms are used to classify the OCT images based on the features extracted by the CNN model. Finally, the ensemble algorithm fuses the distribution probabilities of the same class and predicts the retinal disease class based on probability. Each block of the proposed architecture is described in detail in the following subsections.

Figure 2.
System architecture overview of the proposed method. The proposed method accepts images with resolution of 500 pixels in width and 300 pixels in height. CNN models extract features from OCT images and classify them using machine learning algorithms. Voting classifier ensemble output probabilities for predicting retinal disease.

Feature extraction based on transfer learning
Transfer learning is a technique used to transform the knowledge of a related task that has already been studied to improve the learning of a new task. Training a CNN model from scratch is computationally expensive and time consuming; moreover, an extensive dataset is required to achieve a better performance. Therefore, transfer learning has been developed to overcome DL's drawbacks [22]. To retrain the model with new tasks based on prior knowledge, pretrain was refined, small top layers were trained, and the final layers were frozen. In this study, the transfer learning CNN (TL-CNN) models EfficientNetB0 [23], InceptionResNetV2 [24], InceptionV3 [25], ResNet50 [26], VGG16 [27], and VGG19 [28] are selected and updated. The modification names of the CNN models start with TL, indicating transfer learning, and ends with the original names of the CNN models, including TL-EfficientNetB0, TL -InceptionResNetV2, TL-InceptionV3, TL-ResNet50, TL-VGG16, and TL-VGG19. The original CNN models were created for generic image classification tasks. They were trained and tested on a large dataset (ImageNet) to categorize 1000 different types of images. To use a CNN model with the transfer learning technique and classify retinal OCT images, each CNN model must modify its classification layers to adapt to the target classes. One specific problem is the categorization of OCT images. The new classification layer is modified with continued stacking of GlobalAveragePooling2D, one Normalization layer, and two Dense layers. The first Dense layer consists of 1,024 with the ReLU activation function and the final dense layer with a five output-vectors. Finally, the updated model is pretrained, and the pretrain model is retrained to fine-tune the previous feature representation in the base model to make it more relevant for OCT image classification. The output consists of five vectors representing the distribution class probabilities using the Softmax activation function. As mentioned previously, a CNN model based on transfer learning is used to extract convolutional features from the OCT images. Therefore, the convolutional features were extracted from the TL-CNNs models where the GlobalAveragePooling2D layers of the classification layer. These features are one-dimensional. Different models provide various features and numbers based on the structure and convolution filters of the model.

Ensemble voting classifier
Individual machine learning classifiers provide different identification accuracies. This is because each classifier has its own learning ability to identify classes based on the given features. Therefore, an ensemble method is used to aggregate the distribution probabilities of the two classifiers. The proposed method selects two higher prediction classifiers (k-NN and XGBoost) based on an experiment to perform aggregation. An ensemble is a type of soft voting that performs better than other models [35]. Soft voting predicts the final class label as the class label most frequently predicted by classifiers. In soft voting, class labels are predicted by averaging the probability p of the class. Table 1 presents the proposed algorithm, which includes image processing, splitting data, data augmentation, feature extraction, classification, and an ensemble of classifiers: (1) where w_k is the weight of the machine learning classifiers, which can be either k-NN or XGBoost; it automatically learns from disease features in OCT images and then identifies the type of disease based on the input data; i represents the class label of the retinal diseases, where i ∈{0: ARMD, 1: BRVO, 2: CRVO, 3: CSCR, 4: DME}; and p_ik represents the probability of machine-learning weight k for class i. training-accuracy = accuracy (predicted-labels, labels) 22: save_train_weight 23: voting = "soft" 24: ML1 = k-NN (train-data, train-labels, test-data) 25: ML2 = XGBoost (train-data, train-labels, test-data) 26: procedure ENSMEBLE_CLASSIFIERS (train-data, train-labels, test-data) 27: ensemble-classifiers = concadenate(ML1, ML2) 28: ensemble-classifers.fit (train-data, train-labels) 29: predictions = ensemble-classifers.predict(test-data) 30: save_training_weights, results_visualization

Experiments
The proposed OCT image classification method was developed using Python 3.7 and TensorFlow 2.6.0. In addition, Scikit Learn was operated on a personal computer running the Windows 10 operating system powered by an Intel(R) Xeon (R) Silver 4114 @ 2.20GHz CPU, 192GB RAM, and an NVIDIA TITAN RTX 119GB GPU.
The proposed OCT image classification method was trained using augmented OCT images and evaluated using a test set. There were two types of training. First, six TL-CNN models were trained to perform feature extraction from OCT images.
Six TL-CNN models were separately trained using a combination of the training set and the augmented images of the training set. The combination data were split using a 10-fold cross-validation algorithm to separate the images for training, validate the model during training, and prevent overfitting. Furthermore, the TL-CNN models were individually trained with a fixed batch size of 64, epochs of size 100, and an Adam optimizer with a learning rate of 0.0001. The learning rate was selected based on the standard learning rate provided by the TensorFlow library. For example, with a setting of 100 epochs, each model must be trained 100 times on the same data. Therefore, the performance is improved by updating the weight based on the information lost through repetitions of a training session. The weights of each TL-CNN model were recorded in a separate file after training and were utilized to extract features from the training and testing sets. Then, the machine learning models were trained with the convolution features extracted by the TL-CNN models to access the class probabilities. Six machine learning models were separately trained, and the weights were recorded after the training completed. Finally, an ensemble method based on soft voting was applied to the average class probabilities of the two classifiers to obtain an effective final class prediction.

Results and discussion
The results of the proposed OCT image classification are divided into three parts: classification results, deployment of the classification results to web services, and a comparison of the results with similar studies in terms of classification accuracy.

Classification
A test set was used to evaluate the performance of the proposed method after training the model. The same preprocessing was performed on both the test dataset and the training dataset without data augmentation. The test set contained 601 OCT images, which were used to assess the classification performance. Six TL-CNN models were individually trained to extract features from the OCT images and store the extracted features in pickle format. Six machine learning classifiers were utilized to discriminate the classes of OCT images based on the features extracted by the TL-CNN. Statistical theories were analyzed to measure the classification ability among the classes, sensitivity, specificity, precision, and accuracy. The relationship between the sensitivity and specificity of various categories was shown through a receiver operating characteristic (ROC) curve. Moreover, the confusion matrix was analyzed, which indicated the correct and incorrect class predictions. Table 2 lists the test results of using TL-EfficientNetB0 as an extractor and seven types of classifiers, including an ensemble classifier, the classification result outperformed the ensemble classifier with a sensitivity, specificity, precision, and accuracy of 96.17, 98.92, 95.89 and 95.85%, respectively. The second highest performance was achieved with the k-NN classifier, which achieved a sensitivity, specificity, precision, and accuracy of 87.37, 96.95, 88.82 and 88.89%, respectively. The classification results for the other machine learning classifiers are unstable, both increasing and decreasing randomly.   Table 3 shows the classification results when using TL-InceptionResnetV2 as an extractor and seven classifiers, showing that the result outperforms the ensemble classifier with a sensitivity, specificity, precision, and accuracy of 97.42, 99.40, 97.49 and 97.68%, respectively. The second highest performance was achieved with the k-NN classifier, with a sensitivity, specificity, precision, and accuracy of 87.37, 96.48, 88.19 and 87.56%, respectively. In addition, with the same extractor, the classification performance of XGBoost was similar to that of the k-NN classifier. Table 4 lists the evaluation results when using the TL-InceptionV3 extractor and seven machine learning classifiers, including the ensemble classifier, which outperformed other methods with a sensitivity, specificity, precision, and accuracy of 91.34, 97.59, 91.03 and 91.04%, respectively. The second highest performance was achieved by XGBoost, with a sensitivity, specificity, precision, and accuracy of 84.42, 95.10, 82.88, and 82.91%, respectively. Table 5 lists the classification results when using the TL-ResNet50 model as a feature extractor and classifying those features by seven different classifiers, which indicates that using ensemble classifiers outperforms the obtained a sensitivity, specificity, precision, and accuracy of 96.46, 99.14, 96.76 and 96.68%, respectively. The second highest performance was achieved by XGBoost, with a sensitivity, specificity, precision, and accuracy of 87.63, 96.59%, 88.27 and 87.73%, respectively. The performances of the other two classifiers, SVM and k-NN, were comparable and better than those of the three classifiers in the experiments. Table 6 lists the test results of the proposed classification with VGG-16 as a feature extractor and seven machine learning classifiers, the ensemble classifier exhibited the best performance, with a sensitivity, specificity, precision, and classification accuracy of 92.07, 98.00, 92.60 and 92.54%, respectively. The XGBoost classifier had the second highest performance for TL-VGG16 as a feature extractor; it obtained a sensitivity, specificity, precision, and accuracy of 80.48, 94.91, 81.44 and 82.26%, respectively. A similar performance was observed for SVM and k-NN. Table 7 lists the classification test results of the TL-VGG19 model for feature extraction and classification using these features by various classifiers. Ensemble classifiers algorithm outperformed the five other classifiers; its sensitivity, specificity, precision, and accuracy are 93.86, 93.40, 93.44 and 93.86%, respectively. The second-and third-highest performances were achieved by XGBoost and SVM, respectively. Table 3. Performance summary of proposed classification through feature extraction using TL-InceptionResnetV2, six classifiers, and ensemble voting classifiers. Various sensitivities, specificities, precisions, and accuracies are obtained using different classifiers. The proposed classification method with ensemble classifiers outperforms all statistic measurements.  Six TL-CNN models were compared, and TL-InceptionResNetV2 achieved a better performance than the other five models used in this study, with a sensitivity, specificity, precision, and accuracy of 97.42, 99.40, 97.49 and 97.68 %, respectively. The ensemble algorithm always outperformed all the TL-CNN models. The individual k-NN and XGBoost classifiers performed better than the three individual classifiers. Thus, ensembled k-NN and XGBoost also achieved better performance than k-NN and XGBoost. Figure 3 shows the ROC result of the proposed classification method, which outperforms TL-InceptionResnetV2 with ensemble classifiers (k-NN and XGBoost). The ROC among each class ARMD, BRVO, CRVO, CSCR, and DME is 0.99, 0.96, 0.99, 0.99, and 0.98, respectively. The relationship between sensitivity and specificity of the five classes is most important. The confusion matrix is implemented by using the Sklearn library in Python. The size of test data is essential to present the robustness of classification. The confusion matrix shows the number of correct and incorrect predictions among all classes. Figure 4 shows the confusion matrix of the proposed method which exhibited best performance; 148 of 149 OCT images of ARMD class are correctly predicted, 85 of 91 images of BRVO class are correctly predicted (ARMD:3 and DME:3 are incorrect predictions), 59 of 60 images are correctly predicted as CRVO and one image that is incorrectly predicted as BRVO, 148 of 150 images are correctly predicted and two images are incorrectly predicted as ARMD, and 149 of 153 are correctly predicted, and four are incorrectly predicted (ARMD: 1, BRVO:1, and CRVO:2 are incorrect prediction).

OCT image classification web service
To render the proposed method applicable and accessible from outside through an Internet connection, we deployed the proposed OCT image classification to a web server using the Flask framework. The web server receives one image input at a time and inputs it into the proposed classification method to predict retinal diseases. The input image is an OCT image consisting of three channels with a resolution of 300 pixels in height and 500 pixels in width. When inputting an OCT image through a web service user interface (UI), the image is transferred to a computer server that runs a DL classification model. First, the computer server performs image processing which is the same to the processes used in both the train and test sets. Second, the preprocessed image is inputted into the proposed classification weights for prediction. Finally, the predicted results are forwarded to the web service using the Flask framework. The prediction results consist of the image input, distribution probabilities among the five classes, the retinal disease diagnosis class, and prediction times of an image. The prediction time is the time taken to input an image to a web service to predict and return the prediction result. Figure 5 shows the initial UI of the web server. The prediction results obtained after inputting the OCT images are shown in Figure 6. The "Select an Image" button allows the user to browse to the location of a stored image and upload it to the webservice, and the "Predict" button sends the image to a deep learning server and receives the diagnosis class.

Comparison results
The higher accuracy of the proposed OCT image classification method is compared with that of the recent studies reviewed in the literature review section, as listed in Table 8. These studies focused on transfer learning, developing new models, and combining well known CNN models with machine learning. All the listed studies used either different OCT databases or a combination of these datasets. Moreover, the number and type of classification classes were different, with at most four classes. We classify retinal diseases into five classes using a dataset obtained from a hospital. An additional number of classes can affect the performance of the classification methods. Table 8 lists the methods and algorithms that have been presented, including the suggested model with transfer learning, the multiscale DL model, and transfer learning using existing CNN models. However, the results as listed in the literature review have shown an accuracy of < 97%. Instead of focusing on a single classifier, this study combines two machine-learning classifiers and the DL as a feature extractor. Our study exhibits an accuracy of 97.68%, which is greater than the accuracy of the aforementioned studies. In addition, the number of classification classes is higher than that of the studies reviewed.
Our study classifies retinal OCT images with disease classes that differ from the reviewed studies and are not available in the public dataset. We hope that these retinal diseases will become available in the future, and we will evaluate the proposed OCT image classification system using a public dataset.

Conclusions
This study presents a hybrid ensemble OCT image classification method for the diagnosis of five classes of retinal diseases. The proposed method employs an ensemble machine learning classifier as the classifier and a hybrid deep learning model as the feature extractor. We identified the deep learning model and ensemble classifiers that were most suitable for OCT image classification. The proposed model outperformed an individual classifier. With an accuracy of 97.68%, the best deep learning model and ensemble machine learning classifiers of the proposed method were TL-InceptionResnetV2 and the aggregation of KNN and XGBoost. This classification can be deployed to web services for convenient access to diagnose retinal disease from outside the Internet. Moreover, the prediction time in seconds was short, reducing the time required for prediction. This study contributes to the development of accurate multiclass OCT image classification. In the future, we aim to improve the classification performance. If datasets with the same class as in our study are made public, we will assess the proposed method on these datasets to broaden their applicability. In the medical field, improved performance can be used to automatically classify OCT images and eliminate timeconsuming tasks, and this classification can also aid in the prevention of vision loss.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Data availability
The data used to support this study have not been made available access because they are real clinical data from Soonchunhyang Bucheon Hospital, and patient's privacy should be protected, it enables to detect people through this data, but they are available from the corresponding author on reasonable request.