Automatic detection of retinopathy with optical coherence tomography images via a semi-supervised deep learning method

: Automatic detection of retinopathy via computer vision techniques is of great importance for clinical applications. However, traditional deep learning based methods in computer vision require a large amount of labeled data, which are expensive and may not be available in clinical applications. To mitigate this issue, in this paper, we propose a semi-supervised deep learning method built upon pre-trained VGG-16 and virtual adversarial training (VAT) for the detection of retinopathy with optical coherence tomography (OCT) images. It only requires very few labeled and a number of unlabeled OCT images for model training. In experiments, we have evaluated the proposed method on two popular datasets. With only 80 labeled OCT images, the proposed method can achieve classification accuracies of 0.942 and 0.936, sensitivities of 0.942 and 0.936, specificities of 0.971 and 0.979, and AUCs (Area under the ROC Curves) of 0.997 and 0.993 on the two datasets, respectively. When comparing with human experts, it achieves expert level with 80 labeled OCT images and outperforms four out of six experts with 200 labeled OCT images. Furthermore, we also adopt the Gradient Class Activation Map (Grad-CAM) method to visualize the key regions that the proposed method focuses on when making predictions. It shows that the proposed method can accurately recognize the key patterns of the input OCT images when predicting retinopathy.


Introduction
Optical coherence tomography (OCT) has been widely adopted by ophthalmologists for the detection of retinopathies due to its high resolution, contactless and nondestructive testing properties [1,2]. Based on the textural and morphological variations, it is able to detect major retinopathies, like age-related macular degeneration (AMD) and diabetic macular edema (DME), which generally lead to vision loss. The AMD can be further divided into "dry" AMD and "wet" AMD (also known as choroidal neovascularization (CNV)), where the typical feature of DRUSEN can be seen. With huge number of scanned OCT images, i.e., around 30 millions per year [3], manual identification of OCT images is a huge burden for ophthalmologist. The identification task becomes even more challenging in developing countries where the number of qualified ophthalmologists is inadequate.
Recently, deep learning has achieved great success in many challenging areas, such as image classification and nature language understanding [4]. One of the most popular deep learning algorithms is the convolutional neural network (CNN). CNN is designed for image processing and thus it has been widely used for the recognition of OCT images [5][6][7][8]. Lee

Overview
The proposed framework for retinopathy detection is shown in Fig. 1. Firstly, the VGG-16 model is pretrained on the popular ImageNet dataset [24]. Note that the standard VGG-16 network was designed for the classification task with 1,000 classes. Therefore, the last layer, i.e., the softmax layer, contains 1,000 outputs, corresponding to the predicted probabilities for 1,000 classes. In our case, the number of outputs for the softmax layer needs to be changed to the category number of retinopathies. After the model pretraining, the next step is to fine-tune the parameters of the model with both labeled and unlabeled data. Specifically, we replace the last layer of pretrained VGG-16 (i.e., the softmax layer with 1,000 outputs) by a softmax layer with C outputs where C refers to the types of diseases, e.g., normal, AMD, DME, etc. The weights of this new softmax layer is randomly initialized. Next, the model is fine-tuned using the small amount of labelled OCT images with large amount of unlabelled ones. Note that, the pretraining provides a good initialization for fast and better convergence of the model. For the labeled and unlabeled data, a cross-entropy loss and a VAT loss are calculated, respectively. Then, both losses are combined to obtain the final loss of the model. Finally, the stochastic gradient decent (SGD) calculated upon the final loss is adopted to optimize the parameters of the model for retinopathy detection. The detailed introduction of the VGG-16 with pretraining and the VAT will be presented in the following subsections.

Pretrained VGG-16 for OCT images
VGG-16, a very deep CNN model, has been widely used for image classification tasks [25]. Figure 2 shows the detailed structure of the VGG-16 for the detection of retinopathy with OCT images. Firstly, the original OCT B-scans are resized to the dimension of 224 × 224 × 3 which is the input size of VGG-16. The resized OCT B-scans will go through several convolutional layers with a filter size of 3 × 3 and a stride size of 1, followed by max-pooling layers with a pooling size of 2 × 2 with a stride size of 2. The number of filters for each convolutional layer can be found in Fig. 2. Then, a flatten layer is utilized to convert the data into a one-dimensional feature vector after the last max-pooling layer. The one-dimensional feature vector will be fed into two fully-connected (FC) layers with a hidden size of 4096 and a softmax classification layer with C outputs. The activation functions of the convolutional layers and fully-connected layers are the popular rectified linear unit (ReLU). To train a very deep CNN model from scratch (i.e., VGG-16), it generally requires large amount of labeled data. Here, we adopt the idea of pretraining which intends to train the model on a large public dataset with labels. Specifically, we train the VGG-16 on the ImageNet dataset. The pretrained VGG-16 thus has a strong capability of feature learning on images. When applying this pretrained VGG-16 for the detection of retinopathy on OCT images, it requires less labeled images and converges faster in model training.

Virtual adversarial training
Due to the high cost of data annotation and the lack of ophthalmologists in many countries, especially the developing countries, collecting a large labeled OCT image dataset to train a deep learning model for automatic detection of retinopathies is not realistic. However, unlabeled OCT images are widely available, e.g., around 30 millions of OCT images have been generated per year [26]. In this work, we attempt to develop a deep learning algorithm (built upon the pretrained VGG-16) with very few labeled and large amount of unlabeled OCT images.
We adopt the idea of local distributional smoothness (LDS) which indicates the smoothness of the output distributions of a model with respect to the inputs [20]. In another words, the changes of the model outputs should be small with small input perturbations, resulting a robust model with good generalization performance. Virtual adversarial training (VAT) is an effective way to achieve LDS of a model [20]. It intends to find a small perturbation of an input, such that the output distributions of the model have the largest change. Then, the model is trained to minimize this change due to the perturbed input. Specifically, given an input data x (e.g., an OCT image), a new sample,x will be generated by adding a small perturbation µ to the original input x, i.e., x = x + µ. Among all the possible perturbations µ, the adversarial perturbation, denoted as µ adv , will lead to the largest change over the model outputs. Here, the change of the model outputs is measured by using the KL divergence. The µ adv can be obtained by solving the following equation: where θ is the model parameters, y is the model output, and ∥ µ∥ 2 ≤ ϵ is to ensure that the perturbation µ is smaller than ϵ which is pre-defined hyper-parameter. After obtaining the adversarial perturbation µ adv , the objective of VAT is to minimize the KL divergence of the model outputs, resulting a robust model with LDS towards various perturbations. Thus, the VAT loss can be defined as follows: Based on the above analysis, it is clear that the calculation of the VAT loss only requires the model outputs, but no labels. It is different from conventional adversarial training which requires the true labels. Therefore, the VAT loss can be used for semi-supervised learning where large number of unlabeled data are available.
In this work, we assume that very few labeled and large number of unlabeled OCT images are available for retinopathy detection. For the labeled OCT images, the cross-entropy (CE) loss is calculated during model training, which can be expressed as where N is the total number of labeled OCT images, C is the number of retinopathies, y ic represents the true probability of the sample i and the class c, andŷ ic represents the predicted probability. The VAT loss in Eq. (2) is then employed for unlabeled OCT images. Hence, the overall loss of the proposed method can be expressed as where x l and x ul denote the labeled and unlabeled OCT images, and α is the hyper-parameter to control the contribution of the CE loss and the VAT loss.

Grad-CAM for visualization
For the automatic detection of retinopathy via deep learning, a major concern is how to interpret the predictions of the model, so that the ophthalmologists will be confident on the predictions.
Here, we adopt a gradient-based method, i.e., Grad-CAM [21], which is able to visualize the regions of inputs (also known as saliency map) that are "vital" for the current prediction. As the retinopathy has clear patterns on OCT images, we can verify the model by checking whether it can focus on these patterns when predicting different retinopathies. The basic idea of Grad-CAM for the visualization of the proposed method is to use the gradient information flowing into the last convolutional layer of the VGG-16 to understand each neuron when making predictions. Assume that y c is the score (before the softmax layer) of the class of interest, and A ij k is the pixel at i-th row and j-th column of the k-th feature map of the last convolutional layer, we can calculate the weight of each feature map based on the gradient information, shown as where Z is the total number of pixels for the k-th feature map. After obtaining the weights for all the feature maps, the saliency map can be expressed as Here, the ReLU helps us to focus on the pixels that have positive relationship with the class of interest. With the saliency map, we are able to understand which regions of an OCT image play a key role when predicting different retinopathies. This can be further verified by ophthalmologists, such that users are confident when using the proposed method for automatic detection of retinopathy with OCT images.

Implementation details
For the proposed method, stochastic gradient descent (SGD) with a learning rate of 0.001 and a momentum of 0.9 is adopted for parameter optimization in training. Here, we use batch training. The batch size is 8 for the labeled data and the batch sizes are 16 and 32 for the unlabelled data in the two datasets (namely BOE and CELL respectively). Let's use the training procedure on the BOE dataset as as example. In particular, for each individual step, given a batch of labelled OCT images (8 images) and a batch of unlabelled OCT images (16 images), the CE loss and the VAT loss can be calculated. Then, the overall loss can be obtained based on Equation (4), which is back-propagated for parameter optimization with SGD. Since there are much more unlabelled OCT images than labelled ones, when looping over all the labelled OCT images (also known as one training epoch), the unlabelled data still have many batches left. Then, for the next training epoch, the labelled data starts at the first batch, while the unlabelled data starts at the next batch for the remaining batches, until last batch. After looping over all the batches in the unlabelled data, it goes to the first batch for the next training step. Note that, for the CELL dataset which has too many unlabelled data, we may not be able to loop over all the unlabelled OCT images even when we have already trained on the labelled data for many epochs.
In this paper, ϵ is set to 3 in Equation (1) and α is set to 2 in Equation (4). These hyperparameters are determined based on the grid search on the validation set. The training is stopped when the performance on the validation set starts to degrade. The detailed training procedure is shown in Fig. 3. It can be found that the proposed method can converge very fast as the pretraining provides a good initialization. Moreover, the techniques of dropout and batch-normalization have been adopted to prevent overfitting. The code is written under the PyTorch platform with Python 3 and running on a NVIDIA GeForce RTX 2080 Ti GPU. We have released the code of the proposed method in GitHub with the following link: https://github.com/xuqing88/Pytorch-SSDL-OCT.

Data description
In this paper, we use two popular retinopathy detection datasets to evaluate the performance of the proposed method. The first dataset (denoted as BOE dataset) was collected by Srinivasan et al. [22] from 45 subjects in three Universities, i.e., Duke University, Harvard University and the University of Michigan. It contains 723 images with AMD, 1,101 images with DME, and 1,407 NORMAL images, which were collected from 15 normal patients, 15 patients with AMD, and 15 patients with DME. Note that, each patient has one volume scan with different number of B-scans. For the BOE dataset, the number of outputs in the softmax layer, C, equals to 3 and it is a three-class classification task. The second dataset (denoted as CELL dataset) was obtained from [23]. Totally, 84,484 OCT images were collected from 5,319 subjects, where the number of CNV, DME, DRUSEN and NORMAL images are 37,455, 11,498,8,866, and 26,565, respectively. For the CELL dataset, C equals to 4 and it is a four-class classification task. Fig. 4 shows some typical OCT images with different retinopathies in the two datasets. Since the input size of VGG-16 is 224 × 224 × 3, the first step is to resize all the OCT images into the dimension of 224 × 224 × 3. Note that, the original size of the OCT images is varying. This is the only preprocessing that needs to be done for the proposed method. In this paper, we consider a very challenging and practical scenario where only very few labeled and large amount of unlabeled OCT images are used for training the proposed method. Specifically, we randomly select 80 labeled OCT images, which account for around 0.25% and 0.095% of total OCT images in BOE and CELL datasets respectively, for training the proposed method. Note that the training also includes 1,357 unlabeled OCT images for the BOE dataset and 83,324 unlabeled OCT images for the CELL dataset. Another 80 randomly selected OCT images are utilized for validation. For testing, we use 924 and 1,000 OCT images in BOE and CELL datasets, respectively. Note that we perform patient level random selection to guarantee that the scans from the same patient appear in one set only (i.e., training, validation or test set). Specifically, training and validation data are from 33 and 4,686 subjects for the BOE and CELL datasets, respectively. While the test data are from another 12 and 633 subjects for the BOE and CELL datasets, respectively.

Experimental setup
To evaluate the performance of the proposed method, we have compared it with some benchmark approaches in the literature, including the SVM with Histogram of Oriented Gradients (HOG) features [22], normal CNN [27], AlexNet [28] with and without pretraining, ResNet-18 [10] with and without Pretraining, VGG-16 [16] with and without pretraining. Note that these supervised learning methods can only use the labeled OCT images for model training, i.e., 80 labeled OCT images. We also conducted a comparison with some advanced semi-supervised deep learning methods, i.e., pseudo-labeling [17], temporal ensembling [18], and mean teacher [19]. The pretrained VGG-16 is adopted to combine with these semi-supervised techniques for fair comparison.
In order to quantify the performance of different methods, we adopt the evaluation metrics of classification accuracy, sensitivity (also known as recall), specificity, AUC and ROC curve, which are widely used in the literature [10,16,29].
The definition of overall accuracy can be straightforward, which can be expressed as Accuracy overall = Number of correctly classified samples Total number of samples = TP A + TP B + TP C N (7) where N is the total number of samples, and TP A , TP B , and TP C are the numbers of correctly classified samples (True Positive) for the classes of A, B and C respectively. Nevertheless, the sensitivity, specificity, AUC and ROC curve are originally defined in binary classification. In order to use these evaluation metrics for multi-class classification, we adopt the general strategy of one-vs-rest [30].
To explain the calculation of sensitivity and specificity, let's use a toy example for three-class classification (namely class A, B and C). We can define three one-vs-rest scenarios, i.e., A-vs-rest, B-vs-rest, and C-vs-rest. Figure 5 shows the calculations of sensitivity and specificity for the scenarios of A-vs-rest, denoted as Sensitivity A and Specificity A , B-vs-rest, denoted as Sensitivity B and Specificity B , and C-vs-rest, denoted as Sensitivity C and Specificity C . Here, we use "microaverage" strategy for the calculation of overall sensitivity and specificity. In particular, the reported sensitivity and specificity are weighted averages over all the one-vs-rest scenarios. The detailed calculations are shown in Eq. (8) and (9), where N A , N B , and N C represent the total numbers of samples in the classes of A, B and C respectively, and N = N A + N B + N C . We also adopt the one-vs-rest strategy to draw the ROC curve for each method. Specifically, with the one-vs-rest strategy, it will generate C ROC curves (C equals to the number of classes) for each method. Then, the final ROC curve can be obtained by averaging over all the C ROC curves. The corresponding AUC is the area under the final ROC curve.

Comparison with benchmark approaches
The experimental results are shown in Table 1 and Table 2. It can be found that the VGG-16 with pretraining has a superior performance over all the other benchmark approaches on both datasets. This indicates the effectiveness of the VGG-16 network and the pretraining scheme for the detection of retinopathy with OCT images. The semi-supervised learning algorithms can effectively improve the performance of retinopathy detection. Moreover, the proposed semi-supervised deep learning method significantly outperforms all the benchmark approaches with accuracies of 0.942 and 0.936, sensitivities of 0.942 and 0.936, specificities of 0.971 and 0.979, and AUCs of 0.997 and 0.993, on the two datasets. Note that, we only use 80 labeled OCT images for training the proposed method. The standard deviation of the proposed method is also small, indicating its robustness. Since the SVM does not contain random components, the standard deviation of SVM is not available. To demonstrate the significance of the proposed method over benchmark approaches, p-values of the proposed method over other methods are also shown in Table 1 and Table 2. It can be observed that p-values are all less than 0.05, which indicates the significance of the proposed method. We also present the training and testing time for all the approaches. Even though most of methods requires a long time (up to 5 hours) for model training, this tedious training process only needs to be done once, which is still acceptable for real applications. The testing time (for all the testing samples, i.e., 924 for BOE and 1,000 for CELL) of all the approaches is quite small. The inference time for a single OCT image is less than 10 ms for the proposed method, which is adequate for real applications.
We also implemented the supervised baseline methods with all labeled images in the training set (i.e., 1,437 labelled OCT images for BOE and 83,404 labelled OCT images for CELL), which can be treated as a higher bound. The results are shown in Table 3 and 4. For the BOE dataset, the proposed method (accuracy 0.942) can outperform most of the supervised baseline methods with all the labeled OCT images for training. It performs slightly worse than the VGG-16 with pretraining. For the CELL dataset, all the supervised baseline methods (except SVM) significantly outperform the proposed method (accuracy 0.936). It is worth noting that for the BOE dataset, the supervised baseline methods use 1,437 labeled OCT images for training, which is around 18 times more labeled images than our semi-supervised learning method (with only 80 labeled OCT images). While, for the CELL dataset, the supervised baseline methods use 83,404 labeled OCT images for training, which is around 1,000 times more labeled images than our semi-supervised learning method. That is why all the supervised learning methods including these without pretraining can achieve very good classification performance and significantly outperform the proposed method. But in real life, it is almost impossible to obtain such a huge labelled dataset for model training. Even labeling thousands of OCT images can be very challenging for resource-limited regions. That is why the proposed method can be of great significance in real applications. The ROC curves of the all the approaches on the two datasets are demonstrated in Fig. 6. It is clear that the proposed semi-supervised deep learning method performs the best, which is consistent with other evaluation metrics in Tables 1 and 2. The confusion matrices of the proposed method are illustrated in Fig. 7. It can be found that the NORMAL class is relatively easy to be detected due to the clean patterns (see Fig. 4). Considering that the DME has more complicated patterns, the recognition of DME is relatively challenging in both datasets.

Impact of the number of labeled OCT images
Here, we investigate the performance of the VGG-16 with pretraining and the proposed semisupervised deep learning method with different numbers of labeled OCT images. Figure 8 shows the experimental results. Specifically, we test the two methods with 40, 60, 80, 100, 120, 160 and 200 labeled OCT images. It is obvious that the performance of the models improves with more labeled OCT images. If the number of OCT images is too few, e.g., 40, the models may not be able to converge during training, leading to a poor performance. When the number of labeled OCT images is larger than 60, the accuracies of the proposed model on both datasets are higher than 0.9, which is very crucial in real application where the number of labeled images is very limited. In all scenarios, the proposed method outperforms the VGG-16 with pretraining, which indicates the effectiveness of the proposed method.

Impacts of the hyper-parameters α and ϵ
For our proposed method, α in Equation (4) is a key hyper-parameter, which controls the contributions of the CE loss and the VAT loss. Specifically, we tested α with values of [0, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0] on the two datasets. We also investigated the impact of hyper-parameter ϵ, which controls the magnitude of the perturbation in Equation (1), on model performance. The results are shown in Fig. 9. Note that α = 0 means that the VAT loss is not used, which turns out to be the VGG-16 with pretraining. It can be found that the performance of the proposed method improves with larger α values (more contributions of the VAT loss) at the beginning. This indicates the effectiveness of the VAT loss on unlabeled data for enhancing the performance of retinopathy detection. But when the α value is too large, the performance of the proposed method degrades. This is because the optimization will be mainly based on the VAT loss on the unlabeled data, which lacks the correct supervision from the labeled data. Therefore, a careful selection of this hyper-parameter is of great importance for the proposed method. In this paper, we have chosen α = 2 for both datasets. According to Fig. 9, too small or large perturbation will also lead to a degraded performance of the proposed method. In this paper, we have chosen ϵ = 3 for both datasets.

Comparison with human experts
For the CELL dataset, Kermany et al. also asked six human experts with significant experience in an academic ophthalmology center to classify 1,000 OCT images [13]. Here, we attempt to compare the proposed method with these human experts for identifying different retinopathies with OCT images. The results are presented in Fig. 10, where the proposed method with different number of labeled OCT images, i.e., 80, 160 and 200, are selected for comparison. It can be  And the rest refer to the misclassification rates for each scenario. For example, the value at the first row and second column in the left figure represents that 6.5% of samples with AMD have been wrongly classified as DME. Fig. 8. The performance of the proposed method with different numbers of labeled OCT images. "ACC", "Sens" and "Spec" stand for accuracy, sensitivity and specificity, respectively. found that the proposed method with only 80 labeled OCT images can achieve expert level when identifying retinopathy (the lowest recognition accuracy of experts is 0.921.). With 200 labeled OCT images, the proposed method outperforms four out of six human experts, which indicates the effectiveness of the proposed method. Besides, it also proves that the proposed automatic retinopathy detection system can be used for clinical applications.

Visualization with Grad-CAM
In order to interpret the predictions of the models, we adopt the Grad-CAM technique to show the saliency maps of the proposed method, the VGG-16 with pretraining, pseudo-labeling and mean teacher (Pseudo-labeling and mean teacher are the second best models in Table 2 and Table 1 respectively). Some representative samples are manually selected, illustrated in Fig. 11. It can be found that the VGG-16 with pretraining often finds the wrong regions of the OCT images when making predictions, leading to a poor performance. Generally, semi-supervised learning methods have a superior performance on finding the key patterns. Among them, the proposed method performs the best. It can accurately find the key patterns of the OCT images when predicting different retinopathies. This explains why the proposed method performs the best for retinopathy detection. Besides, it will also give confidence to ophthalmologists when using the proposed system for the detection of retinopathy with OCT images. Fig. 11. The visualization of VGG-16 with pretraining, pseudo-labeling, mean teacher and the proposed method by using Grad-CAM.
We need to emphasize that there are also some bad cases where the models cannot find the key patterns. These cases may lead to wrong predictions. We have manually selected two bad cases (true label is DME) in the two datasets, shown in Fig. 12. The network prediction for each method has been indicated on the figure. It can be found that all the four models cannot well capture the key patterns. In this case, the models will have a much higher chance to give wrong predictions. For example, the mean teacher focuses on a normal area of the sample from BOE, resulting a wrong prediction as NORAML. Similarly, the VGG-16 with pretraining and pseudo-labeling also predict DME as NORMAL when focusing on a normal area of the OCT image from CELL. The proposed method and mean teacher predict the DME as DRUSEN as they focus on a region that looks like drusen for the given sample from CELL.

Discussion
In this paper, we proposed a semi-supervised deep learning method for automatic detecting retinopathy with OCT images. The proposed method achieves recognition accuracies of 0.942 and 0.936 with only 80 labeled OCT images for training on two widely used datasets. Even though the benchmark approaches in the literature have shown great performance for the detection of retinopathies based on OCT images, they highly rely on the huge number of labeled OCT images for model training, which may not be available in real world scenarios. The proposed semi-supervised deep learning methods only requires very few labeled and large number of unlabeled OCT images, to achieve a high detection accuracy. We also verified that if more labeled OCT images are available, the performance of the proposed method can be further enhanced.
When comparing the proposed method with human experts, we surprisingly find that the proposed method with the training on only 80 labeled OCT images can achieve expert level in recognizing retinopathies with OCT images. By further increasing the number of labeled OCT images to 200, it can outperform four experts from six. Note that these human experts are with significant clinical experience in an academic ophthalmology center. This unique property of the proposed method is quite meaningful and practical, especially for developing countries where the number of ophthalmologists is inadequate for the diagnosis of retinopathy with OCT images.
Only giving an accurate prediction is not enough for clinical applications. If we are also able to provide some evidences on how the algorithm makes certain predictions, ophthalmologists will be confident on the prediction results. In this paper, we adopt the Grad-CAM approach to visualize the key regions (patterns) of the input OCT images that determine the final predictions. In experiments, it can be found that the proposed method is able to focus on the correct patterns when predicting retinopathies.
Overall, the proposed semi-supervised deep learning method for retinopathy detection only requires a very small amount of labelled OCT images which can be easily obtained. It is able to achieve expert level performance with only 80 labelled OCT images. Besides, it can provide some evidences for the prediction by finding key patterns of the OCT images. However, there are some limitations of the proposed method when applying to real-world clinical applications. First of all, since only very few labelled OCT images are adopted, class imbalance issue (e.g., normal class may be dominate in real applications) may have a big influence on model performance.
Another key issue is the quality of these few labeled OCT images. If the quality of OCT images cannot be guaranteed, the performance of the proposed method would suffer a lot.
In our future works, we will consider to further enhance the performance of the proposed method by using optical coherence tomography angiography (OCTA) images which contain additional information for retinopathy detection [31,32].

Conclusion
In this paper, we proposed a semi-supervised deep learning method for retinopathy detection with OCT images. The proposed method consists of a pretrained VGG-16 network for feature learning on raw OCT images and a virtual adversarial training (VAT) to incorporate large amount of unlabeled OCT images for performance improvement. In experiments, the proposed method outperforms benchmark approaches and achieves accuracies of 0.942 and 0.936 with only 80 labeled OCT images on two popular datasets. When comparing with human experts, the proposed method with 200 labeled OCT images performs better than four experts from six. By adopting the Gradient Class Activation Map (Grad-CAM) technique, we also illustrate the saliency maps which indicate the key regions of the input OCT images when making predictions. The proposed method can accurately find the key patterns in input OCT images, which verifies its superior performance on retinopathy detection.