Deep learning for quality assessment of retinal OCT images

: Optical coherence tomography (OCT) is a promising high-speed, non-invasive imaging modality providing high-resolution retinal scans. However, a variety of external factors such as light occlusion and patient movement can seriously degrade OCT image quality, which complicates manual retinopathy detection and computer-aided diagnosis. As such, this study ﬁrst presents an OCT image quality assessment (OCT-IQA) system, capable of automatic classiﬁcation based on signal completeness, location, and eﬀectiveness. Four CNN architectures (VGG-16, Inception-V3, ResNet-18, and ResNet-50) from the ImageNet classiﬁcation task were used to train the proposed OCT-IQA system via transfer learning. The ResNet-50 with the best performance was then integrated into the ﬁnal OCT-IQA network. The usefulness of this approach was evaluated using retinopathy detection results. A retinopathy classiﬁcation network was ﬁrst trained by ﬁne-tuning Inception-V3 model. The model was then applied to two test datasets, created randomly from the original dataset, one of which was screened by the OCT-IQA system and only included high quality images while the other was mixed by high and low quality images. Results showed that retinopathy detection accuracy and area under curve (AUC) were 3.75% and 1.56% higher, respectively, for the ﬁltered data (compared with the unﬁltered data). These experimental results demonstrate the eﬀectiveness of the proposed OCT-IQA system and suggest that deep learning could be applied to the design of computer-aided systems (CADSs) for automatic retinopathy detection.

adopted for retinopathy detection, one of which was first fed to the OCT-IQA network to eliminate unqualified images and produce a 'pure' dataset. The other was composed of acceptable and unacceptable images with a 1:1 ratio, and labeled the 'mixed' set. Test results showed that retinopathy detection accuracy and area under curve (AUC) were 3.75% and 1.56% higher, respectively, for the pure data (compared with the mixed data), which demonstrated that image quality is a vital element in automatic retinopathy detection.

Database
We acquired 15,379 original retinal OCT images using a Zeiss Cirrus device (Carl Zeiss Meditec, Inc., Dublin, CA) from 710 persons of ages ranging from 47 to 85 years and stored them as JPG files at a resolution of 1536 × 1024 pixels. A minimum of 5 B-scans from this area were selected for each patient, to thoroughly observe the retinal macular area.
Data annotation was conducted after the images were collected. Two medical students manually screened the data and removed unclassifiable images (i.e. complete signal obstruction). As a result, 781 images were excluded in this first step. The remaining images were separately assessed by two specialists with more than five years of clinical experience. This evaluation included four quality categories: 'good,' 'off-center,' 'signal-shielded,' and 'other,' as well as two clinical categories: 'abnormal' and 'normal.' This reference standard complies with the situation detailed in Table 1, as 1063 images were eliminated due to a lack of consensus between the two specialists. Typical examples from each category are shown in Fig. 1 and representative 'normal' and 'abnormal' retina images are shown in Fig. 2. Following annotation, the database distribution was as follows: good (11804, 56.27% abnormal and 43.72% normal), off-center (647, 17.16% unrecognizable anomaly, 34% abnormal, and 48.84% normal), signal-shielded (710, 36.06% unrecognizable anomaly, 27.04% abnormal, and 36.9% normal), and other (351, 100% unrecognizable anomaly). All poor quality images (including unrecognizable and recognizable anomaly) and 20% of the randomly selected high quality images were used to train the OCT-IQA classifier to make the image numbers of high and

Quality label Description
Good The complete structure of retina can be clearly observed in the image, and signal is useful for retinopathy diagnosis.

Off-center
The retinal signal is placed too high or too low on the image where the signal cannot be fully displayed.
Signal-shield Total or partial loss of retinal signal.

Other
Poor quality under other circumstances, such as images with good signal quality but without useful signal information, or images with serious artifacts.

Normal
The retina is healthy and without any deformation or defect.

Abnormal
The retina is unhealthy because of deformation, edema, bleeding, hiatus, or other anomalies.
low quality balanced. The included dataset exhibited an imbalance problem as fewer poor quality images (including 'off-center,' 'signal-shielded,' and 'other' categories) were available compared with high quality. This is primarily due to a limited sample size. This uneven sample distribution can have a significant influence on classification results and the number of poor-quality images is too small to train a model without overfitting. As such, data augmentation was applied to resolve these issues.

Data augmentation
Data augmentation is used to produce new synthetic samples from simple transformations of original images [24]. This study included few poor quality images, which were augmented to compensate for low quantities. Horizontal mirroring, rotations through random angles (± 10 degrees), and a contrast enhancement defined by: were applied. Here, I denotes the destination image, i is the source image, and v (=10 in this study) indicates the degree of contrast enhancement. Before augmentation, one hundred images were randomly selected from the original dataset for each category to test the model, the remaining samples were randomly divided into training and validation sets (scans from the same patients were kept in the same set). Augmentation was then applied to each, to produce sufficient samples while preserving the fraction of images in each category (i.e., fewer samples in the 'other' class). The final data distribution is shown in Table 2.

Convolutional neural network (CNN)
CNNs are a common framework used for deep learning. VGG, ResNet, and Inception are typical CNNs which have been successfully applied to medical image classification problems such as skin cancer classification [25], early diagnosis of Alzheimer's disease [26], and retinal vessel detection in fundus images [27]. A CNN is typically composed of convolutional, pooling, and fully connected layers. Convolutional layers are used to calculate convolutions between specific kernels and the input data. An activation function is then applied to produce a new feature map. The convolutional operation for a single channel can be expressed as: where k is the convolutional kernel and W, H, and S represent the dimensions of k. The kernel convolves the input i z along its dimensions of width and height to produce the output o z+1 at the location x, y. The nonlinear activation function F is given by: The feature map is then sent to the pooling layer for feature selection and information filtering. Finally, every output unit is connected to all units of the previous feature map in the fully connected layer. VGG-16, Inception-V3, and ResNet were adopted in this study, as described below. VGG-16 [28]: a plain CNN network consisting of 13 convolutional layers, 5 max-pooling layers, and 3 fully connected layers. All convolutional layers in VGG-16 featured a small 3 × 3 kernel. The max-pooling layer included a 2 × 2 kernel to decrease the number of parameters.
Inception-V3 [29]: a deeper network consisting of 22 layers and fewer parameters. An 'Inception' module was utilized in this network architecture, which factored 3 × 3 convolutions into two smaller convolutions, such as 1 × 3 and 3 × 1. This was beneficial for reducing parameters, accelerating calculations, and preventing over-fitting.
ResNet (ResNet-18, ResNet-50) [30]: a residual block with a skip connection bypass to resolve the difficulties of training deeper networks, allowing for higher accuracy. This architecture exhibits varying depths of 18, 50, 101, and 152 layers. In this study, we evaluated the performance of ResNet-18 and ResNet-50 because the residual block architecture in ResNet-50 has made some adjustments to improve performance for ImageNet tasks, as shown in Fig. 3. This performance was then tested separately using our dataset. ResNet-18 and ResNet-50 both consist of six modules denoted by conv1, conv2 x, conv3 x, conv4 x, conv5 x, and FC. Conv1 is a convolutional layer. Conv2 x, conv3 x, conv4 x, and conv5 x include residual blocks numbered as 2, 2, 2, and 2 in ResNet-18 and 3, 4, 6, and 3 in ResNet-50, respectively. FC is the fully connected layer.

A transfer learning framework for OCT-IQA
Deep CNNs are invariant to shifts, rotations, and scaling. They also include large quantities of trainable parameters, providing them with a high capacity for generalization. However, large amounts of labeled data are necessary to avoid under-fitting and over-fitting, which is a challenge with medical images. Furthermore, adjustments to the training parameters are required to improve the convergence.
Transfer learning is a useful method for resolving these issues, having been used in a variety of fields [9,[31][32]. The basic idea of transfer learning is that features extracted by the pre-trained model can be reused in specific classification tasks. The architecture proposed in this study, which consists of an OCT-IQA network and a retinopathy network, was trained using transfer learning (see Fig. 4). The workflow for this architecture is described as follows.
1) The OCT-IQA network model was trained to remove unqualified images, including offcenter, signal-shielded, and other quality abnormalities. Four CNNs introduced previously I(VGG, ResNet-18, ResNet-50, and Inception-V3) were fine-tuned to identify the most suitable network for OCT-IQA in two steps. The first step was network initialization where convolutional layers in each CNN were initialized by loaded weights, pre-trained on the ImageNet dataset by the corresponding network [33]. The second step included redesigning new top layers for the network because these architectures were originally designed for 1000-category classification task. A final layer and a softmax layer were then added to the pre-trained networks and retrained to recognize specific classes. In this study, images were classified into four categories during OCT-IQA. A testing dataset was randomly selected to evaluate four fine-tuned networks and the architecture with the best performance was selected.
2) The OCT-IQA model was used to classify the entire dataset into unqualified images, where images needing to be retaken, and acceptable images.
3) Images with serious quality issues were removed. The remaining samples, which had been previously labeled by ophthalmologists, were used to train the retinopathy detection network to classify acceptable images into normal and abnormal categories.

Experimental setup
Four CNNs were fine-tuned during the OCT-IQA process. To reduce computational complexity, input images were resized to 299 × 299 × 3 for Inception-V3 and to 224 × 224 × 3 for VGG, ResNet-18 and ResNet-50. A learning rate of 0.001, stochastic gradient decent [34] with Nesterov momentum 0.9 was applied to fine-tune the four CNNs. The training process was completed when the cross-entropy cost function and the accuracy converged. Corresponding training and loss curves are shown in Fig. 5. This result required 18 epochs for ResNet18 and 24 epochs for the other architectures with batch sizes of 32. The training data were directly fed to the network with pre-trained weight parameters and processed using a desktop computer with an Intel Xeon E5-2620 CPU, 32 GB RAM, and two NVIDIA GeForce GTX 1080Ti GPUs. The cross entropy loss can be expressed as: where g i is the ground truth and o i is the predicted output.

Experimental results for OCT-IQA
Images labelled as 'off-center,' 'signal-shielded,' or 'other' were considered unqualified, while those with high quality were classified as acceptable. The OCT-IQA classifier was trained using five-fold cross-validation on the training and validation sets, as demonstrated by Table 2. The data were divided into five separate, equally stratified sets. Each subset was used to evaluate the model while the remaining four subsets were used to train the model. The final assessment was made using the test set. Following this five-fold cross-validation, the area under the receiver operating characteristic curve (AUC) was calculated and used to determine the model's ability to  The highest AUCs model of each architecture were tested using the test dataset. The specificity (SP), recall (Re), and precision (PR) were calculated for the independent classes to assess model performance. In this four-category classification problem, true positive and true negative values were determined by correctly dividing the test images into positive and negative classes. The positive class was the appointed class and the negative class included any samples not in the selected class. False positives and false negatives were determined by incorrectly dividing the test images into negative and positive classes, respectively. The recall indicates the proportion of true positives in the sum to the number of true positives and false negatives. SP is the proportion of true negatives in the sum to the number of true negatives and false positives. PR is defined as the proportion of true positives in the sum to true positives and false positives. In addition, the overall accuracy (OA), overall recall (OR), overall specificity (OS), overall precision (OP), and F1 score (F1) were included, as shown in Table 3. These parameters were calculated using Eqs. (5)-(8), where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative. Evaluation metrics were defined as follows: Table 3. Accuracy, recall, specificity, precision and F1 score for different network architectures used for classification of image quality into "good," "off-center," "signal-shielded," and "other." The best option within each metric is in bold.  These results suggest that networks with deeper layers (e.g., ResNet-18, ResNet-50, and Inception-V3) and improved architectures (e.g., residual and inception modules) outperformed networks with basic structures (VGG16). Specifically, ResNet-50 achieved better results for 'good' and 'other' cases, though its performance was not ideal for 'signal-shielded' or 'off-center' images. In contrast, Inception-V3 and ResNet-18 achieved better results in these two categories, suggesting these networks may focus on different features. ResNet-50 achieved the highest OA, OR, OS, and F1 scores, while its OP was slightly lower than that of ResNet-18. ROC curves were also used to determine the most suitable architecture for OCT-IQA. These curves plot the true positive rate against the false positive rate, which represents the overall classification abilities of architectures across four classes. Values closer to the upper-left corner indicate better performance. Figure 7 suggests that ResNet-50 performed better than the other four models, achieving the highest AUC (0.9947).

Network
As such, we selected the ResNet-50 model for OCT-IQA, with an OA of 96.25%, an OR of 97%, an OS of 98.98%, and an AUC of 99.47%.
Heat maps were developed during the training phase to determine which regions of an image were being paid the most attention. Figure 8 shows a heat map generated from the retrained ResNet-50 model. The images in Fig. 8(a) and 8(b) were classified as off-center, with regions above and below the retina highlighted. The image in Fig. 8(c) was classified as signal-shielded and the regions with a signal hiatus were highlighted. The image in Fig. 8(d) was classified as 'other' and the region around the optic disk was highlighted. These heat map results demonstrate that key features used in OCT-IQA (i.e., signal location and signal appearance) were successfully learned by the network.

Results of retinopathy detection
A second subnetwork was trained to demonstrate the necessity of IQA in retinopathy detection and construct a more robust retinopathy detection model. Inception-V3 has been shown to perform well in retinopathy classification (i.e., drusen, CNV, and DME [9]). As such, we fine-tuned this architecture to train the retinopathy detection model, which was designed to classify lesion images into 'normal' and 'abnormal' categories. Normal indicated the retina was healthy and abnormal implied the retina suffered from a retinopathy (e.g., drusen, macular edema, neovascularization, etc.), causing retinal distortion. The retinopathy detection dataset was established by eliminating all images belonging to the 'other' category and some images in the signal-shielded and off-center categories, since retinopathies in these images could not be recognized during anomaly grading. Afterward, 12,794 images consisting of 7,096 abnormal (6,644 good, 220 off-center, and 192 signal-shielded) and 5,738 normal (5,160 good, 316 off-center, and 262 signal-shielded) samples were selected from 13,535 images. The dataset was separated into training, validation, and testing sets in a process similar to that of the network described above. Only the training data were augmented using contrast enhancement (defined by Eq.(1)) and horizontal flipping. The final dataset consisted of 23,602 images including 22,530 training, 472 validation, and 600 test samples (200 poor and 400 good quality). The experimental setting (i.e. optimizer, batch size and learning rate) was the same as Inception-V3 used in the first sub-network.
Five-fold cross-validation was applied to select hyper-parameters and confirm the robustness and stability of the Inception-V3 model, the training and validation sets were involved in. Results showed slight variability in the performance of the network with a mean AUC of 99.90%. The model with the highest AUC score (99.96%) was selected as the final model. Corresponding fivefold ROC curves are shown in Fig. 9. We have also tested the generalizability of proposed model by public dataset in study [9], which captured from Spectralis OCT (Heidelberg Engineering, Germany) and had different signal distribution and image size from ours. The public dataset has classified images into four categories: choroidal neovascularization, diabetic macular edema, drusen and normal. We tested our retinopathy detection model on their open test dataset (1000 images, 250 from each category). As our model is a two-class model, we put the choroidal neovascularization, diabetic macular edema, drusen together to form the abnormal category, and images of normal form the normal category. Finally, excellent results of 99.5% AUC and 99.87% sensitivity were obtained, which proved that the proposed model can be generalized.
The effect of unqualified images on retinopathy detection was evaluated using 'pure' and 'mixed' test datasets. The dataset was fed into the OCT-IQA first to get quality label. Then the 'pure' set consisted only of images in good quality (400 images, 200 from each category), the mixed dataset consisted of 100 poor-quality images (50 off-center and 50 signal-shielded) and 100 acceptable-quality images (400 images). Corresponding test results are shown in Table 4. As seen in the table, retinopathy detection accuracy for the pure dataset was 97.5%, higher than that of the mixed dataset at 93.75%. The sensitivity and specificity values were also higher in the pure dataset. ROC (Fig. 10) and confusion matrices (Fig. 11) were generated to evaluate model performance. The resulting AUC values were 99.74% and 98.18% for the pure and mixed sets, respectively, suggesting that model classification performance was compromised by unqualified images. The confusion matrices imply that the true positives and true negatives for the pure dataset improved significantly.   Heat maps were also generated during the test to determine whether lesions were discovered by the network. The images shown in Fig. 12(a) and 12(b) are of good quality, 12(c) is off-center, and 12(d) is signal-shielded. Drusens were highlighted in Fig. 12(a) and retina thickening and pigment epithelial detachment are evident in Fig. 12(b). Figure 12(c) demonstrates macular edema and a retinal atrophy is visible in 12(d). It is evident that the network classified the retina as either normal or abnormal based on the lesion. The network was able to form the correct conclusion, even for poor quality images, as long as the lesion could be observed. As such, lesion visibility is critical in retinopathy detection. If the retina structure or lesion is severely shielded, the model cannot acquire sufficient information to draw accurate conclusions. These experimental results indicate that unqualified images can negatively affect retinopathy detection. Poor quality images often exhibit signal-shielding or signal deficiency, which can cause difficulties for identifying lesions. OCT-IQA can detect unqualified images in real-time and recapture them after they have been eliminated. This reduces processing time and is useful for improving the accuracy of retinopathy detection.

Conclusion and discussion
In this study, an architecture based on a deep CNN was proposed to classify the quality of OCT images and identify the existence of retinopathies. The system consisted of an OCT-IQA classifier and a retinopathy detector. Several extant CNNs, including VGG, ResNet-18, ResNet-50, and Inception-V3 were fine-tuned by replacing the final layers and retraining on our dataset to determine the proper CNN for the proposed architecture. ResNet-50 achieved the best performance in classifying OCT image quality, with an overall accuracy of 96.25%, an OR of 97.33%, and an AUC of 99.47%. We confirmed the influence of image quality on retinopathy detection by constructing a separate model, which adopted the Inception-V3 architecture and was tested using 'pure' and 'mixed' datasets. The pure dataset was classified by OCT-IQA and only included images with good quality. The mixed dataset was mixed with high quality images and unqualified images. The model achieved better classification results with the dataset that passed OCT-IQA, where testing accuracy was 3.75% higher for the pure dataset than the mixed dataset. In summary, a trained IQA model was used to reduce non-beneficial images during CADS preprocessing. In addition, a CNN model was developed and tested, achieving excellent performance for retinopathy detection. This suggests that the models discussed above could be advantageous in the design of a computer-aided diagnostic system (CADS) for automatic lesion detection.
Furthermore, the proposed OCT-IQA model demonstrate ability to classify images into four specified quality categories, which can be further applied to hardware control system to assist the automatic reacquisition of poor-quality images in the future.