Effect of optical coherence tomography and angiography sampling rate towards diabetic retinopathy severity classification

: Optical coherence tomography (OCT) and OCT angiography (OCT-A) may benefit the screening of diabetic retinopathy (DR). This study investigated the effect of laterally subsampling OCT/OCT-A en face scans by up to a factor of 8 when using deep neural networks for automated referable DR classification. There was no significant difference in the classification performance across all evaluation metrics when subsampling up to a factor of 3, and only minimal differences up to a factor of 8. Our findings suggest that OCT/OCT-A can reduce the number of samples (and hence the acquisition time) for a volume for a given field of view on the retina that is acquired for rDR classification.


Introduction
Diabetic retinopathy (DR) is a disease affecting the neuronal tissues and microvasculature at the back of the eye that can lead to vision loss [1]. Fundus photography is the current gold standard for the screening and treatment of DR that allows clinicians and retinal specialists to detect hemorrhages, microaneurysms, drusen, and hard exudates, among other hallmarks of DR [2]. While fundus photography can capture a wide-field image of the retina, it is unable to quantify the microvasculature with capillary-resolution or resolve depth information [3]. Hence, fundus photography is often paired with fluorescein angiography for the visualization of the microvasculature through invasive dye injections.
Optical coherence tomography (OCT) and OCT angiography (OCT-A) allow for high-resolution visualization and quantification of the microvasculature noninvasively. OCT provides crosssectional structural visualization of the retinal layers, which allows for the detection of DR biomarkers including retinal thinning and disorganized retinal inner layers [3][4][5]. OCT-A visualizes the blood flow in the microvasculature and facilitates the detection of foveal avascular zone (FAZ) morphology, abnormal vascular loops, neovascularization, and regions of nonperfusion [3,6], all of which are biomarkers of DR. While OCT and OCT-A may provide a complementary benefit to the screening protocol for DR, one drawback is the limited field of view (FOV) [3]. OCT acquisition can be augmented by approaches such as montaging [7] and motion tracking [8] to minimize motion artifacts and patient discomfort which become more prevalent due to a longer acquisition time.
Deep learning has the potential to aid ophthalmologists with their decision-making [9,10] and has been FDA-approved for an autonomous DR diagnostic system [11]. Autonomous artificial intelligence (AI) can alleviate the overwhelming workload faced by clinicians [12], reduce financial burden, and standardize disagreements that clinicians face from different interpretations of the Literature [9]. Following the FDA-approved autonomous DR diagnostic system, other

Dataset and subsampling protocol
In this study, 374 eyes from 237 unique patients (with or without diabetes) were recruited and imaged at the Eye Care Center of Vancouver General Hospital. The project protocol was approved by the Research Ethics Boards at the University of British Columbia and Vancouver General Hospital, and the experiment was performed in accordance with the tenets of the Declaration of Helsinki. Written informed consent was obtained by all subjects. A retinal specialist evaluated each patient using a 30°macula cube (25 B-scans, high-speed, ART 10) acquired with Spectralis OCT (Heidelberg Engineering Inc., Heidelberg, Germany) to exclude macular edema [33]. This was paired with at least one 200°ultra-widefield image recorded with Optomap (Daytona, Optos Inc., Marlborough, MA) for DR severity grading [33]. The retinal specialist graded DR severity based on the International Clinical Disease Severity Scale for DR [34]. The distribution of the data on the clinician-graded five-stage DR scale was: normal (156), mild non-proliferative DR (NPDR; 68), moderate NPDR (27), severe NPDR (60), and proliferative DR (PDR; 63). Further details regarding the acquisition protocol, ground truth, inclusion, and exclusion criteria are as described in our previously published study comparing perfusion parameters of different regions in the retina [33].
Image acquisition was performed using a commercially available swept-source (SS) OCT (Plex Elite 9000; Carl Zeiss Meditec, Dublin, CA) centered on the fovea. The OCT system extracts the superficial region encompassing the inner limiting membrane (ILM) and the inner plexiform layer (IPL), whereas the deep region ranges from the IPL to the outer plexiform layer (OPL) which are all derived from the device-specific ILM, retinal pigment epithelium (RPE), and RPE-fit segmentations [35]. Referring to the nomenclature proposed in the Literature [36], the superficial and deep en face images best correspond to the superficial (SVC) and deep vascular complexes (DVC), respectively. In this study, we extracted the 3×3mm SVC and DVC images from both the OCT structural and OCT-A volumes in their original resolution (300×300 pixels) using the Zeiss Macular Density v0.7.1 algorithm.
The data used for analysis was comprised of the SVC OCT-A, DVC OCT-A, and an average intensity projection (AIP) of the SVC and DVC OCT structural en face image. The AIP was generated through a pixel-wise mean of the OCT structural en face images from the two complexes. OCT structural en face images were included to capture key findings in DR such as microaneurysms, exudates, retinal thinning, and disorganized retinal inner layers [3][4][5] through the projection or detection of these biomarkers of DR. We have previously shown that neural networks trained on the OCT structural en face images achieved comparable performances to those trained on the OCT-A en face images [18]. The OCT and OCT-A image data were combined into a three-channel image, shown in Fig. 1 as an RGB image. This combination of the data facilitated the use of transfer learning using ImageNet weights. The OCT-A and OCT intensity en face from two depths were combined with the intent of capturing different biomarkers of DR. Images were digitally subsampled laterally by removing B-scans from the OCT and OCT-A en face images without altering the axial resolution (A-scans), as shown in Fig. 2. In the case of subsampling by a factor of 2, every other B-scan would remain, and for a factor of 3, every third B-scan would remain, and so on. Original 300×300-pixel images (10µm lateral resolution) when subsampled by factors of 2, 3, 4, 5, and 8 produced images with 150, 100, 75, 60, and 37 lateral pixels by 300 axial (A-scan depth) pixels, respectively; this is equivalent to lateral sample spacing (in the slow scan direction) of approximately 20, 30, 40, 50, and 80µm, respectively. Subsampled images were subsequently laterally upsampled, using nearest-neighbor interpolation, back to 300×300-pixel images for homogenous input shapes and interpretable Gradient-weighted Class Activation Mappings (Grad-CAMs) for neural network attention visualization. The neural network's performance across all subsampling factors was compared to the performance on the full-resolution images.

Experimental settings
Model evaluation and training hyperparameter selection utilized nested 5-fold cross-validation. The dataset was split across five folds with each graded DR severity equally distributed across the folds. Each fold consisted of the following distribution of data: normal (31)(32), mild NPDR (13)(14), moderate NPDR (5-6), severe NPDR (12), and PDR (12)(13). This ensured that the folds had similar representation from each severity and would promote fairness and consistency across folds since differentiating between the severities near the decision boundary (mild NPDR and moderate NPDR) is a more difficult task than categorizing the extremes (normal and PDR). Nested 5-fold cross-validation was performed by iterating through the combinations of folds such that each fold was used at least once for training, validation, and testing resulting in 20 models for each test. This rigorous evaluation method ensured repeatability and fair representation of neural network performances. Nested 5-fold cross-validation was also utilized for training hyperparameter selection. Before training, the less prevalent class was upsampled through random dropout, linear contrast changes, and flips using the ImgAug library to balance the dataset. Throughout training, the batches of images were further augmented through horizontal and vertical flips, rotations between [−10°, 10°], and random translations to develop a more generalizable and robust neural network.
Our training consisted of a two-step approach, similar to the experimental settings from our previous publication [18]. The base model, detailed in Table 1, was similarly derived from the VGG-19 architecture [37] and initialized with ImageNet weights. Two fully connected layers were appended following the base model as the classifier. First, due to our limited dataset size, we leveraged the benefits of transfer learning by freezing most of the weights in the convolutional base layer and trained the classifier. This step utilized a cyclic learning rate decaying from 5×10-4 to 1×10-5 three times for 300 epochs. According to the minimum validation loss, the best model was saved; to decrease training time, early stopping was implemented where if the validation loss had not decreased for 30 epochs, the second stage of training would begin. In the second stage, we allowed all the layers of the best model from the first step to be trained with a lower learning rate of 5×10-5. The same callback functions from the first step were used, with early stopping executed if the validation loss had not decreased for 20 epochs. Both steps trained the neural network with a batch size of 8, Adam optimizer, and binary cross-entropy loss.
The DNN was developed and evaluated in TensorFlow and the Keras API [38] using Python 3.6.3 on Canadian Supercomputer "Cedar" powered server nodes with the NVIDIA Tesla V100-SXM2 GPU and 32GB RAM. Nested 5-fold cross-validated training, testing, and Grad-CAM visualization required less than 6 hours.

Model evaluation
Trained models were evaluated quantitatively for a wide range of metrics used and proposed in the Literature for more transparent representation and comparison across studies. Class predictions were categorized with a threshold of 0.5 on the model's probabilistic output. The model performance on the allocated test fold was evaluated for accuracy [17,18], balanced accuracy, area under the receiver operating characteristics (AUROC) [13,14], area under the precision-recall curve (AUPRC) [40], F1 Score [41,42], sensitivity [13,14,17,18], and specificity [13,14,17,18]. Accuracy, AUROC, F1 Score, sensitivity, and specificity have been reported in the Literature for diabetes-related classification evaluation whereas AUPRC has been proposed over AUROC for retinal disease classification using OCT as it better represents the prediction performance on unbalanced datasets [40]. Like AUROC, accuracy is a poor metric for unbalanced datasets and should either be calculated on a balanced test set or substituted with balanced accuracy. Hence, with the unbalanced nature of medical datasets, some publications omit accuracy and only report AUROC along with specificity and sensitivity [13,14]. Each evaluation metric for the 20 models from each subsampled factor from the nested 5-fold cross-validation was evaluated for a statistically significant (p < 0.05) difference in means across subsampling factors through repeated-measures ANOVA and post-hoc two-tailed t-test. Neural network rDR classification evaluation metrics were compared across subsampling factors and to our previously published deep learning methods [18].

Visual Explanations
Grad-CAMs compute the gradients flowing through a convolutional neural network (CNN) into the last convolutional layer and identifies regions of high importance for the classification and visualizes the relative weight of regions through a heatmap [43]. They generate class activation maps that are especially relevant for autonomous AI in clinical decision-making as they allow for the verification of the reasoning behind a neural network's predictions and allow for a qualitative verification of their decision-making. OCT and OCT-A en face images contain clinically relevant biomarkers that clinicians consider when examining fundus photography and fluorescein angiographies during their screening for DR. Additionally, Grad-CAMs allow us to visualize the consistency between neural networks trained across the images with different subsampling factors. We qualitatively evaluate the model's ability to detect regions containing biomarkers of DR and validate that when we remove B-scans, the neural network continues to detect rDR based on clinically relevant biomarkers found with the original sampling rate. For example, if in higher resolutions, the class activation map shows a neural network's focus towards the FAZ and regions of non-perfusion, while a less sampled image of the same scan results in a class activation map highlighting regions insignificant to DR, this may suggest that the less sampled input images have been sampled too sparsely.

Results
The DNN models trained on the laterally subsampled images were evaluated for rDR classification accuracy, balanced accuracy, AUROC, AUPRC, F1 Score, sensitivity, and specificity and were compared to the models trained on the original resolution images. Table 2 summarizes the mean, standard deviation, and statistically significant difference (p < 0.05) in means of the evaluation metrics across all subsampling factors. Generally, the lateral subsampling factor was negatively correlated to the performance of the neural network classification. With the early stopping criteria, it was observed that larger subsampling factors required more training. Slower convergence, which accompanies the larger subsampling factors, suggests that as the biomarkers and features of DR become more difficult to identify, the neural network requires a longer training period. There were no significant differences between the original image and subsampling by a factor of 3 across all metrics. Repeated-measures ANOVA tests revealed no significant difference in means across all subsampling factors for both sensitivity and specificity. Although subsampling factors of 4, 5, and 8 were statistically worse, the evaluation metrics were comparable, and all models trained across the subsampling factors fell within one standard deviation across every evaluation metric when compared to the performance of the models trained on the original sampling rate. Model probabilistic outputs were evaluated for the 5 DR severities to understand the effect of thresholding and provide further insights into the performance of the neural network. Figure 3 is a violin plot showing the probabilistic outputs of the neural networks of all 20 models for each test. Subsampling factors are represented by different colors and the red dotted line separates the 5 DR severities into the binary stratified groups of non-referable DR and rDR. The violin plot shows that the number of false negatives (probabilistic outputs above the probability of 0.5 and to the left of the red dotted decision boundary) was consistent throughout different subsampling factors. Alternatively, the false positives (probabilistic outputs below the probability of 0.5 and to the right of the red dotted decision boundary) were primarily errors from the models trained on heavily subsampled images. This is consistent with Table 2 since the specificity remains relatively consistent while the sensitivity decreases with an increased subsampling factor.
The classification activation maps generated from Grad-CAM are represented as heatmaps superimposed on the SVC input image to visualize the regions of importance. There was no significant difference in evaluation metrics, as shown in Table 2, between models trained on the original resolution to those subsampled up to a factor of 3. Figure 4 shows the consistency across heatmaps for correctly classified PDR and normal eyes across all subsampling factors.

Discussion
Image acquisition with OCT and OCT-A requires compromises between sample density, FOV, and duration (which is related to increased motion artifact). A contribution of this study is that deep neural network classification of rDR based on OCT/OCT-A acquired in the parafovea is relatively insensitive (reduction of less than 0.05 across all evaluation metrics) to the increase of acquisition efficiency that comes with the subsampling density for factors of up to 8× subsampling. This finding has significant implications for the OCT acquisition system hardware and acquisition parameters.
The quantitative evaluation of the DNN classification performance, as shown in Table 2, found no statistical difference when using subsampled images up to a factor of 3. This result suggests that commercial OCT systems can have a 3× lower B-scan sampling density without a loss of neural network classification performance for rDR. Notably, the performance of neural networks trained on images with a subsampling factor of 2 resulted in an insignificant improvement compared to those trained on originally sampled images. This difference was an order of magnitude below the standard deviation and was not significant when evaluated through two-tailed t-tests. The lower sampling density could be used to shorten the duration of acquisition. Alternatively, for a set volume acquisition duration, the distribution of the samples across the retina could be reshaped to cover a wider FOV. By imaging 3× less dense in the 'slow scan' direction (increasing the distance between B-scans), the pixels could be reallocated to imaging 3× more of the retina. For example, a 300×300-pixel 3mm en face scan could be sampled sparsely to encompass approximately a 5×5mm region.
Generally, the neural network classification performances as quantified using evaluation metrics were negatively correlated to the subsampling rate. However, the decreased performance was only on the order of a few percent, and we speculate that this may be overcome by the benefits of capturing a wider FOV. As an example, subsampling by a factor of 8 can be paired with imaging over a FOV on the retina approximately an 8.5×8.5mm area in the same acquisition time as the original fully sampled 3×3mm scan. Our results suggest the central 3×3mm region could be cropped from the center of the subsampled volume, and neural networks would have rDR classification performance comparable to those trained on the original 3×3mm en face; with the following average change in performances: accuracy (factor of 8 -original; −0.021), AUROC (−0.022), AUPRC (−0.032), balanced accuracy (−0.024), F1 Score (−0.030), sensitivity (−0.044), and specificity (−0.005). It remains to be explored whether the additional features detected in the larger FOV would overcome the change in performance.
While our study focuses on the relative performance between models trained on images with different subsampling factors, our models can also be evaluated against our previously published results [18]. Comparisons across methods evaluated on different datasets are often difficult due to differences in the ground truth and in the distribution of data, hence, we compare to our previously published work using the same dataset with different preprocessing steps. Table 3 shows the performance of our neural network trained on an original image and an 8× subsampled image compared to our ensemble learning and standard VGG-19 approaches [18]. The classification performance in this manuscript trained on the fully sampled images is comparable to results published in [18] using a similar non-ensembled approach using VGG-19 but is significantly worse than the ensemble learning approach.
We focused on the evaluation across subsampling factors in favor of optimizing specifically for model performance. Hence, the performance would improve if we utilized our ensemble learning framework, but the increased training time and model complexity were not necessary as the comparison across subsampling factors still stands. A fairer comparison would be to refer to our previously published work that utilized ensemble learning where we also reported a three-channel input utilizing the VGG-19 base network with an accuracy of 0.877, sensitivity of 0.942, and specificity of 0.805, shown in Table 3. While our methods are slightly different, (in this study, we evaluated using nested 5-fold cross-validation, omitted Zeiss preprocessing that included interpolation, and our dataset size is slightly smaller), we report similar results. The Grad-CAMs shown in Fig. 4 signify that the neural network is consistently focused on regions near the FAZ and regions of non-perfusion across all subsampling factors in the referable eye while a non-referable eye results in Grad-CAMs scanning the entire retina for features of DR. From the class activation maps, we speculate that as the subsampling factor decreases, more of the weight of the prediction shifts towards general perfusion density and textures rather than regions of non-perfusion separated by well-connected vessels. While the weights of the neural network attention shift slightly when subsampling, the regions of interest remain consistent with the original resolution. This qualitatively reinforces that our neural network performances and classification reasoning are not heavily impacted when training and testing on laterally subsampled OCT and OCT-A en face scans up to a factor of 8.
Future work should evaluate the true performance of an autonomous DR screening tool on wide-field OCT images and the effect on patient outcome. We have previously shown that the capillary network outside the parafovea contains early changes from DR [33] which may benefit rDR classification. Regions containing changes in the microvasculature from DR should be further explored and our OCT systems could be re-evaluated to capture a wider FOV targeting hallmarks of retinal diseases to improve classification performance. Sampling over a wider FOV may potentially capture features that allow DNNs to tackle more difficult problems including autonomous DR prognostication and further stratified classification [44]. While machine learning methods may not be significantly impacted by lower resolutions caused by a wider FOV, a clinician's ability to screen patients must still be a priority, and methods of reconstructing image resolution [26] should continue to be explored.
Conventional quantitative retinal imaging biomarkers, including vessel and perfusion density, and FAZ metrics, have been explored extensively as clinically-explainable features to predict the DR severity. Therefore, it would also be interesting to explore the effect of lateral sampling on these biomarkers of DR captured by OCT and OCT-A, and compare the corresponding effect to the neural network based approach. This may provide additional insight into the decision-making of the neural network and whether relevant biomarkers remain for neural network based feature extraction.
Although this study demonstrates a DNN's ability to detect rDR on heavily subsampled images, this study is limited by the number of images in our dataset, lack of access to labeled wide-field OCT data, and lack of an external independent testing set. OCT-A is a relatively new imaging modality where autonomous tools utilizing OCT-A are limited by the small independent datasets at each institution and a lack of widespread data sharing [40] which we accounted for by leveraging transfer learning and data augmentation.

Conclusion
DNNs have been used to aid clinicians with their decision-making and in our case, provide tools for rDR classification. In this report, we have demonstrated no significant differences across all evaluation metrics on our automated rDR classification when subsampling up to a factor of 3× and a minimal effect up to 8×. The purpose of this study was not to validate a neural network's ability to detect rDR, but instead to investigate the performance when subsampling the input images. Our results suggest that OCT/OCT-A systems can sample the microvasculature more sparsely without significantly impeding our automated rDR classification tools. As a result, the additional acquisition time can be reallocated towards imaging more of the microvasculature. Data availability. The retinal image data underlying the results presented in this paper are not publicly available at this time.