MEDnet, a neural network for automated detection of avascular area in OCT angiography

: Screening and assessing diabetic retinopathy (DR) are essential for reducing morbidity associated with diabetes. Macular ischemia is known to correlate with the severity of retinopathy. Recent studies have shown that optical coherence tomography angiography (OCTA), with intrinsic contrast from blood flow motion, is well suited for quantified analysis of the avascular area, which is potentially a useful biomarker in DR. In this study, we propose the first deep learning solution to segment the avascular area in OCTA of DR. The network design consists of a multi-scaled encoder-decoder neural network (MEDnet) to detect the non-perfusion area in 6 × 6 mm 2 and in ultra-wide field retinal angiograms. Avascular areas were effectively detected in DR subjects of various disease stages as well as in the foveal avascular zone of healthy subjects.


Introduction
Diabetic retinopathy (DR) is a leading cause of blindness [1,2]. Capillary damage from hyperglycemia causes vision loss through downstream effects, such as retinal ischemia, edema, and neovascularization. Structural optical coherence tomography (OCT) has been used to objectively guide the treatment of diabetic macular edema, however, the disease severity and treatment threshold are still largely dependent on subjective interpretation of fluorescein angiography (FA) [3]. The recently developed OCT angiography (OCTA) [4,5] provides label-free, three-dimensional images of retinal and choroidal circulation with capillary detail. Not only is it safer, faster, and less expensive than conventional dye-based angiography, OCTA provides the potential of giving clinicians objective tools for determining severity of disease by detecting and quantifying neovascularization and non-perfusion (avascular) areas.
Our prior studies [6][7][8][9] have demonstrated that the avascular area of the superficial capillary complex in the retina is an important indicator of DR stage and progression. In order to accurately measure the avascular area, OCTA flow pixels need to be classified properly in order to correctly identify regions with abnormal inter-vascular space. However, discriminating vascular signal in OCTA is a challenging task owing to the dependence of background flow signal on local tissue reflectivity and the confounding effects of eye motion [10,11].
In our previous work, we proposed a method to automatically quantify the avascular area on 3 × 3 mm 2 OCTA image of the macula [6,9]. Today, high-speed OCT systems [12] and efficient OCTA algorithms [13] have made possible the acquisition of considerably larger fields of view (6 × 6 mm 2 or more); unfortunately, larger fields of view introduce new image processing challenges to the classification of flow pixels. For instance, 6 × 6 mm 2 OCT angiograms are more likely to contain shadows caused by vitreous floaters or pupil vignetting, and are more vulnerable to shadowing effects due to lower sampling rates. Moreover, the 6 × 6 mm 2 area encompasses vasculature on two sides of the fovea (optic disc vs temporal side) in which the normal inter-vascular space is significantly different, demanding a more sophisticated detection/segmentation algorithm.
This segmentation task can be considered a pixel-wise classification problem and is amenable to machine learning approaches. Semantic image segmentation with deep convolutional net is an active research field and a lot of deep network solutions have been proposed. Fully convolutional networks [14] were proposed to transform fully connected layers in CNNs into convolutional layers, in order to convert the network output into a heat map. Because the encoding module reduces the resolution of the input by a factor of 32, it is difficult for the decoding module to produce a fine segmentation map. To solve the loss of resolution, across-layer connections have been used in fully convolutional solutions. A successful FCN called U-net [15] added a contracting path to capture context and a symmetric expanding path to identify the location of objects with precision. In order to improve the segmentation accuracy, Deeplab used atrous convolution kernels [16,17], which not only can reduce the loss of resolution but also the number of trainable parameters. Using the state of the art network structure (e.g. VGG [18], ResNet [19], and Inception [20]) as a part of semantic segmentation network can streamline the design process of network and can take the advantage of the superior performance of existing networks. For example, Segnet [21] borrowed VGG16 network structure and built an efficient semantic segmentation network.
Recently, several machine learning solutions have successfully segmented pathological areas with abnormal tissue reflectance characteristics in OCT images [22][23][24][25][26][27]. However, the metrics based on OCTA image analysis can complement OCT for earlier assessment of ocular diseases with a vascular component, such as DR. Pavle et al. had proved deep convolution network can segment the foveal microvasculature [28] on OCTA images. In this study, we developed a novel deep learning architecture, named multi-scaled encoder-decoder neural network (MEDnet), with a powerful multi-scale feature extraction capability incorporated to segment the non-perfusion area in 6 × 6 mm 2 angiograms of superficial retinal flow.

Methods
The segmentation task consists of pixel-wise classification into two classes: vascular versus avascular area. Our solution, called MEDnet, is a fully convolutional network containing a layer with multi-scale atrous convolutions of different dilation rates aimed at generating feature maps sensitive to the different scales of non-perfusion. This is necessary because the size of avascular areas can be variable and because pixels contained in them might encounter noise in their vicinity, potentially confounding the classification process. The inclusion of this type of layer also helps to reduce the size of the network and, consequently, the number of model parameters to only 890,000.

Data acquisition
OCTA scans were acquired over a 6 × 6 mm 2 region using a 70-kHz OCT commercial AngioVue system (RTVue-XR; Optovue, Inc.) centered at 840nm with a full-width half maximum bandwidth of 45nm. Two repeated B-scans were taken at each of 304 raster positions and each B-scan comprised 304 A-lines. The commercial version of the splitspectrum amplitude decorrelation angiography (SSADA) algorithm [13] was used to calculate OCTA data. Then, the retinal layers ( Fig. 1 (A)) and the boundaries of retinal vascular plexuses ( Fig. 1(B)) were segmented using a graph search method [29]. The reflectance ( Fig.  1(C-D)) and angiographic images ( Fig. 1(E-F)) of the superficial vascular complex (SVC) angiogram were obtained by average and maximum projection of the corresponding volumetric data within the slab of interest, respectively.

Network architecture
The architecture of the proposed network is illustrated in Fig. 2(A). It can be divided into two parts, encoder and decoder. In the encoder section, a multi-scale block formed by three atrous convolutions with different dilation rates ( Fig. 2(B)) was employed to extract multi-scale features from input images. The outputs of the blocks containing the atrous convolutions were concatenated across the depth dimension into a single tensor before being fed to the next layer. After that, each "Conv block" consisted of successive convolutions with 3 × 3-pixel kernel followed by a batch normalization stage and a max pooling operation. The batch normalization was applied to accelerate deep network training and reduce overfitting. The role of the convolution blocks (conv_2 and conv_3) was to encode the image whereas dConv_5, dConv_6 and dConv_7 made up the decoder. The decoder blocks received the outputs of encoder blocks through across-layer connections that allowed the resolution of the output to be preserved and stabilized the training phase [15,30]. A Sigmoid activation function was used in the output layer for pixel-wise classification.

Network parameters
The parameters of each layer are listed in Table 1. Convolutional layers had kernels of size 3 × 3 pixels except for the 1 × 1 pixel convolution operations included in the atrous convolution block with the purpose of reducing the depth of the output, and hence, the computational cost. The default atrous convolution itself had a 3 × 3-pixel kernel size. The dilation rate n meant that n-1 zeros have been padded in between the rows and columns of the original filter ( Fig.  2(B)).

Training data
The training data consisted of en face angiograms of the SVC (Fig. 3(A)), OCT reflectance of the same slab ( Fig. 3(B)), and the corresponding manually segmented non-perfusion binary map ( Fig. 3(C)). During the imaging process, occlusion of the back-scattered signal by anterior objects (eyelashes, vitreous floaters, pupil vignetting) might cause loss of the flow signal at the corresponding position and this may be responsible of pixel misclassification. In order to prevent the potentially confounding impact of shadowed areas, we incorporated the corresponding OCT reflectance images in the training stage. Three expert graders manually delineated non-perfusion area maps, and the ground truth maps were generated by the pixelwise vote on the three manually labeled maps. In order to alleviate the limitations associated with a small training data set, data augmentation techniques [16,17,31] (flipping, noise addition, and rotation) were used to increase the amount of training data.

Loss function and optimizer
The loss function used in training stage was the mean square error (Eq. (1)) with L2 regularization loss (Eq. (2)). Mean square error can provide the distance between the actual label and the predicted value whereas L2 regularization loss can measure the scale of the model and avoid overfitting [31].The total loss is the sum of the mean square error and L2 regularization loss (Eq. (3)).
where E is the mean square error, N is the number of samples in a training batch, y is the label,  y is the predicted value, w is weight factor of the model, p is the total number of weight factor of the model, R is L2 regularization loss and T is the total loss. In the training phase, we used the stochastic gradient descent optimizer (SGD) with an exponential decay learning rate (Eq. (4)) to optimize the total loss.
where t is the training step (250 per epoch), t l is the learning rate of t -th training step, a is the decay factor, l is the initial learning rate and s is the step decay responsible for reducing the learning rate.

Data set
To train MEDnet we collected OCTA volume data from 76 healthy eyes and 104 eyes with DR, for a total of 180 en face angiograms of SVC (304 × 304 pixels). The DR eyes were arranged by disease severity into three sub-groups, severe DR (include severe nonproliferative diabetic retinopathy (NPDR) and proliferative diabetic retinopathy (PDR)), mild to moderate NPDR, and diabetes without retinopathy ( Table 2). These images were annotated for the avascular area by three expert graders. We randomly divided the data set into two groups, 140 samples in the training set and 40 samples in the test set. Both training set and test set have same disease severity distribution. After application of randomized data augmentation operations (Gaussian noise (mean = 0, sigma = 0.5), salt and pepper (salt = 0.001, pepper = 0.001), horizontal flipping, vertical flipping and 180° rotation) the training set was increased to 750 images. During the training phase, 10% of the images were isolated for cross-validation.

Implementation
MEDnet was implemented in Python 3.6 with Keras (Tensorflow-backend) and run on a PC with an Intel i7 CPU, GTX 1080Ti GPU, and 32G RAM. The hyper-parameters are specified in Table 3. We stop training when the accuracy rate became stable in the learning curve, and found that the network can achieve the best generalization performance by the 15-th training epoch. Three samples per training batch were suitable for the memory space available in our hardware. We used a large initial learning rate (L = 0.1) to acquire a high convergence speed, a decay factor a = 0.9 and a step decay s = 200 to obtain a smooth decline in learning rates.

Performance evaluation
To evaluate the performance of MEDnet we applied the trained model on the test set. Several factors can affect the performance of the network, principally a low OCT signal strength index (SSI) and the severity of the disease. To evaluate these dependencies we separated the test set into two groups of twenty eyes each. In the first group we divided images into 4 different sub-groups with different SSI ranges; each group contained five images ( Table 3). The other group was arranged by disease severity into another four sub-groups of five scans, containing healthy control subjects, diabetes without retinopathy, mild to moderate NPDR, and severe DR respectively. Since the output of MEDnet consists of a probability map with a value ranging from 0 to 1, a threshold of 0.5 is set to represent the non-perfusion area from the background to compare it with the manual avascular area map. The training phase takes less than 30 minutes on a single NVidia 1080ti GPU, and segment one image takes 2.5 seconds on Intel Core i7 CPU. The accuracy, precision, recall and F1-score (Eq. (5)) were evaluated in Table 4 where TP are true positives (as in correctly predicted non-perfusion area pixels), TN are true negatives, FP are false positives and FN are false negatives. The evaluation shows that classification accuracy, precision (the ability to not classify normal areas as diseased), and F1-score deteriorated for high disease severity and low SSI, which was expected. The recall was very close to one, indicating excellent sensitivity, as almost all of the avascular area pixels were detected. The peculiarity that precision was lower than recall indicated that the inaccuracies found were mostly caused by the avascular area size being overestimated with respect to the ground truth. Because the network did not perform equally in avoiding false positive and false negative pixels, the F1-score was a better metric to describe network performance as it took both observations into consideration. The avascular area in the healthy controls is concentrated in the macular area ( Fig. 4 (A4,  B4)), while in the DR groups, there are more avascular areas outside the FAZ and randomly distributed over the SVC angiograms ( Fig. 4(C4, D4)). Therefore, the cumulative classification error on the severe DR group was larger than for healthy controls, as it was more likely to exhibit mismatch with the subjective manual segmentation (Table 4). With regards to data with different signal strengths, our method could achieve good accuracy for all scans within the range of SSI values recommended by the manufacturer (SSI>55) but better accuracy for high quality scans (Table 4). This is owing to the fact that low quality scans have a larger prevalence of artifacts causing artificially high OCTA signal (such as in motion artifacts) or signal loss due to pupil vignetting (Fig. 4 (D2-D3)). Moreover, the low quality scans usually exhibited deteriorated vascular integrity and might have biased the classification towards larger avascular areas.

Performance on wide field of view OCTA
In a disease like DR it is useful to gain access to the largest available field of view (FOV). Although the FOV of a single OCTA scan is limited by hardware capabilities, software registration and stitching of images acquired in different retinal locations can assist to generate ultra-wide FOV OCTA. For this purpose, a DR subject was imaged on the optic disc, macula, and temporal retina using a wide-field OCTA prototype housed at the Center for ophthalmic optics and lasers of Oregon Health and Science University and reported previously [12]. For wide-field OCTA, 800 OCT B-scans were acquired at 400 positions over an area of 8 × 10 mm 2 (vertical priority) and the SSADA algorithm was used to generate the angiograms. Detection of the avascular area was applied on the 8 × 10 mm 2 images and they were later rigidly montaged to represent an area of 20 × 10 mm 2 (Fig. 5). The network was able to detect the avascular area apparent to a human grader-despite the large prevalence of noise and without having been trained for images from this OCT instrument-over a considerably larger FOV.

Discussion
We designed a deep convolutional neural network named MEDnet for automated quantification of retinal non-perfusion in DR using OCTA in 6 × 6 mm 2 macular angiograms of the SVC. To the best of our knowledge this is the first deep learning solution used to segment pathological areas in the retina using OCTA. The network uses a multi-scale block with atrous convolutions to enhance the multi-scale feature extraction capabilities [16,17] and across-layer connections to preserve lateral resolution. The features of retinal vasculature in OCTA en face images are very complex and difficult to describe using traditional computer vision methods. However, deep convolutional networks have strong feature representation capabilities. Before us Pavle et al. [28] had used deep learning based method to successfully segment the foveal microvasculature on OCTA en face images. In this work, MEDnet proved deep learning can generalize well to the complex vasculature of OCTA images and accurately localize the regions with loss of perfusion.
The experimental results indicated that MEDnet has good performance (F1-score>80%) for scans of different disease severity and image quality. Although the performance is satisfying, it should be interpreted with care, as the manual segmentation is done using subjective criteria. The threshold that determines whether an inter-vascular space is an avascular area is arbitrary. Moreover, owing to the complexity of retinal angiograms and the amount of detail available, expert graders are unlikely to segment the whole extent of the avascular area. For this reason, the area calculated by the network tends to be larger than the area delineated manually.
Another limitation of this method is related to the confounding factor introduced by low OCT signal artifacts. Since both angiographic and reflectance en face images are fed to the network, it has the ability to identify OCTA signal loss owing to vignetting and vitreous floaters and prevent their classification as avascular areas (Fig. 6(A1-A4)). In some cases, the avascular area could extend to areas affected by vignetting (Fig. 6(B1-B4)). Potential misclassification can occur there, which it is difficult to evaluate, given that the difference between vignetting and avascular areas are not very evident to human graders either. Although some work has been done in reflectance-adjusted thresholding [10,32] and boosting of OCTA signal underneath occluding material [33], the OCT signal level at which OCTA data is irretrievable is still unknown. A mathematical model is still necessary to identify these areas and further exclude them from analysis. Such model could be constructed from scans acquired from healthy eyes and containing local shadows from vitreous floaters or pupil vignetting, but with signal strength above the threshold recommended by the manufacturer.
Besides optical signal occlusion, other artifacts are caused by microsaccadic eye motion during scanning. These artifacts are very common in OCTA [11] and are apparent in en face angiograms as continuous lines crossing entire B-scans. As expected, microsaccadic artifacts can confuse the network into classifying these pixels as vascular, owing to their large OCTA signal level. (Fig. 6(C1-C4)). The data presented in this manuscript corresponds to the superficial vascular complex. However, DR effects on ocular circulation are not restrained to superficial retinal flow. In fact, the capillaries forming the intermediate and deep vascular plexuses have been reported to be strongly affected by DR disease progression [8]. Integrated with projection-resolved (PR) OCTA [34], there is potential in the future to extend the algorithm's capabilities to all three retinal plexuses found in the macular region.

Conclusions
In summary, we have reported a deep learning solution for the segmentation of avascular areas in the retina of DR eyes using OCTA. The network could classify pixels with confidence owing to access to multi-scale context and preservation of the lateral resolution. The inclusion of atrous convolutions with different dilations allowed it to generate features with different receptive fields without increasing the computational load. Consequently, the multi-scale feature maps offered more accurate decision making in the classification process, despite the prevalence of noise in avascular areas. Moreover, the excellent performance on ultra-wide field OCTA highlights the potential clinical applications of this deep learning configuration for the early detection and progression assessment of DR.