Semi-supervised deep learning based 3D analysis of the peripapillary region

: Optical coherence tomography (OCT) has become an essential tool in the evaluation of glaucoma, typically through analyzing retinal nerve ﬁber layer changes in circumpapillary scans. Three-dimensional OCT volumes enable a much more thorough analysis of the optic nerve head (ONH) region, which may be the site of initial glaucomatous optic nerve damage. Automated analysis of this region is of great interest, though large anatomical variations and the termination of layers make the requisite peripapillary layer and Bruch’s membrane opening (BMO) segmentation a challenging task. Several machine learning-based segmentation methods have been proposed for retinal layer segmentation, and a few for the ONH region, but they typically depend on either heavily averaged or pre-processed B-scans or a large amount of annotated data, which is a tedious task and resource-intensive. We evaluated a semi-supervised adversarial deep learning method for segmenting peripapillary retinal layers in OCT B-scans to take advantage of unlabeled data. We show that the use of a generative adversarial network and unlabeled data can improve the performance of segmentation. Additionally, we use a Faster R-CNN architecture to automatically segment the BMO. The proposed methods are then used for the 3D morphometric analysis of both control and glaucomatous ONH volumes to demonstrate the potential for clinical utility


Introduction
Glaucoma is a leading cause of irreversible blindness worldwide [1] and the second most common cause of blindness in the developed world [2]. It is characterized by the degeneration of retinal ganglion cells and the loss of their axons [3], which manifests as narrowing of the neuroretinal rim and structural remodeling of the optic nerve head (ONH) [4]. Current clinical indices for evaluating the progression of glaucomatous optic neuropathy include assessment of the cup-to disc ratio and neuroretinal rim area in fundus photographs, visual field testing, and retinal nerve fiber layer (RNFL) analysis using optical coherence tomography (OCT) images.
Although the conventional RNFL analysis is limited to sectoral averaging of RNFL thickness and circumpapillary thickness profile, recent advances in OCT imaging have enabled the acquisition of high-resolution 3D images from which morphometric measurements and new biomarkers based on anatomical landmarks can be derived. One such landmark, the Bruch's Membrane Opening (BMO), has been shown to be more reliable than disc margin-based rim optic nerve, glaucomatous visual field abnormalities, and peripapillary SD-OCT analysis. In addition, severity of glaucomatous visual field loss was quantified by visual field mean deviation (MD) values.
As the purpose and parameters were different for each of the previous studies from which this data set was generated, the availability of manually corrected segmentations was different for each volume set and ranged from no segmentations, only layer segmentations, only BMO segmentations, and both layer and BMO segmentations. A breakdown of the participants' demographics for each dataset is shown in Table 1.

OCT volume acquisition and processing
Details on the OCT system used in this study has been previously published [29]. In brief, a custom-built swept-source OCT system with a center wavelength of 1.06 µm and 100 kHz sweeping frequency was used to image the ONH region. The acquired three-dimensional (3D) images consisted of 400 B-scans, each with 400 A-scans, and 1024 pixels per A-scan. The imaged region in physical space spanned an axial depth of 2.8 mm and a square area of 5 × 5 to 8 × 8 mm 2 dependent on the axial length of the eye and scan angle. The resulting voxel dimensions were 2.7 µm in the axial direction and ranged from 12.5 to 20 µm in the lateral direction. Axial displacement caused by involuntary axial eye motion during image acquisition was corrected using cross-correlation between adjacent frames, which were subsequently cropped to be 640 pixels in the A-scan direction. Each B-scan used for input to the network was averaged with the preceding and following 5 B-scans. Additionally, three-dimensional bounded variation smoothing was applied to the volumes used for BMO segmentation in order to reduce the effect of speckle while preserving and enhancing edges.

Ground truth labels
Automated layer segmentations of the inner limiting membrane (ILM), the posterior boundary of the RNFL, Bruch's membrane (BM), Bruch's membrane opening, and the choroid-sclera boundary (CS boundary) were generated in 3D using a graph-cut algorithm [15,31]. The area of the automated segmentation result outside of the optic cup was examined and corrected by trained research engineers in Amira (version 5.2; Visage Imaging, San Diego, CA, USA) or ITK-SNAP (version 3.2).
The BMO, defined as the termination point of the high-reflectance BM/retinal pigment epithelium (RPE) complex, was segmented on 80 radial slices extracted from the smoothed volume, intersecting at the approximate center of the BMO and spaced at a constant angle of 2.25°. The ONH is a relatively radially symmetric structure, and radial slices provide a more consistent cross-sectional view of the BMO compared to the raster scan pattern in which the volumes were acquired. For input into the BMO Segmentation Network, the BMO was considered to be the 50 × 50 pixel box centered on the manual BMO segmentation points. Example segmentations of the 4 layer boundaries and the Bruch's membrane opening are shown in Fig. 1.

Architecture
A Pix2Pix GAN [32] based approach was used with the layer segmentation. For the generator network, we used a U-Net based segmentation network [33] which is comprised of a contracting and expanding path connected by skip connections as denoted in Fig. 2. The contracting path contains 4 blocks consisting of a 2D convolutional layer (3 × 3), dropout layer, 2D convolutional layer (3 × 3) and max pooling layer. Similarly, the expanding path also contained 4 blocks consisting of a 2D transpose convolutional layer (2 × 2, stride 2 × 2) concatenated with the corresponding layer from the contracting path, a 2D convolutional layer (3 × 3), dropout layer, and 2D convolutional layer (3 × 3). To compare the effect of semi-supervised learning on a standard U-Net and on a GAN, the output layer was slightly different for each network. For the U-Net implementation, the output layer was implemented as a 1×1 convolution layer with 5 filters corresponding to the number of regions segmented and a sigmoid activation. Pixel-wise classifications were made based on the highest probability. For the GANs implementation, to make the output layer differentiable it was implemented as a 1×1 convolution layer with 1 filter, where the output is integer encoded to the number of regions segmented and a linear activation. The discriminator model is implemented as a PatchGAN [32]. The output of the network is a single feature map of manual/automated predictions corresponding to a patch size of 16 × 16 on the original image that is then averaged to give a single score.

Training
Data was augmented during training with random horizontal flips, cropping from 0 to 10%, linear contrast stretching from 0.75 to 1, and rotations from -20 to +20 degrees. The Adam optimizer with a learning rate of 1e-4 was used and the loss function was categorical crossentropy. Additionally, a batch size of 1 was used and the maximum epochs was set to 50 with a callback set to stop training if the validation loss had not improved in 5 epochs. To test the effects of semi-supervised learning on the layer segmentation, the GAN was first trained in a fully supervised fashion on the available labeled dataset. Subsequently, the network was fine-tuned using a ratio of 10 supervised images to 1 pseudo-labelled image. When training the GAN using the unlabeled data, only the generator weights were updated to preserve the integrity of the discriminator, and the predicted output of the generator network was rounded to the nearest integer before being used as the pseudo-label. Training time for the U-Net architectures were an average of 4 hours, for the GAN an average of 20 hours and the pseudo-labeled GAN an average of 28 hours. The architecture for this network was based on the Faster R-CNN [34] architecture. A ResNet 50 backbone was used, and the ImageNet pretrained weights were loaded. The output feature map was then used as an input for a region proposal network (RPN). Each point on the feature map is considered an anchor, and as the OCT radial frames were approximately the same size and all ground truth inputs were 50 × 50 pixel squares (as mentioned in Section 2.3), we only used a 1:1 ratio anchor of size 50 × 50. The RPN consisted of a 3 × 3 convolutional layer connected to two 1 × 1 convolutional layer output channels for classification and box-regression.
After regions have been proposed, region of interest (ROI) pooling is performed. The output is then fed through three 3 × 3 convolutional layers, averaged with a 7 × 7 filter and flattened. The final step is a softmax function for classification and linear regression to fix the boxes' location. A high-level overview of this network is shown in Fig. 3.

Training
For training, all anchors are separated into BMO and non-BMO patches based on the Intersection over Union (IoU). Anchors that overlap a manual segmentation box with an IoU larger than 0.5 are considered BMO and anchors with an IoU less than 0.1 are considered non-BMO. A mini batch was chosen to be 256 of these anchors, equally split between BMO and non-BMO classes. Non-maximum suppression is applied to ensure there is no overlapping for the proposed regions.
The Adam optimizer with a learning rate of 1e-5 was used to train the RPN and classifier layers (labelled FC Layers in Fig. 3). The loss function was defined as the addition of the losses for the classification and bounding box regression. The classification loss was the log loss over the BMO and non-BMO classes. The box regression loss was the smooth L 1 loss for the coordinates. Three-fold cross-validation was performed on this dataset. Training time was 3 hours on average. Effort was made to keep the three different training sets equal in glaucoma/control ratio while maintaining separation of subjects in the training set to ensure that they were not in the test set.

Clinical parameters
Four boundaries were extracted using the layer segmentation network: the inner limiting membrane, posterior boundary of the RNFL, posterior boundary of the BM and Choroid-Sclera boundary. To extract these boundaries, the segmentations for each B-scan in a volume were grouped into a volume. The largest 3D connected component corresponding to each tissue was assigned to that tissue and holes within the connected component were filled by the corresponding tissue. The boundaries were taken to be the first pixel in the axial direction for each tissue. Three shape characteristics were measured using these extracted layers and BMO points as previously described [18]: RNFL thickness, choroidal thickness, and BMO area.
Nerve fiber layer thickness was measured at each pixel of the posterior RNFL surface as the closest distance to the ILM surface. Similarly, choroidal thickness was measured at each pixel of the posterior CS boundary as the closest distance to the BM surface. For statistical analysis, the thickness measurements were averaged over an elliptical annulus, inwardly bounded at 0.75 mm from BMO and outwardly bounded at 1.75 mm from BMO. This provided a level of anatomical consistency in averaging measurements over multiple eyes with different image and BMO sizes.
To quantify the BMO shape, segmented points on the radial frames were first transformed back to the volume scans. Erroneous points were then eliminated by removing any segmentations more than one standard deviation away from the mean axial position of the segmented BMO points. An ellipse was then fitted to the segmented BMO points by first finding the best-fit plane using principal component analysis (PCA) and fitting an ellipse to the projection of the BMO points on the plane by least-squares criterion. Bruch's membrane opening area was calculated from the fitted ellipse.

Analysis
The Dice similarity coefficient was used to measure the spatial overlap between the manual and automated layer segmentation. It is defined between 0 and 1, where 0 represents no overlap and 1 represents complete overlap. The Dice similarity coefficient was calculated for each tissue as follows: where X denotes the set of pixels corresponding to the tissue in the manual segmentation, while Y denotes the set of pixels corresponding to the tissue in the automatically segmented image. Mean Average Precision (mAP) was used to measure the accuracy of the BMO detection network. Average precision was calculated as where P n and R n are the precision and recall at the n th threshold. A prediction was considered positive if the IoU ≥ 0.5. Clinical parameters for the volumes which had both BMO and layer manual segmentations were presented in a table including the mean and standard deviation for both manual and automated measurements. Paired, two-tailed Student's t-Tests were run to compare the means of the parameters. Bland-Altman plots were also used to evaluate the agreement between manual and automated methods. The differences between manual and automated measures were plotted against the average of both measures. The mean and standard deviation (SD) of the differences, mean of the absolute differences and 95% confidence intervals (± 1.96 SD) were calculated. Statistical significance was set at P<0.05 for all the tests performed.

Layer segmentation
To test the effects of pseudo labelling and adversarial training, the networks were trained using two different amounts of B-scans: 1000 B-scans (800 training, 200 validation) and 10,000 B-scans (8,000 training, 2,000 validation). The training set included data from 22 volumes (7 subjects, 9 eyes). The validation set included data from 4 volumes (2 subjects, 2 eyes). The training and validation sets had an equal split between control and glaucomatous B-scans. For the semi-supervised approaches, all unlabeled B-scans were available to be randomly selected for training. A total of 171 volumes (20 subjects, 40 eyes) were used for testing purposes. Subjects were divided between the training, validation, and testing sets such that they did not appear in more than one set. Longitudinal data (different time points) from the same subject were contained in the same set.
In order to compare the effects of the different architecture and training schemes, no postprocessing was done on the network outputs for the mean DICE coefficients reported in Table 2. As shown the semi-supervised (SS) Pix2Pix GAN has a higher Dice value for the regions inside the retina and choroid. The Dice values were slightly worse for the vitreous and scleral regions in this scheme, but this was mainly due to fluctuations in noise in these regions and were simple to remove in the post-processing steps described in Section 2.6. Therefore, the SS Pix2Pix GAN was chosen for parameterization in Section 3.3.  Figure 4 shows representative B-scans from the test set and their segmentation results using the different architectures, which were trained on 10,000 labeled B-scans. The top two rows are from glaucomatous volumes, whereas the bottom two rows are from healthy volumes. The U-net examples show consistent spikes in the segmentations, particularly in the glaucomatous volumes, which are not present in the ground truth segmentations. As neither of the GAN based segmentations have this artifact, the adversarial learning technique provides segmentations that qualitatively look more similar to the ground truth. The qualitative differences between the semi-supervised and non-semi-supervised GAN are small, but can be seen in some instances such as the Bruch's membrane segmentation of the first row.

BMO segmentation
The mAP for the BMO segmentation network was 0.8547 for glaucomatous (n=21,596 manual BMO points) and 0.9567 for control subjects (n=23,052 manual BMO points). Qualitatively, the network performed quite well. Example B-scans showing BMO segmentations of both control and glaucomatous subjects are shown in Fig. 5.

Parameters
The clinical parameters extracted from the datasets are shown in Table 3. There was no statistical significance between the manual and automated measurements for the BMO area, though both thickness measurements showed significantly thinner measurements. However, the coefficient of determination was above 0.97 showing excellent correlation.
Example BMO segmentations with corresponding ground truth segmentations are overlaid on the sum-voxel, en face view of the OCT volumes in Fig. 6. Manual segmentations (purple dots), the fit ellipse (green circle) and fit ellipse center (green star) as well as the automated   segmentations (yellow dots), the fit ellipse (blue circle) and fit ellipse center (blue star) are shown for a control (A), high-myope control (B) and glaucomatous (C) subject.  Example automated RNFL and choroidal thickness measurements from the semi-supervised pix-to-pix GAN method are overlaid on the sum-voxel, en face view of the OCT volumes in Fig. 7. Qualitatively, the RNFL thickness maps follow the characteristic pattern expected in the control eyes, whereas the glaucomatous eye ( Fig. 7(C)) exhibits a much thinner RNFL, which is congruent with the pathophysiology of glaucoma. Additionally, the choroid is a highly vascular layer and large differences in thickness can be seen in all three eyes with no apparent pattern.
Scatter plots showing the Pearson's correlation coefficient for glaucomatous and control data points are shown in Fig. 8. All clinical parameters showed good correlation between those extracted from automated and manual segmentations. Bland-Altman plots for the clinical parameters further confirm the high reliability of the automated measurements and are shown in Fig. 9.

Discussion
Training neural networks to segment diseased data generally requires a large amount of manually annotated ground truth images for fully supervised learning. Although a general U-Net architecture did show good performance when trained on 1000 B-scans (the equivalent of 2.5 OCT volumes), significantly better performance was shown for both adversarial training and fine-tuning using pseudo-labels when using a small dataset. This is particularly useful for medical imaging modalities, as curating expertly segmented scans is a finite and highly limited resource. Importantly, the results demonstrate that the benefits to the semi-supervised GAN approaches with 10x more data are only an improvement of 1-2% in performance.
Fine-tuning the network using pseudo-labelling generally improved the Dice scores for the regions of interest; however, it should be noted that the regions above the ILM and below the choroid-scleral boundary did get slightly worse after pseudo-labelling as indicated in Table  2. This may be due to the reinforcement of poor segmentations during the pseudo-labelling training, particularly in regions of noise such as above the ILM and below the choroid-sclera boundary. As these errors generally present themselves as smaller pockets fully encased by the correctly segmented region, they are easily filtered out during the post-processing steps described in Section 2.6.
It is important to address that although the Dice similarity coefficient reported above for the peripapillary layers was satisfactory, it did not reach as high as other papers [20,27] have reported for similar layer segmentation. This may be due to several factors. First, the variation in the anatomical shape and retinal layers in the region near the ONH is much higher than in the macula. Additionally, the manual segmentations provided for layer segmentation were completed for studies that did not consider the region inside the BMO. Therefore, manual raters were told not to correct inside the optic cup as it was not used for parameterization, leading to Dice scores inside the optic cup to be significantly lower. As such, the clinical measurements provide a better idea of the accuracy of the networks.
The clinical thickness parameters extracted from the automatically segmented volumes were shown to be slightly thinner than the manually segmented volumes. There was however, excellent correlation with the ground truth parameters, suggesting this was a small systematic difference. Additionally, through the Bland-Altman analysis, we see that the majority of datapoints fall within the limits of agreement. Although the thickness values were averaged to calculate a single score for ease of comparison, further 3D analysis could be done using these automated segmentations.
We were also able to show how a Faster R-CNN could be used to detect the BMO in radial OCT scans. From both the mAP and coefficient of determination values, BMO segmentation was shown to be better on control data than on glaucomatous data. The reason for this may be due to the larger number of control radial frames for training or the poorer quality of the glaucomatous dataset. However, most of the erroneous BMO segmentations were easily eliminated during post-processing (Section 2.6) leading to no significant difference in the BMO area parameter when comparing manual and automated methods as seen in Table 3, with good correlation for both glaucoma (R 2 = 0.9353) and control eyes (R 2 = 0.9885) as shown in Fig. 8.
Future works using this pipeline may also include looking at the 3D BMO minimum rim width (BMO-MRW), a parameter that has been shown to be useful in discriminating preperimetric and perimetric glaucoma [35]. Moreover, the datasets used contained a large amount of longitudinal data which increased the number of B-scans, but not subjects. Therefore, future studies with a larger number of subjects are warranted. Additionally, the methods described in this paper are readily translatable to more of the retinal layers which may also be of interest to clinicians, such as the ganglion cell layer.

Conclusion
In this study, we presented a Generative Adversarial Network based method for segmenting the peripapillary ONH layers which outperformed the vanilla U-Net. Through the use of pseudo-labelling, B-scans that did not have a corresponding manual segmentation were still able to be used and provided a further increase in performance. A Faster R-CNN was also used to segment the BMO from the volumes, allowing for comparison of volumetric parameters. The BMO area was shown to have no statistically significant difference, while the thickness parameters were slightly under segmented but highly correlated.