Towards Label-Free 3D Segmentation of Optical Coherence Tomography Images of the Optic Nerve Head Using Deep Learning

Since the introduction of optical coherence tomography (OCT), it has been possible to study the complex 3D morphological changes of the optic nerve head (ONH) tissues that occur along with the progression of glaucoma. Although several deep learning (DL) techniques have been recently proposed for the automated extraction (segmentation) and quantification of these morphological changes, the device specific nature and the difficulty in preparing manual segmentations (training data) limit their clinical adoption. With several new manufacturers and next-generation OCT devices entering the market, the complexity in deploying DL algorithms clinically is only increasing. To address this, we propose a DL based 3D segmentation framework that is easily translatable across OCT devices in a label-free manner (i.e. without the need to manually re-segment data for each device). Specifically, we developed 2 sets of DL networks. The first (referred to as the enhancer) was able to enhance OCT image quality from 3 OCT devices, and harmonized image-characteristics across these devices. The second performed 3D segmentation of 6 important ONH tissue layers. We found that the use of the enhancer was critical for our segmentation network to achieve device independency. In other words, our 3D segmentation network trained on any of 3 devices successfully segmented ONH tissue layers from the other two devices with high performance (Dice coefficients>0.92). With such an approach, we could automatically segment images from new OCT devices without ever needing manual segmentation data from such devices.


Introduction
The complex 3D structural changes of the optic nerve head (ONH) tissues that manifest with the progression of glaucoma has been extensively studied and better understood owing to the advancements in optical coherence tomography (OCT) imaging [78]. These include changes such as the thinning of the retinal nerve fiber layer (RNFL) [10,62], changes in the choroidal thickness [51], minimum rim width [33], and lamina curvature and depth [34,68]. The automated segmentation and analysis of these parameters in 3D from OCT volumes could improve the current clinical management of glaucoma.
Recent deep learning (DL) based systems have however exploited a combination of low-(i.e. edgeinformation, contrast and intensity profile) and high-level features (i.e. speckle pattern, texture, noise) from OCT volumes to identify different tissues, yielding human-level [21,22,27,54,77,82,69,86] and pathology invariant [21,22,69] segmentations. Yet, given the variability in image characteristics (e.g. contrast or speckle noise) across devices as a result of proprietary processing software [12], a DL system designed for one device cannot be directly translated to others [79]. Since it is common for clinics to own different OCT devices, and for patients to be imaged by different OCT devices during their care, the device-specific nature of these DL algorithms considerably limit their clinical adoption.
While there currently exists only a few major commercial manufacturers of spectral-domain OCT (SD-OCT) such as Carl Zeiss Meditec (Dublin, CA, USA), Heidelberg Engineering (Heidelberg, Germany), Optovue Inc. (Fremont, CA, USA), Nidek (Aichi, Japan), Optopol Technology (Zawiercie, Poland), Canon Inc. (Tokyo, Japan), Lecia Microsystems (Wetzlar, Germany), etc., several others have already started to or will soon be releasing the next-generation OCT devices. This further increases the complexity in deploying DL algorithms clinically. Given that reliable segmentations [12] are an important step towards diagnosing glaucoma accurately, there is a need for a single DL segmentation framework that is not only translatable across devices, but also versatile to accept data from next-generation OCT devices.
In this study, we developed a DL-based 3D segmentation framework that is easily translatable across OCT devices in a label-free manner (without the need to manually re-segment data for each device). To achieve this, we first designed an enhancer: a DL network that can improve the quality of OCT B-scans and harmonize image characteristics across OCT devices. Because of such pre-processing, we demonstrate that a segmentation framework trained on one device can be used to segment volumes from other unseen devices. 2

Overview
The proposed study consisted of two parts: (1) image enhancement, and (2) 3D segmentation.
We first designed and validated a DL based image enhancement network to simultaneously de-noise (reduce speckle noise), compensate (improve tissue visibility and eliminate artefacts) [32], contrast enhance (better differentiate tissue boundaries) [32], and histogram equalize (reduce intensity inhomogeneity) OCT B-scans from three commercially available SD-OCT devices (Spectralis, Cirrus, RTVue). The network was trained and tested with images from all three devices.
A 3D DL-based segmentation framework was then designed and validated to isolate six ONH tissues from OCT volumes. The framework was trained and tested separately on OCT volumes from each of the three devices with and without image enhancement.

Patient Recruitment
A total of 450 patients were recruited from four centers: the Singapore National Eye Center (Singapore), Rajan Eye Care Hospital (Chennai, India), Aravind Eye Hospital (Madurai, India), and the East Avenue Medical Center (Quezon City , Philippines) Table 1. All subjects gave written informed consent. The study adhered to the tenets of the Declaration of Helsinki and was approved by the institutional review board of the respective hospitals. The cohort comprised of 225 healthy and 225 glaucoma subjects. The inclusion criteria for healthy subjects were: an intraocular pressure (IOP) less than 21 mmHg, healthy optic discs with a vertical cup-disc ratio (VCDR) less than or equal to 0.5, and normal visual fields tests. Glaucoma was diagnosed with the presence of glaucomatous optic neuropathy (GON), VCDR > 0.7 and/or neuroretinal rim narrowing with repeatable glaucomatous visual field defects. We excluded subjects with corneal abnormalities that could preclude the quality of the scans.

Optical Coherence Tomography Imaging
All 450 subjects were seated and imaged using spectral-domain OCT under dark room conditions in the respective hospitals. 150 subjects (75 healthy + 75 glaucoma) had one of their ONHs imaged using Spectralis (Heidelberg Engineering, Heidelberg, Germany), 150 (75 healthy + 75 glaucoma) using Cirrus (model: HD 5000, Carl Zeiss Meditec, Dublin, CA, USA), and another 150 (75 healthy + 75 glaucoma) using RTVue (Optovue Inc., Fermont, CA, USA). For glaucoma subjects, the eye with GON was imaged, and if both eyes met the inclusion criteria, one eye was randomly selected. For healthy controls, the right ONH was imaged. The scanning specifications for each device can be found in Table 1.
From the dataset of 450 volumes, 390 (130 from each device) were used for training and testing the image enhancement network, while the remaining 60 (20 from each device) were used for training and testing the 3D segmentation framework.

Image Enhancement
The enhancer network was trained to reproduce simple mathematical operations including spatial averaging, compensation, contrast enhancement, and histogram equalization. When using images from a single device, the use of a DL network to perform such operations would be seen as unnecessary, as one could readily use the mathematical operators instead. However, when mixing images from multiple devices, besides performing such enhancement operations, the network also reduces the differences in the image characteristics across the devices, resulting in images that are harmonized (i.e. less device specific) a necessary step to perform  robust device-independent 3D segmentation.

Image Enhancement Dataset Preparation
The 390 volumes were first resized (in pixels) to 448 (height) The image enhancement network was then trained with 36,000 pairs (12,000 per device) of baseline and digitally-enhanced B-scans, respectively. Another 1,440 pairs were used for testing. B-scans from a same patient were not shared between training and testing.

Image Enhancement Network Description
Briefly, as in our earlier DL based image enhancement study [23], the proposed enhancer exploited the inherent advantages of U-Net38 and its skip connections [61], residual learning [44], dilated convolutions [92], and multi-scale hierarchical feature extraction [53]. We used the same network architecture, except that the output layer was now activated by the sigmoid activation function [67] (originally tanh). The design, implementation, significance of each component, and data augmentation details can be referred to from our earlier study [23]. The loss function was a weighted combination of both the root mean square error (RMSE) and a multi-scale perceptual loss [41] function that was based on the VGG19 DL model [80].
Pixel-to-pixel loss functions (e.g., RMSE) compare only the low-level features (i.e., edge information) between the DL prediction and their corresponding ground-truth often leading to over-smoothened (blur) images [41], especially in image-to-image translation problems (e.g., de-noising). However, perceptual loss based functions exploit the high-level features (i.e., texture, abstract patterns) [41,13,7,50] in these images to assess their differences, enabling the DL network to achieve human-like visual understanding [94]. Thus, a weighted combination of both the loss functions allows the DL network to preserve the low-and high-level features in its predictions, limiting the effects of blurring.  To compute the perceptual loss, the output of the enhancer (referred to as 'DL-enhanced' B-scan) and its corresponding digitally-enhanced B-scan was separately passed through the VGG-19 [80] DL model that was pre-trained on the ImageNet dataset [19]. Feature maps at multiple scales (outputs from the 2nd, 4th, 6th, 10th, and 14th convolutional layers) were extracted, and the perceptual loss was computed as the mean RMSE (average of all scales) between the extracted features from the DL-enhanced and its corresponding digitally-enhanced B-scan.
Experimentally, the RMSE and perceptual loss when combined in a weighted-ratio of 1.0:0.01 offered the best performance (qualitative and quantitative; as described below).
The enhancer comprised of a total of 900 K trainable parameters, and was trained end-to-end using the Adam optimizer [24], with a learning rate of 0.0001. We trained and tested on an NVIDIA GTX 1080 founders edition GPU with CUDA 10.1 and cuDNN v7.5 acceleration. Using the given hardware configuration, the DL network enhanced a single baseline B-scan in under 25 ms.

Image Enhancement Qualitative Analysis
Upon training, the network was used to enhance the unseen baseline B-scans from all the three devices. The DL-enhanced B-scans were qualitatively assessed by two expert observers (S.K.D and T.P.H) for the following: (1) noise reduction, (2) deep tissue visibility and blood vessel shadows, (3) contrast enhancement and intensity inhomogeneity, and (4) DL induced artifacts.

Image Enhancement Quantitative Analysis
The following metrics were used to quantitatively assess the performance of the enhancer: (1) universal image quality index (UIQI) [96], and (2) structural similarity index (SSIM) [97]. We used the UIQI to assess the extent of image enhancement (baseline vs. DL-enhanced B-scans), while the MSSIM was used to assess the structural reliability of the DL-enhanced B-scans (digitally-enhanced vs. DL-enhanced).
Unlike the traditional error summation methods (e.g., RMSE etc.) that compared only the intensity differences, the UIQI jointly modeled the (1) loss of correlation (L C ), (2) luminance distortion (D L ), and (3) contrast distortion (D C ) to assess image quality [96]. It was defined as (x: baseline; y: DL-enhanced B-scan): where, y (L C ) measured the degree of linear correlation between the baseline and DL-enhanced B-scans; (D L ) and (D C ) measured the distortion in luminance and contrast respectively; µ x , σ x , σ 2 x denoted the mean, standard deviation, and variance of the intensity for B-scan x, while µ y , σ y , σ 2 y denoted the same for the B-scan y; σ xy was the cross-covariance between the two B-scans. The UIQI was defined between -1 (poor quality) and +1 (excellent quality). As in our previous study [23], the SSIM (x: digitally-enhanced; y: DL-enhanced B-scan) was defined as: The constants C 1 and C 2 (to stabilize the division) were chosen as 6.50 and 58.52, as recommended in a previous study [97]. The SSIM was defined between -1 (no similarity) and +1 (perfect similarity).

3D Segmentation Dataset Preparation
The 60 volumes used for training and testing the 3D segmentation framework (20 from each device, balanced with respect to pathology) were manually segmented (slice-wise) by an expert observer (SD) using Amira (version 6, FEI, Hillsboro, OR). The following classes of tissues were segmented ( . Noise (all regions below the choroid-sclera interface; in grey) and vitreous humor (black) were also isolated. We were unable to obtain a full-thickness segmentation of the LC due to limited visibility [59]. We also excluded the peripapillary sclera due to its poor visibility and the extreme subjectivity of its boundaries especially in Cirrus and RTVue volumes. To optimize computational speed, the volumes (baseline OCT + labels) for all three devices were resized (in voxels) to 112 (height) x 88 (width) x 48 (number of B-scans).

Deep Learning Based 3D Segmentation of the ONH
Recent studies have demonstrated that 3D CNNs can improve the reliability of automated segmentation [63,20,98,76,38,64,75,25], and even out-perform their 2D variants [20]. This is because 3D CNNs not only harness the information from each image, but also effectively combine it with the depth-wise spatial information from adjacent images. Despite its tremendous potential, the applications of 3D CNNs in ophthalmology is still in its infancy [1,28,26,49,66,56], and has not yet been explored for the segmentation of the ONH tissues.
Further, there exist discrepancies in the delineation of ambiguous regions (e.g., choroid-sclera boundary, LC boundary) even among different well-trained DL model depending upon the type and complexity of architecture/feature extraction, learning method, etc., causing variability in the automated measurements. To address this, recent DL studies have explored ensemble learning [69,8,18,31,45,52,55,73,95,48], a meta-learning approach that synergizes (combine and fine-tune) [55] the predictions from multiple networks, to offer a single prediction that is closest to the ground-truth. Specifically, ensemble learning has shown to better generalize and increase the robustness of segmentations in OCT [69,18] and other medical imaging modalities [31,45,52,95].
In this study, we designed and validated ONH-Net, a 3D segmentation framework inspired by the popular 3D U-Net [98] to isolate six ONH tissues from OCT volumes. The ONH-Net consisted of three segmentation networks (3D CNNs) and one 3D CNN for ensemble learning (referred to as the ensembler). Each of the three segmentation CNNs offered an equally plausible segmentation, which were then synergized by the ensembler to yield the final 3D segmentation of the ONH tissues.

3D Segmentation Network Description
The design of the three segmentation CNNs was based on the 3D U-Net [98] and its variants [28]. Each of the three segmentation CNNs ( They consisted of an encoder segment that extracted contextual features (i.e. spatial arrangement of tissues), and a decoder segment that extracted the local information (i.e. tissue texture). The encoder segment sequentially downsampled the feature maps using the 3D max-pooling layers (stride=2,2,2), while the decoder segment sequentially upsampled using the 3D transposed convolutional layers (stride=2,2,2; filter size: 3x3x3; no of filters: 48).
The latent space, implemented using residual blocks similar to our earlier study [23], transferred the extracted features from the encoder to the decoder segment. The use of residual learning improved the flow of gradient information through the network. Skip connections [74] between the encoder and decoder segments helped the DL network to jointly learn the contextual and local information, and the relationships between them.
Also, as implemented in our earlier study [23], we used multi-scale hierarchical feature extraction to improve the delineation of tissue boundaries. The feature maps obtained from multi-scale hierarchical feature extraction were then added with the output of the decoder segment. The Type 1 and Type 2 FE (Figure 2 D; Types 1-2) units had a similar design, except that the input was pre-activated by the elu activation [16] in Type 2 FE.
In both the FE units, the input was passed through three parallel pathways: (1) the identity pathway; (2) the planar pathway; and (3) the volumetric pathway. The identity pathway implemented using a 1x1x1 3D convolutional layer allowed the unimpeded flow of gradient information throughout the network. In the planar pathway, the information from any two dimensions was extracted by the network at once (filter size: 3x3x1 [height x width]; 3x1x3 [height x depth]; 1x3x3 [width x depth]; 48 filters each). The volumetric pathway exploited the depth-wise spatially related and continuous information from all three dimensions at once (i.e., tissue morphology) using three 3D convolutional layers (filter size: 3x3x3; no of filters: 48).Finally, the feature maps from all the three pathways were added, batch normalized [39], and elu activated [16].
In the Type 3 FE (Figure 2 D) unit, the input was elu activated and passed on to three sets of simple residual blocks with 48, 96, and 144 filters, respectively. In each residual block, one 3D convolutional layer (filter size: 3x3x3) extracted the features, while a 1x1x1 3D convolution layer was used as the identity connection [44]. The feature maps were then added, elu activated, and passed on to the next block. Finally, the feature maps were batch normalized and elu activated.
For all three segmentation CNNs, the pre-final output feature maps (decoder output + multi-scale hierarchical feature extraction) were passed through a 3D convolutional layer (filter size: 1x1x1; no of filters: 8 [number of classes; 6 tissues + noise + vitreous humor]) and softmax activated to obtain the tissue-wise probability for each pixel. For each pixel, the tissue class of the highest probability was then assigned.
The ensembler (Figure 2 E) was then implemented using three sets of 3D convolutional layers (specifications for each set; filter size [no of filters]: 3x3x3 [48]; 3x3x3 [96]; 3x3x3 [192]). A dropout [81] of 0.50 was used between each set to reduce overfitting and improve the generalizability of the DL network. The feature maps were then passed through two dense layers of 64 and 8 units (number of classes) respectively, that were separated by a dropout layer (0.50). Finally, a softmax activation was applied to obtain the pixel-wise predictions.
Each of the three segmentation CNNs were first trained end-to-end with the same labeled-dataset. The ONH-Net was then assembled by using the three trained CNNs as parallel input pipelines to the ensem-bler network (Figure 2 F). Finally, we trained the ONH-Net (ensembler weights: trainable; segmentation CNN weights: frozen) end-to-end using the same aforementioned labeled-dataset. During this process, each segmentation CNN provided equally plausible segmentation feature maps (obtained from the last 3D convolution layer), which were then concatenated and fed to the ensembler for fine-tuning. The ONH-Net was trained separately for each device.
All the DL networks (segmentation CNNs, ONH-Net) were trained with the stochastic gradient descent (SGD; learning rate:0.01; Nestrov momentum:0.05 [85]) optimizer, and the Jaccard distance was used as the loss function [22]. We empirically observed that the use of SGD optimizer with Nesterov momentum offered a better generalizability and faster convergence compared to Adam optimizer [24] for OCT segmentation problems that typically use limited data, while Adam performed better for image-to-image translation problems (i.e., enhancement [23]) that use much larger datasets. However, we are unable to theoretically explain this yet for our case. Given the limitations in hardware, all the DL networks were trained with a batch size of 1. To circumvent the scarcity in data, all the DL networks used custom data augmentation techniques (B-scans wise) as in our earlier study [22]. We ensured that the same data augmentation was used for each B-scan in a given volume.

3D Segmentation Training and Testing
We used a five-fold cross-validation approach (for each device) to train and test the performance of ONH-Net. In this process, the labeled-dataset (20 OCT volumes + manual segmentations) was split into five equal parts. One part (left-out set; 4 OCT volumes + manual segmentations) was used as the testing dataset, while the remaining four parts (16 OCT volumes + manual segmentations) were used as the training dataset. The entire process was repeated five times, each with a different left-out testing dataset (and corresponding training dataset). Totally, for each device, the segmentation performance was assessed on 20 OCT volumes (4 per validation; 5-fold cross-validation).

3D Segmentation Performance Qualitative Analysis
The segmentations obtained from the trained ONH-Net on unseen data were manually reviewed by expert observers (S.D. and T.P.H) and compared against their corresponding manual segmentations.

3D Segmentation Performance Quantitative Analysis
We used the following metrics to quantitatively assess the segmentation performance: where D and M were the voxels that represented the chosen tissue in the DL segmented and the corresponding manually segmented volumes. Specificity (Sp) was used to assess the true negative rate of the segmentation framework, while sensitivity (Sn) was used to assess the true positive rate. They were defined as: where D represented the voxels that did not belong to the chosen tissue in the DL segmented volume, while M represented the same in the corresponding manually segmented volume.

3D Segmentation Performance Effect of Image Enhancement
To assess if image enhancement had an effect on segmentation performance, we trained and tested ONH-Net on the baseline and the DL-enhanced datasets. For both datasets, ONH-Net was trained on any one device (Spectralis/Cirrus/RTVue), but tested on all the three devices (Spectralis, Cirrus, and RTVue). Paired t-tests were used to compare the differences (means) in the segmentation performance (Dice coefficients, sensitivities, specificities; mean of all tissues) for both cases.

3D Segmentation Performance Device Independency
When tested on a given device (Spectralis/Cirrus/RTVue), paired t-tests were used to assess the differences (Spectralis vs. Cirrus; Cirrus vs. RTVue; RTVue vs. Spectralis) in the segmentation performance depending on the device used for training ONH-Net. The process was performed with both baseline and DL-enhanced datasets.

3D Segmentation Clinical Reliability Automated Parameter Extraction
Upon obtaining the DL segmentations, two clinically relevant structural parameters that are crucial for the diagnosis of glaucoma: the (1) peripapillary RNFL thickness (p-RNFLT); and the (2) peripapillary GCC thickness (p-GCCT) were automatically extracted as in our earlier works.
For each volume in the testing dataset, a circular scan of diameter 3.4mm centered around the ONH [30] was obtained. The p-RNFL thickness (global) was computed as the distance between the inner limiting membrane and the posterior RNFL boundary (mean of 360 • measure). The p-GCT (global) was computed as the distance between the posterior RNFL boundary and the inner plexiform layer boundary (mean of 360 • measure).
The intraclass correlation coefficients (ICCs) were obtained to compare the measurements computed from the DL and their corresponding manual segmentations for all cases. 3 Results

Image Enhancement Qualitative Analysis
The enhancer was tested on a total of 1440 (480 from each device) unseen baseline B-scans. In the DLenhanced B-scans from all the three devices ( Figure 3, Column 3), the ONH-tissue boundaries appeared sharper with an uniformly enhanced intensity profile (compared to respective baseline B-scans). The blood vessel shadows were also reduced with improved deep tissue (choroid-scleral interface, LC) visibility. In all cases, the DL-enhanced B-scans were consistently similar to their corresponding digitally-enhanced B-scans ( Figure 3, Column 2), with no DL induced artifacts.

3D Segmentation Performance Qualitative Analysis
When trained and tested on the baseline volumes from the same device (Figure 4 -Figure 6; 4th column), ONH-Net successfully isolated all ONH layers. Further, the DL segmentations appeared consistent with their respective manual segmentations (Figure 4 -Figure 6; 3rd column), with no difference in the segmentation performance between glaucoma and healthy OCT volumes. A representative case of the segmentations in 3D when trained on Spectralis and tested on the other three devices is shown in Figure 7.

3D Segmentation Performance Effect of Image Enhancement and Device Independency
Without image enhancement (baseline dataset), ONH-Net trained with one device was unable to segment even a single ONH tissue reliably on the other two devices (Figure 4 ; Rows 2,3,5,6; Column 4; similarly for Figures 5 -6). In all cases, dice coefficients were always lower than 0.65, sensitivities lower than 0.77, and specificities lower than 0.80.
However, with image enhancement (DL-enhanced dataset), ONH-Net trained with one device was able to accurately segment all tissue layers on the other two devices with mean Dice coefficients and sensitivities > 0.92 (Figures 4-6, Column 5). In addition, when trained and tested on the same device, it performed better for several ONH layers (p < 0.05), when it was tested on the same device that it was trained on. The tissue wise quantitative metrics for the aforementioned cases can be found in Tables 2-4.
Further, when trained and tested with the DL-enhanced OCT volumes, irrespective of the device used for training, there were no significant differences (p<0.05) in the segmentation performance for all tissues (Figures 8-10), except for the LC. The tissue wise quantitative metrics for the individual cases can be found in Tables 5-7.

3D Segmentation Clinical Reliability Automated Parameter Extraction
When trained and tested (same device) on the baseline OCT volumes, the ICCs were always greater than 0.99 for both the p-RNFLT and the p-GCCT. However, when tested on the other two devices, since that the ONH-Net was unable to segment even a single tissue reliably, we did not extract the p-RNFLT and the g-GCCT for these cases.
When repeated the same with the DL-enhanced volumes, irrespective of the device used for training, the ICCs were always greater than 0.98 for all cases, indicating excellent reliability.

Discussion
In this study, we proposed a 3D segmentation framework (ONH-Net) that is easily translatable across OCT devices in a label-free manner (i.e. without the need to manually re-segment data for each device). Specifically, we developed 2 sets of DL networks. The first (referred to as the enhancer) was able to enhance OCT image quality from 3 OCT devices, and harmonized image-characteristics across these devices. The second performed 3D segmentation of 6 important ONH tissue layers. We found that the use of the enhancer was critical for our segmentation network to achieve device independency. In other words, our 3D segmentation network trained on any of 3 devices successfully segmented ONH tissue layers from the other two devices with high performance.
Our work suggests that it is possible to automatically segment OCT volumes from a new OCT device without having to re-train ONH-Net with manual segmentations from that device. Besides existing commercial SD-OCT manufacturers, the democratization and emergence of OCT as the clinical gold-standard for in vivo ophthalmic examinations [29] has encouraged the entry of several new manufacturers to the market as well. Further, owing to advancements in imaging technology, there has been a rise of the next generation devices: swept-source [91], polarization sensitive [17], and adaptive optics [70] based OCTs. Given that preparing reliable manual segmentations (training data) for OCT-based DL algorithms requires months of training for a skilled technician, and that it would take more than 8 hours of manual work to accurately segment just a single 3D volume for just a limited number of tissue layers (here 6), it will soon become practically infeasible to perform manual segmentations for all OCT brands, device models, generations, and applications. Furthermore, only a few research groups have successfully managed to exploit DL to fullyisolate ocular structures from 3D OCT images [21,22,27,54,77,86], and only for a very limited number of devices. There is therefore a strong need for a single DL segmentation framework that can easily be translated across all existing and future OCT devices, thus eliminating the excruciating task of preparing training datasets manually. Our approach provides a high-performing solution to that problem.Eventually, we believe, this could open doors for multi-device glaucoma management.
In this study, we found that the use of enhancer was crucial for ONH-Net to achieve device independency, in other words, the ability to segment OCT volumes from devices it had not been trained with earlier. This can be attributed to the design of the proposed DL networks that allowed a perception of visual information through a host of low-level (e.g. tissue boundaries) and high-level abstract features (e.g. speckle pattern, intensity, and contrast profile). When image enhancement was used as a pre-processing step, the enhancer not only improved the quality of low-level features, but also reduced differences in high-level abstract features across OCT devices, thus deceiving ONH-Net into perceiving volumes from all three devices similarly. This enabled ONH-Net trained on the DL-enhanced OCT volumes from one device to successfully isolate the ONH tissues from the other two devices with very high performance (mean Dice coefficients > 0. 92). Note that such a performance is superior to that of our previous 2D segmentation framework that also had the additional caveat that it only worked on a single device [22]. In addition, irrespective of the device used for training, there were no significant differences (p > 0.05) in segmentation performance. In all cases, our DL segmentations were deemed clinically reliable.
In a recent landmark study, De Fauw et al [18] proposed the idea of using device-independent representations (segmentation maps) for the diagnosis of retinal pathologies from OCT images. However, the study was not truly device-independent, as, even though the diagnosis network was device-independent, the segmentation network was still trained with multiple devices. Similarly, our approach may not truly be considered as device-independent. While ONH-Net is device-independent, the enhancer (on which ONH-Net relies on) needs to be trained with data for all considered devices. But this is a still a very acceptable option, because the enhancer only requires un-labeled images (i.e. non-segmented; 100 OCT volumes) for any new device that is being considered. After which, automated segmentation can still be performed without ever needing manual segmentation for that new device. Such a task would require a few minutes rather than several weeks/months needed for manual segmentations.
Finally, the proposed approach should not be confused with transfer learning [88], a DL technique gaining momentum in medical imaging [52,11,58,36,72]. In this technique, a DL network is first pre-trained on large-size datasets (e.g. ImageNet [19]), and when subsequently fine-tuned on a smaller dataset for the task of interest (e.g. segmentation), it re-uses the pre-trained knowledge (high-level representations [e.g. edges, shapes]) to generalize better. In our approach, the generalization of ONH-Net was achieved using the enhanced images, and not the actual knowledge of the enhancer network, thus keeping the learning of both the networks mutually exclusive, yet necessary.
There are several limitations to this study that warrant further discussion. First, we used only 20 volumes in total to test the segmentation performance for each device. Second, the study was performed only using spectral-domain OCT devices, but not swept-source. Third, although the enhancer simultaneously addressed multiple issues affecting image quality, we were unable to quantify the effect of each. Also, we were unable to quantify the extent to which the DL-enhanced B-scans were harmonized. Fourth, we observed slight differences in LC curvature and LC thickness when the LC was segmented using ONH-Net trained on different devices ( Figure 8, Figure 9, Figure 10; 2nd and 4th rows). Given the significance of LC morphology in glaucoma [47], this subjectivity could affect glaucoma diagnosis. This has yet to be tested.This is yet to be tested. Further, in a few B-scans ( Figure 8, Figure 9, Figure 10; 6th column), we observed that the GCC segmentations were thicker when the ONH-Net was trained on volumes from RTVue device. These variabilities might limit a truly multi-device glaucoma management. We are currently exploring the use of advanced DL concepts such as semi-supervised learning [9] to address these issues that may have occurred as a result of limited training data.
Finally, although ONH-Net was invariant to volumes with glaucoma, it is unclear if the same will be true in the presence of other conditions such as cataract [35], peripapillary atrophy [42], and high-mypoia [90] that commonly co-exist with glaucoma.
In conclusion, we demonstrate as a proof of concept that it is possible to develop DL segmentation tools that are easily translatable across OCT devices without ever needing additional manual segmentation data. To the best of our knowledge, our work is the first of its kind to propose a framework that could increase the clinical adoption of DL tools and eventually simplify glaucoma management. Finally, we hope the proposed framework can help patients for the longitudinal follow-up on multiple devices, and encourage multi-center glaucoma studies also.